# Data Mining

Data mining is the process of discovering meaningful patterns, trends, and relationships in large datasets using statistical, machine learning, and database management techniques. It goes beyond simple data analysis by automatically extracting hidden knowledge that can support decision-making, and helps us understand complex phenomena. Common applications include customer behavior analysis, fraud detection, medical diagnosis, or market trend prediction. By turning raw data into actionable insights, data mining serves as a critical tool in today’s data-driven world.

## Correlation and Causuality

**Correlation refers to a statistical relationship between two variables** - when changes in one variable are associated with changes in another. For example, ice cream sales and beach attendance often rise together, showing a positive correlation. **Causality, on the other hand, means that one event directly influences or produces another.** If A causes B, then changing A will lead to a predictable change in B. While correlation can hint at possible causal links, it does not prove them.

The key difference is that correlation simply describes a relationship, while causality explains the underlying mechanism of that relationship. Many correlated events share a common cause or are influenced by other variables (confounders). For instance, both ice cream sales and drowning incidents increase in summer, but the cause is warmer weather - not ice cream itself.

**A common misconception is assuming that "correlation implies causation".** This error, sometimes called the [post hoc fallacy](https://en.wikipedia.org/wiki/Post_hoc_ergo_propter_hoc), can lead to flawed conclusions in research, business, and policy-making. Proper causal inference requires careful experimental design, statistical controls, or methods like randomized controlled trials, not just observational data. In short: correlation can point you toward possible causes, but causality must be proven through deeper investigation.

[It is not that hard to find missleading examples!](https://www.tylervigen.com/spurious-correlations)

## Bonferroni’s principle 

Bonferroni’s principle is a statistical caution that says:
> If you keep looking for patterns in data without adjusting your criteria, you’re bound to find "significant" results purely by chance.

It reminds us that if you search a large enough dataset for correlations, patterns, or anomalies without proper statistical controls, you will almost certainly find patterns that are just random noise. This is especially important when working with high-dimensional data, where the number of possible comparisons is huge.

# Data

The most convenient way to think of the datasets that the majority
of data mining algorithms operate upon is the tabular view. In this
analogy the problem at hand can be treated as (a potentially gigantic)
spreadsheet with several rows – corresponding to data objects – and
columns, each of which includes observed attributes with respect the
different aspects of these data objects.

Another important aspect of the datasets we work with is the
measurement scale of the individual columns in the data matrix
(each corresponding to a random variable). A concise summary of
the different measurement scales and some of the most prototypical
statistics which can be calculated for them:

| Type of attribute | Description | Examples | Statistics |
|-------------------|-------------|----------|------------|
| **Categorical**   |             |          |            |
| Nominal           | Variables can be checked for equality only; | names of cities, hair color | mode, entropy, correlation, χ²-test |
| Ordinal           |  `>` relation can be interpreted among variables; | grades {fail, pass, excellent} | median, percentiles |
| **Numerical**     |             |          |            |
| Interval          | The difference of two variables can be formed and interpreted | shoe sizes, dates, °C | mean, deviation, significance (e.g., F-, t- tests) |
| Ratio             | Ratios can be formed from values of the variables of this kind | age, length, temperature in Kelvin | percent, geometric/harmonic mean, variation |

Let's load the [Bike Rental Dataset](https://www.kaggle.com/datasets/lakshmi25npathi/bike-sharing-dataset). A slightly modified version that we are going to use can be found at `/data/rental.csv`.

Dataset features:
- `season`: The season, either 1: spring, 2: summer, 3: fall or 4: winter.
- `mnth`: The month, `{1,...,12}`.
- `holiday`: Indicator whether the day was a holiday or not.
- `weekday`: Indicator whether the day was a weekday or not.
- `workingday`: Indicator whether the day was a working day or weekend.
- `weathersit`: The weather situation on that day. One of:
  - 1: clear, few clouds, partly cloudy, cloudy
  - 2: mist + clouds, mist + broken clouds, mist + few clouds, mist
  - 3: light snow, light rain + thunderstorm + scattered clouds, light rain + scattered clouds
  - 4: heavy rain + ice pallets + thunderstorm + mist, snow + mist
- `temp`: Temperature in degrees Celsius.
- `atemp`: Felt temperature in Celsius.
- `hum`: Relative humidity in percent (0 to 100).
- `windspeed`: Wind speed in km per hour.
- `cnt`: Count of bicycles including both casual and registered users. The count is used as the target in the regression task.


In [27]:
# loading dataset
import pandas as pd

df = pd.read_csv("https://raw.githubusercontent.com/ficstamas/data-mining/8babb620f865532b769ff81d2f12ee1ef14084da/data/rental.csv", index_col=0)
df

Unnamed: 0,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,cnt
0,1,0,1,0,6,0,2,24.175849,39.999250,80.5833,10.749882,985
1,1,0,1,0,0,0,2,25.083466,39.346774,69.6087,16.652113,801
2,1,0,1,0,1,1,1,17.229108,28.500730,43.7273,16.636703,1349
3,1,0,1,0,2,1,1,17.400000,30.000052,59.0435,10.739832,1562
4,1,0,1,0,3,1,1,18.666979,31.131820,43.6957,12.522300,1600
...,...,...,...,...,...,...,...,...,...,...,...,...
726,1,1,12,0,4,1,2,19.945849,30.958372,65.2917,23.458911,2114
727,1,1,12,0,5,1,2,19.906651,32.833036,59.0000,10.416557,3095
728,1,1,12,0,6,0,2,19.906651,31.998400,75.2917,8.333661,1341
729,1,1,12,0,0,0,1,20.024151,31.292200,48.3333,23.500518,1796


In [18]:
# make the train-test splits and separate the target variable
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2, random_state=42)

train_X, train_y = train[train.columns.difference(["cnt"])], train[["cnt"]]
test_X, test_y = test[test.columns.difference(["cnt"])], test[["cnt"]]

In [28]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(train_X, train_y)

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


In [30]:
from sklearn.metrics import r2_score
predictions = model.predict(test_X)

In [31]:
r2_score(test_y, predictions)

0.8276670090367205

# Pre-processing



## Categorical data

### Numeric Mapping

### One-hot encoding

## Numerical data

### Mean centering

In [32]:
%matplotlib widget
import numpy as np
import matplotlib.pyplot as plt
from ipywidgets import interact, widgets

@interact
def circle(radius=1.0, linewidth=1, color = ['red', 'blue', 'green']):
    angles = np.linspace(0,2*np.pi,100)
    fig, ax = plt.subplots()
    ax.set_aspect(1)
    ax.set_xlim(-10,10)
    ax.set_ylim(-10,10)
    ax.plot(radius*np.cos(angles), radius*np.sin(angles), linewidth = linewidth, c = color)
    plt.show()

RuntimeError: 'widget' is not a recognised GUI loop or backend name

### Standardization

### Whitening

### Min-max scaling

### Unit normalization