## MADMO Course

### Seminar 4: extra materials
### Linear Regression for Tabular Data

Let's use a dataset example "California Housing" from [scikit-learn.datasets](https://scikit-learn.org/stable/datasets/real_world.html#california-housing-dataset). 

```
This dataset was derived from the 1990 U.S. census, using one row per census block group. A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people).

The target variable is the median house value for California districts, expressed in hundreds of thousands of dollars ($100,000).
```

For further work, let's do the following:
* Conduct reasonable data preprocessing.
* Divide the data into training, validation and test samples
* Perform hyperparameter search and select the best approach among the following:
- LinearRegression
- Lasso (L1-regularized linear regression)
- Ridge (L2-regularized linear regression)

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.datasets import fetch_california_housing

In [None]:
dataset_dict = fetch_california_housing()
feature_names = dataset_dict["feature_names"]

In [None]:
data_df = pd.DataFrame(dataset_dict["data"], columns=feature_names)

In [None]:
data_df.head()

In [None]:
data_df["target"] = dataset_dict["target"]

## Simple EDA

Check information about each feature, See, if there is missing values

In [None]:
print(f"Total # objects: {len(data_df)}")

In [None]:
print("Statistics for each feature")
data_df.describe()

In [None]:
data_df.info()

Okay, no missing values

Let's build some graphics, and see correlation between features

In [None]:
# plot target distribution
data_df.hist("target", bins=30);

In [None]:
# plot distributions for other features to estimate their range
def visualize_df_columns(used_df, used_col_names):
    fig, ax = plt.subplots(len(used_col_names) // 2, 2, figsize=(20, 20))
    for i, name_feat in enumerate(used_col_names):
        used_df[name_feat].hist(bins=30, ax=ax[i // 2, i % 2])
        ax[i // 2, i % 2].set_title(name_feat)
    plt.show()
    return

In [None]:
visualize_df_columns(data_df, feature_names)

In [None]:
## plot correlation finally

In [None]:
# data.corr() --> correlation matrix
sns.heatmap(data_df.corr(), annot=True, cmap="RdYlGn", linewidths=0.2)
fig = plt.gcf()
fig.set_size_inches(10, 8)
plt.show()

Okay, that was really brief eda, we aren't going to focus on it today

Next steps are:

1. Split dataset into train / val / test
2. Fit some models
3. Evaluate them

### Split Data

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(
    data_df.drop("target", 1).values, data_df["target"].values, test_size=0.2, random_state=42
)

### Fit Models

Now, we are going to try out 3 models:

`sklearn.linear_model.LinearRegression` - this model tries to find analytical solution for:
$$
\widehat{w} = argmin_w \sum{(y - Xw)^2}
$$
$$
\widehat{w} = (X^TX)^{-1}X^Ty
$$
`sklearn.linear_model.Ridge` - this is also linear model, but with l2 regularization. It also has analytical solution

`sklearn.linear_model.Lasso` - this is linear model with l1 regularization. Unfortunately it has not any analytical solution, so model tries to find it iteratively (SGD)

In [None]:
from sklearn.linear_model import Lasso, LinearRegression, Ridge

In [None]:
lr = LinearRegression()
lr.fit(X_train, Y_train)

alpha = 0.5
ridge = Ridge(alpha=alpha)
ridge.fit(X_train, Y_train)

lasso = Lasso(alpha=alpha)
lasso.fit(X_train, Y_train);

### Evaluate models

Let's see some metrics for evaluation:

[Mean Squared Error](https://en.wikipedia.org/wiki/Mean_squared_error)

[Mean Absolute Error](https://en.wikipedia.org/wiki/Mean_absolute_error)

[R2 score](https://en.wikipedia.org/wiki/Coefficient_of_determination)

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [None]:
lr_predict = lr.predict(X_test)
mae_lr = mean_absolute_error(Y_test, lr_predict)
mse_lr = mean_squared_error(Y_test, lr_predict)
r2_lr = r2_score(Y_test, lr_predict)

ridge_predict = ridge.predict(X_test)
mae_ridge = mean_absolute_error(Y_test, ridge_predict)
mse_ridge = mean_squared_error(Y_test, ridge_predict)
r2_ridge = r2_score(Y_test, ridge_predict)

lasso_predict = lasso.predict(X_test)
mae_lasso = mean_absolute_error(Y_test, lasso_predict)
mse_lasso = mean_squared_error(Y_test, lasso_predict)
r2_lasso = r2_score(Y_test, lasso_predict)

print("Linear regression:")
print(f"MSE: {mse_lr}")
print(f"MAE: {mae_lr}")
print(f"R2: {r2_lr}")
print("-" * 10)
print("Ridge regression:")
print(f"MSE: {mse_ridge}")
print(f"MAE: {mae_ridge}")
print(f"R2: {r2_ridge}")
print("-" * 10)
print("Lasso regression:")
print(f"MSE: {mse_lasso}")
print(f"MAE: {mae_lasso}")
print(f"R2: {r2_lasso}")
print("-" * 10)

### Extra tip: preprocessing

Sometimes, it is very usefull to preprocess your data -> center it, or put it in some range

`sklearn` has special module for it

In [None]:
from sklearn import preprocessing

In out case it is no use: our data is quite okay

In [None]:
X_train_scaled = X_train.copy()
X_test_scaled = X_test.copy()

model = LinearRegression()
model.fit(X_train_scaled, Y_train)

print("Unscaled")
print("Train MSE:", mean_squared_error(Y_train, model.predict(X_train_scaled)))
print("Test MSE:", mean_squared_error(Y_test, model.predict(X_test_scaled)))

scaler = preprocessing.StandardScaler()
scaler.fit(X_train_scaled)

model = LinearRegression()
model.fit(scaler.transform(X_train_scaled), Y_train)
print("Scaled (Standard scaler)")
print("Train MSE:", mean_squared_error(Y_train, model.predict(scaler.transform(X_train_scaled))))
print("Test MSE:", mean_squared_error(Y_test, model.predict(scaler.transform(X_test_scaled))))

scaler = preprocessing.MinMaxScaler()
scaler.fit(X_train_scaled)


model = LinearRegression()
model.fit(scaler.transform(X_train_scaled), Y_train)
print("Scaled (MinMax scaler)")
print("Train MSE:", mean_squared_error(Y_train, model.predict(scaler.transform(X_train_scaled))))
print("Test MSE:", mean_squared_error(Y_test, model.predict(scaler.transform(X_test_scaled))))

But if we stretch some features and add some noise, scaling will become very handy

In [None]:
X_train_scaled = X_train.copy()
X_test_scaled = X_test.copy()

# streching + noising
for _ in range(9):
    X_train_scaled[:, 3] *= 150
    X_train_scaled[:, 2] *= 200
    X_test_scaled[:, 3] *= 150
    X_test_scaled[:, 2] *= 200

    X_train_scaled[:, 3] += np.random.normal(size=X_train_scaled.shape[0])
    X_train_scaled[:, 2] += np.random.normal(size=X_train_scaled.shape[0])
    X_test_scaled[:, 3] += np.random.normal(size=X_test_scaled.shape[0])
    X_test_scaled[:, 2] += np.random.normal(size=X_test_scaled.shape[0])


model = LinearRegression()
model.fit(X_train_scaled, Y_train)

print("Unscaled")
print("Train MSE:", mean_squared_error(Y_train, model.predict(X_train_scaled)))
print("Test MSE:", mean_squared_error(Y_test, model.predict(X_test_scaled)))

scaler = preprocessing.StandardScaler()
scaler.fit(X_train_scaled)


model = LinearRegression()
model.fit(scaler.transform(X_train_scaled), Y_train)
print("Scaled (Standard scaler)")
print("Train MSE:", mean_squared_error(Y_train, model.predict(scaler.transform(X_train_scaled))))
print("Test MSE:", mean_squared_error(Y_test, model.predict(scaler.transform(X_test_scaled))))

scaler = preprocessing.MinMaxScaler()
scaler.fit(X_train_scaled)


model = LinearRegression()
model.fit(scaler.transform(X_train_scaled), Y_train)
print("Scaled (MinMax scaler)")
print("Train MSE:", mean_squared_error(Y_train, model.predict(scaler.transform(X_train_scaled))))
print("Test MSE:", mean_squared_error(Y_test, model.predict(scaler.transform(X_test_scaled))))

**Note:** scaler must be fitted only on `train` part. Never use `scaler.fit(test)`! It is supposed to be `scaler.transform(test)` only, if you don't want to get any data leaking in your modeling pipeline

### Extra tip: hyperparameters search

`sklearn.model_selection` has special class called `GridSearchCV` to find best hyperparameters for your model:
1. It iterates over all possible combinations of hyperparameters, you had given it
2. It produce cross-validation
3. Based on cross-validation metrics it choses the best parameters

Let's see, how it is done

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
def train_and_plot(model, parameter_dict):
    """This function takes a model and parameters
    dict as input and plot a graph of MSE loss VS parameter value"""
    gscv = GridSearchCV(model, parameter_dict, cv=3, verbose=1)
    gscv.fit(X_train, Y_train)
    plt.errorbar(
        gscv.param_grid["alpha"],
        gscv.cv_results_["mean_test_score"],
        gscv.cv_results_["std_test_score"],
        capsize=5,
        label=model.__str__().split("(")[0],
    )
    plt.xscale("log")
    plt.xlabel("alpha")
    plt.ylabel("negative MSE")
    plt.grid()
    plt.legend()
    return gscv.best_estimator_

Let's try to find best regularization parameter for `Ridge`

In [None]:
model = Ridge()
params = {"alpha": np.linspace(0.0001, 20.0)}

best_ridge = train_and_plot(model, params)

*Extra:* see `sklearn.model_selection.RandomizedSearchCV` and try it out

*Extra2:* find best params for `sklearn.linear_model.SGDRegressor`

*Extra3:* Make `GridSearchCV` pipeline with scaled features (hint: you can't transform your data before passing it to `gscv.fit` because it will result in data leakage - you should be using [pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html))

### Extra tip: predicting log

Sometimes, the log distribution of your target is easier to predict. Let's see:

In [None]:
# plot target distribution
plt.title("target")
data_df["target"].hist(bins=30);

In [None]:
# plot log target distribution
plt.title("target (log)")
data_df["target"].apply(np.log).hist(bins=30);

In [None]:
lr = LinearRegression()
lr.fit(X_train, Y_train)

lr_predict = lr.predict(X_test)
mae_lr = mean_absolute_error(Y_test, lr_predict)
mse_lr = mean_squared_error(Y_test, lr_predict)
r2_lr = r2_score(Y_test, lr_predict)

print("Linear regression (raw target):")
print(f"MSE: {mse_lr}")
print(f"MAE: {mae_lr}")
print(f"R2: {r2_lr}")
print("-" * 10)

lr = LinearRegression()
lr.fit(X_train, np.log(Y_train))

lr_predict = lr.predict(X_test)
mae_lr = mean_absolute_error(np.log(Y_test), lr_predict)
mse_lr = mean_squared_error(np.log(Y_test), lr_predict)
r2_lr = r2_score(np.log(Y_test), lr_predict)

print("Linear regression (log target):")
print(f"MSE: {mse_lr}")
print(f"MAE: {mae_lr}")
print(f"R2: {r2_lr}")
print("-" * 10)

----