# CAS DML Course Project 1: Exploratory Data Analysis and Regression

In this notebook, we will explore regression using the diamond dataset. By the end, you should get a comprehensive understanding of how to implement and interpret regression models. 

The tasks are marked as **EXERCISE**. Some exercises will ask you to write text (e.g., to interpret a result), while others will ask you to code something.
Coding exercises usually come with a cell as the one below. A line like `# scaler_demo = ...` asks you to define the `scaler_demo` variable:

**DEMO EXERCISE**

In [None]:
from sklearn.preprocessing import StandardScaler

# Initialize the scaler
# scaler_demo = ...

**DEMO SOLUTION**

In [None]:
from sklearn.preprocessing import StandardScaler

# Initialize the scaler
scaler_demo = StandardScaler()

To solve this exercise, we completed the statement, e.g. by replacing `...` with `StandardScaler()`, and uncomment the line. Hence, the 4th line of the cell above should then read 

`scaler_demo = StandardScaler()`

**END OF DEMO SOLUTION**

Please don't change the variable names, as we often use these in later steps, and the notebook might not run any more if you change the variable name.

The tasks are based on notions and code that you have seen in the lectures and/or tutorials. For any questions, please use the office hours.

## Preparations
Here we import the very commonly used Python libraries. For clarity, we will import more classes and functions as we use them.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Dataset Overview

This notebook will provide an initial exploration of different regression models based on the `diamond` dataset available e.g. from Kaggle: https://www.kaggle.com/datasets/shivam2503/diamonds. Below is a brief description of the dataset columns:

- `price`: Diamond price in US dollars (values in the range of 326 to 18'823)
- `carat`: The weight of the diamond (in the range 0.2 to 5.01)
- `cut`: Quality of the cut (`Fair`, `Good`, `Very Good`, `Premium`, `Ideal`)
- `color`: Diamond color, ranging from `J` (worst) to `D` (best)
- `clarity`: Measurement of how clear the diamond is (`I1`, `SI2`, `SI1`, `VS2`, `VS1`, `VVS2`, `VVS1`, `IF`)
- `depth`: Total diamond depth percentage
- `x`, `y`, `z`: Diamond length, width, depth in mm (in the range 0 to 10.74)
- `table`: Width of the top of the diamond relative to the widest point (in the range 43 to 95)

We use `read_csv` from the `pandas` library to load the data:

In [None]:
# Load the dataset
data = pd.read_csv('diamonds.csv', index_col=0)
data.reset_index(inplace=True, drop=True)

We just use some functions in pandas to get first view for these data.

**EXERCISE**: Write code to get the first few rows for this dataset (1 line). Hint: use the function `head`.

In [None]:
# ...

**EXERCISE**: Write code to get a statiscal analysis for this dataset (1 line)

In [None]:
# ...

**EXERCISE**: Write code to get a heatmap for this dataset.
Hint: Proceed according to these steps:

1. Use function `corr` to get the correlation for the dataset. Use the argument `numeric_only=True` to limit the correlation calculation to numeric attributes (you will get an error otherwise)
2. Then use function `heatmap` from the `seaborn` library. Use the arugment `annot` to annotate the correlation values on the heatmap.
3. Finally use `matplotlib.pyplot` library to show it.

In [None]:
# Correlation convention here is the Pearson correlation factor (-1 <= r <= 1)

# ...
plt.show()

**EXERCISE**: Based on the heatmap above, asnwer the following two questions:

1. Is there a linear relationship between variables `table` and `depth`? Give your explanation.
2. Search for more useful information in this heatmap we have got *(Hint: which variables are relevant to `price` ?)*

**SOLUTION**

1. ...

2. ...

**EXERCISE**: Write code to visualise the relationship between the variables, and get the distribution of each variable.

Hint: use function `pairplot` from the `seaborn` library by setting `diag_kind='kde'`.


In [None]:
# ...

**EXERCISE**:

1. Is the there a linear relationship between `carat` and `price`?
2. If you answered *yes*, is that relationship positive or negative ?

**SOLUTION**

1. ...

2. ...

## Preparing the Data Sets

Before we evaluate and compare different regression models, we prepare the dataset. Following common practice, we will denote by `X` the features (or predictors, or independent variables) and by `y` the target value throughout this notebook. Furthermore, we will add additional information to the variable names, e.g., `X_train` should denote the features of the training data.

First, we split the overall data into `X` and `y`. For `X`, we only use numeric data, i.e. the attributes `'carat', 'x', 'y', 'z', 'table', 'depth'`:

In [None]:
X_all = data[['carat', 'x', 'y', 'z', 'table', 'depth']]
y_all = data[['price']]

Next, we put apart 10% of the data for the **final evaluation**. Remember, to avoid any overfitting, we must not use that data for training and validation! We will denote this part of the data as `X_final_test` and `y_final_test`, respectively. The remainder of the data, stored in the variables `X_train_val` and `y_train_val`, will be split further into a training and validation set to train and compare different models. Note that some of the functions, e.g., `train_test_split`, do not use the notion of a validation set (at least not in their names), but they can be used analogously. It's important to always remember which dataset was or will be used for which task - this is also why we chose the respective names for `X_final_test` and `y_final_test`.

For reproducibility, we add a `random_state` for the shuffling of the samples before the splitting.

In [None]:
from sklearn.model_selection import train_test_split

X_train_val, X_final_test, y_train_val, y_final_test = train_test_split(X_all, y_all, test_size=0.1, random_state=1)

In the following, we will only work with `X_train_val` and `y_train_val` to train and validate models.

## Linear regression
As a first example of a model, we will use simple linear regression.

Remember: Simple linear regression assumes a linear relationship between 

* one predictor/independent variable `x`
* one response/dependent variable/prediction target `y`

The relationship is modeled by a linear function: $\hat y = w \cdot x + b$, where

* `w` refers to the **slope** of the regression line, in other words: the value with which `y` changes when `x` increases by 1 unit.
* `b` refers to the **intercept** of the regression line, in other words: the value of `y` when `x=0`.

For this task, we'll predict the price of a diamond based on one of its features.

Examining the correlation matrix using a heatmap, we can see that the `carat` weight has the strongest correlation with `price`.

Therefore, we use `carat` as `x` and `price` as `y`, i.e. we use carat as the predictor in our linear regression model to estimate diamond prices.

Let's import and prepare a linear regression model:

In [None]:
from sklearn.linear_model import LinearRegression

simple_linear_model = LinearRegression()

### Data Preparation
For the simple linear regression, we choose `carat` as the only feature:

In [None]:
X_train_val_single = X_train_val[['carat']]

For the simple linear regression, we choose `carat` as the only feature:

**EXERCISE**: Using the function `train_test_split` to split the data into train and test datasets to train and evaluate the model. 

In [None]:
from sklearn.model_selection import train_test_split

# X_train, X_val, y_train, y_val = ...

### Transformation
Using a scaler like `StandardScaler`, we transform our features to have zero mean and unit variance. This will be important when doing regularization, and to improve convergence.

To avoid a data leakage from training to validation (or test) data, we do the scaling after the splitting. We fit the scaler on the training data and then apply the same transformation to training, validation (and test) data.

**EXERCISE**: Complete the following code to scale both the training and the test data.

Hints:
1. Use the `StandardScaler` class from the `sklearn.preprocessing` module to standardize features.
2. Initialize the scaler and fit it on the training data using `fit_transform()` to learn the scaling parameters.
3. Apply the learned scaling to the test data using `transform()` to ensure both datasets are scaled in the same way.

In [None]:
from sklearn.preprocessing import StandardScaler

# Initialize the scaler
# scaler = ...

# fit the scaler

# Tansform both training and test data
# X_train_scaled = ...
# X_val_scaled = ...

### Model Training

Now we are ready to train our first model:

**EXERCISE**: Write code to train the linear regression model using the training data.

In [None]:
# ...

**EXERCISE:** According to the model, 

1. What is the value of the intercept (b) in this linear regression?
2. What is the value of the slope (w) in this linear regression?
3. How much does the price of a diamond increase with each additional carat?

Write code to answer these questions.

In [None]:
# ...

### Model Evaluation
`scikit-learn` offers predefined functions for measuring regression quality - see slide *Quality Metric in Python / scikit-learn*.

**EXERCISE**: Write code to obtain predictions on the scaled `X_val` and evaluate the performance of the regression model by computing `root_mean_squared_error`, `
mean_absolute_error`, `
median_absolute_error` and `
r2_score`.

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, median_absolute_error, root_mean_squared_error, r2_score

# Make predictions using the model
# y_pred = ...

# Calculate R-squared
# r_squared = ...
r_squared = 0

# Calculate Mean Absolute Error
# mean_abs = ...
mean_abs = 0

# Calculate Mean Squared Error
# mean_squ = ...
mean_squ = 0

# Calculate Median Absolute Error
# median_error = ...
median_error = 0

# Print the results
print("R-squared:", r_squared)
print("Mean Absolute Error (MAE):", mean_abs)
print(f"Mean Squared Error (MSE): {mean_squ}")
print(f"Median Absolute Error: {median_error}")

With these formulas, we have quantified/measured the regression model's performance.

But, to actually evaluate our model, we need to answer the question:  how can we interpret how good our model actually is?

**EXERCISE**:
What does the R-squared value indicate about the model's performance?  

**SOLUTION**

...

## Multiple Linear Regression

What if we want to predict diamond price using more than one variable?

If we want to use more variables in our model to predict diamond price, we can use **Multiple Linear Regression**.
Multiple Linear Regression is very similar to Simple Linear Regression, but this method is used to explain the relationship between one continuous response (dependent) variable and **two or more** predictor (independent) variables.
Most of the real-world regression models involve multiple predictors. We will illustrate the structure by using four predictor variables, but these results can generalize to any integer:

* $y$: Response Variable
* $x_1$: Predictor Variable 1
* $x_2$: Predictor Variable 2
* $x_3$: Predictor Variable 3
* $x_4$: Predictor Variable 4
* ...

Such a model will have the following parameters:

* $b$: intercept
* $w_1$: coefficient of Variable 1
* $w_2$: coefficient of Variable 2
* $w_3$: coefficient of Variable 3
* $w_4$: coefficient of Variable 4
* ...

The equation is given by:

$\hat y = b + w_1 \cdot x_1 + w_2 \cdot x_2 + w_3 \cdot x_3 + w_4 \cdot x_4 + ...$

**EXERCISE**: Write code to define, train (fit) and evaluate a multiple linear model that is using all available features to predict the price of a diamond. Proceed following the same structure of the simple linear model:

1. Load the data -> There's nothing to do here, as we already have prepared `X_train_val` and `y_train_val`.
2. Split into train and validation sets. Allocate 20% of the data for validation, and use `random_state=1` to ensure reproducibility of the split.
3. Standardize the datasets: Use the `StandardScaler` class from the `sklearn.preprocessing` module to standardize features.
4. Initialize and train/fit the linear regression model
5. Evaluate the model on validation data

In [None]:
# 2. Split into train and validation sets
from sklearn.model_selection import train_test_split

# X_train, X_val, y_train, y_val = ...

In [None]:
# 3. Standardize the datasets
from sklearn.preprocessing import StandardScaler

# Initialize the scaler
# ... 

# Fit the scaler on the training data 
# ... 

# Tansform both training and validation data
# X_train_scaled = ...
# X_val_scaled  = ...

In [None]:
# 4. Initialize and train the linear regression model

# Initalise an instance of the model
# multi_linear_model = ...

# Fit the model
# ... 

In [None]:
# 5. Evaluate the model on validation data

# Use the model to predict on validation data
# ... 

# Calculate R-squared
# r_squared = ...

# Calculate Mean Absolute Error
# mean_abs = ...

# Calculate Mean Squared Error
# mean_squ = ...

# Calculate Median Absolute Error
# median_error = ...

# Print the results
print("R-squared:", r_squared)
print("Mean Absolute Error (MAE):", mean_abs)
print(f"Mean Squared Error (MSE): {mean_squ}")
print(f"Median Absolute Error: {median_error}")

## Preparation for Model Comparison

As we are going to compare several models with each other, we group the different steps as followed above into a function such that we don't have to copy-paste a lot of code.

In [None]:
def apply_eval_model(model, X, y_true, model_name, do_print=True):
    """
    Function to evaluate a given model on a feature data frame and compute several performance metrics.
    The results are returned as a data frame and optionally printed out.

    Arguments:
    model - the model to be evaluated
    X - a data frame containing the predictors
    y_true - the true target values, used for the performance assessment
    model_name - name of the model (will be added to the result data frame)
    do_print - optional argument to indicate whether the results should be printed 
    """
    y_pred = model.predict(X)
    if do_print:
        print('r2-Score: ' + str(r2_score(y_true, y_pred)))
        print('MSE: ' + str(mean_squared_error(y_true, y_pred)))
        print('RMSE: ' + str(root_mean_squared_error(y_true, y_pred)))
    
    if len(model_name)>0:
        df = pd.DataFrame({'model_name': model_name,
                           'r2_score': r2_score(y_true, y_pred),
                           'MSE': mean_squared_error(y_true, y_pred),
                           'MAE': mean_absolute_error(y_true, y_pred),
                           'RMS': root_mean_squared_error(y_true, y_pred)},
                           index=[model_name])
        return df

In [None]:
def train_apply_eval_model(model, X_train, y_train, X_val, y_val, model_name='model', do_print=True):
    """
    Train a given model on a training data set, and evaluate it on both the training and validation data.

    Arguments:
    model - the model to be evaluated
    X_train - the training predictors
    y_train - the true labels of the training data set
    X_val - the predictors of the validation data set
    y_val - the true labels of the validation data set
    model_name - name of the model (will be added to the result data frame)
    do_print - optional argument to indicate whether the results should be printed 
    """
    
    n_val = X_val.shape[0]
    
    # Fit the model
    model.fit(X_train, y_train)

    if do_print:
        print('Evaluation on Training Data:')
    res_train = apply_eval_model(model, X_train, y_train, model_name, do_print=do_print)
    res_train['dataset'] = 'train'

    if n_val>0:
        if do_print:
            print('\nEvaluation on Validation Data:')
        res_val = apply_eval_model(model, X_val, y_val, model_name, do_print=do_print)
        res_val['dataset'] = 'validation'
        
        return pd.concat([res_train, res_val])
    else:
        return res_train

Let's double-check we get the same results when using the function as we got before:

In [None]:
model_name='all features'

model_all_features = LinearRegression()

all_features_results = train_apply_eval_model(model_all_features, X_train_scaled, y_train, X_val_scaled, y_val, model_name=model_name)

Also, the scalers return the result in a different data structure than what we used so far: We like to have dataframes, as it allows us to access the features via their names. The scalers, however, return the scaled values as an array (you'll get to know these terms later in the Python course). We can easily convert the scales values to a dataframe:

In [None]:
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_val_scaled = pd.DataFrame(X_val_scaled, columns=X_val.columns)

### Trying different predictor variable combinations

Using the function defined above, we can now easily evaluate models based on different combinations of predictor variables. For example, let's start with a model that only uses `x`, `y`, and `z`. Remember the `X_train` contains all the attributes:

In [None]:
X_train_scaled.head()

We define the set of attributes we want to be used for the prediction as `predictor_features = ['x', 'y', 'z']`. And we give the model a name `model_name = 'only_x_y_z'`.

Then, we create a new linear regression object (which is not yet trained!) that we hand over the function. The function `train_apply_eval_model` will then train the model on the training data and evaluate it on both the training and validation data:

In [None]:
predictor_features = ['x', 'y', 'z']
model_name = 'only_x_y_z'
model_only_x_y_z = LinearRegression()

only_x_y_z_results = train_apply_eval_model(model_only_x_y_z, X_train_scaled[predictor_features], y_train,
                                            X_val_scaled[predictor_features], y_val, model_name=model_name)

**EXERCISE:** Train and evaluate a linear model based only on the two attribues `depth` and `carat`:

In [None]:
# ...

# only_depth_carat_results = ...

**EXERCISE:** Train and evaluate a linear model based only on the two attribues `carat`, `x`, `depth` and `table`; i.e., all attributes *except* `y` and `z`

In [None]:
# ...

# all_except_y_and_z_results = ...

Finally, we also define a model that only uses `carat` to predict the `price`:

In [None]:
predictor_features = ['carat']
model_name = 'only_carat'
model_only_carat = LinearRegression()

only_carat_results = train_apply_eval_model(model_only_carat, X_train_scaled[predictor_features], y_train,
                                             X_val_scaled[predictor_features], y_val, model_name=model_name)

Now, we combine all the results to a single data structure:

In [None]:
all_results = pd.concat([only_carat_results, only_depth_carat_results, only_x_y_z_results,
                         all_except_y_and_z_results, all_features_results])

Here is the overview over the performance of the different models:

In [None]:
all_results

We can now plot the performance (measured via r2-Score and RMS) of the different models, both on the training and the validation dataset:

In [None]:
# Plotting the results
metrics =['r2_score', 'RMS']

# Create a figure with subplots
fig, axs = plt.subplots(1, 2, figsize=(10, 5))

for i, metric in enumerate(metrics):
    sns.barplot(data=all_results, x='model_name', y=metric, hue='dataset', ax=axs[i])
    axs[i].tick_params(axis='x', rotation=45)
    axs[i].set_xlabel('Features')
    axs[i].legend(loc='lower center')
    axs[i].grid()

plt.tight_layout()
plt.show()

**EXERCISE:** 

1. Comment these results. What could be a possible explanation? You might include findings from the correlation heatmap.
2. Compare the model where we only excluded `y` and `z` to the one where we kept all the features (*hint*: Given `x`, did we gain enough information from these variables ?)
3. Which model would you choose to automatically predict the price of a new set of diamonds? Explain your choice.
4. How do you estimate how well the model you've chosen will work on new data?

**N.B.** You can add python code cells to illustrate your solution.

**SOLUTION:**

1. ...

2. ...

3. ...

4. ...

## Trying different Regularized Models
We will now use different regularized regression models. We start with `Ridge` Regression and `Lasso`, using the default value for the hyperparameter $\alpha$:

In [None]:
predictor_features = ['carat', 'x', 'y', 'z', 'table', 'depth']

In [None]:
from sklearn.linear_model import Ridge

model_name = 'ridge, default param, all features'

model = Ridge()

ridge_default_all_results = train_apply_eval_model(model, X_train_scaled, y_train, X_val_scaled, y_val, model_name=model_name)

In [None]:
from sklearn.linear_model import Lasso

model_name = 'lasso, default param, all features'

lasso_model = Lasso()

lasso_default_all_results = train_apply_eval_model(lasso_model, X_train_scaled, y_train, X_val_scaled, y_val, model_name=model_name)

We define a function to visualize the coefficient values - we define this rather general, as we will use it later also for other models.

In [None]:
def visualize_coefficients(coefficient_values, feature_names, model_name='Lasso'):
    """
    Visualize the coeffients in decreasing order of their absolute values in a bar chart

    Arguments:
    coefficient_values - the values of the coefficients to be plotted. An array or series (i.e., a column of a data frame)
    feature_names - the name of the features for which the coefficients are given, in the same order as coefficient_values
    model_name - name of the model (will be added to the result data frame)
    """

    # Get the indices of non-zero coefficient values
    non_zero_indices = np.where(coefficient_values != 0)[0]

    # Sort coefficient_values by absolute value
    sorted_indices = non_zero_indices[np.argsort(np.abs(coefficient_values[non_zero_indices]))[::-1]]

    # Create a bar plot
    plt.bar(range(len(sorted_indices)), coefficient_values[sorted_indices])

    # Set the tick labels to feature names
    plt.xticks(range(len(sorted_indices)), [feature_names[i] for i in sorted_indices], rotation=45, ha='right')

    plt.title(model_name + ': Non-Zero Coefficients')
    plt.xlabel('Features')
    plt.ylabel('Coefficient Value')
    plt.tight_layout()
    plt.grid()
    plt.show()

Now let's look at the feature parameters as estimated by the Lasso regression model:

In [None]:
visualize_coefficients(lasso_model.coef_, predictor_features)

As we can see from the plot, `carat` and `x` are very important according to the lasso model, while the variables `y` and `z` get only a small coefficient.

## Grid Search for Hyperparameters in Shrinkage Models

We now want to optimize the hyperparameter $\alpha$ for both Lasso and Ridge regression. We want to make a choice such that we get the best performance on a **new** data set. As this is a learning step, we need to avoid overfitting, and therefore need to make this selection on a separate data set. 

The class `GridSearchCV` from `scikit-learn` makes the process of finding the best hyperparameter simpler: it internally does a cross-validation and then selects the best hyperparameter value out of a provided list of values. The selection is based on a criterion that we have to provide. Hence, we can use `GridSearchCV` on the (already scaled) `X_train_scaled` dataframe of predictors, and the `fit` function will internally create multiple splits.

For both shrinkage models, we will use the R2-score as criterion.

In [None]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_absolute_error, mean_squared_error, median_absolute_error
from sklearn.linear_model import Lasso, Ridge

### Grid Search for Lasso
We start with the Lasso model.

When starting a hyperparameter optimization, it is usually very hard to guess the right range of the hyperparameters. Therefore, we start with a rather wide range of potential values for $\alpha$, and then refine that range after a first indication of the order of magnitude.

In [None]:
alpha_range = [0.000_1, 0.001, 0.01, 0.1, 1, 10, 100, 1_000]
lasso_model = Lasso(max_iter=10000)

In [None]:
grid_search = GridSearchCV(lasso_model, param_grid = {'alpha': alpha_range}, scoring='r2')
grid_search.fit(X_train_scaled, y_train)

In [None]:
# Get the best model
best_model = grid_search.best_estimator_
print(f"Best alpha value: {grid_search.best_params_['alpha']}")

Now that we have found the order of magnitude for $\alpha$, let us optimize the parameter value in more detail in this range:

In [None]:
alpha_range = [0.3, 0.5, 0.8, 1, 2, 3, 5]

grid_search_lasso = GridSearchCV(lasso_model, param_grid = {'alpha': alpha_range}, scoring='r2')
grid_search_lasso.fit(X_train_scaled, y_train)

In [None]:
# Get the best model
print(f"Best alpha value: {grid_search_lasso.best_params_['alpha']}")

Now we evaluate the lasso regression model with the optimal values of the hyperparameter:

In [None]:
from sklearn.linear_model import Lasso

model_name = 'Lasso, alpha=2, all features'
lasso_model = Lasso(alpha=grid_search_lasso.best_params_['alpha'])

lasso_opt_all_results = train_apply_eval_model(lasso_model, X_train_scaled, y_train, X_val_scaled, y_val, model_name=model_name)

### Grid Search for Ridge Regression

**EXERCISE:** Fine-tune the hyperparameter for Ridge regression analoguously. Run the ridge regression model with the obtained optimal parameter value, and store its performance. 

In [None]:
# ...

In [None]:
# ...

# ridge_opt_all_results = train_apply_eval_model(lasso_model, X_train_scaled, y_train, X_val_scaled, y_val, model_name=model_name)

### Comparison of Models
Next, we compare the different models based on all features, i.e.
* the unconstrained linear regression
* the lasso model with default parameters
* the lasso model with optimized parameters
* the ridge regression with default parameters
* the ridge regression with optimized parameters

As we have done above with the linear regression models with different features, we will create a DataFrame with the results and then generate a plot.

In [None]:
all_features_results

In [None]:
all_results = pd.concat([all_features_results, lasso_default_all_results, ridge_default_all_results,
                         lasso_opt_all_results, ridge_opt_all_results])

In [None]:
all_results

In [None]:
# Plotting the results
metrics =['r2_score', 'RMS']

# Create a figure with subplots
fig, axs = plt.subplots(1, 2, figsize=(10, 3))

for i, metric in enumerate(metrics):
    sns.barplot(data=all_results, y='model_name', x=metric, hue='dataset', ax=axs[i])
    axs[i].legend(loc='lower center')
    axs[i].grid()

plt.tight_layout()
plt.show()

**EXERCISE**: Comment on the results above. In particular, compare the results obtained the (unconstrained) multiple linear regression, by Lasso and by Ridge, and for both Lasso and Ridge the results with the default and the optimized parameters.

Then, what do you conclude?

**SOLUTION**:

...

## An Experiment: Varying the Amount of Training Data
In the next experiment, we will vary the amount of training data. To do so, we will randomly select a varying amount of data samples from the training data set. To have the same reference, we will always work with the same validation data set. As above, we will save the results in a structure to later plot them:

In [None]:
from numpy.random import random_sample

np.random.seed(42)

In [None]:
n_train_all = X_train_scaled.shape[0]
training_sample_counts = [ 2, 5, 10, 100, 1000, 10000 ]

all_model_results = pd.DataFrame()

for n_train in training_sample_counts:
    # randomly sample the required number of indices
    print('Using ' + str(n_train) + ' samples for training')
    my_train_idx = np.random.choice(range(n_train_all), size=n_train, replace=False)

    # select the respective training data points from the training data set (X and y):
    my_X_train = X_train_scaled.iloc[my_train_idx]
    my_y_train = y_train.iloc[my_train_idx]

    # for model_type in model_list:
    # instantiate model
    model = LinearRegression()
    
    # train and evaluate model performance
    all_features_results = train_apply_eval_model(model, my_X_train, my_y_train, X_val_scaled, y_val, model_name='linear', do_print=False)
    all_features_results['n_train'] = n_train

    # add the result data frame to the overall data frame
    all_model_results = pd.concat([all_model_results, all_features_results])    

Now we can plot the model performance for the varying number of training data samples:

In [None]:
# Plotting the results
metrics =['r2_score', 'RMS']

# Create a figure with subplots
fig, axs = plt.subplots(1, 2, figsize=(10, 5))

for i, metric in enumerate(metrics):
    sns.barplot(data=all_model_results, x='n_train', y=metric, hue='dataset', ax=axs[i])
    axs[i].tick_params(axis='x', rotation=45)
    axs[i].set_xlabel('Training Set Size')
    axs[i].legend(loc='lower center')
    axs[i].grid()

fig.suptitle('Model Performance depending on Training Data Size\nUnregularized Linear Regression', fontsize="x-large")
plt.tight_layout()
plt.show()

**EXERCISE**: Interpret the plots above. What could explain the performance difference between the different training set sizes? Why do the performance values for training and validation set look so different, and also behave differently over the training set size?

*Hint - Advanced:* You might also want to re-run the last two cells (the two cells *after* the cell containing `np.random.seed(42)`) a few times. Can you interpret it?

In [None]:
only_carat_results

**SOLUTION:**

...

In order to get a better understanding of the variability of the results, we sample multiple training data sets from the overall dataset, which will then allow us to get a grasp of the variance:

In [None]:
np.random.seed(42)

number_of_samplings = 10

n_train_all = X_train_scaled.shape[0]
training_sample_counts = [ 2, 5, 10, 100, 1000, 10000 ]

all_linear_params = pd.DataFrame()
all_linear_results = pd.DataFrame()

for n_train in training_sample_counts:
    print('Using ' + str(n_train) + ' samples for training')

    for iter in range(number_of_samplings):
        # randomly sample the required number of indices
        my_train_idx = np.random.choice(range(n_train_all), size=n_train, replace=False)

        # select the respective training data points from the training data set (X and y):
        my_X_train = X_train_scaled.iloc[my_train_idx]
        my_y_train = y_train.iloc[my_train_idx]
    
        # instantiate model
        linear_model = LinearRegression()
        
        # train and evaluate model performance
        all_features_result_linear = train_apply_eval_model(linear_model, my_X_train, my_y_train,
                                                            X_val_scaled, y_val, model_name='linear', do_print=False)

        # book-keeping
        all_features_result_linear['n_train'] = n_train
        all_features_result_linear['iter'] = iter
        
        # add the result data frame to the overall data frame
        all_linear_results = pd.concat([all_linear_results, all_features_result_linear])    

        # add the model parameters to the model parameter structure
        my_params = pd.DataFrame(linear_model.coef_, columns=['carat', 'x', 'y', 'z', 'table', 'depth'])
        my_params['n_train'] = n_train
        my_params['iter'] = iter

        all_linear_params = pd.concat([all_linear_params, my_params])

To plot the results, we now use a boxplot. This allows us to see the "average" spreading of the results, as well as the outliers:

In [None]:
# Plotting the results
metrics =['r2_score', 'RMS']

# Create a figure with subplots
fig, axs = plt.subplots(1, 2, figsize=(10, 5))

for i, metric in enumerate(metrics):
    sns.boxplot(data=all_linear_results, x='n_train', y=metric, hue='dataset', ax=axs[i])
    axs[i].tick_params(axis='x', rotation=45)
    axs[i].set_xlabel('Training Set Size')
    axs[i].grid()

fig.suptitle('Model Performance depending on Training Data Size\nUnregularized Linear Regression', fontsize="x-large")
plt.tight_layout()
plt.show()

In [None]:
all_linear_params_long = pd.melt(all_linear_params, id_vars=['n_train', 'iter'], value_vars = ['carat', 'x', 'y', 'z', 'table', 'depth'])

In [None]:
all_linear_params_long

In [None]:
sns.boxplot(data=all_linear_params_long, x='n_train', y='value', hue='variable')

**EXERCISE**: Review the interpreation above based on the plots above. Can you relate the results to the idea of bias and variance discussed in class?

**SOLUTION:**

...

Next, we want to compare the linear regression with a lasso model under the same settings.

**EXERCISE**: Adapt the above two cells (the one with the sampling, and the one with the plotting) to a lasso model. **You only have to change a few lines for this**.

*Hints:*
* Recall that in `scikit-learn`, all models have the same interface. For example, they all have a `fit` and a `predict` function. Therefore, you can use `train_apply_eval_model` for your lasso model as we did above for the linear regression model.
* Remeber that for lasso, there is a hyperparameter (called `alpha`) that we need to optimize. This needs to be done *before* we can do a usefull evaluation of the model, i.e. before we call `train_apply_eval_model`. We have done this already earlier in this notebook - hence, browse upwards for inspiration.
* Finally, more on a programming style level: Above we have named the variables that are specific to the linear models with a name that contains `linear`. We recommend you modify these variable names to sharpen your understanding and improve the clarity of the program code.

**Note**: You might get a convergence warning (something like `ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 3.297e+10, tolerance: 5.518e+07`, with numbers potentially different). This means that the optimization algorithm did not find a good solution. As mentioned in the warning message, a higher regularization can help here. As we are searching for the best hyperparameter, we can assume that the candidate values for which these messages occurr are not yielding a good results and will therefore be discarded. To avoid cluttering of the output, we switch off the display of convergence warnings with the cell below:

In [None]:
from warnings import simplefilter
from sklearn.exceptions import ConvergenceWarning
simplefilter("ignore", category=ConvergenceWarning)

In [None]:
# put your code here:
# ...

As a further illustration, we generate a boxplot of the estimated coefficients for the different parameters for different numbers of training data. For better comparability, and as lasso was not trained with less than 10 training data samples, we omit these cases also for the linear regression:

In [None]:
all_linear_params_long = pd.melt(all_linear_params, id_vars=['n_train', 'iter'], value_vars = ['carat', 'x', 'y', 'z', 'table', 'depth'])
sns.boxplot(data=all_linear_params_long[all_linear_params_long['n_train']>5], x='n_train', y='value', hue='variable')

In [None]:
all_lasso_params_long = pd.melt(all_lasso_params, id_vars=['n_train', 'iter'], value_vars = ['carat', 'x', 'y', 'z', 'table', 'depth'])
sns.boxplot(data=all_lasso_params_long, x='n_train', y='value', hue='variable')

For small training set sizes, the estimated parameter coefficients vary much more in the unrestricted linear regression model (note the difference in the y-axis scale!), leading to very different models. This also explains the vast performance variation. As more training data gets available, the variation in the coefficients decreases.

## Multiple Polynomial Regression

In this last part of the notebook, we will fit a model using multiple polynomial regression. We will be generating the polynomial features using the `PolynomialFeatures` class that you have already seen in the practice notebook. Unlike there, we here use it with multiple features (and not only one, as we did when working with synthetic data). In this case, `PolynomialFeatures` will also create all mixed terms up to the given degree. For example, `PolynomialFeatures(degree=3)` will generate the features `carat`, `carat^2` and `carat^3` as well as `x`, `x^2` and `x^3`, but also `carat x` (the product of `carat` and `x`), `carat^2 x` (the product of `carat^2` and `x`) and `carat x^2` (the product of `carat` and `x^2`). 

The code below loops over the maximum degree of the polynomial up to `maxdegree` and applies the corresponding feature transformation. It then trains and evaluates the 3 different regression models on these features: linear regression, lasso and ridge regression. For the latter two, the optimal hyperparameter values are searched using `GridSearchCV`. The performance of the model on both training and validation data sets, as well as the paramters of each model are stored for later analysis.

*Note*: If you want, you can increase `maxdegree` in the code cell below. However, this will make this cell run much longer. More precisely, the number of parameters resulting from the polynomial feature transformation is $(n_{feat})^{degree}$ if you have $n_{feat}$ features initially and build polynomials up to degree $degree$. We have $n_{feat}=6$ features. So, if you increase `maxdegree` by 1, the cell below will take at least 6 times longer to run. On my (Andreas') laptop, the below cell takes a good minute to run. We add some time measurement.

**DEMO EXERCISE**

In [None]:
import time
from sklearn.preprocessing import PolynomialFeatures

In [None]:
maxdegree = 2

alpha_range = [ 0.1, 0.3, 1, 3, 10, 30, 100]
model_list = ['linear', 'lasso', 'ridge']

all_results_poly = pd.DataFrame()
all_params_poly = pd.DataFrame()

for polydeg in range(1, maxdegree+1):
    # Create polynomial features
    print('***********')
    print('preparing polynomial features of degree ', polydeg)
    poly_features = PolynomialFeatures(degree=polydeg)
    poly_features.fit(X_train)
    
    X_train_poly = poly_features.transform(X_train)
    X_val_poly = poly_features.transform(X_val)

    # scaling
    poly_scaler = StandardScaler()
    poly_scaler.fit(X_train_poly)
    
    X_train_poly_scaled = poly_scaler.transform(X_train_poly)
    X_val_poly_scaled  = poly_scaler.transform(X_val_poly)

    # generating names of polynomial features
    poly_names = poly_features.get_feature_names_out(X_train_scaled.columns)
    X_train_poly_scaled = pd.DataFrame(X_train_poly_scaled, columns = poly_names)
    X_val_poly_scaled = pd.DataFrame(X_val_poly_scaled, columns = poly_names)

    # run training, parameter optimization and evaluation over the 3 model types.
    # we do a loop here, as most of the code would have to be copied otherwise.
    for model_type in model_list:
        if model_type == 'linear':
            # linear model: we only need to initialize it.
            model_name = model_type
            print('Initializing linear model')
            model = LinearRegression()
        else:
            # for the other two models (i.e., lasso and ridge):
            # we need to tune the hyperparameters.
            if model_type == 'lasso':
                model = Lasso()
            else:
                model = Ridge()

            # Find optimal value of hyperparameter 
            # Note this is done in the same way for lasso and for ridge regression
            print('Fining optimal hyperparameter value for', model_type)
            time_start = time.time()
            grid_search = GridSearchCV(model, param_grid = {'alpha': alpha_range}, scoring='r2')
            grid_search.fit(X_train_poly_scaled, y_train)
            
            # Get the best hyperparameter
            best_alpha = grid_search.best_params_['alpha']
            time_end = time.time()
            print(f'-> optimization took {(time_end - time_start):.2f} seconds.')
            print(f'-> Best alpha value: {best_alpha}')
                        
            # initialize the model with the best hyperparameter we just found.
            
            if model_type == 'lasso':
                model = Lasso(alpha=best_alpha)
            else:
                model = Ridge(alpha=best_alpha)
            model_name = model_type + ', alpha=' + str(best_alpha)

        # here the if-else condition is over, i.e. we do this for all three models.
        # main training and evaluation, using the function train_apply_eval_model we have used before.
        model_name = model_name + ', features degree ' + str(polydeg)
        time_start = time.time()
        results = train_apply_eval_model(model, X_train_poly_scaled, y_train, 
                                         X_val_poly_scaled, y_val, model_name=model_name, do_print=False)
        all_results_poly = pd.concat([all_results_poly, results])
        time_end = time.time()
        print(f'-> training and evaluation took {(time_end - time_start):.2f} seconds.\n')

        # save model parameters:
        my_params = pd.DataFrame(np.reshape(model.coef_, (1, -1)), columns=X_train_poly_scaled.columns)
        my_params['model_name'] = model_name
        my_params['model_type'] = model_type
        my_params['degree'] = polydeg
        my_params_long = pd.melt(my_params, id_vars=['model_name', 'model_type', 'degree'])
        all_params_poly = pd.concat([all_params_poly, my_params_long])

The code below plots the non-zero coefficients for the lasso model for all degrees that were calculated (i.e., depending on your choice of `maxdegree` above):

In [None]:
for polydeg in range(1, maxdegree+1):
    for model_name in ['linear', 'lasso', 'ridge']:
        rel_params = all_params_poly[(all_params_poly['model_type']==model_name) & (all_params_poly['degree']==polydeg)]
        visualize_coefficients(rel_params['value'], rel_params['variable'], model_name=rel_params['model_name'][0])

**EXERCISE**: What do you observe when looking at the model parameters?

**SOLUTION:**

...

Using the same graphics as you already know, we visualize the performance of the different models:

In [None]:
# Plotting the results
metrics =['r2_score', 'RMS']

# Create a figure with subplots
fig, axs = plt.subplots(1, 2, figsize=(10, 3))

for i, metric in enumerate(metrics):
    sns.barplot(data=all_results_poly, y='model_name', x=metric, hue='dataset', ax=axs[i])
    axs[i].legend(loc='lower center')
    axs[i].grid()

plt.tight_layout()
plt.show()

In [None]:
all_results_poly

**EXERCISE:** Among the models listed above, which one do you choose to automatically predict the diamond price? What steps would be necessary to give a dependable estimate of the model performance on new data (you don't have to program it)? Explain why this step is necessary.