# Assignment 7: Linear Model Selection and Regularization
Derived from MLEARN51-Assignment7-Student_Name.ipynb in Canvas MLEARN 510 Spring 2024.   <br>
Modified by Ernst Henle. Modifications Copyright © 2024 by Ernst Henle<br>
<br>
## Learning Objectives
- Produce a model with l2 regularization, with a statistically significant improvement over a model without regularization.
- Produce a model with l1 regularization, with a statistically significant improvement over a model without regularization.
- Produce a model with both l1 and l2 regularization terms, with a statistically significant improvement over a model without regularization.
- Produce a generalized additive model with a statistically significant improvement over the null model (a model without input variables).

In [None]:
# The following code can be removed
import time
start_time = time.time()

In [None]:
# Packages
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge, LinearRegression, ElasticNet, Lasso
from sklearn.metrics import mean_squared_error

# There could be over 50 Convergence Error Warnings
import warnings
warnings.filterwarnings('ignore')

# our favorite magic
%matplotlib inline

## Get Data and Basic EDA
<br>
Dataset(s) needed:
Kaggle House Prices (https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data)<br>

This data is only the training data.  We will not use the actual test data in this exercise.  All "tests" will be validations done on validation data that is taken from the training data.

In [None]:
train = pd.read_csv('../data/House Prices.csv')
print(train.shape)
print(train.dtypes.value_counts())
train.head()

 Question 1.1: Drop the Id column from the data as it is not needed for prediction and may actually lead to overfitting.

In [None]:
# drop id column


 Question 1.2: Visualize a scatter plot of 'GrLivArea' in the x-axis and 'SalePrice' in the y-axis. Can you spot any outliers?

In [None]:
# Create scatter plot


 Question 1.3: Removing outliers in the data for all GrLivArea greater than 4000 then check the scatter plot again

In [None]:
# Remove outliers for train['GrLivArea']>4000

# Create scatter plot that shows that the 2 outliers were removed


Quesiont 2.1: Convert categorical variable into dummy variables using pandas get_dummies API

Do not use sklearn.  In sklearn you would have to do the following:
1. identify the category columns in the dataframe
2. ceate a one-hot-encoder object
3. one-hot-encode the category columns of the dataframe and put results in a new dataframe
4. drop the category columns from the original dataframe to create a dataframe of the original numeric variables
5. combine the new dataframe of one-hot-encoded variables with the numeric variable of the original dataframe

<br><br>
Do the following:
1. Please one-hot-encode using pandas `get_dummies`.  With `get_dummies`you just use the data as the argument for `get_dummies` and assign the output to the same variable name. 
3. Present shape of data.  Use `shape` as was done above.  How many columns were added?
4. Present counts of data type.  Use `dtypes` and `value_counts` as was done above.  How have the data types changed? 

In [None]:
# One-hot encode

# Present shape of data

# Present counts of data type


Question 2.2: Impute missing data by the median of each column.
1. Count the total number of nulls in the data
2. Replace nulls with column medians
3. Count the total number of nulls in the data

In [None]:
# Count the total number of nulls in the data

# Replace nulls with column medians

# Count the total number of nulls in the data


Question 2.3: Generate train validation (test) split of 70/30
1. Create the input variables `X` without 'SalePrice'
2. Create the target variable `y` which is 'SalePrice'
3. Do train-test split to split data into training and validation datasets

In [None]:
# Create input and target variables X and y


# Do train-test split


Question 3.1: Train a linear regression algorithm to predict `SalePrice` from the remaining features.

In [None]:
#Fit a linear regression model to this data


Question 3.2: Evaluate the model with RMSE. Report the performance on both training and test data. These numbers will serve as our benchmark performance.

In [None]:
#Compute the RMSE 


We now train a regularized version of `LinearRegression` called `Lasso`. `Lasso` has an argument called `alpha`, which is the **shrinkage parameter**.

Question 4.1: Let `alpha = 0.000001` and train a `Lasso` algorithm. Show that the resulting model is practically identical to the one we trained with `LinearRegression`. There are different ways to show this, so you will need to think of a way. <span style="color:red" float:right>[2 point]</span>

In [None]:
# Train Lasso with very small alpha


#Compute the RMSE for train and validation (test)


Question 4.2: Iteratively train a new `Lasso` model, letting `alpha` change each time to one of the values given by the suggested `alpha_vals` below.
For each alpha keep track of and store: 
- the performance (RMSE) on the training data
- the performance (RMSE) on the validation (test) data
- the coefficients (`coef_`) of the trained model

In [None]:
# The following are suggested alpha values
alpha_vals = 10**np.arange(-1, 4, .2)

# For each alpha, determine train rmse, test rmse, coefficients


Question 4.3: Using a visual, show how the performance (rmse) on the training and test data changed as we gradually increased `alpha`. Use a lineplot where the x-axis is `alpha` and the y-axis is rmse.  Use a log scale for the x-axis.
<br><br>
Discuss your results:
- From this plot, estimate the best alpha value.
- How does the plot for the training data compare to the lineplot of the validation (test) data?

In [None]:
# Create lineplot where the x-axis is alpha and the y axis is rmse


Question 4.4: Using a visual, show how the model's coefficients changed as we gradually increased the shrinkage parameter `alpha`. HINT: They should appear to be shrinking toward zero as you increase `alpha`!  There are too many coefficients to create lineplots for every coefficient.  Present only a subset of the coefficients that make the point.

In [None]:
# ax = sns.lineplot(x = 'alpha', y = 'coef', hue = 'col', data = iter_coefs)
# ax.legend(loc = 'center right', bbox_to_anchor = (1.5, 0.5), ncol = 1);

Question 5.1: Repeat steps in Question 4.2.  This time using `Ridge` instead of `Lasso`.

Question 5.2: Using a visual, show how the performance (rmse) on the training and test data changed as we gradually increased `alpha`. Use a lineplot where the x-axis is `alpha` and the y-axis is rmse.  Use a log scale for the x-axis.  
<br><br>
Discuss your results:
- From this plot, estimate the best alpha value.
- How does the plot for the training data compare to the validation (test) data?

Question 5.3: Using a visual, show how the model's coefficients changed as we gradually increased the shrinkage parameter `alpha`. HINT: They should appear to be shrinking toward zero as you increase `alpha`!  There are too many coefficients to create lineplots for every coefficient.  Present a subset of the coefficients that make the point.

In [None]:
# ax = sns.lineplot(x = 'alpha', y = 'coef', hue = 'col', data = iter_coefs)
# ax.legend(loc = 'center right', bbox_to_anchor = (1.5, 0.5), ncol = 1);

In [None]:
# The following code can be removed
print("Elapsed time: ", time.time() - start_time)