# Your turn to try out Linear Regression with Test/Train Split & Regularization

If you get stuck at any step, please ask others or ask me!

Execute the following cell to import our libraries:

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.metrics import mean_squared_error

We'll use a dataset from Scikit-Learn:  [California Housing Dataset](https://scikit-learn.org/stable/datasets/real_world.html#california-housing-dataset)

In [None]:
# Execute this cell to import the data
# print a description
# and initialize the Pandas dataframe "ca_housing_df"

from sklearn.datasets import fetch_california_housing

california_housing = fetch_california_housing(as_frame=True)

print(california_housing.DESCR)

ca_housing_df = california_housing.frame

Use the cell below to look at some sample rows of the `ca_housing_df` dataframe:

Use the `info` method to look at the number of rows & columns, and see whether there are any null values:

Make simple histogram plots of all variables (e.g. with `ca_housing_df.hist()`).
* Are they normally distributed? Bi-modal?  mostly normal with a couple outliers?  uniformly distributed with obvious caps to the allowable range of values?
* If you'd like, tinker with the number of bins for the histogram, zoom in on the ranges, etc
* You may also find it useful to change the figure size (e.g. `figsize=(12, 10)` as an input parameter to `hist`) or use `plt.tight_layout()` after the plotting command to keep multiple plots from overlapping

You can get a matrix of correlation coefficients by using the dataframe's `corr()` method.
* Check that out in the cell below
* Which variables are most correlated with the target variable of `MedHouseVal`?
* Which pairs of variables are highly correlated with each other?

Use `train_test_split` to make a training set and test set, where `MedHouseVal` is your target variable and all other variables are your feature variables.
* You can use `california_housing.data` and `california_housing.target` to get your features and target, or you can use `ca_housing_df.loc[:, ca_housing_df.columns != 'MedHouseVal']` and `ca_housing_df.loc[:, 'MedHouseVal']`  (or other options too)

In the next cell:
* train `Linear_Regression` on your training set
* assess the learned model's performance on the test data using `mean_squared_error`
* make a plot of the coefficient amplitudes
  * the coefficient values are stored in the `coef_` attribute of the variable for your `LinearRegression` object

Do the same thing, only now use Ridge regularization by using `Ridge(alpha=1000)`.

Repeat again using Lasso with `Lasso(alpha=0.05)`

Check the following:
* Is it easy to see differences in coefficient amplitudes on the same plot?  between the separate plots?
* Does Ridge reduce the coefficient amplitudes?
* Does Lasso reduce the coefficient amplitudes?
* What is the difference in coefficients resulting from Lasso vs Ridge?
* Look at the pairs of variables that you previously identified as being highly correlated with each other -- has regularization had more of an effect on the coefficient amplitude of one variable from your pair than the other?

If you get this far, try some of the following:
* Try out Elastic Net with `ElasticNet(alpha=0.05)` and compare
* Scaling of feature values can be useful for machine learning.
  * Look up Scikit-Learn's `StandardScaler`
  * Use this to scale your feature variables
  * Repeat one of the above ML trainings and look at what effect it has on the coefficients
* Make a bar plot that compares coefficient amplitudes for all your models
* Use cross-validation to find an optimum `alpha` value among some set of possible `alpha` values for one of your regularization models