# Lec 17 Lab: The Lasso
## CMSE 381 - Fall 2022
## Oct 19, 2022



In this module we are going to test out the ridge/lasso methods we discussed in class from Chapter 6.2, and the PCR ideas from Chapter 6.3.

In [None]:
# Everyone's favorite standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import time


# ML imports we've used previously
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error



# Loading in the data

Ok, here we go, let's play with a baseball data set again. Note this cleanup is all the same as the last lab. 

In [None]:
hitters_df = pd.read_csv('Hitters.csv')

# Print the dimensions of the original Hitters data (322 rows x 20 columns)
print("Dimensions of original data:", hitters_df.shape)

# Drop any rows the contain missing values, along with the player names
hitters_df = hitters_df.dropna().drop('Player', axis=1)

# Replace any categorical variables with dummy variables
hitters_df = pd.get_dummies(hitters_df, drop_first = True)

hitters_df.head()

In [None]:
y = hitters_df.Salary

# Drop the column with the independent variable (Salary)
X = hitters_df.drop(['Salary'], axis = 1).astype('float64')

X.info()

We had an issue last class with normalization. While there are internal methods from `scikitlearn` to do this, they make the code more complicated, so for now, let's just do it manually. 

In [None]:
# Manually normalizing 

X_Normalized = X/X.std() # Note that X.std() is a list of columns, so this does it columnwise
X_Normalized

Finally lets get a simple train/test split to work with (Validation style), and a list of $\alpha$s to test for our Lasso.

In [None]:
# Split data into training and test sets
X_train, X_test , y_train, y_test = train_test_split(X_Normalized, y, test_size=0.3, random_state=1)

# List of alphas
alphas = 10**np.linspace(4,-2,100)*0.5
# alphas = np.append(alphas,0)
alphas

# Lasso 

Thanks to the wonders of `scikit-learn`, now that we know how to do all this with ridge regression, translation to lasso is super easy. 

- [Lasso Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html)
- [LassoCV Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html#sklearn.linear_model.LassoCV)
- [User guide](https://scikit-learn.org/stable/modules/linear_model.html#lasso)



In [None]:
from sklearn.linear_model import Lasso, LassoCV

In [None]:
# Here's a quick lasso code for a fixed $\alpha$
lasso = Lasso(max_iter = 10000)
# max_iter increases how long the lasso model tries to find a good solution
# In our case, if I leave it at the default 1000 I was getting errors, so I
# upped the value. 
lasso.set_params(alpha=1)
lasso.fit(X_train, y_train)

mean_squared_error(y_test,lasso.predict(X_test))


&#9989; **<font color=red>Do this:</font>** Make a few graphs similar to what we did in the previous lab, but using Lasso instead of Ridge. 
- A graph of the coeffiencts as alpha changes 
- A graph of the test mean squared error as alpha changes

*Note: we did similar things in the last class, you should be able to borrow and modify code from there*

In [None]:
# Your code for a graph of the coefficients goes here 

In [None]:
# Your code for a graph of the MSEs goes here


&#9989; **<font color=red>Do this:</font>** Now try what we did with `LassoCV`.  What choice of $\alpha$ does it recommend? 

*I would actually recommend either not passing in any $\alpha$ list or passing explicitly `alphas = None`. `RidgeCV` can't do this, but `LassoCV` will automatically try to find good choices of $\alpha$ for you.*

In [None]:
# Your code here

Now let's take a look at some of the coefficients. 

In [None]:
# Some of the coefficients are now reduced to exactly zero.
pd.Series(lassocv.coef_, index=X.columns)

&#9989; **<font color=red>Q:</font>** We've been repeating over and over that lasso gives us coefficients that are actually 0.  At least in my code, I'm not seeing any that are 0. What happened? Can I change something to get more 0 entries? 

*Your answer here*

In [None]:
# You might also want some code in here to try to figure it out



-----
### Congratulations, we're done!
Written by Dr. Liz Munch, Michigan State University

<a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/">Creative Commons Attribution-NonCommercial 4.0 International License</a>.