# Linear Regression vs. LASSO

This notebook demonstrate the performances of the ordinal least squares (OLS) linear model and the least absolute shrinkage and selection operator (LASSO) model. Let $\hat{y}$ be the predicted value, we have the following liner asumption.

$$
\hat{y} = w_0 + w_1 x_1 + w_2 x_2
$$

Although both two models assume the linear relation between the input and output, they minimize different loss functions. 

Linear Model minimizes
$$
\sum \big(y-\hat{y} \big)^2.
$$

LASSO minimizes
$$
\sum \bigg(y-\hat{y} \bigg)^2 +  \alpha \bigg( |w_1|+|w_2| \bigg).
$$

### LASSO Demonstration

In [1]:
alpha = 0

import matplotlib.pyplot as plt
import numpy as np 
import pandas as pd
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import warnings
warnings.filterwarnings('ignore')
from sklearn.metrics import r2_score

import pandas as pd
import numpy as np
np.random.seed(12356)
df = pd.read_csv('titanic.csv')

df = df[df['Age'].notna()]

# Assign input variables
X = df.loc[:,['Fare','SibSp','Parch']]

# Assign target variable
y = df['Age']

X = pd.get_dummies(X)
from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test=train_test_split(X,y,test_size=0.3)

model = LinearRegression()
model.fit(x_train, y_train)

coef1 = pd.DataFrame({'Variable':x_train.columns, 
                     'Coef':model.coef_,
                      'Model': 'Linear',
                      'Data': 'Original'
                    })

lm_org = r2_score(y_test, model.predict(x_test))
model = linear_model.Lasso(alpha=alpha)
model.fit(x_train, y_train)

coef2 = pd.DataFrame({'Variable':x_train.columns, 
                     'Coef':model.coef_,
                      'Model':'LASSO',
                      'Data': 'Original'
                    })

lasso_org = r2_score(y_test, model.predict(x_test))
coef = pd.concat([coef1, coef2], ignore_index=True, axis=1)
coef = coef.drop([2, 3, 4, 6, 7], axis=1)
coef.columns = ['Variable','Linear Model', 'LASSO']
coef.round(3)

Unnamed: 0,Variable,Linear Model,LASSO
0,Fare,0.047,0.047
1,SibSp,-4.356,-4.356
2,Parch,-2.395,-2.395


### 1. Predicting Age in Titanic

We predict housing prices in Boston using the linear model and LASSO. 

In [2]:
import pandas as pd
import numpy as np
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import warnings
warnings.filterwarnings('ignore')
from sklearn.metrics import r2_score

np.random.seed(12356)
df = pd.read_csv('titanic.csv')

df = df[df['Age'].notna()]

df = df.dropna()

# Assign input variables
X = df.loc[:,['Fare','SibSp','Parch']]

# Assign target variable
y = df['Age']

In [3]:
# Standardize data
import pandas as pd
from sklearn.preprocessing import StandardScaler
  
# define standard scaler
scaler = StandardScaler()
  
# transform data
col_names = X.columns
X = scaler.fit_transform(X)
X = pd.DataFrame(X)
X.columns = col_names

In [4]:
from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test=train_test_split(X,y,test_size=0.3)

#### Linear Regression

In [5]:
model1 = LinearRegression()
model1.fit(x_train, y_train)
lm_org = r2_score(y_test, model1.predict(x_test))
print('R2 of Linear Model:',lm_org)

R2 of Linear Model: -0.17259310776963477


#### LASSO

In [6]:
alpha = .1

model2 = linear_model.Lasso(alpha=alpha)
model2.fit(x_train, y_train)
lasso_org = r2_score(y_test, model2.predict(x_test))
print('R2 of Linear Model:',lasso_org)

R2 of Linear Model: -0.16678655419990585


#### Compare coefficients

In [7]:
coef1 = pd.DataFrame({'Variable':x_train.columns, 
                     'Coef':model1.coef_,
                      'Model': 'Linear',
                      'Data': 'Original'
                    })

coef2 = pd.DataFrame({'Variable':x_train.columns, 
                     'Coef':model2.coef_,
                      'Model':'LASSO',
                      'Data': 'Original'
                    })

coef = pd.concat([coef1, coef2], ignore_index=True, axis=1)
coef = coef.drop([2, 3, 4, 6, 7], axis=1)
coef.columns = ['Variable','Linear Model', 'LASSO']
coef.round(3)

Unnamed: 0,Variable,Linear Model,LASSO
0,Fare,0.261,0.075
1,SibSp,-1.445,-1.31
2,Parch,-6.153,-6.002


### 2.  Practice

Download an NBA dataset at this [link](https://bryantstats.github.io/math460/data/nba_salary2.csv). Use the `Salary` variable as the target. Don't use the following variables as input: `Player`,`Salary`,`NBA_DraftNumber`,`NBA_Country`,`Tm` 

 - Split the data 80:20 for training and testing
 - Calculate the testing Rsquared of linear model and show the coefficients of the models
 - Calculate the testing Rsquared of LASSO model with alpha =  1.  Show the coefficients of the LASSO.  Do you observe any variables no longer has effect in the LASSO model?
 - Change the value of alpha in the LASSO and observe the Rsquared of LASSO.  What value of alpha would make all coeeficients zeros. 