# LASSO vs. Losgistic Regression in Classification

[Notebook](lasso_classification_fa21.ipynb)



## 1. Titanic 
#### a. Logistic Regression

In [1]:
import pandas as pd
import numpy as np
np.random.seed(12356)
df = pd.read_csv('titanic.csv')

# Remove all missing values
df = df.dropna()

# Assign input variables
X = df.loc[:,['Pclass','Sex','Age','Fare','Embarked','SibSp','Parch']]

# Assign target variable
y = df['Survived']

# Encode categorical variable
X = pd.get_dummies(X)

In [2]:
# Standardize Data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
  
# transform data
col_names = X.columns
X = scaler.fit_transform(X)
X = pd.DataFrame(X)
X.columns = col_names

from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test=train_test_split(X,y,test_size=0.3)
from sklearn.linear_model import LogisticRegression

model1 = LogisticRegression()
model1.fit(x_train, y_train)
# Accuracry on test data
print("Logistic Regression's Testing Accuracy: ", model1.score(x_test, y_test))

coef1 = pd.DataFrame({'Variable':x_train.columns, 
                     'Coef':model1.coef_.reshape(x_train.columns.shape[0],),
                      'Model': 'Logistic Regression',
                      'Data': 'Original'
                    })

Logistic Regression's Testing Accuracy:  0.7454545454545455


#### b. LASSO

In [3]:
alpha=.5

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import SGDClassifier
sns.set(style="white")
model2 = SGDClassifier(loss='log', penalty='elasticnet', alpha=.01, l1_ratio=1)
model2.fit(x_train, y_train)

print("LASSO's Testing Accuracy: ", model2.score(x_test, y_test))

coef2 = pd.DataFrame({'Variable':x_train.columns, 
                     'Coef':model2.coef_.reshape(x_train.columns.shape[0],),
                      'Model': 'Logistic Regression',
                      'Data': 'Original'
                    })

LASSO's Testing Accuracy:  0.7454545454545455


#### c. Coefficients

In [4]:
coef = pd.concat([coef1, coef2], ignore_index=True, axis=1)
coef = coef.drop([2, 3, 4, 6, 7], axis=1)
coef.columns = ['Variable','Linear Model', 'LASSO']
coef

Unnamed: 0,Variable,Linear Model,LASSO
0,Pclass,-0.479559,-0.474349
1,Age,-0.567665,-0.470222
2,Fare,-0.027017,0.0
3,SibSp,0.134773,0.0
4,Parch,-0.109295,0.0
5,Sex_female,0.715005,0.66234
6,Sex_male,-0.715005,-0.66234
7,Embarked_C,0.146361,0.128287
8,Embarked_Q,0.122903,0.0
9,Embarked_S,-0.171922,-0.235345


## 2. Practice

Use the breast cancer dataset ([link](https://bryantstats.github.io/math460/python/breast_cancer.csv))

 - Split the data 80:20 for training and testing
 - Calculate the testing Accuracy of linear model and show the coefficients of the models
 - Calculate the testing Accuracy of LASSO model with alpha = 1. Show the coefficients of the LASSO. Do you observe any variables no longer has effect in the LASSO model?
 - Change the value of alpha in the LASSO and observe the Accuracy of LASSO. Give your comments.
 - Plot the coefficients of linear model and LASSO.

