# Challenge: Vanilla vs Ridge vs Lasso logistic regression

## Dataset
The [crime rates in 2013](https://ucr.fbi.gov/crime-in-the-u.s/2013/crime-in-the-u.s.-2013/tables/table-8/table_8_offenses_known_to_law_enforcement_by_state_by_city_2013.xls/view) dataset, including all states, 9281 entries in total.

First, load the data and do a little cleaning.

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Load data.
df = pd.read_excel('Table_8_Offenses_Known_to_Law_Enforcement_by_State_by_City_2013.xls',
                   skipinitialspace=True, header=3)
df.drop(['State', 'City', 'Unnamed: 14', 'Unnamed: 15', 'Unnamed: 16'], axis=1, inplace=True)

# Cleaning data.
df.columns = ['population', 'violent_crime', 'murder', 'rape_1',
               'rape_2', 'robbery', 'assault', 'property_crime', 'burglary',
               'larceny', 'vehicle_theft', 'arson']

df[['rape_1', 'rape_2', 'arson']] = df[['rape_1', 'rape_2', 'arson']].fillna(0)
df['rape'] = df['rape_1'] + df['rape_2']

df = df[:-10]
df.dropna(inplace=True)
df.drop(['rape_1', 'rape_2'], axis=1, inplace=True)

# Make assault the outcome variable.
df['assault'] = np.where(df['assault'] == 0, 0, 1)

# Creating training set and test set.
X = df.loc[:, ~(df.columns).isin(['assault'])]
Y = df['assault']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3)


## Vanilla logistic regression (using statsmodel)

In [2]:
# Vanilla logistic regression.
X_vlr_train = X_train.copy()
X_vlr_train['intercept'] = 1
Y_vlr_train = pd.DataFrame(Y_train)

vlr = sm.Logit(Y_vlr_train, X_vlr_train)
result = vlr.fit(method='bfgs')
print(result.summary())

# Check accuracy
X_vlr_test = X_test.copy()
X_vlr_test['intercept'] = 1

pred_vlr = result.predict(X_vlr_test)
pred_y_vlr = np.where(pred_vlr < .5, 0, 1)

print('\n Accuracy by assault status')
table = pd.crosstab(pred_y_vlr, Y_test)
print(table)
print('\n Percentage accuracy')
print((table.iloc[0,0] + table.iloc[1,1]) / (table.sum().sum()))


         Current function value: 0.010751
         Iterations: 35
         Function evaluations: 54
         Gradient evaluations: 45
                           Logit Regression Results                           
Dep. Variable:                assault   No. Observations:                 6496
Model:                          Logit   Df Residuals:                     6486
Method:                           MLE   Df Model:                            9
Date:                Sun, 29 Jul 2018   Pseudo R-squ.:                  0.9751
Time:                        16:32:10   Log-Likelihood:                -69.840
converged:                      False   LL-Null:                       -2805.2
                                        LLR p-value:                     0.000
                     coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------------
population         0.0003   8.71e-05      2.918      0.004    8.35e-

  return 1/(1+np.exp(-X))
  return np.sum(np.log(self.cdf(q*np.dot(X,params))))
  return 1/(1+np.exp(-X))
  return np.sum(np.log(self.cdf(q*np.dot(X,params))))


## Ridge logistic regression (sklearn with L2 penalty)

In [3]:
# Ridge logistic regression.
rlr = LogisticRegression(penalty='l2')
rlr_fit = rlr.fit(X_train, Y_train)
print('\nTraining set accuracy: ', rlr.score(X_train, Y_train))
print('\nTest set accuracy: ', rlr.score(X_test, Y_test))

pred_y_test = rlr.predict(X_test)
print('\n Accuracy by assault status')
print(pd.crosstab(pred_y_test, Y_test))



Training set accuracy:  0.958743842364532

Test set accuracy:  0.9583482944344703

 Accuracy by assault status
assault    0     1
row_0             
0        325    22
1         94  2344


## Lasso logistic regression (sklearn with L1 penalty)

In [4]:
# Lasso logistic regression.
llr = LogisticRegression(penalty='l1')
llr_fit = llr.fit(X_train, Y_train)
print('\nTraining set accuracy: ', llr.score(X_train, Y_train))
print('\nTest set accuracy: ', llr.score(X_test, Y_test))

pred_y_test = llr.predict(X_test)
print('\n Accuracy by assault status')
print(pd.crosstab(pred_y_test, Y_test))



Training set accuracy:  0.8445197044334976

Test set accuracy:  0.8495511669658887

 Accuracy by assault status
assault    0     1
row_0             
1        419  2366


__Q: Do we need to define C when using ridge or lasso logistic regression? Is it the same as lambda? What is the best way to determine C?__