# Linear Models for Classification

# Exercise
Load and preprocess the adult data as before.
include dummy encoding and scaling
Learn a logistic regression model and visualize the coefficients.
Then grid-search the regularization parameter C.
Compare the coefficients of the best model with the coefficients of a model with more regularization.

In [4]:
import pandas as pd
df = pd.read_csv("data/adult.csv", index_col=0)
df.head()

Unnamed: 0,age,workclass,education,education-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


First, we can set aside income as our target variable and drop it from the dataframe.

We can also drop `education` as it mimics `education-num`.

In [5]:
income = df.income

df = df.drop(['education', 'income'], axis=1)

## Preprocessing

We'll one-hot encode our categorical features followed by a scaling of numerical features:

In [7]:
### one hot encode data
data_one_hot = pd.get_dummies(df)
data_one_hot.head()

Unnamed: 0,age,education-num,capital-gain,capital-loss,hours-per-week,workclass_ ?,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,workclass_ Private,...,native-country_ Portugal,native-country_ Puerto-Rico,native-country_ Scotland,native-country_ South,native-country_ Taiwan,native-country_ Thailand,native-country_ Trinadad&Tobago,native-country_ United-States,native-country_ Vietnam,native-country_ Yugoslavia
0,39,13,2174,0,40,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,50,13,0,0,13,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,38,9,0,0,40,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
3,53,7,0,0,40,0,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
4,28,13,0,0,40,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


In [8]:
### Scaling
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data_one_hot, income)

scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)

## Basic Logistic Regression

In [13]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

lr = cross_val_score(LogisticRegression(solver='lbfgs'), X_train_scaled, y_train, cv=10)
print(lr.mean())

0.8525803676070314


Not too shabby.

## Grid Searched Logistic Regression

In [17]:
import numpy as np
from sklearn.model_selection import GridSearchCV

param_grid = {'C': np.logspace(-3, 3, 7)}

grid_lr = GridSearchCV(LogisticRegression(solver='lbfgs'), param_grid,
                       cv=10, return_train_score=True)

grid_lr.fit(X_train_scaled, y_train)

print(f'best params: {grid_lr.best_params_}')
print(f'best score:  {grid_lr.best_score_}')

best params: {'C': 0.1}
best score:  0.8526208026208026
