# Logistic Regression Implementation

An example binary classification problem can be represented by a dataset containing information about customers who did or did not default on their credit cards.  We want to do the following:

- Basic EDA: explore default groups for each individual feature (boxplots could be a nice way in here)
- Process categorical variables using `pd.get_dummies`
- Split your data
- Run a `LogisticRegression` to explore the likelihood of default based on the `balance` column.
- Cross validate this using values $[0.1, 1, 5, 10, 100]$ for the `C` parameter.
- Incorporate `PolynomialFeatures` into your model and rerun.  How did the performance change?
- Repeat for the `student` column.

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures

In [2]:
df = pd.read_csv('data/default.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 4 columns):
default    10000 non-null object
student    10000 non-null object
balance    10000 non-null float64
income     10000 non-null float64
dtypes: float64(2), object(2)
memory usage: 312.6+ KB


In [4]:
df = pd.get_dummies(df, drop_first=True)

In [5]:
df.head()

Unnamed: 0,balance,income,default_Yes,student_Yes
0,729.526495,44361.625074,0,0
1,817.180407,12106.1347,0,1
2,1073.549164,31767.138947,0,0
3,529.250605,35704.493935,0,0
4,785.655883,38463.495879,0,0


In [6]:
df.columns

Index(['balance', 'income', 'default_Yes', 'student_Yes'], dtype='object')

In [13]:
X = df.drop('default_Yes', axis = 1)
y = df.default_Yes
X_train, X_test, y_train, y_test = train_test_split(X,y )

In [8]:
clf = LogisticRegression()
clf.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [9]:
clf.predict(X_test)

array([0, 0, 0, ..., 0, 0, 0], dtype=uint8)

In [10]:
clf.score(X_test, y_test)

0.9708

In [11]:
pred = clf.predict(X_test)

In [15]:
from sklearn.metrics import accuracy_score, recall_score, classification_report

In [14]:
recall_score(pred, y_test)

0.0

In [16]:
print(classification_report(pred, y_test))

             precision    recall  f1-score   support

          0       1.00      0.97      0.98      2498
          1       0.00      0.00      0.00         2

avg / total       1.00      0.97      0.98      2500



In [17]:
df.default_Yes.value_counts()

0    9667
1     333
Name: default_Yes, dtype: int64