## Logistic Regression

- What is Logistic Regression?
- How do we do Logistic Regression with Python? statsmodels + scikit+learn
- Validate Data Splits
-Analyzing + Evaluating our models

What is logistic regression?

- OLS + **logit** function
- a **logit** function produces a number between 0 and 1
- output is a number between 0 and 1 -- the probability of an observation being in the positive class


- Pros:
    - fast to train, very fast to predict
    - we get probabilities of being in the positive class
    - more interpretable than some other classification models
    
    
- Cons:
    - less interpretable than some other classification models
    - assume the X predictors are independent
    - multi-class classification is more complicated, but doable (**one-vs-rest**)


- Overall Great Baseline, as a first pass

## Simple Example

In [10]:
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix


import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
# ignore warnings
import warnings
warnings.filterwarnings("ignore")

from acquire import get_titanic_data
from prepare import prep_titanic

## Mini Exercise

1. Load the `titanic dataset` that you've put together from previous lessons.
2. Split your data into training and test.
3. Fit a logistic regression model on your training data using sklearn's
   linear_model.LogisticRegression class. Use fare and pclass as the
   predictors.
4. Use the model's `.predict` method. What is the output?
5. Use the model's `.predict_proba` method. What is the output? Why do you
   think it is shaped like this?
6. Evaluate your model's predictions on the test data set. How accurate
   is the mode? How does changing the threshold affect this?

In [36]:
df = get_titanic_data()
df.head()

Unnamed: 0,passenger_id,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,deck,embark_town,alone
0,0,0,3,male,22.0,1,0,7.25,S,Third,,Southampton,0
1,1,1,1,female,38.0,1,0,71.2833,C,First,C,Cherbourg,0
2,2,1,3,female,26.0,0,0,7.925,S,Third,,Southampton,1
3,3,1,1,female,35.0,1,0,53.1,S,First,C,Southampton,0
4,4,0,3,male,35.0,0,0,8.05,S,Third,,Southampton,1


In [37]:
df.dropna(inplace=True)

In [49]:
X = df[['pclass', 'fare']]
y = df[['survived']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .30, random_state = 123)

X_train.head()

Unnamed: 0,pclass,fare
123,2,13.0
689,1,211.3375
174,1,30.6958
88,1,263.0
712,1,52.0


In [50]:
logit = LogisticRegression(C=1, class_weight={1:2}, random_state = 123, solver='saga')

In [51]:
logit.fit(X_train, y_train)

LogisticRegression(C=1, class_weight={1: 2}, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=123, solver='saga', tol=0.0001, verbose=0,
                   warm_start=False)

In [52]:
print('Coefficient: \n', logit.coef_)
print('Intercept: \n', logit.intercept_)

Coefficient: 
 [[0.0132979 0.01546  ]]
Intercept: 
 [0.00673706]


In [53]:
y_pred = logit.predict(X_train)

In [54]:
y_pred_proba = logit.predict_proba(X_train)

In [55]:
print('Accuracy of Logistic Regression classifier on training set: {:.2f}'
     .format(logit.score(X_train, y_train)))

Accuracy of Logistic Regression classifier on training set: 0.64


In [56]:
print(confusion_matrix(y_train, y_pred))

[[ 0 46]
 [ 0 81]]


In [57]:
print(classification_report(y_train, y_pred))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00        46
           1       0.64      1.00      0.78        81

    accuracy                           0.64       127
   macro avg       0.32      0.50      0.39       127
weighted avg       0.41      0.64      0.50       127



In [58]:
print('Accuracy of Logistic Regression classifier on test set: {:.2f}'
     .format(logit.score(X_test, y_test)))

Accuracy of Logistic Regression classifier on test set: 0.76


In [67]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

df = sns.load_dataset('titanic')[['fare', 'pclass', 'survived']]

train, test = train_test_split(df, random_state=123, train_size=.8)

X = train[['pclass', 'fare']]
y = train[['survived']]

model = LogisticRegression(random_state=123).fit(X, y)

In [68]:
model.predict(X)

array([1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1,
       0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
       1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1,

In [69]:
model.predict_proba(X)

array([[0.38105535, 0.61894465],
       [0.73684755, 0.26315245],
       [0.7373293 , 0.2626707 ],
       ...,
       [0.73668683, 0.26331317],
       [0.73730638, 0.26269362],
       [0.73684755, 0.26315245]])

In [70]:
train['yhat'] = model.predict(X)
train['p_survived'] = model.predict_proba(X)[:, 1]

In [75]:
model.score(X, y)

0.672752808988764

In [78]:
from sklearn.metrics import accuracy_score, precision_score, recall_score

In [79]:
accuracy_score(train.survived, train.yhat)

0.672752808988764

In [80]:
# 0 and then 1
precision_score(train.survived, train.yhat, average=None)

array([0.68498168, 0.63253012])

In [81]:
recall_score(train.survived, train.yhat)

0.37906137184115524

In [83]:
t = .7
train['yhat'] = train.p_survived > t

accuracy_score(train.survived, train.yhat), precision_score(train.survived, train.yhat), recall_(train.survived, train.yhat)

(0.6264044943820225, 0.6896551724137931, 0.07220216606498195)

In [84]:
t = .25
train['yhat'] = train.p_survived > t

accuracy_score(train.survived, train.yhat), precision_score(train.survived, train.yhat), recall_score(train.survived, train.yhat)

(0.3890449438202247, 0.3890449438202247, 1.0)

## More Complicated Example

- **Validate Data Split**, lets us compare models, tweak hyperparameters, experiment with thresholds without leaking information from the test split
- Train: fit models -- majority of our data
- Validate: compare models, choose hyperparameters, thresholds -- ~20% of our data
- Test: to get an idea of *our out of sample error* -- ~20% of our data

In [92]:
df = sns.load_dataset('titanic')[['fare', 'sex', 'pclass', 'survived']]
df.head()

Unnamed: 0,fare,sex,pclass,survived
0,7.25,male,3,0
1,71.2833,female,1,1
2,7.925,female,3,1
3,53.1,female,1,1
4,8.05,male,3,0


In [93]:
train, test = train_test_split(df, random_state=123, train_size=.86)
train, validate = train_test_split(train, random_state=123, train_size=.83)

print('    test: %d rows x %d columns' % test.shape)
print('   train: %d rows x %d columns' % train.shape)
print('validate: %d rows x %d columns' % validate.shape)

    test: 125 rows x 4 columns
   train: 635 rows x 4 columns
validate: 131 rows x 4 columns
