### Guided Practice: Logit Function and Odds

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# Statsmodels logistic regression is sm.Logit
import statsmodels.api as sm

In [None]:
# Read in the data

df = pd.read_csv('../../assets/dataset/collegeadmissions.csv')

In [None]:
# Quick peek at the first three rows
df.head(3)

In [None]:
# Check the dimensions of your dataset
df.shape

In [None]:
# Check for datatypes - any objects/strings?
df.get_dtype_counts()

In [None]:
# Always check for missing values
df.isnull().sum()

In [None]:
# Dummify the prestige categorical variable
df = df.join(pd.get_dummies(df['rank'], prefix="rank"))

In [None]:
df.head()

In [None]:
# Set our features/predictors to X
X = df[['gre', 'gpa', 'rank_1', 'rank_2', 'rank_3',]]

# Add the intercept as recommended by statsmodels
# (An intercept is not included by default and should be added by the user.) 
X = sm.add_constant(X)

# Set our target variable to y
y = df['admit']

# Call your Logit function and fit the model
# Note: Order of inputs is important here:
# First y (dependent variable, target variable, endog) then X (features, exog)
lr = sm.Logit(y, X).fit()

# Output your summary of results from your model
lr.summary()

In [None]:
# Use "result.params" to output just your model coefficients
lr.params

In [None]:
# You can convert those coefficients into odds using np.exp()
np.exp(lr.params)

In [None]:
predicted = lr.predict(X)
threshold = 0.5
predicted_classes = (predicted > threshold).astype(int)
from sklearn.metrics import accuracy_score
accuracy_score(y, predicted_classes)

In [None]:
#predicted = result.predict(X)
#threshold = 0.3
#predicted_classes = (predicted > threshold).astype(int)
#accuracy_score(y, predicted_classes)

Below is some code to walk through confusion matrices. It'll be useful for working through the Titanic problem.

In [None]:
from sklearn.metrics import roc_curve, roc_auc_score

Below the ROC curve is based on various thresholds: It shows a false positive rate (x-axis) ~0, it also expects a true positive rate (y-axis) ~0.

The second chart, which does not play with thresholds, shows the one true TPR and FPR point, joined to 0,0 and 1,1.

The first chart will be more effective as you compare models and determine where the decision line should exist for the data. The second simplifies the first in case this idea of thresholds is confusing.

In [None]:
plt.plot(roc_curve(df[['admit']], predicted)[0], roc_curve(df[['admit']], predicted)[1])
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.title('ROC curve');

In [None]:
plt.plot(roc_curve(df[['admit']], predicted_classes)[0], roc_curve(df[['admit']], predicted_classes)[1])
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.title('ROC curve sans threshold');

Finally, you can use the `roc_auc_score` function to calculate the area under these curves (AUC).

In [None]:
roc_auc_score(df['admit'], predicted_classes)

### Note: sklearn also has logistic regression:
```
from sklearn.linear_model import LogisticRegression
lm = LogisticRegression()
lm.fit(X, y)
```

### Titanic Problem

** Goals **

1. Spend a few minutes determining which data would be most important to use in the prediction problem. You may need to create new features based on the data available. Consider using a feature selection aide in sklearn. But a worst case scenario; identify one or two strong features that would be useful to include in the model.
2. Spend 1-2 minutes considering which _metric_ makes the most sense to optimize. Accuracy? FPR or TPR? AUC? Given the business problem (understanding survival rate aboard the Titanic), why should you use this metric?
3. Build a tuned Logistic Regression model. Be prepared to explain your design (including regularization), metric, and feature set in predicting survival using the tools necessary (such as a fit chart).

In [None]:
titanic = pd.read_csv('../../assets/dataset/titanic.csv')

In [None]:
titanic.head(3)

In [None]:
# Check your data types
titanic.get_dtype_counts()

In [None]:
# Check for missing values
# Filtered to just output columns with > 0 missing values
titanic.isnull().sum()[titanic.isnull().sum() != 0]

In [None]:
# Reset the index to be the PassengerID 
titanic.set_index('PassengerId', inplace=True)

# Dummify Pclass (Passenger Class) variable
titanic = titanic.join(pd.get_dummies(titanic.Pclass))


titanic['is_female'] = titanic.Sex.apply(lambda x: 1 if x == 'female' else 0)

In [None]:
titanic.Survived.value_counts()

In [None]:
# Two histograms (One for Survived=0 and one for Survived=1 with 
# Age on the x-axis and the count on the y-axis
titanic.groupby('Survived').Age.hist(grid=False, edgecolor='#000000');

In [None]:
titanic.tail(3)

In [None]:
titanic['Age'] = titanic.groupby(["Sex", 'Pclass']).Age.transform(lambda x: x.fillna(x.mean()))
titanic['had_parents'] = titanic.Parch.apply(lambda x: 1 if x > 0 else 0)
titanic['had_siblings'] = titanic.SibSp.apply(lambda x: 1 if x > 0 else 0)

In [None]:
from sklearn import grid_search, cross_validation
from sklearn.linear_model import LogisticRegression

feature_set = titanic[['is_male', 1, 2, 'Fare', 'Age', 'had_parents', 'had_siblings']]
gs = grid_search.GridSearchCV(
    estimator=LogisticRegression(),
    param_grid={'C': [10**-i for i in range(-5, 5)], 'class_weight': [None, 'balanced']},
    cv=cross_validation.KFold(n=len(titanic), n_folds=10),
    scoring='roc_auc')


gs.fit(feature_set, titanic.Survived)
gs.grid_scores_

In [None]:
gs.best_estimator_