# JUST CLIMB ON THE DOOR, LEO!

![leo](https://images2.minutemediacdn.com/image/upload/c_crop,h_347,w_620,x_0,y_36/f_auto,q_auto,w_1100/v1555428149/shape/mentalfloss/titanic_large.jpg)

In this activity, we will use actual data from the ill-fated Titanic voyage. You are provided a file, `titanic.csv`, that contains information about passengers on the ship, as well as whether or not the person survived the sinking of the vessel.

As you can probably imagine, several factors were critical in determining survival, such as gender, age and class. In this notebook, we will build a logistic regression classifier to see how effectively we can predict who survived the disaster.


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

Last time around, we wrote our own implementation of gradient descent when performing linear regression. But in practice, it's more common (and preferable!) to use a highly optimized machine learning library that already implements the most common algorithms, so that we can focus on the problem to be solved. In this workbook, we'll use the `scikit-learn` library.

In [None]:
# sklearn is the name of the package; we're importing some specific functions and classes that 
# we'll use further below.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix

In [None]:
# read in the data
data = pd.read_csv("titanic.csv")

# take a look at the data 
data.tail()

In [None]:
# our targets
y = data["Survived"]

# recoding the gender column to be numeric (unfortunately, the historic dataset uses a limited, 
# binary notion of gender identity)
data['Recoded_Gender'] = data['Gender'].apply(lambda g: 1 if g == 'female' else 0)
# the apply() method is a good one to be familiar with, see documentation -->
#   https://pandas.pydata.org/docs/reference/api/pandas.Series.apply.html
# talk to us if you have questions!

# build the feature matrix -- we'll only use three features in building our model
# confused about the double brackets? talk to us!
X = data[['Fare', 'Recoded_Gender', 'Age']]

It's time for some additional data preprocessing: we're going to perform a train-test split (to get an unbiased estimate of model performance, as discussed in the videos) and then rescale the features (to make the optimizer's job easier in finding the best-fit parameters). Note that we perform the split _first_, before doing the scaling --- that way, we ensure that we don't leak _any_ information from our test fold into our training fold, including the means and standard deviations of our various features.

In [None]:
# a simple usage of the train_test_split function, to divide our data into a
# training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Answer the following questions based on the documentation for this function:
#   https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
#   - Does the data get randomly shuffled before splitting?
#   - What fraction of the data becomes training data and what fraction becomes test data?
# Finally: modify the function call above so that you get an 80-20 train-test split

In [None]:
# Scaler objects in sklearn can be used to rescale data in various ways
# in this case, we're using z-score based scaling
# another common scaler is the MinMaxScaler which performs linear scaling into the range [0, 1]
scaler = StandardScaler()

# all scalers implement a fit and a transform method
#   - fit: computes the appropriate statistics from the data
#   - transform: rescales the data using the statistics computed by fit
#   - fit_transform: a convenience method that applies fit(), followed by transform()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Questions to answer:
#   - why did we call fit_transform on the training data but just transform on the test data?
#   - why did we not call anything on y_train or y_test?

Time to build a model!

In [None]:
# initialize a logistic regression model
model = LogisticRegression()

# every model family in scikit-learn implements a fit and a predict method:
#   - fit: takes the training data and targets and fits the model parameters to it
#   - predict: takes a fitted model and makes predictions on the supplied data
model.fit(X_train, y_train)
y_preds = model.predict(X_test)

# at this point, we now have y_test (the true labels) and y_preds (our model's
# predictions for those same data points); these two should have the same
# dimensions
print(y_test.shape)
print(y_preds.shape)

In [None]:
# time to compute some metrics: first, let's check our accuracy
print("Acc:", accuracy_score(y_test, y_preds))
print()

# now, the F1 score
print("F1:", f1_score(y_test, y_preds))
print()

# we can also see the confusion matrix
print(confusion_matrix(y_test, y_preds))

Earlier, we created a logistic regression model by instantiating a `LogisticRegression` object. This uses some sensible default set of hyperparameters, but you will often want to customize these hyperparameters to your specific dataset. You can set these hyperparameters by passing arguments to the `LogisticRegression` constructor. For example, what if you wanted to change the strength of the regularization applied to the loss function? Identify which argument controls this by reading the documentation page:  
[https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)  
Then, retrain a `LogisticRegression` model with a regularization strength of 0.01 and measure its performance using the F1 score. 

In [None]:
# YOUR CODE HERE

Let's try to identify the best setting of this regularization strength hyperparameter. As noted in the videos, we need to be a little careful when doing this: you don't want to use the test set for determining the best hyperparameter setting, as you would be "contaminating" the test set, i.e., using it as part of the model fitting process. To ensure the integrity of the test set, we should instead split the training set further -- into a training set and a validation set. We can train a variety of `LogisticRegression` models, with different regularization strengths, on this smaller training set, and evaluate performance on the validation set. Then, once we've identified the best performing hyperparameter value, we can train a "fresh" model on the entire training fold using this value and measure its performance on the test set.

In [None]:
# you'll write code to implement the above workflow here
# first, split the training data further into a training and validation set

# YOUR CODE HERE

In [None]:
# train a range of LogisticRegression models, using different hyperparameter settings
# on just the training set (use a loop); evaluate each model on the validation set

# YOUR CODE HERE

In [None]:
# create a plot of your results: F1 score vs. C value
# remember to label your axes!

# YOUR CODE HERE

In [None]:
# finally: what's the best C value based on your plot? train a *single* LogisticRegression model 
# on the entire training data using that setting -- how well does this model do on the test set?

# YOUR CODE HERE

<i><p style='text-align: right;'> <b>Authors:</b> Michelle Kuchera, Raghuram Ramanujan </p></i>