<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Titanic-dataset-to-predict-survival-🚢-🥶" data-toc-modified-id="Titanic-dataset-to-predict-survival-🚢-🥶-1">Titanic dataset to predict survival 🚢 🥶</a></span></li><li><span><a href="#Instructions" data-toc-modified-id="Instructions-2">Instructions</a></span></li><li><span><a href="#Load-and-understand-the-data" data-toc-modified-id="Load-and-understand-the-data-3">Load and understand the data</a></span></li><li><span><a href="#Fit-a-first-model" data-toc-modified-id="Fit-a-first-model-4">Fit a first model</a></span></li><li><span><a href="#Fit-your-own-models" data-toc-modified-id="Fit-your-own-models-5">Fit your own models</a></span></li></ul></div>

<center><h2>Titanic dataset to predict survival 🚢 🥶</h2></center>

Let's try different preprocess techniques to improve prediction on the infamous Titanic dataset.

Instructions
------

- Complete this individually. 

    You'll get the most out of this activity by attempting it on your own.
    <br>
    
- But together.

    I'll place you breakout rooms so you can ask questions and chat with peers.
    
- Type every command. 

    You learn almost nothing by copy n' pasting code. Typing the commands will build procedural fluency and you will make small typos that will force you to debug common mistakes. Tab complete is awesome, use it!
<br>
- Complete the activity in Deepnote. 

    After completion, send the link as a private message in Zoom to Brian. Time permitting, Brian might give you a quick code review. It also signals who in the class is done.
    <br>
- Any random seed should be set to `42` so we can compare results amongst ourselves.

<center><h2>Load and understand the data</h2></center>

In [2]:
reset -fs

In [3]:
from sklearn.datasets import fetch_openml

titanic = fetch_openml(name='titanic', 
                       version=1,
                       as_frame=True)

In [4]:
print(titanic.DESCR)

**Author**: Frank E. Harrell Jr., Thomas Cason  
**Source**: [Vanderbilt Biostatistics](http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.html)  
**Please cite**:   

The original Titanic dataset, describing the survival status of individual passengers on the Titanic. The titanic data does not contain information from the crew, but it does contain actual ages of half of the passengers. The principal source for data about Titanic passengers is the Encyclopedia Titanica. The datasets used here were begun by a variety of researchers. One of the original sources is Eaton & Haas (1994) Titanic: Triumph and Tragedy, Patrick Stephens Ltd, which includes a passenger list created by many researchers and edited by Michael A. Findlay.

Thomas Cason of UVa has greatly updated and improved the titanic data frame using the Encyclopedia Titanica and created the dataset here. Some duplicate passengers have been dropped, many errors corrected, many missing ages filled in, and new variable

https://www.openml.org/d/40945

		
| Feature | Definition |  Key |
|:-------|:------|:------|
| survival | Survival |0 = No, 1 = Yes
| pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |
| sex | Sex |  |
| Age | 	Age in years |  |
| sibsp | 	# of siblings / spouses aboard the Titanic	 |  |
| parch | 	# of parents / children aboard the Titanic	 |  |
| ticket | Ticket number	 |  |
| fare | Passenger fare	 |  |
| cabin | Cabin number	 |  |
| embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |


		
	

__Variable Notes__:

pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way...  
Sibling = brother, sister, stepbrother, stepsister  
Spouse = husband, wife (mistresses and fiancés were ignored)  

parch: The dataset defines family relations in this way...  
Parent = mother, father  
Child = daughter, son, stepdaughter, stepson  
Some children travelled only with a nanny, therefore parch=0 for them.  

In [5]:
# Survive or not
titanic.target

0       1
1       1
2       0
3       0
4       0
       ..
1304    0
1305    0
1306    0
1307    0
1308    0
Name: survived, Length: 1309, dtype: category
Categories (2, object): ['0', '1']

In [6]:
y = titanic.target

In [7]:
import pandas as pd
# Always look at the raw data
titanic.data.tail()

Unnamed: 0,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
1304,3.0,"Zabour, Miss. Hileni",female,14.5,1.0,0.0,2665,14.4542,,C,,328.0,
1305,3.0,"Zabour, Miss. Thamine",female,,1.0,0.0,2665,14.4542,,C,,,
1306,3.0,"Zakarian, Mr. Mapriededer",male,26.5,0.0,0.0,2656,7.225,,C,,304.0,
1307,3.0,"Zakarian, Mr. Ortin",male,27.0,0.0,0.0,2670,7.225,,C,,,
1308,3.0,"Zimmerman, Mr. Leo",male,29.0,0.0,0.0,315082,7.875,,S,,,


In [8]:
# Types matter
titanic.data.dtypes

pclass        float64
name           object
sex          category
age           float64
sibsp         float64
parch         float64
ticket         object
fare          float64
cabin          object
embarked     category
boat           object
body          float64
home.dest      object
dtype: object

In [9]:
# Hints about missing data for continuous types
titanic.data.describe()

Unnamed: 0,pclass,age,sibsp,parch,fare,body
count,1309.0,1046.0,1309.0,1309.0,1308.0,121.0
mean,2.294882,29.881135,0.498854,0.385027,33.295479,160.809917
std,0.837836,14.4135,1.041658,0.86556,51.758668,97.696922
min,1.0,0.1667,0.0,0.0,0.0,1.0
25%,2.0,21.0,0.0,0.0,7.8958,72.0
50%,3.0,28.0,0.0,0.0,14.4542,155.0
75%,3.0,39.0,1.0,0.0,31.275,256.0
max,3.0,80.0,8.0,9.0,512.3292,328.0


<center><h2>Fit a first model</h2></center>

In [10]:
# Select a single feature to keep the modeling simple
X = titanic.data.sibsp
X = X.values.reshape(-1, 1)

In [11]:
# Do the three way data split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test             = train_test_split(X,       y      , random_state=42)
X_train, X_validation, y_train, y_validation = train_test_split(X_train, y_train)

In [12]:
from sklearn.linear_model  import LogisticRegression 
from sklearn.metrics       import accuracy_score
from sklearn.pipeline      import Pipeline
from sklearn.preprocessing import StandardScaler

# Fit model in a pipeline
# For this activity, only use LogisticRegression with default hyperparameters
pipe = Pipeline([('scaler',       StandardScaler()), 
                 ('logistic',     LogisticRegression())])

pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_validation)

# This is a classification task so accuracy is an acceptable metric
# We'll learn about other classification metrics later
acc = accuracy_score(y_validation, y_pred)
print(f"Accuracy: {acc:.2%}")

Accuracy: 67.48%


<center><h2>Fit your own models</h2></center>

Now is your turn to try different techniques from class:

- Imputation of missing data
- Preprocessing categorical data
- Preprocessing continuous data

The goal of this course is to develop your machine learning intuition. PRIMM is a systematic way to develop your intuition:

- Predict - Make an out-loud guess at the start.
- Run - Alway run the code. The interpreter is a great teacher.
- Investigate - Why did it turn out that way?
- Modify - Change something and repeat previous steps.
- Make - Apply the same concept to a new context.

For this activity, experiment on one feature at a time. I'll show you how to preprocess combinations of features right after.

Suggested features to start with:

1. fare
2. pclass
3. embarked

Remember you can stack preprocessing steps:

```python
pipe = Pipeline([('impute',       KNNImputer(n_neighbors=2)),
                 ('scaler',       StandardScaler()), 
                 ('logistic',     LogisticRegression())])
```

In [13]:
import numpy as np

from sklearn.impute import *
from sklearn.preprocessing import *

<br>
<br> 
<br>

----