# Homework 3: Suggested Solution

# Question 1

# Question 2

In [9]:
from pathlib import Path
import pandas as pd
import tarfile
import urllib.request

def load_titanic_data():
    tarball_path = Path("datasets/titanic.tgz")
    if not tarball_path.is_file():
        Path("datasets").mkdir(parents=True, exist_ok=True)
        url = "https://github.com/ageron/data/raw/main/titanic.tgz"
        urllib.request.urlretrieve(url, tarball_path)
        with tarfile.open(tarball_path) as titanic_tarball:
            titanic_tarball.extractall(path="datasets")
    return [pd.read_csv(Path("datasets/titanic") / filename) for filename in ("train.csv", "test.csv")]

In [10]:
train_data, test_data = load_titanic_data()

In [11]:
train_data
train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [8]:
test_data

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0000,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S
...,...,...,...,...,...,...,...,...,...,...,...
413,1305,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.0500,,S
414,1306,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9000,C105,C
415,1307,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.2500,,S
416,1308,3,"Ware, Mr. Frederick",male,,0,0,359309,8.0500,,S


The goal is to train a classifier that can predict the *Survived column* based on the other columns. 

Your goal is to train the best model you can on the training data, then make your predictions on the test data.

However, the test data does *not* contain the labels. 

Normally, you can upload your file to Kaggle to see your final score, but you have to consider how you validate your answer when there is no label in the test set.

The attributes have the following meaning:
* **PassengerId**: a unique identifier for each passenger
* **Survived**: that's the target, 0 means the passenger did not survive, while 1 means he/she survived.
* **Pclass**: passenger class.
* **Name**, **Sex**, **Age**: self-explanatory
* **SibSp**: how many siblings & spouses of the passenger aboard the Titanic.
* **Parch**: how many children & parents of the passenger aboard the Titanic.
* **Ticket**: ticket id
* **Fare**: price paid (in pounds)
* **Cabin**: passenger's cabin number
* **Embarked**: where the passenger embarked the Titanic

The goal is to predict whether or not a passenger survived based on attributes such as their age, sex, passenger class, where they embarked and so on.

Let's explicitly set the `PassengerId` column as the index column: (WHY?)

In [14]:
train_data = train_data.set_index("PassengerId")
test_data = test_data.set_index("PassengerId")

In [15]:
# Let's get more info to see how much data is missing:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Name      891 non-null    object 
 3   Sex       891 non-null    object 
 4   Age       714 non-null    float64
 5   SibSp     891 non-null    int64  
 6   Parch     891 non-null    int64  
 7   Ticket    891 non-null    object 
 8   Fare      891 non-null    float64
 9   Cabin     204 non-null    object 
 10  Embarked  889 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB


Try to think about how you handle the missing values: **Age**, **Cabin**, **Embarked**.

The first question is whether they are necessary? 

What is your story? That is, does your model need to have them?

The **Age** attribute has about 19% null values, so we will need to decide what to do with them. Replacing null values with the median age seems reasonable. We could be a bit smarter by predicting the age based on the other columns (for example, the median age is 37 in 1st class, 29 in 2nd class and 24 in 3rd class), but we'll keep things simple and just use the overall median age.

How about **Cabin** and **Embarked**?

In [16]:
train_data[train_data["Sex"]=="female"]["Age"].median()

27.0

The **Name** and **Ticket** attributes may have some value, but they will be a bit tricky to convert into useful numbers that a model can consume. So for now, we will ignore them.

Further, note that **Sex** is a string column.

In [17]:
train_data[['Sex']].head()

Unnamed: 0_level_0,Sex
PassengerId,Unnamed: 1_level_1
1,male
2,female
3,female
4,female
5,male


Let's see the summary of numerical columns.

In [96]:
train_data.describe()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699113,0.523008,0.381594,32.204208
std,0.486592,0.836071,14.526507,1.102743,0.806057,49.693429
min,0.0,1.0,0.4167,0.0,0.0,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292


* See only 38% **Survived**! That's close enough to 40%, so accuracy will be a reasonable metric to evaluate our model.
    -    Furthermore, because the target is somewhat skewed, we need to address precision/recall
* The mean **Fare** was £32.20, which does not seem so expensive (but it was probably a lot of money back then).
* The mean **Age** was less than 30 years old, but still many observations are missing here.

In [97]:
# Let's check that the target is indeed 0 or 1
train_data["Survived"].value_counts()

0    549
1    342
Name: Survived, dtype: int64

In [98]:
# Now let's take a quick look at all the categorical attributes:
train_data["Pclass"].value_counts()

3    491
1    216
2    184
Name: Pclass, dtype: int64

In [99]:
train_data["Sex"].value_counts()

male      577
female    314
Name: Sex, dtype: int64

In [100]:
train_data["Embarked"].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

The Embarked attribute tells us where the passenger embarked: C=Cherbourg, Q=Queenstown, S=Southampton.

In [18]:
from sklearn.linear_model import SGDClassifier

Let's apply SGDClassifer.

Note that SGD Classifier only accepts numerical values. So, we need to process two things at least

1. impute missing values
2. select features
    - transform string columns into numbers

This process is called *feature engineering*.

Let's handle the missing values first. Let's replace the missing values with the most frequent values of the corresponding columns.

The problems are: Age, Cabin, and Embarked. You can set the rules on your own but need to supply your intuition. Here is just an example.

1. Age: median
2. Cabin, Embarked: most freq

In [36]:
# Impute Age with median age
train_data.loc[train_data.Age.isnull(),'Age'] = train_data[train_data["Sex"]=="female"]["Age"].median()

In [38]:
train_data.Age.isnull().sum()

0

In [40]:
train_data[['Cabin','Embarked']].head(10)

Unnamed: 0_level_0,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,,S
2,C85,C
3,,S
4,C123,S
5,,S
6,,Q
7,E46,S
8,,S
9,,S
10,,C


In [41]:
train_data.Embarked.value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [43]:
train_data.loc[train_data.Embarked.isnull(),'Embarked'] = 'S'

Now let's think about Cabin column

In [46]:
train_data.Cabin.value_counts()

B96 B98        4
G6             4
C23 C25 C27    4
C22 C26        3
F33            3
              ..
E34            1
C7             1
C54            1
E36            1
C148           1
Name: Cabin, Length: 147, dtype: int64

Maybe the Cabin number would put too much details, so let's focus on the alphabet only.

In [49]:
train_data.Cabin.str[0].value_counts()

C    59
B    47
D    33
E    32
A    15
F    13
G     4
T     1
Name: Cabin, dtype: int64

In [51]:
train_data.Cabin.isnull().sum() / len(train_data) * 100

77.10437710437711

Note that there are too many missing values, and it looks hard to find a representative value to replace them. Hence, let's ignore **Cabin**

Now what's left? 

- Need to set X and y
- Convert string columns into numeric

In [53]:
# Easy
y_train = train_data.Survived

Let's think about **Pclass** and **Embarked**

- they are categorical variables (what does it mean?)

In [54]:
X_train_num = train_data[['Age','SibSp','Parch','Fare']]
X_train_cat = train_data[['Pclass','Embarked']]

In [55]:
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(sparse=False)

In [56]:
X_train_cat = enc.fit_transform(X_train_cat)
X_train_cat = pd.DataFrame(X_train_cat,index = train_data.index)
X_train_cat.head()

Unnamed: 0_level_0,0,1,2,3,4,5
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,0.0,0.0,1.0,0.0,0.0,1.0
2,1.0,0.0,0.0,1.0,0.0,0.0
3,0.0,0.0,1.0,0.0,0.0,1.0
4,1.0,0.0,0.0,0.0,0.0,1.0
5,0.0,0.0,1.0,0.0,0.0,1.0


In [57]:
print(set(train_data.Pclass))
print(set(train_data.Embarked))

{1, 2, 3}
{'C', 'S', 'Q'}


In [58]:
X_train_cat.columns = ['Pclass1','Pclass2','Pclass3','EmbarkedC','EmbarkedS','EmbarkedQ']

In [59]:
X_train_cat.head()

Unnamed: 0_level_0,Pclass1,Pclass2,Pclass3,EmbarkedC,EmbarkedS,EmbarkedQ
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,0.0,0.0,1.0,0.0,0.0,1.0
2,1.0,0.0,0.0,1.0,0.0,0.0
3,0.0,0.0,1.0,0.0,0.0,1.0
4,1.0,0.0,0.0,0.0,0.0,1.0
5,0.0,0.0,1.0,0.0,0.0,1.0


In [60]:
X_train = pd.concat([X_train_num,X_train_cat],axis = 1)

In [61]:
X_train.head()

Unnamed: 0_level_0,Age,SibSp,Parch,Fare,Pclass1,Pclass2,Pclass3,EmbarkedC,EmbarkedS,EmbarkedQ
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,22.0,1,0,7.25,0.0,0.0,1.0,0.0,0.0,1.0
2,38.0,1,0,71.2833,1.0,0.0,0.0,1.0,0.0,0.0
3,26.0,0,0,7.925,0.0,0.0,1.0,0.0,0.0,1.0
4,35.0,1,0,53.1,1.0,0.0,0.0,0.0,0.0,1.0
5,35.0,0,0,8.05,0.0,0.0,1.0,0.0,0.0,1.0


In [62]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 10 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Age        891 non-null    float64
 1   SibSp      891 non-null    int64  
 2   Parch      891 non-null    int64  
 3   Fare       891 non-null    float64
 4   Pclass1    891 non-null    float64
 5   Pclass2    891 non-null    float64
 6   Pclass3    891 non-null    float64
 7   EmbarkedC  891 non-null    float64
 8   EmbarkedS  891 non-null    float64
 9   EmbarkedQ  891 non-null    float64
dtypes: float64(8), int64(2)
memory usage: 76.6 KB


In [63]:
sgd_clf = SGDClassifier(random_state=123)
sgd_clf.fit(X_train, y_train)

SGDClassifier(random_state=123)

In [64]:
sgd_clf.score(X_train,y_train)

0.6868686868686869

In [65]:
from sklearn.model_selection import cross_val_score

In [66]:
score = cross_val_score(sgd_clf, X_train, y_train, cv=3)
score.mean()

0.67003367003367

In [67]:
from sklearn.model_selection import cross_val_predict

y_train_pred = cross_val_predict(sgd_clf, X_train, y_train, cv=3)

In [68]:
from sklearn.metrics import precision_score, recall_score
print("precision of the model: {:.03f}".format(precision_score(y_train, y_train_pred)))
print("sensitivity of the model: {:.03f}".format(recall_score(y_train, y_train_pred)))

precision of the model: 0.755
sensitivity of the model: 0.208


We may need to lower the threshold. Why?