# Tutorial 10: Supervised Learning

__The goal of this assignment is to create two models that predict the ticket fare and the survival of Titanic passengers.__ 

The prediction models are built using supervised learning, a technique that relies on examples to learn a function.

To implement these models, you will use pandas and sklearn, two popular machine learning libraries for Python.

In the first section, your task is to engineer the features that will be used to train and test the models.

In the second section, you have to train the models and evaluate their performance on testing data.

__Grade scale__: 20 points
- __final model__: 3 points
- __correct answer__: 2 points
- __incorrect answer__: 0 points

__Further documentations__:
* https://docs.python.org/3/
* http://scikit-learn.org/stable/index.html
* http://scikit-learn.org/stable/supervised_learning.html

# Core

__VARIABLE DESCRIPTION__:

- __survival__        Survival(0 = No; 1 = Yes) / used for classification
- __pclass__          Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
- __name__            Name
- __sex__             Sex
- __age__             Age
- __sibsp__           Number of Siblings/Spouses Aboard
- __parch__           Number of Parents/Children Aboard
- __ticket__          Ticket Number
- __fare__            Passenger Fare / used for regression
- __cabin__           Cabin
- __embarked__        Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

In [1]:
# import librairies
import pandas as pd
import inspect
import sklearn

In [2]:
# load the dataset with pandas
df = pd.read_csv("titanic.csv.gz")

In [3]:
# drop passengers without ticket fare or survival
df.dropna(subset=["fare", "survival"], inplace=True)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1308 entries, 0 to 1308
Data columns (total 11 columns):
pclass      1308 non-null int64
survival    1308 non-null int64
name        1308 non-null object
sex         1308 non-null object
age         1045 non-null float64
sibsp       1308 non-null int64
parch       1308 non-null int64
ticket      1308 non-null object
fare        1308 non-null float64
cabin       295 non-null object
embarked    1306 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 122.6+ KB


# Features

__1. Drop the `name`, `ticket` and `cabin` columns from the dataframe__
- _explaination_: these variables are too specific to predict `fare` and `survival`

In [5]:
def Q1(df):
    # YOUR CODE HERE
    # raise NotImplementedError()
    return df.drop(columns=["name", "ticket", 'cabin'])
    
df1 = Q1(df.copy())

df1.head()

Unnamed: 0,pclass,survival,sex,age,sibsp,parch,fare,embarked
0,1,1,female,29.0,0,0,211.3375,S
1,1,1,male,0.9167,1,2,151.55,S
2,1,0,female,2.0,1,2,151.55,S
3,1,0,male,30.0,1,2,151.55,S
4,1,0,female,25.0,1,2,151.55,S


In [6]:
assert "name" not in df1
assert "cabin" not in df1
assert "ticket" not in df1
assert df1.shape == (1308, 8)

__2. Return `survival` and `ticket` columns in addition to the dataframe without these columns__
- _hint_: you can return multiple variables in Python by separating values with a comma
- _explaination_: we must separate the variables we want to predict from the rest

In [7]:
def Q2(df):
    # YOUR CODE HERE
    # raise NotImplementedError()
    return df["fare"], df["survival"], df.drop(columns=["fare","survival"])
FARE, SURVIVAL, df2 = Q2(df1.copy())

df2.head()

Unnamed: 0,pclass,sex,age,sibsp,parch,embarked
0,1,female,29.0,0,0,S
1,1,male,0.9167,1,2,S
2,1,female,2.0,1,2,S
3,1,male,30.0,1,2,S
4,1,female,25.0,1,2,S


In [8]:
assert "fare" not in df2
assert "survival" not in df2
assert df2.shape == (1308, 6)

assert FARE.equals(df["fare"])
assert isinstance(FARE, pd.Series)
assert SURVIVAL.equals(df["survival"])
assert isinstance(SURVIVAL, pd.Series)

__3. Fill the missing values from the dataframe: `age` by its mean (29) and `embarked` by its mode ('S'):__
- _explaination_: most supervised learning algorithms do not support missing values, they must be handled explicitly

In [9]:
def Q3(df):
    # YOUR CODE HERE
    # raise NotImplementedError()
    age_mean = 29# df.age.mean()
    emb_mode = 'S'# df.embarked.mode()
    #print(age_mean, emb_mode)
    df.age = df.age.fillna(age_mean)
    df.embarked = df.embarked.fillna(emb_mode)
    return df
    
df3 = Q3(df2.copy())
df3.head()

Unnamed: 0,pclass,sex,age,sibsp,parch,embarked
0,1,female,29.0,0,0,S
1,1,male,0.9167,1,2,S
2,1,female,2.0,1,2,S
3,1,male,30.0,1,2,S
4,1,female,25.0,1,2,S


In [10]:
assert df3.shape == (1308, 6)
assert 29 < df3["age"].mean() < 30
assert (df3["embarked"].mode()=='S').all()
assert not df3["age"].isnull().values.any()
assert not df3["embarked"].isnull().values.any()

__4. Replace `sex` and `embarked` string values by numerical values: sex \[male -> 0, female -> female\], embarked \[C -> 1, S -> 2, Q -> 3\]__
- _explaination_: most supervised learning algorithms require numerical values to operate

In [11]:
def Q4(df):
    # YOUR CODE HERE
    # raise NotImplementedError()
    df.sex = (df.sex=='female').astype(int)
    ret=list()
    for x in df.embarked.astype(str):
        if x == 'C':
            ret.append(1)
        elif x == 'S':
            ret.append(2)
        elif x == 'Q':
            ret.append(3)
    df.embarked = pd.Categorical(ret,ordered=True)
    return df
df4 = Q4(df3.copy())

df4.head()
df4["embarked"].value_counts().to_dict()

{2: 915, 1: 270, 3: 123}

In [12]:
assert df4.shape == (1308, 6)
assert df4["sex"].value_counts().to_dict() == {0: 842, 1: 466}
assert df4["embarked"].value_counts().to_dict() == {1: 270, 2: 915, 3: 123}

__5. Create a function that returns the first 2/3 of a dataframe for training and the last 1/3 for testing__
- _explaination_: it is important to ensure that an algorithm is performing well on unseen example (testing)

In [13]:
def Q5(df):
    nrows = df.shape[0]
    thres = int(nrows * 2/3)
    # YOUR CODE HERE
    # raise NotImplementedError()
    return df.iloc[:thres], df.iloc[thres:]
    

SURVIVAL_TRAIN, SURVIVAL_TEST = Q5(SURVIVAL.copy())
print("SURVIVAL: train = {}, test = {}".format(len(SURVIVAL_TRAIN), len(SURVIVAL_TEST)))

FARE_TRAIN, FARE_TEST = Q5(FARE.copy().astype('int'))
print("FARE: train = {}, test = {}".format(len(FARE_TRAIN), len(FARE_TEST)))

DF_TRAIN, DF_TEST = Q5(df4.copy())
print("DF: train = {}, test = {}".format(len(DF_TRAIN), len(DF_TEST)))

SURVIVAL: train = 872, test = 436
FARE: train = 872, test = 436
DF: train = 872, test = 436


In [14]:
assert SURVIVAL_TRAIN.shape == (872,)
assert SURVIVAL_TEST.shape == (436,)
assert FARE_TRAIN.shape == (872,)
assert FARE_TEST.shape == (436,)
assert DF_TRAIN.shape == (872,6)
assert DF_TEST.shape == (436,6)

# Models

__1. Create and train a baseline model based on LinearSVC that predicts the survival of Titanic passengers.__
- _note_: you should use the default parameters with random_state = 0
- _note_: the score method returns the mean accuracy (% of correct answers)

http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC

In [15]:
from sklearn.svm import LinearSVC

def M1(features, target):
    # YOUR CODE HERE
    # raise NotImplementedError()
    m=LinearSVC(random_state=0)
    return m.fit(features, target)
    
m1 = M1(DF_TRAIN, SURVIVAL_TRAIN)

print("SCORE: {}".format(m1.score(DF_TEST, SURVIVAL_TEST)))
DF_TEST.assign(target=SURVIVAL_TEST, predicted=m1.predict(DF_TEST))

SCORE: 0.7270642201834863




Unnamed: 0,pclass,sex,age,sibsp,parch,embarked,target,predicted
872,3,1,29.0,0,0,2,1,1
873,3,0,42.0,0,0,2,0,0
874,3,0,29.0,0,0,2,1,0
875,3,0,30.0,0,0,1,0,0
876,3,0,29.0,0,0,2,0,0
877,3,1,27.0,1,0,2,0,1
878,3,1,25.0,1,0,2,0,1
879,3,0,29.0,0,0,2,0,0
880,3,0,29.0,0,0,1,1,0
881,3,0,21.0,0,0,2,1,0


In [16]:
assert m1.score(DF_TEST, SURVIVAL_TEST) > 0.70

__2. Create and train a baseline model based on LogisticRegression that predict the ticket fare of Titanic passengers.__
- _note_: you should use the default parameters with a random_state = 0
- _note_: the score method returns the mean accuracy (% of correct answers)

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

In [17]:
from sklearn.linear_model import LogisticRegression

def M2(features, target):
    # YOUR CODE HERE
    # raise NotImplementedError()
    m = LogisticRegression(random_state=0)
    return m.fit(features, target)
    
m2 = M2(DF_TRAIN, FARE_TRAIN)

print("SCORE: {}".format(m2.score(DF_TEST, FARE_TEST)))
DF_TEST.assign(target=FARE_TEST, predicted=m2.predict(DF_TEST))



SCORE: 0.45871559633027525


Unnamed: 0,pclass,sex,age,sibsp,parch,embarked,target,predicted
872,3,1,29.0,0,0,2,8,7
873,3,0,42.0,0,0,2,7,7
874,3,0,29.0,0,0,2,7,7
875,3,0,30.0,0,0,1,7,7
876,3,0,29.0,0,0,2,7,7
877,3,1,27.0,1,0,2,7,7
878,3,1,25.0,1,0,2,7,7
879,3,0,29.0,0,0,2,7,7
880,3,0,29.0,0,0,1,7,7
881,3,0,21.0,0,0,2,7,7


In [18]:
assert m2.score(DF_TEST, FARE_TEST) > 0.40

__3. Create and train a classification model that surpasses the current model in predicting passenger survival (M1)__
- _hint_: you can either change the model parameters or the choice of algorithm
- _note_: use a fixed random state to ensure the stability of your predictions

In [19]:
from sklearn.ensemble import RandomForestClassifier

def M3(features, target):
    # YOUR CODE HERE
    # raise NotImplementedError()
    ma = max(features.age)
    mi = min(features.age)
    features.age = features.age.apply(lambda x: (x-mi)/(ma-mi))
    m=LinearSVC(random_state=0, max_iter=1000000, dual=False)
    #m = RandomForestClassifier(random_state=0, n_estimators=21)
    return m.fit(features, target)
    
m3 = M3(DF_TRAIN, SURVIVAL_TRAIN)

print("SCORE: {}".format(m3.score(DF_TEST, SURVIVAL_TEST)))
DF_TEST.assign(target=SURVIVAL_TEST, predicted=m3.predict(DF_TEST))

SCORE: 0.7408256880733946


Unnamed: 0,pclass,sex,age,sibsp,parch,embarked,target,predicted
872,3,1,29.0,0,0,2,1,0
873,3,0,42.0,0,0,2,0,0
874,3,0,29.0,0,0,2,1,0
875,3,0,30.0,0,0,1,0,0
876,3,0,29.0,0,0,2,0,0
877,3,1,27.0,1,0,2,0,0
878,3,1,25.0,1,0,2,0,0
879,3,0,29.0,0,0,2,0,0
880,3,0,29.0,0,0,1,1,0
881,3,0,21.0,0,0,2,1,0


In [20]:
assert m3.score(DF_TEST, SURVIVAL_TEST) > m1.score(DF_TEST, SURVIVAL_TEST)

__4. Create and train a regression model that surpasses the current model in predicting passenger ticket fare (M2)__
- _hint_: you can either change the model parameters or the choice of algorithm
- _note_: use a fixed random state to ensure the stability of your predictions

In [21]:
from sklearn.tree import DecisionTreeRegressor

def M4(features, target):
    # YOUR CODE HERE
    #raise NotImplementedError()
    #m = LogisticRegression(random_state=0, solver='lbfgs', multi_class='auto', max_iter=1000)
    m = DecisionTreeRegressor(random_state=0)
    return m.fit(features, target)

m4 = M4(DF_TRAIN, FARE_TRAIN)

print("SCORE: {}".format(m4.score(DF_TEST, FARE_TEST)))
DF_TEST.assign(target=FARE_TEST, predicted=m4.predict(DF_TEST))

SCORE: 0.5483909089949143


Unnamed: 0,pclass,sex,age,sibsp,parch,embarked,target,predicted
872,3,1,29.0,0,0,2,8,7.0
873,3,0,42.0,0,0,2,7,7.0
874,3,0,29.0,0,0,2,7,7.0
875,3,0,30.0,0,0,1,7,7.0
876,3,0,29.0,0,0,2,7,7.0
877,3,1,27.0,1,0,2,7,14.0
878,3,1,25.0,1,0,2,7,14.0
879,3,0,29.0,0,0,2,7,7.0
880,3,0,29.0,0,0,1,7,7.0
881,3,0,21.0,0,0,2,7,7.0


In [22]:
assert m4.score(DF_TEST, FARE_TEST) > m2.score(DF_TEST, FARE_TEST)