# Tutorial 10: Supervised Learning

__The goal of this assignment is to create two models that predict the ticket fare and the survival of Titanic passengers.__ 

The prediction models are built using supervised learning, a technique that relies on examples to learn a function.

To implement these models, you will use pandas and sklearn, two popular machine learning libraries for Python.

In the first section, your task is to engineer the features that will be used to train and test the models.

In the second section, you have to train the models and evaluate their performance on testing data.

__Grade scale__: 20 points
- __final model__: 3 points
- __correct answer__: 2 points
- __incorrect answer__: 0 points

__Further documentations__:
* https://docs.python.org/3/
* http://scikit-learn.org/stable/index.html
* http://scikit-learn.org/stable/supervised_learning.html

# Core

__VARIABLE DESCRIPTION__:

- __survival__        Survival(0 = No; 1 = Yes) / used for classification
- __pclass__          Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
- __name__            Name
- __sex__             Sex
- __age__             Age
- __sibsp__           Number of Siblings/Spouses Aboard
- __parch__           Number of Parents/Children Aboard
- __ticket__          Ticket Number
- __fare__            Passenger Fare / used for regression
- __cabin__           Cabin
- __embarked__        Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

In [1]:
# import librairies
import pandas as pd
import inspect
import sklearn

In [2]:
# load the dataset with pandas
df = pd.read_csv("titanic.csv.gz")

In [3]:
# drop passengers without ticket fare or survival
df.dropna(subset=["fare", "survival"], inplace=True)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1308 entries, 0 to 1308
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   pclass    1308 non-null   int64  
 1   survival  1308 non-null   int64  
 2   name      1308 non-null   object 
 3   sex       1308 non-null   object 
 4   age       1045 non-null   float64
 5   sibsp     1308 non-null   int64  
 6   parch     1308 non-null   int64  
 7   ticket    1308 non-null   object 
 8   fare      1308 non-null   float64
 9   cabin     295 non-null    object 
 10  embarked  1306 non-null   object 
dtypes: float64(2), int64(4), object(5)
memory usage: 122.6+ KB


# Features

__1. Drop the `name`, `ticket` and `cabin` columns from the dataframe__
- __explaination__: these variables are too specific to predict `fare` and `survival`

In [9]:
def Q1(df):
    ### BEGIN SOLUTION
    df_dropped = df.drop(["name", "ticket", "cabin"], axis=1)
    return df_dropped
    ### END SOLUTION
df_dropped1=Q1(df)
df_dropped1.head()

Unnamed: 0,pclass,survival,sex,age,sibsp,parch,fare,embarked
0,1,1,female,29.0,0,0,211.3375,S
1,1,1,male,0.9167,1,2,151.55,S
2,1,0,female,2.0,1,2,151.55,S
3,1,0,male,30.0,1,2,151.55,S
4,1,0,female,25.0,1,2,151.55,S


__2. Return `survival` and `fare` columns in addition to the dataframe without these columns__
- __hint__: you can return multiple variables in Python by separating values with a comma
- __explaination__: we must separate the variables we want to predict from the rest

In [10]:
def Q2(df):
    ### BEGIN SOLUTION
    return df["fare"], df["survival"], df.drop(["fare", "survival"], axis=1)
    ### END SOLUTION
fare, survival, df_dropped2 = Q2(df_dropped1)

df_dropped2.head()

Unnamed: 0,pclass,sex,age,sibsp,parch,embarked
0,1,female,29.0,0,0,S
1,1,male,0.9167,1,2,S
2,1,female,2.0,1,2,S
3,1,male,30.0,1,2,S
4,1,female,25.0,1,2,S


__3. Fill the missing values from the dataframe: `age` by its mean (29) and `embarked` by its mode ('S'):__
- __explaination__: most supervised learning algorithms do not support missing values, they must be handled explicitly

In [11]:
def Q3(df):
    ### BEGIN SOLUTION
    df["age"] = df["age"].fillna(29)
    df["embarked"] = df["embarked"].fillna('S')
    return df
    ### END SOLUTION
    
df3 = Q3(df_dropped2)

df3.head()

Unnamed: 0,pclass,sex,age,sibsp,parch,embarked
0,1,female,29.0,0,0,S
1,1,male,0.9167,1,2,S
2,1,female,2.0,1,2,S
3,1,male,30.0,1,2,S
4,1,female,25.0,1,2,S


__4. Replace `sex` and `embarked` string values by numerical values: sex \[male -> 0, female -> 1\], embarked \[C -> 1, S -> 2, Q -> 3\]__
- __explaination__: most supervised learning algorithms require numerical values to operate

In [16]:
def Q4(df):
    ### BEGIN SOLUTION
    df["sex"]=df["sex"].replace(['male', 'female'], [0, 1])
    df["embarked"]=df["embarked"].replace(['C', 'S', 'Q'], [1,2,3])
    return df
    ### END SOLUTION
    
df4=Q4(df3)
df4.head()

Unnamed: 0,pclass,sex,age,sibsp,parch,embarked
0,1,1,29.0,0,0,2
1,1,0,0.9167,1,2,2
2,1,1,2.0,1,2,2
3,1,0,30.0,1,2,2
4,1,1,25.0,1,2,2


__5. Create a function that returns the first 2/3 of a dataframe for training and the last 1/3 for testing__
- __explaination__: it is important to ensure that an algorithm is performing well on unseen example (testing)

In [33]:
#def split_data(df, train_perc = 0.8):

#   df['train'] = np.random.rand(len(df)) < train_perc

#   train = df[df.train == 1]

#   test = df[df.train == 0]

#   split_data ={'train': train, 'test': test}

#   return split_data

#https://stackoverflow.com/questions/24147278/how-do-i-create-test-and-train-samples-from-one-dataframe-with-pandas

def Q5(df):
    ### BEGIN SOLUTION
    row_count = df.shape[0]
    split_point = int(row_count*(2/3))
    train_data, test_data = df[:split_point], df[split_point:]
    return train_data, test_data
    ### END SOLUTION
    
survival_train_data, survival_test_data = Q5(survival)
print("SURVIVAL: train = {}, test = {}".format(len(survial_train_data), len(survial_test_data)))


fare_train_data, fare_test_data = Q5(fare.astype('int'))
print("FARE: train = {}, test = {}".format(len(fare__train_data), len(fare_test_data)))

df_train_data, df_test_data = Q5(df4)
print("DF: train = {}, test = {}".format(len(df_train_data), len(df_test_data)))

SURVIVAL: train = 872, test = 436
FARE: train = 872, test = 436
DF: train = 872, test = 436


# Models

__1. Create and train a baseline model based on LinearSVC that predicts the survival of Titanic passengers.__
- __note__: you should use the default parameters with random_state = 0
- __note__: the score method returns the mean accuracy (% of correct answers)

http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC

In [34]:
### BEGIN SOLUTION
from sklearn.svm import LinearSVC
def Model1(features, target):
    ### BEGIN SOLUTION
    model = LinearSVC(random_state=0)
    model.fit(features, target)
    return model
    ### END SOLUTION
    
model1 = Model1(df_train_data, survival_train_data)

print("SCORE: {}".format(model1.score(df_test_data, survival_test_data)))
df_test_data.assign(target=survival_test_data, predicted=model1.predict(df_test_data))
### END SOLUTION

SCORE: 0.7477064220183486




Unnamed: 0,pclass,sex,age,sibsp,parch,embarked,target,predicted
872,3,1,29.0,0,0,2,1,0
873,3,0,42.0,0,0,2,0,0
874,3,0,29.0,0,0,2,1,0
875,3,0,30.0,0,0,1,0,0
876,3,0,29.0,0,0,2,0,0
...,...,...,...,...,...,...,...,...
1304,3,1,14.5,1,0,1,0,1
1305,3,1,29.0,1,0,1,0,0
1306,3,0,26.5,0,0,1,0,0
1307,3,0,27.0,0,0,1,0,0


__2. Create and train a baseline model based on LogisticRegression that predict the ticket fare of Titanic passengers.__
- __note__: you should use the default parameters with a random_state = 0
- __note__: the score method returns the mean accuracy (% of correct answers)

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

In [28]:
### BEGIN SOLUTION
from sklearn.linear_model import LogisticRegression

def Model2(features, target):
    ### BEGIN SOLUTION
    model = LogisticRegression(random_state=0)
    model.fit(features, target)
    
    return model
    ### END SOLUTION
    
model2 = Model2(df_train_data, fare_train_data)

print("SCORE: {}".format(model2.score(df_test_data, fare_test_data)))
df_test_data.assign(target=fare_test_data, predicted=model2.predict(df_test_data))
### END SOLUTION

SCORE: 0.45642201834862384


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Unnamed: 0,pclass,sex,age,sibsp,parch,embarked,target,predicted
872,3,1,29.0,0,0,2,8,7
873,3,0,42.0,0,0,2,7,7
874,3,0,29.0,0,0,2,7,7
875,3,0,30.0,0,0,1,7,7
876,3,0,29.0,0,0,2,7,7
...,...,...,...,...,...,...,...,...
1304,3,1,14.5,1,0,1,14,7
1305,3,1,29.0,1,0,1,14,7
1306,3,0,26.5,0,0,1,7,7
1307,3,0,27.0,0,0,1,7,7


__3. Create and train a classification model that surpasses the current model in predicting passenger survival (M1)__
- __hint__: you can either change the model parameters or the choice of algorithm
- __note__: use a fixed random state to ensure the stability of your predictions

In [50]:
#https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier

#from sklearn.linear_model import RidgeClassifier
#from sklearn.linear_model import Perceptron


### BEGIN SOLUTION
from sklearn.linear_model import SGDClassifier

def Model3(features, target):
    ### BEGIN SOLUTION
    model = SGDClassifier(random_state=0)
    model.fit(features, target)
    
    return model
    ### END SOLUTION
    
model3 = Model3(df_train_data, survival_train_data)

print("SCORE: {}".format(model3.score(df_test_data, survival_test_data)))
df_test_data.assign(target=survival_test_data, predicted=model2.predict(df_test_data))
### END SOLUTION

SCORE: 0.7614678899082569


Unnamed: 0,pclass,sex,age,sibsp,parch,embarked,target,predicted
872,3,1,29.0,0,0,2,1,7
873,3,0,42.0,0,0,2,0,7
874,3,0,29.0,0,0,2,1,7
875,3,0,30.0,0,0,1,0,7
876,3,0,29.0,0,0,2,0,7
...,...,...,...,...,...,...,...,...
1304,3,1,14.5,1,0,1,0,7
1305,3,1,29.0,1,0,1,0,7
1306,3,0,26.5,0,0,1,0,7
1307,3,0,27.0,0,0,1,0,7


__4. Create and train a regression model that surpasses the current model in predicting passenger ticket fare (M2)__
- __hint__: you can either change the model parameters or the choice of algorithm
- __note__: use a fixed random state to ensure the stability of your predictions

In [47]:
#https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html#sklearn.linear_model.SGDRegressor
#from sklearn.linear_model import Lars
#from sklearn.linear_model import Ridge

### BEGIN SOLUTION
from sklearn.linear_model import RANSACRegressor

def Model4(features, target):
    ### BEGIN SOLUTION
    model = RANSACRegressor()
    model.fit(features, target)
    
    return model
    ### END SOLUTION
    
model4 = Model4(df_train_data, fare_train_data)

print("SCORE: {}".format(model4.score(df_test_data, fare_test_data)))
df_test_data.assign(target=fare_test_data, predicted=model4.predict(df_test_data))
### END SOLUTION

SCORE: 0.6979757477985247


Unnamed: 0,pclass,sex,age,sibsp,parch,embarked,target,predicted
872,3,1,29.0,0,0,2,8,7.742748
873,3,0,42.0,0,0,2,7,4.876720
874,3,0,29.0,0,0,2,7,4.548815
875,3,0,30.0,0,0,1,7,5.866844
876,3,0,29.0,0,0,2,7,4.548815
...,...,...,...,...,...,...,...,...
1304,3,1,14.5,1,0,1,14,15.044129
1305,3,1,29.0,1,0,1,14,15.409870
1306,3,0,26.5,0,0,1,7,5.778562
1307,3,0,27.0,0,0,1,7,5.791174
