Your purpose is now to predict if one passenger has survived based on what we know (his age, gender...)

In [None]:
import pandas as pd
import numpy as np

# Load the data as a dataframe
df = pd.read_csv('./../data/titanic_train.csv')

In [None]:
Y_truth = list(1.0*df['Survived'])

In [None]:
def compute_accuracy(Y, Y_pred):
    """
    This function is used below to compute the score associated to a prediction.
    Y should be a list with the result
    Y_pred should be a prediction of this list
    This function will return 1.0 if you correctly predict the survival state of all passengers.
    """
    assert type(Y) is list
    assert type(Y_pred) is list
    assert len(Y) == len(Y_pred)
    n_success = [(1 if round(y_pred) == y else 0) for (y, y_pred) in zip(Y, Y_pred)]
    return 1.0*sum(n_success)/len(Y)

In [None]:
df

## Making simple predictors

We have 891 passengers. For each of them, we are going to attribute:

  * 0 if we predict they are going to die
  * 1 if we predict they are going to survive

The goal is then to compare our prediction against the real results to see how accurate we are.

We are 100% accurate when we predict the correct outcome for everyone, and 0% when we predict for none.

The ideal goal is to build the best **predicition model** based on all the available passenger data (age, class, gender...)

### What if we predict everyone dies?

In this case we simply attribute a *0* to everyone.

The output shows *~62%*, because *62%* of passengers actually lost their lives.

In [None]:
Y_pred_everyone_dies = [0.0]*len(df)
# This will return 0.616, as asserting that everybody
# died is true for 62% of the population.
compute_accuracy(Y_truth, Y_pred_everyone_dies)

### What about using the Pclass information?

We saw in the data exploration a bias with the ticket class. The first class passengers were more likely to survive.

Let's therefore build a finer model, predicting a favorable outcome for the first class.

The output now shws a better outcome, of %68%"!


In [None]:
#We build a Y_pred_with_class list, and for each passenger, we set 0 or 1 into it
Y_pred_with_class = []
for passenger in df.to_dict(orient='records'):
    if passenger['Pclass'] == 1:
        yp = 1
    else:
        yp = 0
    # we then add the passenger prediction to the whole list
    Y_pred_with_class.append(yp)

#we can now check how relevant it is
compute_accuracy(Y_truth, Y_pred_with_class)

## You turn!!!
### make a prediction based on gender

In [None]:
Y_pred_with_gender = []

for passenger in df.to_dict(orient='records'):
    ####### insert your condition here, based on the same template as above

    ####### end of your code
    Y_pred_with_gender.append(yp)

#we can now check how relevant it is
compute_accuracy(Y_truth, Y_pred_with_gender)

### Be more imaginative: create your own model!
### This is a contest

The idea is maybe to combine various factors, in the form, for example

    ####### insert your condition here, based on the same template as above
       if passenger['Pclass']>=2:
           if(passenger['embarked'] == 'S':
               yp=1
           else:
               yp=0
      else:
           if(passenger['Sex'] == 'female':
               yp=0
           else:
               yp=1

    ####### end of your code


In [None]:
Y_pred_my_model_1 = []

for passenger in df.to_dict(orient='records'):
    ####### insert your condition here, based on the same template as above

    ####### end of your code
    Y_pred_my_model_1.append(yp)

#we can now check how relevant it is
compute_accuracy(Y_truth, Y_pred_my_model_1)

You can of course copy/paste the code above, create new cell to test other models.
