# TPOT tutorial on the Titanic dataset 

Consider TPOT your Data Science Assistant. TPOT is a Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.  The Titanic machine learning competition on [Kaggle](https://www.kaggle.com/c/titanic) is one of the most popular beginner's competitions on the platform. We will use that competition here to demonstrate the implementation of [TPOT](https://epistasislab.github.io/tpot/). 

In [1]:
# Import required libraries
from tpot import TPOTClassifier
from sklearn.model_selection import train_test_split
import pandas as pd 
import numpy as np

In [2]:
# Load the data
titanic = pd.read_csv('data/titanic_train.csv')
titanic.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Dataset Exploration 

In [3]:
titanic.head().T

Unnamed: 0,0,1,2,3,4
PassengerId,1,2,3,4,5
Survived,0,1,1,1,0
Pclass,3,1,3,1,3
Name,"Braund, Mr. Owen Harris","Cumings, Mrs. John Bradley (Florence Briggs Th...","Heikkinen, Miss. Laina","Futrelle, Mrs. Jacques Heath (Lily May Peel)","Allen, Mr. William Henry"
Sex,male,female,female,female,male
Age,22,38,26,35,35
SibSp,1,1,0,1,0
Parch,0,0,0,0,0
Ticket,A/5 21171,PC 17599,STON/O2. 3101282,113803,373450
Fare,7.25,71.2833,7.925,53.1,8.05


In [4]:
titanic.groupby('Sex').Survived.value_counts()

Sex     Survived
female  1           233
        0            81
male    0           468
        1           109
Name: Survived, dtype: int64

In [5]:
titanic.groupby(['Pclass','Sex']).Survived.value_counts()

Pclass  Sex     Survived
1       female  1            91
                0             3
        male    0            77
                1            45
2       female  1            70
                0             6
        male    0            91
                1            17
3       female  0            72
                1            72
        male    0           300
                1            47
Name: Survived, dtype: int64

In [6]:
ids = pd.crosstab([titanic.Pclass, titanic.Sex], titanic.Survived.astype(float))
ids.div(ids.sum(1).astype(float), 0)

Unnamed: 0_level_0,Survived,0.0,1.0
Pclass,Sex,Unnamed: 2_level_1,Unnamed: 3_level_1
1,female,0.031915,0.968085
1,male,0.631148,0.368852
2,female,0.078947,0.921053
2,male,0.842593,0.157407
3,female,0.5,0.5
3,male,0.864553,0.135447


## Feature Engineering

The main objective here is to test how good is TPOT, so I won't put a lot of efford on the feature on Engineering



The first and most important step in using TPOT on any data set is to rename the target class/response variable to `class`.

In [7]:
titanic.rename(columns={'Survived': 'class'}, inplace=True)

At present, TPOT requires all the data to be in numerical format. As we can see below, our data set has 5 categorical variables which contain non-numerical values: `Name`, `Sex`, `Ticket`, `Cabin` and `Embarked`.

In [8]:
titanic.dtypes

PassengerId      int64
class            int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

We then check the number of levels that each of the five categorical variables have. 

In [9]:
for cat in ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']:
    print("Number of levels in category '{0}': \b {1:2.2f} ".format(cat, titanic[cat].unique().size))

Number of levels in category 'Name':  891.00 
Number of levels in category 'Sex':  2.00 
Number of levels in category 'Ticket':  681.00 
Number of levels in category 'Cabin':  148.00 
Number of levels in category 'Embarked':  4.00 


As we can see, `Sex` and `Embarked` have few levels. Let's find out what they are.

In [10]:
for cat in ['Sex', 'Embarked']:
    print("Levels for catgeory '{0}': {1}".format(cat, titanic[cat].unique()))

Levels for catgeory 'Sex': ['male' 'female']
Levels for catgeory 'Embarked': ['S' 'C' 'Q' nan]


In [11]:

# Family Size
titanic['FamilySize'] = titanic['SibSp'] + titanic['Parch'] + 1

# Is Alone?
titanic['IsAlone'] = 0
titanic.loc[titanic['FamilySize'] == 1, 'IsAlone'] = 1

The social status could be a factor in the survival, this can be infer from the lenght of the name and from the preffixes, so The social status could be a factor in the survival, so lets extract and categorize them. 


In [12]:
# Name Length
titanic['Name_length'] = titanic['Name'].apply(len)
# title
titanic['Title']=0
titanic['Title']=titanic.Name.str.extract('([A-Za-z]+)\.') #lets extract the Salutations


In [13]:
titanic['Title'].value_counts()/ titanic.shape[0]

Mr          0.580247
Miss        0.204265
Mrs         0.140292
Master      0.044893
Dr          0.007856
Rev         0.006734
Col         0.002245
Major       0.002245
Mlle        0.002245
Lady        0.001122
Ms          0.001122
Capt        0.001122
Sir         0.001122
Don         0.001122
Mme         0.001122
Countess    0.001122
Jonkheer    0.001122
Name: Title, dtype: float64

In [14]:
# Group Major, Capt and Col
titanic['Title'].replace(['Mlle','Mme','Ms','Dr',     'Major','Lady','Countess','Jonkheer','Col','Rev','Capt','Sir','Don','Master'],
                         ['Miss','Mrs','Mrs','scholar','Officer','Mrs','Noble', 'Noble','Officer','scholar','Officer','Noble','Noble','Noble'],inplace=True)

In [15]:
df = titanic.copy()
df.drop(labels=['Name'],axis=1, inplace=True)

In [16]:
# k-means missing Values imputaton,  this code need some refactoring, 
from sklearn import neighbors
import numpy as np

class Imputer:
    """Imputer class."""
    def __init():
        return

    def _fit(self, X, column, k=10, is_categorical=False):
        """Fit a knn classifier for missing column.
        - Args:
                X(numpy.ndarray): input data
                column(int): column id to be imputed
                k(int): number of nearest neighbors, default 10
                is_categorical(boolean): is continuous or categorical feature
        - Returns:
                clf: trained k nearest neighbour classifier
        """
        clf = None
        if not is_categorical:
            clf = neighbors.KNeighborsRegressor(n_neighbors=k)
        else:
            clf = neighbors.KNeighborsClassifier(n_neighbors=k)
        # use column not null to train the kNN classifier
        missing_idxes = np.where(pd.isnull(X[:, column]))[0]
        if len(missing_idxes) == 0:
            return None
        X_copy = np.delete(X, missing_idxes, 0)
        X_train = np.delete(X_copy, column, 1)
        # if other columns still have missing values fill with mean
        col_mean = None
        if not is_categorical:
            col_mean = np.nanmean(X, 0)
        else:
            col_mean = np.nanmedian(X, 0)
        for col_id in range(0, len(col_mean) - 1):
            col_missing_idxes = np.where(np.isnan(X_train[:, col_id]))[0]
            if len(col_missing_idxes) == 0:
                continue
            else:
                X_train[col_missing_idxes, col_id] = col_mean[col_id]
        y_train = X_copy[:, column]
        # fit classifier
        clf.fit(X_train, y_train)
        return clf

    def _transform(self, X, column, clf, is_categorical):
        """Impute missing values.
        - Args:
                X(numpy.ndarray): input numpy ndarray
                column(int): index of column to be imputed
                clf: pretrained classifier
                is_categorical(boolean): is continuous or categorical feature
        - Returns:
                X(pandas.dataframe): imputed dataframe
        """
        missing_idxes = np.where(np.isnan(X[:, column]))[0]
        X_test = X[missing_idxes, :]
        X_test = np.delete(X_test, column, 1)
        # if other columns still have missing values fill with mean
        col_mean = None
        if not is_categorical:
            col_mean = np.nanmean(X, 0)
        else:
            col_mean = np.nanmedian(X, 0)
        # fill missing values in each column with current col_mean
        for col_id in range(0, len(col_mean) - 1):
            col_missing_idxes = np.where(np.isnan(X_test[:, col_id]))[0]
            # if no missing values for current column
            if len(col_missing_idxes) == 0:
                continue
            else:
                X_test[col_missing_idxes, col_id] = col_mean[col_id]
        # predict missing values
        y_test = clf.predict(X_test)
        print(y_test)
        X[missing_idxes, column] = y_test
        return X



    def _check_X_y(self, X, column):
        """Check input, if pandas.dataframe, transform to numpy array.
        - Args:
                X(ndarray/pandas.dataframe): input instances
                column(str/int): column index or column name
        - Returns:
                X(ndarray): input instances
        """
        column_idx = None
        X = X.select_dtypes(include=[np.number])

        if isinstance(X, pd.core.frame.DataFrame):
            if isinstance(column, str):
                # get index of current column
                column_idx = X.columns.get_loc(column)
            else:
                column_idx = column
            X = X.as_matrix()
        else:
            column_idx = column
        return X, column_idx
    
    
    def knn(self, X, column, k=10, is_categorical=False):
        """Impute missing value with knn.
        - Args:
                X(pandas.dataframe): dataframe
                column(str): column name to be imputed
                k(int): number of nearest neighbors, default 10
                is_categorical(boolean): is continuous or categorical feature
        - Returns:
                X_imputed(pandas.dataframe): imputed pandas dataframe
        """
        X, column_id = self._check_X_y(X, column)
        clf = self._fit(X, column_id, k, is_categorical)
        if clf is None:
            return X
        else:
            X_imputed = self._transform(X, column_id, clf, is_categorical)
            return X_imputed[:,column_id]

In [17]:
imputer = Imputer()

In [18]:
for c in df.columns:
    missing = df[c].isnull().sum()
    if missing > 0:
        print(c, missing)

Age 177
Cabin 687
Embarked 2


In [19]:
titanic['Fare'].describe()

count    891.000000
mean      32.204208
std       49.693429
min        0.000000
25%        7.910400
50%       14.454200
75%       31.000000
max      512.329200
Name: Fare, dtype: float64

In [20]:
df['Fare'].replace(0, np.NaN, inplace=True)

In [21]:
df.head()

Unnamed: 0,PassengerId,class,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,FamilySize,Name_length,IsAlone,Title
0,1,0,3,male,22.0,1,0,A/5 21171,7.25,,S,2,23,0,Mr
1,2,1,1,female,38.0,1,0,PC 17599,71.2833,C85,C,2,51,0,Mrs
2,3,1,3,female,26.0,0,0,STON/O2. 3101282,7.925,,S,1,22,1,Miss
3,4,1,1,female,35.0,1,0,113803,53.1,C123,S,2,44,0,Mrs
4,5,0,3,male,35.0,0,0,373450,8.05,,S,1,24,1,Mr


In [22]:
for c in [ 'Fare','Age']:
    missing = df[c].isnull().sum()
    if missing > 0:
        df[c] = imputer.knn(df, c, k=20,  )
        


[21.438125 61.78271  30.990415 22.40229  44.4325   15.348745 21.23979
 22.401045 27.56958  24.583125 47.49042  53.356665 16.588125 13.368545
 20.208335]
[26.85   26.2    25.4    25.35   26.05   26.3    31.5    25.525  24.875
 24.975  25.275  22.475  23.775  22.225  22.4665 18.9165 20.8665 25.1165
 25.1165 26.4415 27.8    27.825  30.     29.075  30.55   28.275  26.075
 24.8    29.575  27.25   26.5    29.725  28.375  29.35   26.     27.25
 29.275  29.     24.45   26.525  24.725  30.5    25.75   26.175  29.275
 26.625  29.925  30.225  33.4    34.95   34.6    34.     33.55   31.675
 30.85   29.525  30.375  28.775  29.375  28.125  30.825  29.375  31.596
 29.725  33.925  28.896  31.375  28.5    27.525  29.15   30.1    27.2
 26.2    23.45   24.85   29.1    22.3    23.55   27.1    26.     26.
 27.1    25.3    25.35   24.85   26.4    24.8    32.8    34.5875 31.8375
 34.0875 30.3375 35.0875 33.3375 34.2375 35.0375 37.2375 32.6375 30.2375
 29.7375 30.5    30.95   27.75   32.7    28.775  27.975  2



In [23]:
for c in df.columns:
    missing = df[c].isnull().sum()
    if missing > 0:
        print(c, missing)

Cabin 687
Embarked 2


In [24]:
# Fill NA
# Categoricals Variable
df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode().iloc[0])


In [25]:
## Assign Binary to Sex str
df['Sex'] = df['Sex'].map( {'female': 1, 'male': 0} ).astype(int)
# Embarked
df['Embarked'] = df['Embarked'].map( {'Q': 0, 'S': 1, 'C': 2} ).astype(int)



One hot encoding for the Titles.

In [26]:
one_hot = pd.get_dummies(df['Title'], prefix='T')
df = df.join(one_hot)
df.head()

Unnamed: 0,PassengerId,class,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,...,FamilySize,Name_length,IsAlone,Title,T_Miss,T_Mr,T_Mrs,T_Noble,T_Officer,T_scholar
0,1,0,3,0,22.0,1,0,A/5 21171,7.25,,...,2,23,0,Mr,0,1,0,0,0,0
1,2,1,1,1,38.0,1,0,PC 17599,71.2833,C85,...,2,51,0,Mrs,0,0,1,0,0,0
2,3,1,3,1,26.0,0,0,STON/O2. 3101282,7.925,,...,1,22,1,Miss,1,0,0,0,0,0
3,4,1,1,1,35.0,1,0,113803,53.1,C123,...,2,44,0,Mrs,0,0,1,0,0,0
4,5,0,3,0,35.0,0,0,373450,8.05,,...,1,24,1,Mr,0,1,0,0,0,0


In [27]:
# drop columns
# I guess there are information on Cabin and ticket, but that is another day battle
df = df.drop(['PassengerId','Ticket','class', 'Title', 'Cabin'], axis=1)

In [28]:
np.isnan(df).any()

Pclass         False
Sex            False
Age            False
SibSp          False
Parch          False
Fare           False
Embarked       False
FamilySize     False
Name_length    False
IsAlone        False
T_Miss         False
T_Mr           False
T_Mrs          False
T_Noble        False
T_Officer      False
T_scholar      False
dtype: bool

Keeping in mind that the final dataset is in the form of a numpy array, we can check the number of features in the final dataset as follows.

In [29]:
df.shape

(891, 16)

In [32]:
df.head().T

Unnamed: 0,0,1,2,3,4
Pclass,3.0,1.0,3.0,1.0,3.0
Sex,0.0,1.0,1.0,1.0,0.0
Age,22.0,38.0,26.0,35.0,35.0
SibSp,1.0,1.0,0.0,1.0,0.0
Parch,0.0,0.0,0.0,0.0,0.0
Fare,7.25,71.2833,7.925,53.1,8.05
Embarked,1.0,2.0,1.0,1.0,1.0
FamilySize,2.0,2.0,1.0,2.0,1.0
Name_length,23.0,51.0,22.0,44.0,24.0
IsAlone,0.0,0.0,1.0,0.0,1.0


Finally we store the class labels, which we need to predict, in a separate variable. 

In [30]:
titanic_class = titanic['class'].values

## Using TPOT

To begin our analysis, we need to divide our training data into training and validation sets. The validation set is just to give us an idea of the test set error. The model selection and tuning is entirely taken care of by TPOT, so if we want to, we can skip creating this validation set.

In [33]:
training_indices, validation_indices = training_indices, testing_indices = train_test_split(df.index, stratify = titanic_class, train_size=0.75, test_size=0.25)
training_indices.size, validation_indices.size

(668, 223)

In [34]:
training_indices

Int64Index([654, 668, 534, 435, 791,  53,  75, 614, 734, 427,
            ...
            368, 147, 465, 446, 445, 270, 625, 237, 250, 400],
           dtype='int64', length=668)

In [35]:
df.iloc[training_indices].head(3)

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,FamilySize,Name_length,IsAlone,T_Miss,T_Mr,T_Mrs,T_Noble,T_Officer,T_scholar
654,3,1,18.0,0,0,6.75,0,1,28,1,1,0,0,0,0,0
668,3,0,43.0,0,0,8.05,1,1,15,1,0,1,0,0,0,0
534,3,1,30.0,0,0,8.6625,1,1,19,1,1,0,0,0,0,0


After that, we proceed to calling the `fit`, `score` and `export` functions on our training dataset. To get a better idea of how these functions work, refer the TPOT documentation [here](http://epistasislab.github.io/tpot/api/).

An important TPOT parameter to set is the number of generations. Since our aim is to just illustrate the use of TPOT, we have set it to 5. On a standard laptop with 4GB RAM, it roughly takes 5 minutes per generation to run. For each added generation, it should take 5 mins more. Thus, for the default value of 100, total run time could be roughly around 8 hours.  

In [36]:
tpot = TPOTClassifier(verbosity=2, max_time_mins=25, max_eval_time_mins=0.1, population_size=100, scoring='roc_auc', n_jobs=-1)
tpot.fit(df.iloc[training_indices], titanic_class[training_indices])



Optimization Progress: 204pipeline [01:06,  2.05pipeline/s]                   

Generation 1 - Current best internal CV score: 0.8670957087371669


Optimization Progress: 308pipeline [02:01,  1.41s/pipeline]                   

Generation 2 - Current best internal CV score: 0.868879579680603


Optimization Progress: 410pipeline [03:06,  1.87s/pipeline]                   

Generation 3 - Current best internal CV score: 0.8706223193181067


Optimization Progress: 514pipeline [04:10,  1.34s/pipeline]                   

Generation 4 - Current best internal CV score: 0.8706223193181067


Optimization Progress: 619pipeline [05:11,  1.98s/pipeline]                   

Generation 5 - Current best internal CV score: 0.8745792794841212


Optimization Progress: 726pipeline [06:28,  2.70s/pipeline]                   

Generation 6 - Current best internal CV score: 0.8745792794841212


Optimization Progress: 833pipeline [07:45,  1.77s/pipeline]                   

Generation 7 - Current best internal CV score: 0.8745792794841212


Optimization Progress: 951pipeline [09:06,  3.07s/pipeline]                   

Generation 8 - Current best internal CV score: 0.8745792794841212


Optimization Progress: 1056pipeline [10:15,  1.82s/pipeline]                   

Generation 9 - Current best internal CV score: 0.8760565050179749


Optimization Progress: 1161pipeline [11:28,  1.68s/pipeline]                    

Generation 10 - Current best internal CV score: 0.8760565050179749


Optimization Progress: 1274pipeline [12:46,  1.74s/pipeline]                    

Generation 11 - Current best internal CV score: 0.8762857078906066


Optimization Progress: 1378pipeline [14:06,  1.75s/pipeline]                    

Generation 12 - Current best internal CV score: 0.8762857078906066


Optimization Progress: 1493pipeline [15:37,  3.48s/pipeline]                    

Generation 13 - Current best internal CV score: 0.8762857078906066


Optimization Progress: 1607pipeline [17:07,  2.33s/pipeline]                    

Generation 14 - Current best internal CV score: 0.8767886245345581


Optimization Progress: 1716pipeline [18:31,  3.24s/pipeline]

Generation 15 - Current best internal CV score: 0.8767886245345581


Optimization Progress: 1828pipeline [19:57,  2.22s/pipeline]

Generation 16 - Current best internal CV score: 0.8767886245345581


Optimization Progress: 1939pipeline [21:21,  2.73s/pipeline]

Generation 17 - Current best internal CV score: 0.8767886245345581


Optimization Progress: 2047pipeline [22:47,  2.24s/pipeline]

Generation 18 - Current best internal CV score: 0.8767886245345581


Optimization Progress: 2155pipeline [24:13,  2.23s/pipeline]

Generation 19 - Current best internal CV score: 0.8767886245345581


                                                            


25.2789158 minutes have elapsed. TPOT will close down.
TPOT closed prematurely. Will use the current best pipeline.

Best pipeline: ExtraTreesClassifier(BernoulliNB(input_matrix, alpha=0.1, fit_prior=False), bootstrap=True, criterion=gini, max_features=0.7000000000000001, min_samples_leaf=11, min_samples_split=9, n_estimators=100)


TPOTClassifier(config_dict={'sklearn.naive_bayes.GaussianNB': {}, 'sklearn.naive_bayes.BernoulliNB': {'alpha': [0.001, 0.01, 0.1, 1.0, 10.0, 100.0], 'fit_prior': [True, False]}, 'sklearn.naive_bayes.MultinomialNB': {'alpha': [0.001, 0.01, 0.1, 1.0, 10.0, 100.0], 'fit_prior': [True, False]}, 'sklearn.tree.DecisionT....3 , 0.35, 0.4 , 0.45, 0.5 , 0.55,
       0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1.  ])}}}},
        crossover_rate=0.1, cv=5, disable_update_check=False,
        early_stop=None, generations=1000000, max_eval_time_mins=0.1,
        max_time_mins=25, memory=None, mutation_rate=0.9, n_jobs=12,
        offspring_size=100, periodic_checkpoint_folder=None,
        population_size=100, random_state=None, scoring=None,
        subsample=1.0, verbosity=2, warm_start=False)

Looks like the score improvement get into a plato after a few generations.... 

In [37]:
tpot.score(df.iloc[validation_indices], titanic.loc[validation_indices, 'class'].values)

0.8737056526905449

In [38]:
# you can export the model
tpot.export('tpot_bestmodel.py')

True

Let's have a look at the generated code. As we can see, the random forest classifier performed the best on the given dataset out of all the other models that TPOT currently evaluates on. If we ran TPOT for more generations, then the score should improve further.

In [40]:
import numpy as np
import pandas as pd
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB
from sklearn.pipeline import make_pipeline, make_union
from tpot.builtins import StackingEstimator



In [41]:
# Score on the training set was:0.8758258062600254
exported_pipeline = make_pipeline(
    StackingEstimator(estimator=BernoulliNB(alpha=0.1, fit_prior=False)),
    ExtraTreesClassifier(bootstrap=True, criterion="gini", max_features=0.7000000000000001, min_samples_leaf=11, min_samples_split=9, n_estimators=100)
)

exported_pipeline.fit(df, titanic['class'])



Pipeline(memory=None,
     steps=[('stackingestimator', StackingEstimator(estimator=BernoulliNB(alpha=0.1, binarize=0.0, class_prior=None, fit_prior=False))), ('extratreesclassifier', ExtraTreesClassifier(bootstrap=True, class_weight=None, criterion='gini',
           max_depth=None, max_features=0.7000000000000001,
         ...imators=100, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False))])

### Make predictions on the submission data 

In [42]:
# Read in the submission dataset
titanic_sub = pd.read_csv('data/titanic_test.csv')
titanic_sub.describe()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare
count,418.0,418.0,332.0,418.0,418.0,417.0
mean,1100.5,2.26555,30.27259,0.447368,0.392344,35.627188
std,120.810458,0.841838,14.181209,0.89676,0.981429,55.907576
min,892.0,1.0,0.17,0.0,0.0,0.0
25%,996.25,1.0,21.0,0.0,0.0,7.8958
50%,1100.5,3.0,27.0,0.0,0.0,14.4542
75%,1204.75,3.0,39.0,1.0,0.0,31.5
max,1309.0,3.0,76.0,8.0,9.0,512.3292


In [43]:
# Family Size
titanic_sub['FamilySize'] = titanic_sub['SibSp'] + titanic_sub['Parch'] + 1
# Name Length
titanic_sub['Name_length'] = titanic_sub['Name'].apply(len)
# Is Alone?
titanic_sub['IsAlone'] = 0
titanic_sub.loc[titanic_sub['FamilySize'] == 1, 'IsAlone'] = 1

The most important step here is to check for new levels in the categorical variables of the submission dataset that are absent in the training set. We identify them and set them to our placeholder value of '-999', i.e., we treat them as missing values. This ensures training consistency, as otherwise the model does not know what to do with the new levels in  the submission dataset. 

In [44]:
titanic_sub['Title']=0
titanic_sub['Title']=titanic_sub.Name.str.extract('([A-Za-z]+)\.') #lets extract the Salutations
titanic_sub['Title'].replace(['Mlle','Mme','Ms','Dr',     'Major','Lady','Countess','Jonkheer','Col','Rev','Capt','Sir','Don','Master'],
                         ['Miss','Mrs','Mrs','scholar','Officer','Mrs','Noble', 'Noble','Officer','scholar','Officer','Noble','Noble','Noble'],inplace=True)

In [45]:
titanic_sub['Fare'].replace(0, np.NaN, inplace=True)

In [46]:
for c in titanic_sub.columns:
    missing = titanic_sub[c].isnull().sum()
    if missing > 0:
#        imputed = imputer.knn(icnv, c, k=20 )
        #df[c] = imputer.knn(df, c, k=20,  )
        print(c, missing)

Age 86
Fare 3
Cabin 327


We then carry out the data munging steps as done earlier for the training dataset.

In [47]:
titanic_sub['Sex'] = titanic_sub['Sex'].map({'male':0,'female':1})
titanic_sub['Embarked'] = titanic_sub['Embarked'].map({'S':0,'C':1,'Q':2})

In [48]:
one_hot = pd.get_dummies(titanic_sub['Title'], prefix='T')
dftest = titanic_sub.join(one_hot)
dftest.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,...,Name_length,IsAlone,Title,T_Dona,T_Miss,T_Mr,T_Mrs,T_Noble,T_Officer,T_scholar
0,892,3,"Kelly, Mr. James",0,34.5,0,0,330911,7.8292,,...,16,1,Mr,0,0,1,0,0,0,0
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",1,47.0,1,0,363272,7.0,,...,32,0,Mrs,0,0,0,1,0,0,0
2,894,2,"Myles, Mr. Thomas Francis",0,62.0,0,0,240276,9.6875,,...,25,1,Mr,0,0,1,0,0,0,0
3,895,3,"Wirz, Mr. Albert",0,27.0,0,0,315154,8.6625,,...,16,1,Mr,0,0,1,0,0,0,0
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",1,22.0,1,1,3101298,12.2875,,...,44,0,Mrs,0,0,0,1,0,0,0


In [49]:
for c in [ 'Fare','Age']:
    missing = dftest[c].isnull().sum()
    if missing > 0:
#        imputed = imputer.knn(icnv, c, k=20 )
        dftest[c] = imputer.knn(dftest, c, k=12,  )

[57.46700833 22.75764167 61.76665833]
[29.41666667 37.5        36.04166667 33.04166667 28.125      34.5
 32.08333333 27.91666667 28.25       22.75       28.875      22.33333333
 22.58333333 23.66666667 23.         22.         21.66666667 20.70833333
 24.75       24.33333333 22.25       22.16666667 24.5        26.33333333
 26.25       26.5        28.25       27.45833333 35.95833333 27.25
 26.375      26.16666667 26.08333333 27.33333333 26.66666667 26.66666667
 29.70833333 34.79166667 27.45833333 25.06916667 28.5        21.56916667
 24.91666667 28.04166667 28.04166667 33.4025     24.79166667 24.45833333
 24.         32.45166667 25.625      22.33333333 18.535      22.20833333
 23.02083333 22.54166667 20.60416667 23.20833333 18.97916667 19.72916667
 21.5625     22.0625     22.8125     22.47916667 24.8125     21.38166667
 23.81916667 26.73583333 27.06916667 22.56916667 26.         26.16666667
 36.79166667 26.43083333 28.91666667 28.91666667 38.29166667 31.41666667
 29.58333333 27.41666667 2



In [50]:
df.columns

Index(['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked',
       'FamilySize', 'Name_length', 'IsAlone', 'T_Miss', 'T_Mr', 'T_Mrs',
       'T_Noble', 'T_Officer', 'T_scholar'],
      dtype='object')

In [51]:
titanic_sub.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,FamilySize,Name_length,IsAlone,Title
0,892,3,"Kelly, Mr. James",0,34.5,0,0,330911,7.8292,,2,1,16,1,Mr
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",1,47.0,1,0,363272,7.0,,0,2,32,0,Mrs
2,894,2,"Myles, Mr. Thomas Francis",0,62.0,0,0,240276,9.6875,,2,1,25,1,Mr
3,895,3,"Wirz, Mr. Albert",0,27.0,0,0,315154,8.6625,,0,1,16,1,Mr
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",1,22.0,1,1,3101298,12.2875,,0,3,44,0,Mrs


In [52]:
titanic_test= dftest.drop(['PassengerId','Ticket', 'Title', 'Cabin', 'Name'], axis=1)
titanic_test.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,FamilySize,Name_length,IsAlone,T_Dona,T_Miss,T_Mr,T_Mrs,T_Noble,T_Officer,T_scholar
0,3,0,34.5,0,0,7.8292,2,1,16,1,0,0,1,0,0,0,0
1,3,1,47.0,1,0,7.0,0,2,32,0,0,0,0,1,0,0,0
2,2,0,62.0,0,0,9.6875,2,1,25,1,0,0,1,0,0,0,0
3,3,0,27.0,0,0,8.6625,0,1,16,1,0,0,1,0,0,0,0
4,3,1,22.0,1,1,12.2875,0,3,44,0,0,0,0,1,0,0,0


In [53]:
# Generate the predictions
submission = exported_pipeline.predict(titanic_test[df.columns])

In [54]:
# Create the submission file
final = pd.DataFrame({'PassengerId': titanic_sub['PassengerId'], 'Survived': submission})
final.to_csv('data/submission.csv', index = False)

In [55]:
final.shape

(418, 2)

In [56]:
final.head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1


# Results & Conlucions

There we go!  I was very excited about this prediction, I did not get too high in the ranking but the prediction I got with not a lot of efford is around the 50% of the board. I tried this process a couple of times, using a different parameter to generate the models, and I always get similar results. Although have a good model with good hyperparameters is important, this is another example of how critic is to put serous effrod in the data preprocessing... Probably extract mode information from the Cabin, and scale the feature may help to improve the results. 
Anyway, I think I may use TPOT in the future.

### ToDO
	- scale results
	- extract features from Cabin
	- add a column to flag columns where are missing values. 