 TPOT tutorial on the Titanic dataset
=================

In [80]:
from tpot import TPOTClassifier
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

In [81]:
# Load data
train_data = pd.read_csv('./train.csv')
train_data.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Data Exploration

In [82]:
train_data.groupby('Sex').Survived.value_counts()

Sex     Survived
female  1           233
        0            81
male    0           468
        1           109
Name: Survived, dtype: int64

In [83]:
train_data.groupby(['Pclass', 'Sex']).Survived.value_counts()

Pclass  Sex     Survived
1       female  1            91
                0             3
        male    0            77
                1            45
2       female  1            70
                0             6
        male    0            91
                1            17
3       female  0            72
                1            72
        male    0           300
                1            47
Name: Survived, dtype: int64

In [84]:
id = pd.crosstab([train_data.Pclass, train_data.Sex], train_data.Survived.astype(float)) # Crosstab 交叉列表取值
id.div(id.sum(1).astype(float), 0)

Unnamed: 0_level_0,Survived,0.0,1.0
Pclass,Sex,Unnamed: 2_level_1,Unnamed: 3_level_1
1,female,0.031915,0.968085
1,male,0.631148,0.368852
2,female,0.078947,0.921053
2,male,0.842593,0.157407
3,female,0.5,0.5
3,male,0.864553,0.135447


## Data Munging
The first and most important step in using TPOT on any data set is to rename the class/response variable to class

In [85]:
train_data.rename(columns={'Survived': 'class'}, inplace=True)

At present, TPOT requires all the data to be in numerical format As we can see below, our data set has 5 categorical variables which contain non-numerical values:Name, Sex, Ticket, Cabin and Embarked

In [86]:
train_data.dtypes

PassengerId      int64
class            int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

We then check the number of levels that each of the five categorical variables have

In [87]:
for cat in ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']:
    print("Number of levels in category '{0}': \b {1:2.2f} ".format(cat, train_data[cat].unique().size))

Number of levels in category 'Name':  891.00 
Number of levels in category 'Sex':  2.00 
Number of levels in category 'Ticket':  681.00 
Number of levels in category 'Cabin':  148.00 
Number of levels in category 'Embarked':  4.00 


In [88]:
for cat in ['Sex', 'Embarked']:
    print("Levels for catgeory '{0}': {1}".format(cat, train_data[cat].unique()))

Levels for catgeory 'Sex': ['male' 'female']
Levels for catgeory 'Embarked': ['S' 'C' 'Q' nan]


In [89]:
train_data['Sex'] = train_data['Sex'].map({'male':0,'female':1})
train_data['Embarked'] = train_data['Embarked'].map({'S':0,'C':1,'Q':2})

We then code these levels manually into numerical values. For nan i.e. the missing values, we simply replace them with a placeholder value (-999). In fact, we perform this replacement for the entire data set.

In [90]:
train_data = train_data.fillna(-999)
pd.isnull(train_data).any()

PassengerId    False
class          False
Pclass         False
Name           False
Sex            False
Age            False
SibSp          False
Parch          False
Ticket         False
Fare           False
Cabin          False
Embarked       False
dtype: bool

Since Name and Ticket have so many levels, we drop them from our analysis for the sake of simplicity. For Cabin, we encode the levels as digits using Scikit-learn's MultiLabelBinarizer and treat them as new features.


```python
>>> from sklearn.preprocessing import MultiLabelBinarizer
>>> mlb = MultiLabelBinarizer()
>>> mlb.fit_transform([(1, 2), (3,)])
array([[1, 1, 0],
       [0, 0, 1]])
>>> mlb.classes_
array([1, 2, 3])
#################
>>> mlb.fit_transform([{'sci-fi', 'thriller'}, {'comedy'}])
array([[0, 1, 1],
       [1, 0, 0]])
>>> list(mlb.classes_)
['comedy', 'sci-fi', 'thriller']

```

In [91]:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
CabinTrans = mlb.fit_transform([{str(val)} for val in train_data['Cabin'].values])

In [92]:
CabinTrans

array([[1, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       ...,
       [1, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0]])

In [93]:
# Drop the unused features from the dataset.
train_data_new = train_data.drop(['Name','Ticket','Cabin','class'], axis=1)

In [94]:
assert (len(train_data['Cabin'].unique()) == len(mlb.classes_)), "Not Equal" #check correct encoding done

We then add the encoded features to form the final dataset to be used with TPOT.

In [95]:
train_data_new = np.hstack((train_data_new.values,CabinTrans))

In [96]:
train_data_new.shape

(891, 156)

In [97]:
np.isnan(train_data_new).any()

False

Keeping in mind that the final dataset is in the form of a numpy array, we can check the number of features in the final dataset as follows.

In [98]:
train_data_new[0].size

156

Finally we store the class labels, which we need to predict, in a separate variable.

In [99]:
Target_class = train_data['class'].values

## Data Analysis using TPOT

To begin our analysis, we need to divide our training data into training and validation sets. The validation set is just to give us an idea of the test set error. The model selection and tuning is entirely taken care of by TPOT, so if we want to, we can skip creating this validation set.

In [21]:
training_indices, validation_indices = training_indices, testing_indices = train_test_split(train_data.index, stratify = Target_class, train_size=0.75, test_size=0.25)
training_indices.size, validation_indices.size

(668, 223)

After that, we proceed to calling the `fit`, `score` and `export` functions on our training dataset. To get a better idea of how these functions work, refer the TPOT documentation [here](http://epistasislab.github.io/tpot/api/).

An important TPOT parameter to set is the number of generations. Since our aim is to just illustrate the use of TPOT, we have set maximum optimization time to 2 minutes (max_time_mins=2). On a standard laptop with 4GB RAM, it roughly takes 5 minutes per generation to run. For each added generation, it should take 5 mins more. Thus, for the default value of 100, total run time could be roughly around 8 hours.  

In [22]:
tpot = TPOTClassifier(generations=20, population_size=80, verbosity=2)
tpot.fit(train_data_new[training_indices], Target_class[training_indices])

HBox(children=(FloatProgress(value=0.0, description='Optimization Progress', max=1680.0, style=ProgressStyle(d…


Generation 1 - Current best internal CV score: 0.8308719560094266

Generation 2 - Current best internal CV score: 0.8308719560094266

Generation 3 - Current best internal CV score: 0.8308719560094266

Generation 4 - Current best internal CV score: 0.8308719560094266

Generation 5 - Current best internal CV score: 0.8383570867467174

Generation 6 - Current best internal CV score: 0.8383570867467174

Generation 7 - Current best internal CV score: 0.8383570867467174

Generation 8 - Current best internal CV score: 0.8383570867467174

Generation 9 - Current best internal CV score: 0.8383570867467174

Generation 10 - Current best internal CV score: 0.8383570867467174

Generation 11 - Current best internal CV score: 0.8383570867467174

Generation 12 - Current best internal CV score: 0.8383570867467174

Generation 13 - Current best internal CV score: 0.8383570867467174

Generation 14 - Current best internal CV score: 0.8383570867467174

Generation 15 - Current best internal CV score: 0.842812

TPOTClassifier(config_dict=None, crossover_rate=0.1, cv=5,
               disable_update_check=False, early_stop=None, generations=20,
               log_file=None, max_eval_time_mins=5, max_time_mins=None,
               memory=None, mutation_rate=0.9, n_jobs=1, offspring_size=None,
               periodic_checkpoint_folder=None, population_size=80,
               random_state=None, scoring=None, subsample=1.0, template=None,
               use_dask=False, verbosity=2, warm_start=False)

In [23]:
tpot.score(train_data_new[validation_indices], train_data.loc[validation_indices, 'class'].values)

0.8026905829596412

In [24]:
tpot.export('tpot_titanic_pipeline.py')

In [None]:
# %load tpot_titanic_pipeline.py
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import make_pipeline, make_union
from sklearn.preprocessing import StandardScaler
from tpot.builtins import StackingEstimator

# NOTE: Make sure that the outcome column is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1)
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'], random_state=None)

# Average CV score on the training set was: 0.8443496801705755
exported_pipeline = make_pipeline(
    StandardScaler(),
    StackingEstimator(estimator=GaussianNB()),
    StandardScaler(),
    RandomForestClassifier(bootstrap=False, criterion="entropy", max_features=0.5, min_samples_leaf=1, min_samples_split=19, n_estimators=100)
)

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)


Let's have a look at the generated code. As we can see, the random forest classifier performed the best on the given dataset out of all the other models that TPOT currently evaluates on. If we ran TPOT for more generations, then the score should improve further.


## Make predictions on the submission data

In [100]:
# Read in the submission dataset
test_data = pd.read_csv('./test.csv')
test_data.describe()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare
count,418.0,418.0,332.0,418.0,418.0,417.0
mean,1100.5,2.26555,30.27259,0.447368,0.392344,35.627188
std,120.810458,0.841838,14.181209,0.89676,0.981429,55.907576
min,892.0,1.0,0.17,0.0,0.0,0.0
25%,996.25,1.0,21.0,0.0,0.0,7.8958
50%,1100.5,3.0,27.0,0.0,0.0,14.4542
75%,1204.75,3.0,39.0,1.0,0.0,31.5
max,1309.0,3.0,76.0,8.0,9.0,512.3292


In [101]:
for var in ['Cabin']:    #,'Name','Ticket']:
    new = list(set(test_data[var]) - set(train_data[var]))
    test_data.loc[test_data[var].isin(new), var] = -999

We then carry out the data munging steps as done earlier for the training dataset.

In [102]:
test_data['Sex'] = test_data['Sex'].map({'male':0,'female':1})
test_data['Embarked'] = test_data['Embarked'].map({'S':0,'C':1,'Q':2})

In [103]:
test_data = test_data.fillna(-999)
pd.isnull(test_data).any()

PassengerId    False
Pclass         False
Name           False
Sex            False
Age            False
SibSp          False
Parch          False
Ticket         False
Fare           False
Cabin          False
Embarked       False
dtype: bool

While calling MultiLabelBinarizer for the submission data set, we first fit on the training set again to learn the levels and then transform the submission dataset values. This further ensures that only those levels that were present in the training dataset are transformed. If new levels are still found in the submission dataset then it will return an error and we need to go back and check our earlier step of replacing new levels with the placeholder value.

In [104]:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
SubCabinTrans = mlb.fit([{str(val)} for val in train_data['Cabin'].values]).transform([{str(val)} for val in test_data['Cabin'].values])
test_data = test_data.drop(['Name','Ticket','Cabin'], axis=1)

In [105]:
# Form the new submission data set
test_data_new = np.hstack((test_data.values,SubCabinTrans))

In [110]:
test_data_new[0]

array([892.    ,   3.    ,   0.    ,  34.5   ,   0.    ,   0.    ,
         7.8292,   2.    ,   1.    ,   0.    ,   0.    ,   0.    ,
         0.    ,   0.    ,   0.    ,   0.    ,   0.    ,   0.    ,
         0.    ,   0.    ,   0.    ,   0.    ,   0.    ,   0.    ,
         0.    ,   0.    ,   0.    ,   0.    ,   0.    ,   0.    ,
         0.    ,   0.    ,   0.    ,   0.    ,   0.    ,   0.    ,
         0.    ,   0.    ,   0.    ,   0.    ,   0.    ,   0.    ,
         0.    ,   0.    ,   0.    ,   0.    ,   0.    ,   0.    ,
         0.    ,   0.    ,   0.    ,   0.    ,   0.    ,   0.    ,
         0.    ,   0.    ,   0.    ,   0.    ,   0.    ,   0.    ,
         0.    ,   0.    ,   0.    ,   0.    ,   0.    ,   0.    ,
         0.    ,   0.    ,   0.    ,   0.    ,   0.    ,   0.    ,
         0.    ,   0.    ,   0.    ,   0.    ,   0.    ,   0.    ,
         0.    ,   0.    ,   0.    ,   0.    ,   0.    ,   0.    ,
         0.    ,   0.    ,   0.    ,   0.    ,   0.    ,   0. 

In [111]:
np.any(np.isnan(test_data_new))

False

In [112]:
# Ensure equal number of features in both the final training and submission dataset
assert (train_data_new.shape[1] == test_data_new.shape[1]), "Not Equal" 

In [113]:
# Generate the predictions
submission = tpot.predict(test_data_new)

In [114]:
# Create the submission file
final = pd.DataFrame({'PassengerId': test_data['PassengerId'], 'Survived': submission})
final.to_csv('TPOT_RES.csv', index = False)

In [147]:
final.shape

(418, 2)

EpistasisLab-tpot-example

In [145]:
print('train_data_new: ', train_data_new.shape)
print('test_data_new: ', test_data_new.shape)
print('Target_class: ', Target_class.shape)
print('features: ', features.shape)

train_data_new:  (891, 156)
test_data_new:  (418, 156)
Target_class:  (891,)
features:  (891, 156)


In [402]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# NOTE: Make sure that the outcome column is labeled 'target' in the data file
# tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
# features = tpot_data.drop('target', axis=1)

features = train_data_new
training_features, testing_features, training_classes, testing_classes = \
            train_test_split(features, Target_class, random_state=None)

exported_pipeline1 = RandomForestClassifier(bootstrap=False, max_features=0.4,
                                            min_samples_leaf=1, min_samples_split=9)
exported_pipeline1.fit(training_features, training_classes)
results1 = exported_pipeline1.predict(test_data_new)

print(exported_pipeline1.score(testing_features, testing_classes))
print(exported_pipeline1.score(train_data_new, Target_class))
print(exported_pipeline1.score(train_data_new[validation_indices], 
      train_data.loc[validation_indices, 'class'].values))

# Create the submission file
final = pd.DataFrame({'PassengerId': test_data['PassengerId'], 'Survived': results1})
final.to_csv('TPOT_RES1.csv', index = False)

0.8026905829596412
0.9326599326599326
0.9461883408071748


My TPOT-example

In [198]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import make_pipeline, make_union
from sklearn.preprocessing import StandardScaler
from tpot.builtins import StackingEstimator

# NOTE: Make sure that the outcome column is labeled 'target' in the data file

training_features, testing_features, training_target, testing_target = \
            train_test_split(features, Target_class, random_state=None)

# Average CV score on the training set was: 0.8443496801705755
exported_pipeline2 = make_pipeline(
    StandardScaler(),
    StackingEstimator(estimator=GaussianNB()),
    StandardScaler(),
    RandomForestClassifier(bootstrap=False, criterion="entropy", max_features=0.5, 
                           min_samples_leaf=1, min_samples_split=20, n_estimators=90)
)

exported_pipeline2.fit(training_features, training_target)
results2 = exported_pipeline2.predict(test_data_new)

print(exported_pipeline2.score(testing_features, testing_classes))
print(exported_pipeline2.score(train_data_new, Target_class))
print(exported_pipeline2.score(train_data_new[validation_indices], 
      train_data.loc[validation_indices, 'class'].values))

# Create the submission file
final = pd.DataFrame({'PassengerId': test_data['PassengerId'], 'Survived': results2})
final.to_csv('TPOT_RES2.csv', index = False)

0.5336322869955157
0.9169472502805837
0.9192825112107623
