# EvalML Titanic Demo:
This demo shows how easy it is to create machine learning mdels using EvalML on the classic Titanic dataset. 
The Titanic dataset includes information on each passenger, such as the cost of fare, sex, age, passenger class,
and most importantly, whether they survived the Titanic. 

Data from: https://www.kaggle.com/c/titanic

In [1]:
# import libraries
import os
import evalml
import numpy as np
import pandas as pd

In [2]:
# import titantic dataset
titanic_train = pd.read_csv('https://featuretools-static.s3.amazonaws.com/evalml/Titanic/train.csv')

display(titanic_train.head())

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Missing data:
As the dataset includes passengers with missing data, we fill the missing values with appropriate data.
Here we fill the missing ages with the mean age and the embarked location with the mode.

In [3]:
# fill missing data
titanic_train["Age"].fillna(titanic_train["Age"].mean(skipna=True), 
                            inplace=True)
titanic_train["Embarked"].fillna(titanic_train["Embarked"].mode()[0], 
                              inplace=True)

titanic_train.drop('Cabin', axis=1, inplace=True)
titanic_train.drop('Name', axis=1, inplace=True)
titanic_train.drop('Ticket', axis=1, inplace=True)
titanic_train.drop('PassengerId', axis=1, inplace=True)

In [4]:
# check if any is NaN
print(titanic_train.isnull().any(axis=1).any())

False


## Encoding Categorical Variables
Many machine learning algorithms are not compatible with categorical values. Here we encode these categorical variables into numerical ones by creating dummy variables. 

In [5]:
# create encodings
titanic_train = pd.get_dummies(titanic_train, columns=["Pclass"])
titanic_train = pd.get_dummies(titanic_train, columns=["Embarked"])
titanic_train = pd.get_dummies(titanic_train, columns=["Sex"])
titanic_train.drop('Sex_female', axis=1, inplace=True)

In [6]:
X_train = titanic_train.drop('Survived', axis=1)
y_train = titanic_train['Survived']

display(X_train.head())

Unnamed: 0,Age,SibSp,Parch,Fare,Pclass_1,Pclass_2,Pclass_3,Embarked_C,Embarked_Q,Embarked_S,Sex_male
0,22.0,1,0,7.25,0,0,1,0,0,1,1
1,38.0,1,0,71.2833,1,0,0,1,0,0,0
2,26.0,0,0,7.925,0,0,1,0,0,1,0
3,35.0,1,0,53.1,1,0,0,0,0,1,0
4,35.0,0,0,8.05,0,0,1,0,0,1,1


## Model Training:
After creating a train/test split to validate our model. We utilize EvalML to automatically search over 50 models, for not only the most accurate type of model but also the model with the most accurate parameters. It is as easy as calling `clf.fit()` on your data!

In [7]:
X_train, X_holdout, y_train, y_holdout = evalml.preprocessing.split_data(X_train, y_train, test_size=.2)

In [8]:
# use evalml
clf = evalml.AutoClassifier(objective="precision",
                            max_pipelines=50)

In [9]:
# fit using autoclassfier
clf.fit(X_train, y_train)

[1m*****************************[0m
[1m* Beginning pipeline search *[0m
[1m*****************************[0m

Optimizing for Precision. Greater score is better.

Searching up to 50 pipelines. No time limit is set. Set one using max_time parameter.

Possible model types: random_forest, xgboost, linear_model

Testing Random Forest w/ imputation: 100%|██████████| 50/50 [05:34<00:00,  6.69s/it]               

✔ Optimization finished


In [10]:
clf.rankings

Unnamed: 0,id,pipeline_name,score,high_variance_cv,parameters
0,29,XGBoostPipeline,0.775,False,"{'eta': 0.35950790057378607, 'min_child_weight..."
1,37,XGBoostPipeline,0.768293,False,"{'eta': 0.009150580761498219, 'min_child_weigh..."
2,7,XGBoostPipeline,0.758621,False,"{'eta': 0.6481718720511973, 'min_child_weight'..."
3,34,XGBoostPipeline,0.75,False,"{'eta': 0.18166317747775154, 'min_child_weight..."
4,38,XGBoostPipeline,0.75,False,"{'eta': 0.0038098661671809317, 'min_child_weig..."
5,45,XGBoostPipeline,0.75,False,"{'eta': 0.023692548385559904, 'min_child_weigh..."
6,2,XGBoostPipeline,0.74359,False,"{'eta': 0.5928446182250184, 'min_child_weight'..."
7,39,XGBoostPipeline,0.741176,False,"{'eta': 0.1422726331411999, 'min_child_weight'..."
8,5,XGBoostPipeline,0.733333,False,"{'eta': 0.38438170729269994, 'min_child_weight..."
9,42,RFClassificationPipeline,0.717391,False,"{'n_estimators': 970, 'max_depth': 422, 'imput..."


### Here we can see how the top ranked model scored in terms of prescision on the holdout set!

In [12]:
pipeline = clf.best_pipeline
print("Best model score: {}".format(pipeline.score(X_holdout, y_holdout)))

Best model score: 0.7777777777777778
