<a href="https://colab.research.google.com/github/codeforhk/python_practitioner/blob/master/ml_example_titanic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://i.ytimg.com/vi/cMVi953awHQ/maxresdefault.jpg">

# **An Interactive Data Science Tutorial**


*[Based on the Titanic competition on Kaggle](https://www.kaggle.com/c/titanic)*

*by Helge Bjorland & Stian Eide*

*January 2017*

---

## Content


 1. Business Understanding (5 min)
     * Objective
     * Description
 2. Data Understanding (15 min)
    * Import Libraries
    * Load data
    * Statistical summaries and visualisations
    * Excersises
 3. Data Preparation (5 min)
    * Missing values imputation
    * Feature Engineering
 4. Modeling (5 min)
     * Build the model
 5. Evaluation (25 min)
     * Model performance
     * Feature importance
     * Who gets the best performing model?
 6. Deployment  (5 min)
     * Submit result to Kaggle leaderboard     

[*Adopted from Cross Industry Standard Process for Data Mining (CRISP-DM)*](http://www.sv-europe.com/crisp-dm-methodology/)

![CripsDM](https://upload.wikimedia.org/wikipedia/commons/thumb/b/b9/CRISP-DM_Process_Diagram.png/220px-CRISP-DM_Process_Diagram.png "Process diagram showing the relationship between the different phases of CRISP-DM")

# 1. Business Understanding

## 1.1 Objective
Predict survival on the Titanic

## 1.2 Description
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

**Before going further, what do you think is the most important reasons passangers survived the Titanic sinking?**

[Description from Kaggle](https://www.kaggle.com/c/titanic)

<img src="https://typeset-beta.imgix.net/2017/1/31/cd2dd3d6-bf72-4a05-93b5-d952c499f335.jpg">
<img src="https://i.imgflip.com/r89z2.jpg?a418296">
<img src="https://i.ytimg.com/vi/Ho1x0c86RrU/hqdefault.jpg">

# 2. Data Understanding

## 2.1 Exercise: Import Libraries 
First of some preparation. We need to import python libraries containing the necessary functionality we will need. 

*Question 1: import the numpy and pandas library (10 seconds)*

*Question 2: import sklearn library for KNN, Random Forest & Gradient Boosting ? The decision tree library is imported for you (60 seconds)* 
<br>hints: http://scikit-learn.org/stable/modules/classes.html#module-sklearn.tree

*Question 3: import the visualization library pyplot, seaborn & pylab? Seaborn is imported for you (30 seconds)*

*Question 4: import sklearn helper function train_test_split?* (60 seconds) 
<br>hints:http://scikit-learn.org/stable/modules/cross_validation.html

In [0]:
#Q1
#hints: import the pandas library as pd?

#Q2
#hints: http://scikit-learn.org/stable/modules/classes.html#module-sklearn.tree
from sklearn.tree import DecisionTreeClassifier

#Q3
import seaborn as sns

#Q4
#http://scikit-learn.org/stable/modules/cross_validation.html
from sklearn.preprocessing import Imputer , Normalizer , scale

## 2.2 Setup helper Functions
There is no need to understand this code. Just run it to simplify the code later in the tutorial.

*Simply run the cell below by selecting it and pressing the play button.*

In [0]:
def plot_histograms( df , variables , n_rows , n_cols ):
    fig = plt.figure( figsize = ( 16 , 12 ) )
    for i, var_name in enumerate( variables ):
        ax=fig.add_subplot( n_rows , n_cols , i+1 )
        df[ var_name ].hist( bins=10 , ax=ax )
        ax.set_title( 'Skew: ' + str( round( float( df[ var_name ].skew() ) , ) ) ) # + ' ' + var_name ) #var_name+" Distribution")
        ax.set_xticklabels( [] , visible=False )
        ax.set_yticklabels( [] , visible=False )
    fig.tight_layout()  # Improves appearance a bit.
    plt.show()

def plot_distribution( df , var , target , **kwargs ):
    row = kwargs.get( 'row' , None )
    col = kwargs.get( 'col' , None )
    facet = sns.FacetGrid( df , hue=target , aspect=4 , row = row , col = col )
    facet.map( sns.kdeplot , var , shade= True )
    facet.set( xlim=( 0 , df[ var ].max() ) )
    facet.add_legend()

def plot_categories( df , cat , target , **kwargs ):
    row = kwargs.get( 'row' , None )
    col = kwargs.get( 'col' , None )
    facet = sns.FacetGrid( df , row = row , col = col )
    facet.map( sns.barplot , cat , target )
    facet.add_legend()

def plot_correlation_map( df ):
    corr = titanic.corr()
    _ , ax = plt.subplots( figsize =( 12 , 10 ) )
    cmap = sns.diverging_palette( 220 , 10 , as_cmap = True )
    _ = sns.heatmap(
        corr, 
        cmap = cmap,
        square=True, 
        cbar_kws={ 'shrink' : .9 }, 
        ax=ax, 
        annot = True, 
        annot_kws = { 'fontsize' : 12 }
    )

def describe_more( df ):
    var = [] ; l = [] ; t = []
    for x in df:
        var.append( x )
        l.append( len( pd.value_counts( df[ x ] ) ) )
        t.append( df[ x ].dtypes )
    levels = pd.DataFrame( { 'Variable' : var , 'Levels' : l , 'Datatype' : t } )
    levels.sort_values( by = 'Levels' , inplace = True )
    return levels

def plot_variable_importance( X , y ):
    tree = DecisionTreeClassifier( random_state = 99 )
    tree.fit( X , y )
    plot_model_var_imp( tree , X , y )
    
def plot_model_var_imp( model , X , y ):
    imp = pd.DataFrame( 
        model.feature_importances_  , 
        columns = [ 'Importance' ] , 
        index = X.columns 
    )
    imp = imp.sort_values( [ 'Importance' ] , ascending = True )
    imp[ : 10 ].plot( kind = 'barh' )
    print (model.score( X , y ))
    

## 2.3 Exercise: Load data
Now that our packages are loaded, let's read in and take a peek at the data.

*Question 5a: Read the train & test csv & define them as variable $train$ & $test$.* (1 min)
<br>
train csv: https://storage.googleapis.com/bwdb/acceleratehk/10%20-%20kaggle%20class/train.csv
<br>
test csv: https://storage.googleapis.com/bwdb/acceleratehk/10%20-%20kaggle%20class/test.csv

*Question 5b: Append/concatenate the 2 csv as a variable $full$.* (5 mins)

*Question 5c: Then select the first 891 row of full and define a new variable called $titanic$* (1 min)

*Question 5d: Describe the variables using the "shape" function (1 min)*

In [0]:
# get titanic & test csv files as a DataFrame
# train csv: https://storage.googleapis.com/bwdb/acceleratehk/10%20-%20kaggle%20class/train.csv
# test csv: https://storage.googleapis.com/bwdb/acceleratehk/10%20-%20kaggle%20class/test.csv



*Silly question: Check if Jack and Rose are on Titanic. They are too real to be a fiction character. How would you do it? List comprehension. Hints: Try 'Rose' in "Rose DeWitt Bukater". It shows true. How would you check if there are "Rose" in the "Name" column in dataframe "full"?* (5 mins)

In [0]:
Most_important_people_on_titanic = ['Jack Dawson',  "Rose DeWitt Bukater"]
#Let's check if they are on Titanic?



## 2.4 Exercise: Statistical summaries and visualisations

To understand the data we are now going to consider some key facts about various variables including their relationship with the target variable, i.e. survival.



*Question 6: We start by looking at a few lines of the data. How do you do it? (10 seconds)*

In [0]:
#Q6


**VARIABLE DESCRIPTIONS:**

We've got a sense of our variables, their class type, and the first few observations of each. We know we're working with 1309 observations of 12 variables. To make things a bit more explicit since a couple of the variable names aren't 100% illuminating, here's what we've got to deal with:


**Variable Description**

 - Survived: Survived (1) or died (0)
 - Pclass: Passenger's class
 - Name: Passenger's name
 - Sex: Passenger's sex
 - Age: Passenger's age
 - SibSp: Number of siblings/spouses aboard
 - Parch: Number of parents/children aboard
 - Ticket: Ticket number
 - Fare: Fare
 - Cabin: Cabin
 - Embarked: Port of embarkation

[More information on the Kaggle site](https://www.kaggle.com/c/titanic/data)

### 2.4.1 Exercise: Look at some key information about the variables
An numeric variable is one with values of integers or real numbers while a categorical variable is a variable that can take on one of a limited, and usually fixed, number of possible values, such as blood type.

Notice especially what type of variable each is, how many observations there are and some of the variable values.

An interesting observation could for example be the minimum age 0.42, do you know why this is?

*Select the cell below and run it by pressing the play button.*

*Question 7: How to quickly describe the dataframe? What does it mean? Try to explain every row to yourself* (1 mins)

In [0]:
#Q7


### 2.4.2 Exercise: A heat map of correlation may give us a understanding of which variables are important
*Qusestion 8: Find the correlation matrix of titanic. (5 mins)*

hints: Check the helper's function. Alternatively, check https://stackoverflow.com/questions/29432629/correlation-matrix-using-pandas

In [0]:
#Q8


### 2.4.3 Let's further explore the relationship between the features and survival of passengers 
*Question 9 : Can yoy plot the the relationship between age and survival? Try use the helper's function?*

In [0]:
#Q9 Plot distributions of Age of passangers who survived or did not survive


Consider the graphs above. Differences between survival for different values is what will be used to separate the target variable (survival in this case) in the model. If the two lines had been about the same, then it would not have been a good variable for our predictive model. 

Consider some key questions such as; what age does males/females have a higher or lower probability of survival? 

### 2.4.3 Excersise : Investigating numeric variables
It's time to get your hands dirty and do some coding! 

*Question Try to plot the distributions of Fare of passangers who survived or did not survive. Then consider if this could be a good predictive variable.*

*Hint: use the code from the previous cell as a starting point.*

In [0]:
#Q10 Plot distributions of Fare of passangers who survived or did not survive




### 2.4.4 Embarked
We can also look at categorical variables like Embarked and their relationship with survival.

- C = Cherbourg  
- Q = Queenstown
- S = Southampton

In [0]:
#Q11 Plot survival rate by Embarked. Hint: Look into the helper function?




### 2.4.4 question 12 - 15: Investigating categorical variables
Even more coding practice! Try to plot the survival rate of Sex, Pclass, SibSp and Parch below. 

*Hint: use the code from the previous cell as a starting point.*

After considering these graphs, which variables do you expect to be good predictors of survival? 

In [0]:
# Q12
# Plot survival rate by Sex


In [0]:
# Q13
# Plot survival rate by Pclass


In [0]:
# Q14
# Plot survival rate by SibSp


In [0]:
# Q15
# Plot survival rate by Parch


# 3. Data Preparation

## 3.1 Categorical variables need to be transformed to numeric variables
The variables *Embarked*, *Pclass* and *Sex* are treated as categorical variables. Some of our model algorithms can only handle numeric values and so we need to create a new variable (dummy variable) for every unique value of the categorical variables.

This variable will have a value 1 if the row has a particular value and a value 0 if not. *Sex* is a dichotomy (old school gender theory) and will be encoded as one binary variable (0 or 1).

*Select the cells below and run it by pressing the play button.*

*Question16: Write a function that transform male = 1 & female  = 0. You can write a function with the input as a list, or loop through a simple function that transform male to 1 & else 0. Convert $full.Sex$ to a series called $sex$ * (10 mins)

In [0]:
#Q16 Transform Sex into binary values 0 and 1


*Question17: Convert $full.Embarked$ to a dataframe with dummy variables and called it "Embarked". Hints: Check out pd.get_dummies*

In [0]:
#Q17 Create a new variable for every unique value of Embarked


*Question18: Convert $full.Pclass$ to a dataframe with dummy variables and called it "pclass". Hints: Check out pd.get_dummies*

In [0]:
#Q18 Create a new variable for every unique value of pclass


## 3.2 Fill missing values in variables
Most machine learning alghorims require all variables to have values in order to use it for training the model. The simplest method is to fill missing values with the average of the variable across all observations in the training set.

*Question19: Create an empty dataframe called "imputed"* (10 mins)
<br>*Question: replace all the NA value with the mean in $full.Age$*
<br>*Question: replace all the NA value with the mean in $full.Mean$*

In [0]:
#Q19


## 3.3 Feature Engineering &ndash; Creating new variables
Credit: http://ahmedbesbes.com/how-to-score-08134-in-titanic-kaggle-challenge.html

### 3.3.1 Extract titles from passenger names
Titles reflect social status and may predict survival probability

*Select the cell below and run it by pressing the play button.*

*Question 20: Create an empty dataframe called "title"*
<br> *Check out full[ 'Name' ]*
<br>*Question: Write a function that convert 'Gilbert, Mr. William' to 'Mr', & 'McGowan, Miss. Anna' "Annie" to "Miss", and convert all the names in full.Name, and put them in title.Title*
<br>*Question: Try to count how many Mr, Miss, Dr....etc on the boat. How do you do it? Check out pd.value_counts() *

In [0]:
#Q20



In [0]:
Title_Dictionary = {
                    "Capt":       "Officer",
                    "Col":        "Officer",
                    "Major":      "Officer",
                    "Jonkheer":   "Royalty",
                    "Don":        "Royalty",
                    "Sir" :       "Royalty",
                    "Dr":         "Officer",
                    "Rev":        "Officer",
                    "the Countess":"Royalty",
                    "Dona":       "Royalty",
                    "Mme":        "Mrs",
                    "Mlle":       "Miss",
                    "Ms":         "Mrs",
                    "Mr" :        "Mr",
                    "Mrs" :       "Mrs",
                    "Miss" :      "Miss",
                    "Master" :    "Master",
                    "Lady" :      "Royalty"

                    }


*Question 21: Use the mapping table, merge the categories into a bigger categories. e.g, Ms & Mrs to "Mrs". Convert all the titles on title['Title]. How do you do it? Check out the function "map"*
<br>*Question: Convert all the title.Title to dummy variables. You should know how to do it!!*
<br>*Question: Concatenate the new dataframe to the old one. How do you do it? Hints: axis = 1*

In [0]:
#Q21



### 3.3.2 Extract Cabin category information from the Cabin number

*Select the cell below and run it by pressing the play button.*

*Question22: create an empty dataframe called 'cabin'*
<br>*Question: replace the NA value in full.Cabin to 'U', and then save it in cabin['Cabin']*
<br>*Question: mapping each Cabin value with the cabin letter. For example, C243 to "C" & D4 to "D"*
<br>*Question: again, convert the categorical data to dummy variables*

In [0]:
#Q22




### 3.3.3 Extract ticket class from ticket number

*Question23: Create a function that clean and standardize the ticket number. For e.g, convert "C.A. 2673" & "C.A. 30769"
to "CA", or "A/5. 2151" &"A/5. 2161" to "A5". Also convert all numerical ticket to "XXX". How would you do it? (This should be the hardest question today)
<br>*Question: create an empty dataframe called "ticket" & write the cleaned full.Ticket into a new colun called "Ticket"*
<br>*Question: create a dummy variables of ticket.Ticket, and called it "ticket"*

In [0]:
#Q23


# a function that extracts each prefix of the ticket, returns 'XXX' if no prefix (i.e the ticket is a digit)

Unnamed: 0,Ticket_A,Ticket_A4,Ticket_A5,Ticket_AQ3,Ticket_AQ4,Ticket_AS,Ticket_C,Ticket_CA,Ticket_CASOTON,Ticket_FC,...,Ticket_SOTONO2,Ticket_SOTONOQ,Ticket_SP,Ticket_STONO,Ticket_STONO2,Ticket_STONOQ,Ticket_SWPP,Ticket_WC,Ticket_WEP,Ticket_XXX
0,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


### 3.3.4 Create family size and category for family size
The two variables *Parch* and *SibSp* are used to create the famiy size variable

*Question24: Create an empty dataframe called "family"*

<br>* Let's create a new feature: the size of families (including the passenger).Let's called it "FamilySize". How would you do it?*
<br>* Let's create another feature called Family_Single. i.e if family size is 1. put it as a column in family*
<br>* Let's create another feature called Family_small. i.e if family size is between 2 & 4. put it as a column in family*
<br>* Let's create another feature called Family_Large. i.e if family size is > 4. put it as a column in family*

In [0]:
#Q24






## 3.4 Assemble final datasets for modelling

Split dataset by rows into test and train in order to have a holdout set to do model evaluation on. The dataset is also split by columns in a matrix (X) containing the input data and a vector (y) containing the target (or labels).

### 3.4.1 Variable selection
Select which features/variables to inculde in the dataset from the list below:

 - imputed 
 - embarked
 - pclass
 - sex
 - family
 - cabin
 - ticket

*Include the variables you would like to use in the function below seperated by comma, then run the cell*

*Question 25: We have created a list of dataframe. Try to select sex, family and imputed, concatenate by column, and create a new dataframe called full_X? How would you do it? Check out pd.concat. What axis will you be using?*

In [0]:
#Q25 Select which features/variables to include in the dataset from the list below:
# imputed , embarked , pclass , sex , family , cabin , ticket



### 3.4.2 Create datasets
Below we will seperate the data into training and test datasets.

*Select the cell below and run it by pressing the play button.*

*Question 26: Remember that our "full" dataframe is combined by both train and test set? Let's split them back to train_valid_X, train_valid_y, test_X, train_X.*

<br>*Create a dataframe called 'train_valid_X' by selecting the 0:891 rows from full_X*
<br>*Create a list called 'train_valid_y' by selecting the survive or not label from train_valid_X*
<br>*Create a dataframe called 'test_X' by selecting row 891 onwards from full_X*
<br>*Using "train_test_split", split the validation set by setting train_size = .7*

In [0]:
#Q26 Create all datasets that are necessary to train, validate and test models



print (full_X.shape , train_X.shape , valid_X.shape , train_y.shape , valid_y.shape , test_X.shape)

### 3.4.3 Feature importance
Selecting the optimal features in the model is important. 
We will now try to evaluate what the most important variables are for the model to make the prediction.

*Try use the helper function plot_variable_importance to check on train_X & train_y, to see which features are the most importants?*

In [0]:
plot_variable_importance(train_X, train_y)

# 4. Modeling
We will now select a model we would like to try then use the training dataset to train this model and thereby check the performance of the model using the test set. 

## 4.1 Model Selection
Then there are several options to choose from when it comes to models. A good starting point is logisic regression. 

**Select ONLY the model you would like to try below and run the corresponding cell by pressing the play button.**

### 4.1.1 Random Forests Model
Try a random forest model by running the cell below. 

In [0]:
model = RandomForestClassifier(n_estimators=100)

### 4.1.2 Support Vector Machines
Try a Support Vector Machines model by running the cell below. 

In [0]:
model = SVC()

### 4.1.3 Gradient Boosting Classifier
Try a Gradient Boosting Classifier model by running the cell below. 

In [0]:
model = GradientBoostingClassifier()

### 4.1.4 K-nearest neighbors
Try a k-nearest neighbors model by running the cell below. 

In [0]:
model = KNeighborsClassifier(n_neighbors = 3)

### 4.1.5 Gaussian Naive Bayes
Try a Gaussian Naive Bayes model by running the cell below. 

In [0]:
model = GaussianNB()

### 4.1.6 Logistic Regression
Try a Logistic Regression model by running the cell below. 

In [0]:
model = LogisticRegression()

## 4.2 Train the selected model
When you have selected a dataset with the features you want and a model you would like to try it is now time to train the model. After all our preparation model training is simply done with the one line below.

*Question27: How do you fit a model? try model.fit with train_X and train_y*

In [0]:
#Q27






# 5. Evaluation
Now we are going to evaluate model performance and the feature importance.

## 5.1 Model performance
We can evaluate the accuracy of the model by using the validation set where we know the actual outcome. This data set have not been used for training the model, so it's completely new to the model. 

We then compare this accuracy score with the accuracy when using the model on the training data. If the difference between these are significant this is an indication of overfitting. We try to avoid this because it means the model will not generalize well to new data and is expected to perform poorly.

*Q28 how do you evaluate the model performance? Try model.score and test on train_X and train_y? and also valid_X and valid_y?*

In [0]:
#Q28





## 5.2 Feature importance - selecting the optimal features in the model
We will now try to evaluate what the most important variables are for the model to make the prediction. The function below will only work for decision trees, so if that's the model you chose you can uncomment the code below (remove # in the beginning)  and see the feature importance.

*Question 29: Try plot_model_var_imp and check for model variance importance?*

In [0]:
#Q29





### 5.2.1 Automagic
It's also possible to automatically select the optimal number of features and visualize this. This is uncommented and can be tried in the competition part of the tutorial.

*Select the cell below and run it by pressing the play button.*

In [0]:
rfecv = RFECV( estimator = model , step = 1 , cv = StratifiedKFold( train_y , 2 ) , scoring = 'accuracy' )
rfecv.fit( train_X , train_y )

#print (rfecv.score( train_X , train_y ) , rfecv.score( valid_X , valid_y ))
#print( "Optimal number of features : %d" % rfecv.n_features_ )

# Plot number of features VS. cross-validation scores
#plt.figure()
#plt.xlabel( "Number of features selected" )
#plt.ylabel( "Cross validation score (nb of correct classifications)" )
#plt.plot( range( 1 , len( rfecv.grid_scores_ ) + 1 ) , rfecv.grid_scores_ )
#plt.show()

## 5.3 Competition time!
It's now time for you to get your hands even dirtier and go at it all by yourself in a `challenge`! 

1. Try to the other models in step 4.1 and compare their result
    * Do this by uncommenting the code and running the cell you want to try
2. Try adding new features in step 3.4.1
    * Do this by adding them in to the function in the feature section.


**The winner is the one to get the highest scoring model for the validation set**

*Final Question: Write a loop that loop over all the model and find out which model has the highest score?*

In [0]:
#Final question





# 6. Deployment

Deployment in this context means publishing the resulting prediction from the model to the Kaggle leaderboard. To do this do the following:

 1. select the cell below and run it by pressing the play button.
 2. Press the `Publish` button in top right corner.
 3. Select `Output` on the notebook menubar
 4. Select the result dataset and press `Submit to Competition` button

In [0]:
test_Y = model.predict( test_X )
passenger_id = full[891:].PassengerId
test = pd.DataFrame( { 'PassengerId': passenger_id , 'Survived': test_Y } )
test.shape
test.head()
test.to_csv( 'titanic_pred.csv' , index = False )