# Titanic: Machine Learning from Disaster
*Predict survival on the Titanic and get familiar with ML basics*

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

## Step 1: Define the Problem

The problem statement is given to us on a golden plater, develop an algorithm to predict the survival outcome of passengers on the Titanic.
## Step 2: Gather the Data

The test and train data have been found at Kaggle's Titanic: Machine Learning from Disaster.(https://www.kaggle.com/c/titanic)

## Step 3 : Data Wrangling 
* Implementing data architectures for storage and processing
* Developing data governance standards for quality and control
* Data extraction (i.e. ETL and web scraping)
* data cleaning to identify aberrant, missing, or outlier data points.

### 3.1: Importing Libraries

In [1]:
#collection of functions for data processing 
import pandas as pd

#foundational package for scientific computing
import numpy as np

#collection of functions for scientific computing and advance mathematics
import scipy as sp

import IPython
from IPython import display #pretty printing of dataframes in Jupyter notebook

import sklearn #collection of machine learning algorithms
import random
import time

import warnings
warnings.filterwarnings('ignore')

#### 3.1.1: Load Data Modelling Libraries

We will use the popular scikit-learn library to develop our machine learning algorithms. In sklearn, algorithms are called Estimators and implemented in their own classes. For data visualization, we will use the matplotlib and seaborn library.

In [12]:
#common Model algoritm
from sklearn import svm,tree,linear_model,neighbors, naive_bayes, ensemble, discriminant_analysis,gaussian_process
from xgboost import  XGBClassifier

#common MOdel Helpers
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn import feature_selection, model_selection, metrics

#Visualization
import matplotlib as mlp
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
import seaborn as sns
from pandas.tools.plotting import scatter_matrix
#shows the plot in jupiter notebook browser
%matplotlib inline
mlp.style.use('ggplot')
sns.set_style('white')
pylab.rcParams['figure.figsize'] =12,8

* `The Survived` variable is our outcome or dependent variable. It is a binary nominal datatype of 1 for survived and 0 for did not survive. All other variables are potential predictor or independent variables. It's important to note, more predictor variables do not make a better model, but the right variables.
* `The PassengerID` and `Ticket` variables are assumed to be random unique identifiers, that have no impact on the outcome variable. Thus, they will be excluded from analysis.
* `The Pclass` variable is an ordinal datatype for the ticket class, a proxy for socio-economic status (SES), representing 1 = upper class, 2 = middle class, and 3 = lower class.
* `The Name` variable is a nominal datatype. It could be used in feature engineering to derive the gender from title, family size from surname, and SES from titles like doctor or master. Since these variables already exist, we'll make use of it to see if title, like master, makes a difference.
* `The Sex` and `Embarked` variables are a nominal datatype. They will be converted to dummy variables for mathematical calculations.
* `The Age` and `Fare` variable are continuous quantitative datatypes.
* `The SibSp` represents number of related siblings/spouse aboard and Parch represents number of related parents/children aboard. Both are discrete quantitative datatypes. This can be used for feature engineering to create a family size and is alone variable.
* `The Cabin` variable is a nominal datatype that can be used in feature engineering for approximate position on ship when the incident occurred and SES from deck levels. However, since there are many null values, it does not add value and thus is excluded from analysis.

#### 3.1.2. Import Data


In [14]:
from subprocess import check_output
print(check_output(["ls", "dataset"]).decode("utf8"))
#a dataset should be broken into 3 splits: train, test, and (final) validation
#the test file provided is the validation file for competition submission
#we will split the train set into train and test data in future sections

gender_submission.csv
test.csv
train.csv



In [62]:
train_data = pd.read_csv('dataset/train.csv')
test_data = pd.read_csv('dataset/train.csv')

In [63]:
#to play with our data we'll create a copy
copy_train_data = train_data.copy(deep=True)

#because we can clean both datasets at once
data_cleaner =[train_data, test_data]

In [64]:
print('*'*80)
print(train_data.info())
print('*'*80)
train_data.head(10)

********************************************************************************
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
None
********************************************************************************


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [65]:
#LET'S CHECK THE NULL VALUES 
print('*'*100)
print('Train data set:\n', train_data.isnull().sum())
print('*'*100)
print('Test data set:\n', test_data.isnull().sum())
print('*'*100)

****************************************************************************************************
Train data set:
 PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
****************************************************************************************************
Test data set:
 PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
****************************************************************************************************


In [66]:
#LET'S CHECK THE SUMMARY STATISTIC
print('*'*100)
print(train_data.describe(include='all'))
print('*'*100)

****************************************************************************************************
        PassengerId    Survived      Pclass                            Name  \
count    891.000000  891.000000  891.000000                             891   
unique          NaN         NaN         NaN                             891   
top             NaN         NaN         NaN  Holverson, Mr. Alexander Oskar   
freq            NaN         NaN         NaN                               1   
mean     446.000000    0.383838    2.308642                             NaN   
std      257.353842    0.486592    0.836071                             NaN   
min        1.000000    0.000000    1.000000                             NaN   
25%      223.500000    0.000000    2.000000                             NaN   
50%      446.000000    0.000000    3.000000                             NaN   
75%      668.500000    1.000000    3.000000                             NaN   
max      891.000000    1.00000

#### 3.1.3 Clean Data
Now that we know what to clean, let's execute our code.
* **'Age':** We have 177 missing values in age column, We are going to fill missing values with median
* **'Embarked':** We have 2 missing values. We are going to use mode() function which is fill the values with most common element from the given list.
* **'PassengerId','Cabin', 'Ticket'** drop this columsn from train data set

In [67]:
###COMPLETING: complete or delete missing values in train and test/validation dataset
for data in data_cleaner:
    data['Age'].fillna(data['Age'].median(), inplace=True)
    data['Embarked'].fillna(data['Embarked'].mode()[0], inplace=True)
#delete the cabin feature/column and others previously stated to exclude in train dataset
drop_columns = ['PassengerId','Cabin', 'Ticket']
train_data.drop(drop_columns, axis=1, inplace=True)    

In [71]:
#LET'S CHECK THE NULL VALUES 
print('*'*100)
print('Trian data set: \n',train_data.isnull().sum())
print('*'*100)
print('Test data set:\n', test_data.isnull().sum())
print('*'*100)

****************************************************************************************************
Trian data set: 
 Survived    0
Pclass      0
Name        0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Embarked    0
dtype: int64
****************************************************************************************************
Test data set:
 PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         0
dtype: int64
****************************************************************************************************


**Feature Engineering for train and test/validation dataset**<br>
* **Family_size:** Create a family size column adding the SibSp(siblings/spouse) and Parch(parents/children) and +1 for themsellf
* **IsAlone:** Create a IsAlone column, first inilize the column to 1 and then update to 0 if family size is greater than 1
* **Title:** Split title from name column. You may found the example the name column to extract the Title we first separate the "," and then "." then take the second as Title as you can see from the example.

    * Braund, Mr. Owen Harris
    * Moran, Mr. James
    * Bonnell, Miss. Elizabeth
    * Rice, Master. Eugene
    * Rice, Mrs. William (Margaret Norton)

* **FareBin:** Create a FareBin column uisng qcut or frequency on Fare column to categorize the fare column
* **AgeBin:** Create a AgeBin column using qcut() function on Age Column.



In [97]:
#CREATE: Feature Engineering for train and test/validation dataset
for data in data_cleaner:
    #Create the Family size column using sibling and parent information from the data
    data['Family_Size'] = data['SibSp'] + data['Parch'] +1
    
    #initialize to yes/1 is alone
    data['IsAlone'] = 1
    data['IsAlone'].loc[data['Family_Size']>1] =0 #updating if family size more then 1 
    
    #Title from name column
    data['Title'] = data['Name'].str.split(', ', expand=True)[1].str.split('. ', expand=True)[0]
    
    #Fare Bins/Buckets using qcut
    data['FareBin'] = pd.qcut(data['Fare'],4)
    
    #Age bins using qcut
    data['AgeBin'] = pd.qcut(data['Age'].astype(int) ,5, duplicates='drop')
    

In [98]:
train_data.head(10)

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked,Family_Size,IsAlone,Title,FareBin,AgeBin
0,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,7.25,S,2,0,Mr,"(-0.001, 7.91]","(20.0, 28.0]"
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,71.2833,C,2,0,Mrs,"(31.0, 512.329]","(28.0, 38.0]"
2,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,7.925,S,1,1,Miss,"(7.91, 14.454]","(20.0, 28.0]"
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,53.1,S,2,0,Mrs,"(31.0, 512.329]","(28.0, 38.0]"
4,0,3,"Allen, Mr. William Henry",male,35.0,0,0,8.05,S,1,1,Mr,"(7.91, 14.454]","(28.0, 38.0]"
5,0,3,"Moran, Mr. James",male,28.0,0,0,8.4583,Q,1,1,Mr,"(7.91, 14.454]","(20.0, 28.0]"
6,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,51.8625,S,1,1,Mr,"(31.0, 512.329]","(38.0, 80.0]"
7,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,21.075,S,5,0,Master,"(14.454, 31.0]","(-0.001, 20.0]"
8,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,11.1333,S,3,0,Mrs,"(7.91, 14.454]","(20.0, 28.0]"
9,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,30.0708,C,2,0,Mrs,"(14.454, 31.0]","(-0.001, 20.0]"


In [102]:
#LET"S COUNT THE TITLE 
print(train_data['Title'].value_counts())

Mr          517
Miss        182
Mrs         125
Master       40
Dr            7
Rev           6
Major         2
Col           2
Mlle          2
Don           1
Lady          1
Capt          1
Ms            1
Jonkheer      1
Sir           1
th            1
Mme           1
Name: Title, dtype: int64


In [114]:
#apply and lambda functions are quick and dirty code to find and replace with fewer lines of code
#If title names count less than 10 we can replace with Misc 
title_names = train_data['Title'].value_counts() < 10
for data in data_cleaner:
    data['Title'] = data['Title'].apply(lambda x: 'Misc' if title_names.loc[x] == True else x)
print(train_data['Title'].value_counts())

Mr        517
Miss      182
Mrs       125
Master     40
Misc       27
Name: Title, dtype: int64
