# Kaggle Challenge: Machine Learning from Disaster

This is my first Kaggle Challenge that I am participating in. 

### Reading the data

In [70]:
import pandas as pd
import numpy as np
from IPython.display import display

In [71]:
#Let us Import our data first
data = pd.read_csv("train.csv")

#Let us have a look at the data. 
display(data[0:5])

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


We see that our dataset has 12 variables. 
1. <B>PassengerId:</B> Identifies the passenger by a unique number 
2. <b>Survived:</b> If the passenger survived, the value is 1, otherwise 0 if the passenger did not survive the crash. This is the variable we will be predicting for our submission. It is also known as dependent variable.
3. <b>PClasss:</b> The passengers are distributed into three socio-economic classes, 1 being upper, 2 being middle and 3 being lower class.
4. <b>Name:</b> Name of the Passenger
5. <b>Sex:</b> Gender of the Passenger
6. <b>Age:</b> Age of the Passenger
7. <b>SibSp:</b> This variable denotes if the passenger had any siblings or spouse on board. The number denotes the total number of Siblings and spouses on board. 
8. <b>Parch:</b> This variable denotes the total number of parents or children the passenger had on board with them
9. <b>Ticket:</b> Ticket Number
10. <b>Fare:</b> Fare passenger paid for that ticket
11. <b>Cabin:</b> Cabin number of the passenger
12. <b>Embarked:</b> There were 3 different ports of embarkation for Titanic. This variable denotes those ports.

Next we will divide our dataset into X and Y, a set of features and the dependent variable. We will ignore Name, Ticket Number and Cabin.

In [72]:
data_X = train[['Pclass','Sex','Age','SibSp','Parch','Fare','Embarked']]
data_Y = train[['Survived']]

Let us start exploring the features one by one, and see how they affect the chance of survival in the crash. For this purpose, we will group the features based on their type. 

1. Nominal Features: These are categorical features that do not have any order. 
2. Ordinal Features: These are categorical or numeric features that have an order or a rank. 
3. Continuous Features: These features are numeric in nature, and can take any continuous real value. 


In [73]:
nominal_features = data[['Sex','Embarked']]
ordinal_features = data[['Pclass','Fare']]
continuous_features = data[['Age','SibSp','Parch']]

Let us explore if we have any missing values in our dataset. We will need to impute these values before we can apply any machine learning techniques. 

In [74]:
data.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

We see 687 values for Cabin are missing, and 177 values for Age are missing. Also 2 values for Embarked are missing. Since the total values missing for port of embarkation is very less, we can impute those values with the the port, where maximum people embarked. For Age we will assess our imputation technique in the next steps.

In [75]:
#Imputing Embarked
data['Embarked'].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [78]:
data['Embarked'].fillna("S", inplace = True)
print data.isnull().sum()
print data['Embarked'].value_counts()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         0
dtype: int64
S    646
C    168
Q     77
Name: Embarked, dtype: int64


Since we are ignoring the Cabin variable, we do not need to impute that. Let us analyze Age now. 

In [79]:
data['Age'].describe()


count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%       20.125000
50%       28.000000
75%       38.000000
max       80.000000
Name: Age, dtype: float64