<a href="https://colab.research.google.com/github/ashishkumar26/Kaggle/blob/master/Titanic_CaseStudy/titanic_survival_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

                       
##                   **Kernel Goals**


There are two primary goals of this kernel.
To do a statistical and exploratory data analysis of how some group of people was survived more than others through visualization.
And to create machine learning models that can predict the chances of passengers survival.

# <font color='blue'>1: Importing Necessary Libraries and datasets</font>
***
<a id="import_libraries**"></a>
### 1.1. Libraries

In [0]:
# Import necessary modules/libraries for data analysis and data visualization. 
# Data analysis modules
import pandas as pd
import numpy as np

# Visualization libraries
from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline

## Machine learning libraries
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.model_selection import cross_val_score,GridSearchCV


## Ignore warning
# import warnings
# warnings.filterwarnings('ignore')



***
<a id="load_data"></a>
### 1.2. Load datasets


In [0]:
## Importing the training dataset
train =pd.read_csv("https://raw.githubusercontent.com/ashishkumar26/Datasets/master/CaseStudy/titanic_train.csv")
test = pd.read_csv("https://raw.githubusercontent.com/ashishkumar26/Datasets/master/CaseStudy/titanic_test.csv")

### 1.3. Analyze the Dataset
<a id="analyzethedataset"></a>
***

In [0]:
# Look for the top 5 rows
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [0]:
# Describe the dataset
train.describe()


Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [0]:
# Shape of the dataset
train.shape # Training Dataset has 891 rows and 12 columns


(891, 12)

# <font color='blue'>2: Overview and Cleaning the Data</font>
<a id="cleaningthedata"></a>
***

In [0]:
## saving passenger id in advance in order to submit later. 
passengerid = test.PassengerId
ticket = test.Ticket


In [0]:
## We will drop PassengerID and Ticket since it will be useless for our data. 
train.drop(['PassengerId'], axis=1, inplace=True)
test.drop(['PassengerId'], axis=1, inplace=True)
train.drop(['Ticket'], axis= 1, inplace=True)
test.drop(['Ticket'], axis=1, inplace = True )

This dataset is almost clean. However, before we jump into visualization and machine learning models, lets analyze and see what we have here.


In [0]:
print ('Train '.center(50, "*"))
print (train.info())
print ('Test '.center(50, "*"))
print (test.info())

**********************Train **********************
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 10 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Name        891 non-null object
Sex         891 non-null object
Age         714 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Fare        891 non-null float64
Cabin       204 non-null object
Embarked    889 non-null object
dtypes: float64(2), int64(4), object(4)
memory usage: 69.7+ KB
None
**********************Test ***********************
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 9 columns):
Pclass      418 non-null int64
Name        418 non-null object
Sex         418 non-null object
Age         332 non-null float64
SibSp       418 non-null int64
Parch       418 non-null int64
Fare        417 non-null float64
Cabin       91 non-null object
Embarked    418 non-null object
dtypes: float64(2

It looks like, neither the features have the same amount of entries, nor they have only numerical(int, float) values, which can only mean...
* We may have missing values in our features.
* We may have categorical features. 
* We may have alphanumerical or/and text features. 

### 2.1. Dealing with Missing values
<a id="dealwithnullvalues"></a>
***

In [0]:
print ('Train '.center(20, "*"))
print (train.isnull().sum())
print ('Test  '.center(20, "*"))
print (test.isnull().sum())

*******Train *******
Survived      0
Pclass        0
Name          0
Sex           0
Age         177
SibSp         0
Parch         0
Fare          0
Cabin       687
Embarked      2
dtype: int64
*******Test  *******
Pclass        0
Name          0
Sex           0
Age          86
SibSp         0
Parch         0
Fare          1
Cabin       327
Embarked      0
dtype: int64


We see that in both train and test dataset have missing values. Let's fix them.

### Embarked feature
***

In [0]:
print (train.Embarked.value_counts(dropna=False))
print (train.Embarked.value_counts(dropna=False, normalize=True)*100) # dropna parameter is to include null values also; normalize parameter is to get the relative value

S      644
C      168
Q       77
NaN      2
Name: Embarked, dtype: int64
S      72.278339
C      18.855219
Q       8.641975
NaN     0.224467
Name: Embarked, dtype: float64


It looks like there are only two null values( ~ 0.22 %) in the Embarked feature. Since this is less than 1%, we can replace these with mode value "S."

In [0]:
train[train.Embarked.isnull()]


Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
61,1,1,"Icard, Miss. Amelie",female,38.0,0,0,80.0,B28,
829,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,80.0,B28,


In [0]:
train.Embarked.fillna(train.Embarked.mode()[0], inplace=True)
train.Embarked.isnull().sum()


0

### Cabin Feature
***

In [0]:
print(train.Cabin.isnull().sum()/len(train.Cabin)*100)
print(test.Cabin.isnull().sum()/len(test.Cabin)*100)

77.10437710437711
78.22966507177034


Approximately 77% of Cabin feature is missing in the training data. We have two choices, we can either get rid of the whole feature, or we can brainstorm a little and find an appropriate way to put them in use.
* We may say passengers with cabin records had a higher socio-economic-status then others. 
* We may also say passengers with cabin records were more likely to be taken into consideration for the rescue mission. 

I believe it would be wise to keep the data. We will assign all the null values as **"N"** for now and will put cabin column to good use in the feature engineering section.

In [0]:
train.Cabin[train.Cabin.isnull()] = 'N'
test.Cabin[test.Cabin.isnull()] = 'N'

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


All the cabin names start with an english alphabet following by digits. We can group these cabins by the alphabets. 

In [0]:
train.Cabin = [i[0] for i in train.Cabin]
test.Cabin = [i[0] for i in test.Cabin]

In [0]:
print (train.isnull().sum())
print(''.center(15,'*'))
print(test.isnull().sum())

Survived      0
Pclass        0
Name          0
Sex           0
Age         177
SibSp         0
Parch         0
Fare          0
Cabin         0
Embarked      0
dtype: int64
***************
Pclass       0
Name         0
Sex          0
Age         86
SibSp        0
Parch        0
Fare         1
Cabin        0
Embarked     0
dtype: int64
