### This notebook deals with the improvements of the titanic survival prediction

# 1. Setup data

In [1]:
import pandas as pd
train_df = pd.read_csv('data/train.csv')
test_df = pd.read_csv('data/test.csv')

# 2. Data Exploration and pre-processing

A case of accident is being discussed. By a natural understanding, 'Name' won't have any effect on the nature of survival. Therefore, 'Name' has to be dropped.


In [10]:
train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
features_train = train_df.drop(['Name'], axis = 1)
features_test = test_df.drop(['Name'], axis = 1)

### 2.1 Missing data

Age could be a factor which would affect the survival. It is missing in 19% of data, we can simply drop those records.
Cabin is missing in a lot of records almost 77% records. We can treat it as NA.
2 records of EMbarked can be dropped

In [11]:
total1 = features_train.isnull().sum().sort_values(ascending = False)
percent1 = (features_train.isnull().sum()/features_train.isnull().count()).sort_values(ascending = False)
missingData_df_train = pd.concat([total1,percent1], axis = 1)
missingData_df_train

Unnamed: 0,0,1
Embarked,0,0.0
Cabin,0,0.0
Fare,0,0.0
Ticket,0,0.0
Parch,0,0.0
SibSp,0,0.0
Age,0,0.0
Sex,0,0.0
Pclass,0,0.0
Survived,0,0.0


In [5]:
total2 = features_test.isnull().sum().sort_values(ascending = False)
percent2 = (features_test.isnull().sum()/features_test.isnull().count()).sort_values(ascending = False)
missingData_df_test = pd.concat([total2,percent2], axis = 1)
missingData_df_test

Unnamed: 0,0,1
Cabin,327,0.782297
Age,86,0.205742
Fare,1,0.002392
Embarked,0,0.0
Ticket,0,0.0
Parch,0,0.0
SibSp,0,0.0
Sex,0,0.0
Pclass,0,0.0
PassengerId,0,0.0


In [6]:
#drop Age
features_train.drop(features_train[features_train.Age.isnull()].index.tolist(), axis = 0, inplace = True)
features_test.Age.fillna(features_test.Age.mean, inplace=True)

features_train.Cabin.fillna('NA', inplace = True)
features_test.Cabin.fillna('NA', inplace = True)

features_train.drop(features_train[features_train.Embarked.isnull()].index.tolist(), axis = 0, inplace = True)
features_test.Fare.fillna(features_test.Fare.mean, inplace=True)

### 2.2 Ensuring uniformity of unique values in categorical variables

For SibSp (Number of siblings/spouses) we have a value of 8 for 1 record in Test set. This value is not present in the training set.
We consider this number as highest, so we update it to the highest value in the training set so that it would be considered 


In [14]:
(features_train.dtypes == 'object').index

Index(['PassengerId', 'Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch',
       'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [12]:
features_train[features_train.dtypes == 'object']

  """Entry point for launching an IPython kernel.


IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).