### Kaggle Titianic Dataset

In [1]:
# Data analysis and wrangling
import numpy as np
import pandas as pd

# Visualization
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
import seaborn as sns
%matplotlib inline
mpl.style.use('ggplot')
sns.set_style('white')
pylab.rcParams['figure.figsize'] = 12,8

#Sklearn
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn import feature_selection
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV, StratifiedKFold
from sklearn import metrics
from sklearn.pipeline import Pipeline, FeatureUnion 
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.metrics import accuracy_score 
from sklearn import model_selection, metrics, linear_model, datasets, feature_selection

import warnings
warnings.filterwarnings('ignore')

The test dataset excludes the target variable: 'Survived'

In [2]:
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

In [3]:
train_df.info()
print('-'*40)
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
----------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null

In [4]:
train_df.describe(include='all')

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
count,891.0,891.0,891.0,891,891,714.0,891.0,891.0,891,891.0,204,889
unique,,,,891,2,,,,681,,147,3
top,,,,"Hippach, Miss. Jean Gertrude",male,,,,CA. 2343,,B96 B98,S
freq,,,,1,577,,,,7,,4,644
mean,446.0,0.383838,2.308642,,,29.699118,0.523008,0.381594,,32.204208,,
std,257.353842,0.486592,0.836071,,,14.526497,1.102743,0.806057,,49.693429,,
min,1.0,0.0,1.0,,,0.42,0.0,0.0,,0.0,,
25%,223.5,0.0,2.0,,,20.125,0.0,0.0,,7.9104,,
50%,446.0,0.0,3.0,,,28.0,0.0,0.0,,14.4542,,
75%,668.5,1.0,3.0,,,38.0,1.0,0.0,,31.0,,


#### Describe function tips:

The features that are 'objects' or strings are categorical and thus don't have certain statistics that are for continous data. Statistics that they don't have : mean, standard deviation, minimum, quartiles and max.

There are statistics for categorical variables only (not for continous features): unique, top, freq.

Count statistic can be used for features and is useful to get an idea of null values.

To get describe tables automatically broken down between continous and categorical features:

In [5]:
#continous features only
train_df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [6]:
#categorical features only
train_df.describe(include=['O'])

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
count,891,891,891,204,889
unique,891,2,681,147,3
top,"Hippach, Miss. Jean Gertrude",male,CA. 2343,B96 B98,S
freq,1,577,7,4,644


#### Getting to know your features:

1. PassengerID, random unique identifiers that can be excluded from the analysis
2. Survived, dependant variable
3. Pclass is an ordinal datatype which can be used as a proxy for Socio Economic Status (SES)
4. Name is a nominal datatype that may be used in feature engineering
5. Sex is a nominal datatype that will converted into dummy variables
6. Age is a continous quantitative variable 
7. SibSp represents the number of sibling or spouses on board
8. Parch represents the number of parents or children on board
9. Ticket is a random variable that can be excluded in the analysis
10. Fare is a continous variable that be used to indicate the section of ship the passenger was situated in
11. Cabin is a nominal datatype which may have been useful but contains a high number of Null values
12. Embarked is a nominal datatype with two null values that may need to be dealt with

### Cleaning

#### Null Values

In [7]:
print('Train columns with null values:\n',train_df.isnull().sum())
print('-'*40)
print('Test columns with null values:\n',test_df.isnull().sum())

Train columns with null values:
 PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
----------------------------------------
Test columns with null values:
 PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64


Cleaning made a list of both dataframes to loop through and make changes to both dataframes

In [8]:
# combine the dataframes for cleaning
entire_dataset = [train_df,test_df]

Replacing Null Values:

In [9]:
for dataset in entire_dataset:
    #Age Cleaning: fillna method applied to the pandas series object. Taking the median of the Age series
    dataset['Age'].fillna(dataset['Age'].median(),inplace= True)
    
    #Embarked: fillna method applied to the series object. Taking the mode of the Embarked series
    dataset['Embarked'].fillna(dataset['Embarked'].mode()[0], inplace= True)
    
    #Fare: fillna method appiled to the series object. Taking the median fare
    dataset['Fare'].fillna(dataset['Fare'].median(), inplace= True)

    

Deleting columns unused in the analysis:

In [10]:
drop_columns = ['PassengerId','Ticket','Cabin']

for dataset in entire_dataset:
    dataset = dataset.drop(drop_columns, axis=1, inplace=True)

### Feature Engineering

Family Size Feature = SibSp (Sibling or spouse) + Parch (Parents & children)

In [11]:
for dataset in entire_dataset:
    dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch']

IsAlone Feature. Using numpy's where method function to input 1 where the familysize is 0 and 0 if the family size is not 0

In [12]:
for dataset in entire_dataset:
    dataset['IsAlone'] = np.where(dataset['FamilySize']==0,1,0)

Title Feature: extracting out the title from the name column and replacing Titles that appear less than 10 times as 'Misc'

In [13]:
for dataset in entire_dataset:
    dataset['Title'] = dataset['Name'].str.split(',',expand=True)[1].str.split('.',expand=True)[0]

In [14]:
train_df['Title'].value_counts()

 Mr              517
 Miss            182
 Mrs             125
 Master           40
 Dr                7
 Rev               6
 Mlle              2
 Col               2
 Major             2
 Sir               1
 Jonkheer          1
 Lady              1
 Don               1
 the Countess      1
 Ms                1
 Mme               1
 Capt              1
Name: Title, dtype: int64

Any Titles that occur less than ten times are replaced as 'Misc' in train and test data frames

In [15]:
for dataset in entire_dataset:
    dataset['Title'] = dataset['Title'].apply(lambda x: x if x in [' Mr', ' Miss', ' Mrs', ' Master'] else 'Misc')

In [16]:
train_df['Title'].value_counts()

 Mr        517
 Miss      182
 Mrs       125
 Master     40
Misc        27
Name: Title, dtype: int64