<a href="https://colab.research.google.com/github/cmannnn/titanic/blob/master/titanic_pynb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Titanic Dataset

https://www.kaggle.com/c/titanic


The competition is simple: use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.


Variable Notes:

pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.


The attributes have the following meaning:

Survived: that's the target, 0 means the passenger did not survive, while 1 means he/she survived.
Pclass: passenger class.
Name, Sex, Age: self-explanatory
SibSp: how many siblings & spouses of the passenger aboard the Titanic.
Parch: how many children & parents of the passenger aboard the Titanic.
Ticket: ticket id
Fare: price paid (in pounds)
Cabin: passenger's cabin number
Embarked: where the passenger embarked the Titanic

In [12]:
# imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='darkgrid')

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import mean_squared_error

from sklearn.base import BaseEstimator, TransformerMixin 

from sklearn.pipeline import Pipeline

from sklearn.impute import SimpleImputer

from google.colab import files
import io

In [2]:
# google colab uploader
uploaded = files.upload()

Saving train.csv to train.csv
Saving test.csv to test.csv
Saving gender_submission.csv to gender_submission.csv


In [3]:
# uploading training data
train_data = pd.read_csv(io.StringIO(uploaded['train.csv'].decode('utf-8')))

In [4]:
# uploading testing data
test_data = pd.read_csv(io.StringIO(uploaded['test.csv'].decode('utf-8')))

## Lil EDA

EDA

In [5]:
# checking training head
train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [6]:
# checking train shape
train_data.shape

(891, 12)

In [7]:
# checking train columns
train_data.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [8]:
# describing training data
train_data.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [9]:
# checking training data info
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


Target Col

In [15]:
# checking target col
train_data['Survived'].value_counts()

0    549
1    342
Name: Survived, dtype: int64

Numerical Cols

In [53]:
# checking Parch col
train_data['Parch'].value_counts()

0    678
1    118
2     80
5      5
3      5
4      4
6      1
Name: Parch, dtype: int64

In [52]:
# checking SibSp col
train_data['SibSp'].value_counts()

0    608
1    209
2     28
4     18
3     16
8      7
5      5
Name: SibSp, dtype: int64

In [51]:
# checking Fare col
train_data['Fare'].value_counts()

8.0500     43
13.0000    42
7.8958     38
7.7500     34
26.0000    31
           ..
8.4583      1
9.8375      1
8.3625      1
14.1083     1
17.4000     1
Name: Fare, Length: 248, dtype: int64

In [48]:
# checking Pclass col
train_data['Pclass'].value_counts()

3    491
1    216
2    184
Name: Pclass, dtype: int64

In [49]:
# checking Sex col
train_data['Sex'].value_counts()

male      577
female    314
Name: Sex, dtype: int64

In [50]:
# checking embarked col
# S = Southampton, C = Cherbourg, Q = Queenstown
train_data['Embarked'].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [10]:
# checking null training values
# def function to show NaN values
def nan_val(data):
  for col in data:
	  if data[col].isnull().sum() != 0:
		  print('TRAINING DATA {} column has {} missing data points'.format(col, data[col].isnull().sum()))
  print('\n')
  
nan_val(train_data)

TRAINING DATA Age column has 177 missing data points
TRAINING DATA Cabin column has 687 missing data points
TRAINING DATA Embarked column has 2 missing data points




In [11]:
# checking null testing values
# def function to show NaN values
def nan_val(data):
  for col in data:
	  if data[col].isnull().sum() != 0:
		  print('TESTING DATA {} column has {} missing data points'.format(col, data[col].isnull().sum()))
  print('\n')
  
nan_val(test_data)

TESTING DATA Age column has 86 missing data points
TESTING DATA Fare column has 1 missing data points
TESTING DATA Cabin column has 327 missing data points




##Preprocessing Pipeline(s)

In [55]:
# creating BaseEstimator, TransformerMixin to ease pipeline integration
class DataFrameSelector(BaseEstimator, TransformerMixin):
  def __init__(self, attribute_names):
    self.attribute_names = attribute_names
  def fit(self, X, y = None):
    return self
  def transform(self, X):
    return X[self.attribute_names]

In [60]:
# create preprocessing pipeline for numerial attributes
num_pipeline = Pipeline([
                         ('select_numeric', DataFrameSelector(['Age', 'SibSp', 'Parch', 'Fare'])),
                         ('imputer', SimpleImputer(strategy='median')),
  ])

In [None]:
# create preprocessing pipeline for categorical attributes
# CONFIGURE
def fit(self, X, y=None):

        self.fill = pd.Series([X[c].value_counts().index[0]
            if X[c].dtype == np.dtype('O') else X[c].mean() for c in X],
            index=X.columns)

        return self

    def transform(self, X, y=None):
        return X.fillna(self.fill)

###

##OLD CODE BELOW

In [None]:
# fix 177 missing age NaN's in training data and testing data
corr_matrix = train_data.corr().abs()
corr_matrix_ = corr_matrix.unstack()

# sorting correlation matrix
corr_matrix_sort = corr_matrix_.sort_values(kind='quicksort', ascending=False).reset_index()

# creating new descriptive columns from sorted correlation
corr_matrix_resort = corr_matrix_sort.rename(columns={'level_0':'feature 1', 'level_1':'feature 2', 0:'corr'})

# which feature is most correlated to age?
print(corr_matrix_resort[corr_matrix_resort['feature 1'] == 'Age'])



In [None]:
# heatmap of feature correlations
plt.figure(figsize = (8,6))
sns.heatmap(corr_matrix, annot=True, cbar=True, linewidths=0.3, linecolor='black')
plt.title('Feature correlation', fontsize=15)
plt.xlabel('Feature 1', labelpad=-18)
plt.xticks(rotation=45, fontsize=10)
plt.ylabel('Feature 2', labelpad=-5)
plt.yticks(rotation=45, fontsize=10)
plt.show()

In [None]:
# fix 687 missing cabin NaN's in training data and 1 in testing data
age_by_pclass = all_data.groupby(['Pclass']).median()['Age']

for pclass in range(1, 4):
	print('Median age of Pclass {} is: {}'.format(pclass, age_by_pclass[pclass]))


# fix 2 missing embarked NaN's in training data and 327 in testing data

In [None]:
# women survival rate
women = train_data.loc[train_data.Sex == 'female']['Survived']
women_rate = sum(women)/len(women)
# print('The % of women that survived is:', women_rate*100)

# men survival rate
men = train_data.loc[train_data.Sex == 'male']['Survived']
men_rate = sum(men)/len(men)
# print('The % of men that survived is:', men_rate*100)



In [None]:
# class 1 survival rate
pclass1 = train_data.loc[train_data.Pclass == 1]['Survived']
rate_pclass1 = sum(pclass1)/len(pclass1)
# print('The % of First Class that survived is:', rate_pclass1*100)

# class 2 survival rate
pclass2 = train_data.loc[train_data.Pclass == 2]['Survived']
rate_pclass2 = sum(pclass2)/len(pclass2)
# print('The % of Second Class that survived is:', rate_pclass2*100)

# class 3 survival rate
pclass3 = train_data.loc[train_data.Pclass == 3]['Survived']
rate_pclass3 = sum(pclass3)/len(pclass3)
# print('The % of Third Class that survived is:', rate_pclass3*100)



In [None]:
# sikit random forest 
# y variable
y = train_data['Survived']


# variables looking into
features = ['Pclass', 'Sex', 'SibSp', 'Parch']

# indicator train variables
X = pd.get_dummies(train_data[features])

# indicator test variables
X_test = pd.get_dummies(test_data[features])

# model
model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)

# fitting the model
fit_model = model.fit(X, y)

# prediction
prediction = fit_model.predict(X_test)

output_pred = pd.DataFrame({'Survived': prediction})

from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X, y, scoring='neg_mean_squared_error', cv=10)

tree_rmse_scores = np.sqrt(-scores)

def display_scores(scores):
	print('Scores:', scores)
	print('Mean:', scores.mean())
	print('Standard deviation:', scores.std())


# print(display_scores(tree_rmse_scores))