# Predict Survival on Titanic

The RMS Titanic set sail on its maiden voyage in 1912, crossing the Atlantic from Southampton, England to New York City. The ship never completed the voyage, sinking to the bottom of the Atlantic Ocean after hitting an iceberg, bringing down 1,502 of 2,224 passengers onboard.

In this project, a supervised learning technique-Logistic Regression model would be applied to predict which passengers survived the sinking of the Titanic, based on features like age and class.

In [655]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [656]:
passengers = pd.read_csv('train.csv')
passengers.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [657]:
passengers.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


What is the distribution of categorical features?

- Names are unique across the dataset (count=unique=891)
- Sex variable as two possible values with 65% male (top=male, freq=577/count=891).
- Cabin values have several dupicates across samples. Alternatively several passengers shared a cabin.
- Embarked takes three possible values. S port used by most passengers (top=S)
- Ticket feature has high ratio (22%) of duplicate values (unique=681).

In [658]:
passengers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


Notice that the age, cabin and embarked columns have null values. Also we apparently have some free-loaders because the minimum fare is 0. We might think that these are babies, so let’s check that:

In [659]:
#passengers[['Age','Fare']][passengers['Fare']<5]
passengers.loc[passengers['Fare']<5,['Age','Fare']]

Unnamed: 0,Age,Fare
179,36.0,0.0
263,40.0,0.0
271,25.0,0.0
277,,0.0
302,19.0,0.0
378,20.0,4.0125
413,,0.0
466,,0.0
481,,0.0
597,49.0,0.0


In [660]:
passengers['Fare'] = passengers['Fare'].apply(lambda x: np.nan if x==0 else x)

In [661]:
passengers_mean_fare = passengers.groupby(['Pclass']).Fare.mean().reset_index()
passengers_mean_fare

Unnamed: 0,Pclass,Fare
0,1,86.148874
1,2,21.358661
2,3,13.787875


In [662]:
# passengers['Fare'] = passengers[['Fare','Pclass']].apply(lambda x: passengers_mean_fare[x['Fare']] if pd.isnull(x['Fare']) \
#                     else x['Fare'], axis= 1)
# def pclass(a):
#    #return passengers_mean_fare.loc[a,['Fare']]
#    return passengers_mean_fare.loc[passengers_mean_fare['Pclass'].isin(a),'Fare']


# passengers['Fare'] = passengers['Fare'].apply(lambda x: pclass(passengers['Pclass']) if pd.isnull(x) \
#                     else passengers['Fare']).reset_index()

In [663]:
# classmeans = passengers.pivot_table('Fare', index='Pclass', aggfunc='mean')
# classmeans

In [664]:
#passengers.Fare = passengers.apply(lambda x: classmeans[x['Pclass']] if pd.isnull(x['Fare']) else x['Fare'], axis=1 )

Given the saying, “women and children first,” Sex and Age seem like good features to predict survival. Let’s map the text values in the Sex column to a numerical value. 

In [665]:
passengers['Sex']=passengers['Sex'].map({'male':0,'female':1})
passengers.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,,S


 Fill all the empty Age values in passengers with the mean age.

In [666]:
#passengers['Age'].fillna(value=np.mean(passengers['Age']),inplace=True)
passengers['Age'].fillna(value=passengers['Age'].mean(),inplace=True)

Create a first class column

In [667]:
passengers['FirstClass']=passengers['Pclass'].apply(lambda x: 1 if x==1 else 0)

Create a second class column

In [668]:
passengers['SecondClass']=passengers['Pclass'].apply(lambda x: 1 if x==2 else 0)

Now that we have cleaned our data, let’s select the feature columns we want to build our model on. 

In [669]:
features=passengers[['Sex','Age','FirstClass','SecondClass']]
survival=passengers['Survived']

## Model Training and Evaluation

In [670]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler, Normalizer, MaxAbsScaler
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import VarianceThreshold
from sklearn.model_selection import GridSearchCV

In [671]:
X_train,X_test,y_train,y_test=train_test_split(features, survival)

In [672]:
print(X_train.shape)
print(X_test.shape)

print(y_train.shape)
print(y_test.shape)

(668, 4)
(223, 4)
(668,)
(223,)


Scale the feature data so it has mean = 0 and standard deviation = 1

In [673]:
pipe_logistic = Pipeline([
    ('scaler',StandardScaler()),
    ('selector',VarianceThreshold()),
    ('logistic',LogisticRegression())
    ])

In [674]:
parameters = {'scaler':[StandardScaler(),MinMaxScaler(),Normalizer(),MaxAbsScaler()],
            'selector':[0,0.001,0.01]
            

}

grid = GridSearchCV(pipe_logistic,parameters,cv=2).fit(X_train,y_train)

24 fits failed out of a total of 24.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
8 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\chyij\OneDrive\Python_project\Biodiversity_project\env\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\chyij\OneDrive\Python_project\Biodiversity_project\env\lib\site-packages\sklearn\pipeline.py", line 390, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "c:\Users\chyij\OneDrive\Python_project\Biodiversity_project\env\lib\site-packages\sklearn\pipeline.py", line 316, in _fit
    self._validate_steps()
  File "c:\Users\chyij\OneDrive\Python_project\

TypeError: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' '0' (type <class 'int'>) doesn't

In [None]:
#pipe_logistic.fit(X_train,y_train)

Pipeline(steps=[('scaler', StandardScaler()),
                ('logistic', LogisticRegression())])

In [None]:
print(f'Training score:{grid.score(X_train,y_train)}')
print(f'Test score:{grid.score(X_test,y_test)}')

Training score:0.7949101796407185
Test score:0.8026905829596412


In [None]:
# scaler=StandardScaler()
# X_train=scaler.fit_transform(X_train)
# X_test=scaler.transform(X_test)
# len(X_train)

In [None]:
# model=LogisticRegression()
# model.fit(X_train,y_train)

In [None]:
# model.score(X_train,y_train)

In [None]:
# model.score(X_test, y_test)

In [None]:
print(list(zip(['Sex','Age','FirstClass','SecondClass'],model.coef_[0])))

[('Sex', 1.2375293868111386), ('Age', -0.3785990378633057), ('FirstClass', 0.8830828715306674), ('SecondClass', 0.4360781777943924)]


In [None]:
Jack = np.array([0.0,20.0,0.0,0.0])
Rose = np.array([1.0,17.0,1.0,0.0])
You = np.array([0.5,0.0,1.0,0.0])

sample_passengers=np.array([Jack,Rose,You])

In [None]:
sample_passengers=scaler.fit_transform(sample_passengers)
print(sample_passengers)

[[-1.22474487  0.87056284 -1.41421356  0.        ]
 [ 1.22474487  0.52990781  0.70710678  0.        ]
 [ 0.         -1.40047065  0.70710678  0.        ]]


In [None]:
model.predict(sample_passengers)

array([0, 1, 1], dtype=int64)

In [None]:
# Probability of surviving
model.predict_proba(sample_passengers)[:,1]


array([0.02173949, 0.77328312, 0.60876523])

In [None]:
test_passengers=pd.read_csv('test.csv')
test_passengers.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [None]:
test_passengers['Sex']=test_passengers['Sex'].map({'male':0,'female':1})
test_passengers['Age'].fillna(value=test_passengers['Age'].mean(),inplace=True)
test_passengers['FirstClass']=test_passengers['Pclass'].apply(lambda x: 1 if x==1 else 0)
test_passengers['SecondClass']=test_passengers['Pclass'].apply(lambda x: 1 if x==2 else 0)

In [None]:
test_dataset=test_passengers[['Sex','Age','FirstClass','SecondClass']]

In [None]:
test_dataset=scaler.fit_transform(test_dataset)
model.predict(test_dataset)

array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1,
       1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1,
       0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,