Titanic survival rate using Logistic regression

In [77]:
#Lets first import the required libraries
import pandas as pd
from sklearn import preprocessing


Load data from CSV file

In [78]:
df = pd.read_csv(r"C:\Users\Anne Marie\Documents\train.csv")
test =  pd.read_csv(r"C:\Users\Anne Marie\Documents\test.csv")

In [79]:
print(df.shape)
print(test.shape)

(891, 12)
(418, 11)


In [80]:
df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [81]:
test.columns

Index(['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch',
       'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

Lets see if there are any null values in our dataset

In [82]:
df.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [83]:
test.isna().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

Data pre-processing and selection

Lets select some features for modeling, and exclude any rows with null values Lets also change some datatypes. 

In [121]:
df = df.loc[df.Embarked.notna(),['Survived', 'Pclass', 'Sex', 'Embarked' ]]
test = test.loc[:,['Pclass', 'Sex', 'Embarked']]

In [85]:
print(df.shape)
print(test.shape)

(889, 4)
(418, 3)


In [86]:
df.head()

Unnamed: 0,Survived,Pclass,Sex,Embarked
0,0,3,male,S
1,1,1,female,C
2,1,3,female,S
3,1,1,female,S
4,0,3,male,S


In [87]:
test.head()

Unnamed: 0,Pclass,Sex,Embarked
0,3,male,Q
1,3,female,S
2,2,male,Q
3,3,male,S
4,3,female,S


In [88]:
x = df.loc[:, ['Pclass']]
y = df.Survived

In [89]:
x.shape

(889, 1)

Let us now convert the sex and embarked feature to numerical form to make it easy for our algorithm to use

In [90]:
y.shape

(889,)

Lets use Logistic regression because it is good for binary classification

In [91]:
from sklearn.linear_model import LogisticRegression

In [92]:
logreg =  LogisticRegression(solver ='lbfgs')

Lets evaluate our model. Here we are cross validating our logreg model using one feature which is the Pclass. We will use 5-fold cross validation. Our output is the mean accuracy of the 5-fold cross validations.

In [93]:
from sklearn.model_selection import cross_val_score

In [99]:
cross_val_score(logreg, x, y, cv=5, scoring='accuracy').mean()

0.6783406335301212

Lets check how this compares to the null accuracy- The accuracy we'll get by predicting the most frequent class. This is just an optional step

In [100]:
y.value_counts(normalize=True)

0    0.617548
1    0.382452
Name: Survived, dtype: float64

Lets convert our categorical features into numerical values

In [101]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse=False)

In [102]:
x= df.drop('Survived', axis='columns')

In [103]:
x.head()

Unnamed: 0,Pclass,Sex,Embarked
0,3,male,S
1,1,female,C
2,3,female,S
3,1,female,S
4,3,male,S


Lets specify which columns to encode using column transformer. In this case, we transform the sex and embarked columns, while passing through the Pclass column

In [104]:
from sklearn.compose import make_column_transformer

In [105]:
column_trans = make_column_transformer(
    (OneHotEncoder(), ['Sex', 'Embarked']),
    remainder='passthrough')

In [106]:
column_trans.fit_transform(x)

array([[0., 1., 0., 0., 1., 3.],
       [1., 0., 1., 0., 0., 1.],
       [1., 0., 0., 0., 1., 3.],
       ...,
       [1., 0., 0., 0., 1., 3.],
       [0., 1., 1., 0., 0., 1.],
       [0., 1., 0., 1., 0., 3.]])

In [107]:
test.head()

Unnamed: 0,Pclass,Sex,Embarked
0,3,male,Q
1,3,female,S
2,2,male,Q
3,3,male,S
4,3,female,S


In [108]:
x_test= test
column_trans.fit_transform(x_test)

array([[0., 1., 0., 1., 0., 3.],
       [1., 0., 0., 0., 1., 3.],
       [0., 1., 0., 1., 0., 2.],
       ...,
       [0., 1., 0., 0., 1., 3.],
       [0., 1., 0., 0., 1., 3.],
       [0., 1., 1., 0., 0., 3.]])

In [109]:
from sklearn.pipeline import make_pipeline

In [110]:
#pipeline is for chaining steps together. So in this case, our pipeline transforms 
#our specified columns, and then in builds our logreg model.
pipe = make_pipeline(column_trans,logreg)

In [111]:
cross_val_score(pipe, x, y, cv=4, scoring= 'accuracy').mean()

0.7739162929745889

In [112]:
#Our accuracy has improved to 0.77

Lets use our test data set to pass the model

In [113]:
x_test = x_test.sample(5, random_state=99)
x_test

Unnamed: 0,Pclass,Sex,Embarked
50,1,male,S
288,3,male,C
7,2,male,S
61,2,male,S
260,3,male,S


In [114]:
pipe.fit(x, y)

Pipeline(memory=None,
         steps=[('columntransformer',
                 ColumnTransformer(n_jobs=None, remainder='passthrough',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('onehotencoder',
                                                  OneHotEncoder(categories='auto',
                                                                drop=None,
                                                                dtype=<class 'numpy.float64'>,
                                                                handle_unknown='error',
                                                                sparse=True),
                                                  ['Sex', 'Embarked'])],
                                   verbose=False)),
                ('logisticregression',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                         

In [115]:
pipe.predict(x_test)

array([0, 0, 0, 0, 0], dtype=int64)