<a href="https://colab.research.google.com/github/athayadhiya/data_science_portfolio/blob/main/Titanic_ML_LogisticRegression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Titanic: Survivor Prediction ⚓
---
We did some [exploratory analysis](https://github.com/athayadhiya/data_science_portfolio/blob/main/Titanic_EDA.ipynb) before and answered the question: “*what sorts of people were more likely to survive*?”.

Now, in this notebook, we'll do some classification analysis and predict the passengers survival.

## Data Train Preparation

In this section, we'll do the data cleaning process including feature selection, feature encoding and normalization to make sure that the data is ready to perform the machine learning model.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
import pandas as pd
import numpy as np

df_train = pd.read_csv('/content/drive/MyDrive/Project/titanic_train.csv')
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
df_train = df_train.drop(['Name', 'Ticket', 'Fare', 'Cabin', 'Embarked'], axis = 1)
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch
0,1,0,3,male,22.0,1,0
1,2,1,1,female,38.0,1,0
2,3,1,3,female,26.0,0,0
3,4,1,1,female,35.0,1,0
4,5,0,3,male,35.0,0,0


In [4]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Sex          891 non-null    object 
 4   Age          714 non-null    float64
 5   SibSp        891 non-null    int64  
 6   Parch        891 non-null    int64  
dtypes: float64(1), int64(5), object(1)
memory usage: 48.9+ KB


In [32]:
df_train.duplicated().sum()

0

In [5]:
df_train['Sex'] = df_train['Sex'].map({"male" : 0, "female" : 1})

In [6]:
df_train['Age'] = df_train['Age'].fillna(df_train['Age'].mean())

In [7]:
df_train['Companion'] = df_train['SibSp'] + df_train['Parch']

In [11]:
def companion(df_train):
  companion = df_train['Companion']

  if companion >= 1:
    return 'not alone'
  else:
    return 'alone'

df_train['Companion Def'] = df_train.apply(companion, axis = 1)

In [12]:
df_train['Companion Def'] = df_train['Companion Def'].map({"alone" : 0, "not alone" : 1})

In [14]:
df_train = df_train.drop(['SibSp', 'Parch', 'Companion'], axis = 1)

In [15]:
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,Companion Def
0,1,0,3,0,22.0,1
1,2,1,1,1,38.0,1
2,3,1,3,1,26.0,0
3,4,1,1,1,35.0,1
4,5,0,3,0,35.0,0


In [16]:
df_train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,Companion Def
count,891.0,891.0,891.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,0.352413,29.699118,0.397306
std,257.353842,0.486592,0.836071,0.47799,13.002015,0.489615
min,1.0,0.0,1.0,0.0,0.42,0.0
25%,223.5,0.0,2.0,0.0,22.0,0.0
50%,446.0,0.0,3.0,0.0,29.699118,0.0
75%,668.5,1.0,3.0,1.0,35.0,1.0
max,891.0,1.0,3.0,1.0,80.0,1.0


In [22]:
from sklearn.preprocessing import MinMaxScaler

df_train['Age'] = MinMaxScaler().fit_transform(df_train['Age'].values.reshape(len(df_train), 1))
df_train['Pclass'] = MinMaxScaler().fit_transform(df_train['Pclass'].values.reshape(len(df_train), 1))

## Perform the machine learning model.

In this section, we'll split the data into data train and data test. Then put the data train into machine learning process.

Here we use Logistic Regression algorithm because we expect binary output (0 for not survived and 1 for survived).

In [24]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

x = df_train.drop(['Survived'], axis = 1)
y = df_train['Survived']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 123)

In [25]:
log_reg = LogisticRegression(solver = 'lbfgs', max_iter = 500)
log_reg.fit(x_train, y_train)

In [26]:
y_pred = log_reg.predict(x_test)

In [29]:
from sklearn.metrics import confusion_matrix, accuracy_score

confusion = confusion_matrix(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)

print("Confusion Matrix:\n", confusion)
print("Accuracy: %.3f%%" % (accuracy * 100.0))

Confusion Matrix:
 [[96 18]
 [18 47]]
Accuracy: 79.888%


Our machine learning models accuracy is **79.8%**.

This means that our machine learning model is quite good, and can be used for further analysis.

## Data Test Preparation

In this section, we'll do the exact same process with data train preparation to make sure that the data is ready to deploy the machine learning model.

In [31]:
df_test = pd.read_csv('/content/drive/MyDrive/Project/titanic_test.csv')
df_test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [33]:
df_test = df_test.drop(['Name', 'Ticket', 'Fare', 'Cabin', 'Embarked'], axis = 1)
df_test.head()

Unnamed: 0,PassengerId,Pclass,Sex,Age,SibSp,Parch
0,892,3,male,34.5,0,0
1,893,3,female,47.0,1,0
2,894,2,male,62.0,0,0
3,895,3,male,27.0,0,0
4,896,3,female,22.0,1,1


In [34]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Sex          418 non-null    object 
 3   Age          332 non-null    float64
 4   SibSp        418 non-null    int64  
 5   Parch        418 non-null    int64  
dtypes: float64(1), int64(4), object(1)
memory usage: 19.7+ KB


In [35]:
df_test.duplicated().sum()

0

In [37]:
df_test['Sex'] = df_test['Sex'].map({"male" : 0, "female" : 1})

In [38]:
df_test['Age'] = df_test['Age'].fillna(df_test['Age'].mean())

In [39]:
df_test['Companion'] = df_test['SibSp'] + df_test['Parch']

In [40]:
def companion(df_test):
  companion = df_test['Companion']

  if companion >= 1:
    return 'not alone'
  else:
    return 'alone'

df_test['Companion Def'] = df_test.apply(companion, axis = 1)

In [41]:
df_test['Companion Def'] = df_test['Companion Def'].map({"alone" : 0, "not alone" : 1})

In [42]:
df_test = df_test.drop(['SibSp', 'Parch', 'Companion'], axis = 1)

In [43]:
df_test.head()

Unnamed: 0,PassengerId,Pclass,Sex,Age,Companion Def
0,892,3,0,34.5,0
1,893,3,1,47.0,1
2,894,2,0,62.0,0
3,895,3,0,27.0,0
4,896,3,1,22.0,1


In [44]:
df_test.describe()

Unnamed: 0,PassengerId,Pclass,Sex,Age,Companion Def
count,418.0,418.0,418.0,418.0,418.0
mean,1100.5,2.26555,0.363636,30.27259,0.394737
std,120.810458,0.841838,0.481622,12.634534,0.48938
min,892.0,1.0,0.0,0.17,0.0
25%,996.25,1.0,0.0,23.0,0.0
50%,1100.5,3.0,0.0,30.27259,0.0
75%,1204.75,3.0,1.0,35.75,1.0
max,1309.0,3.0,1.0,76.0,1.0


In [45]:
df_test['Age'] = MinMaxScaler().fit_transform(df_test['Age'].values.reshape(len(df_test), 1))
df_test['Pclass'] = MinMaxScaler().fit_transform(df_test['Pclass'].values.reshape(len(df_test), 1))

## Deploy the machine learning model.

Here we deploy the machine learning model that we built before.

In [46]:
predictions = log_reg.predict(df_test)
predictions

array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0,
       1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1,
       1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1,
       0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0,
       0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,

In [48]:
df_test['Survival Pred'] = predictions
df_test.head()

Unnamed: 0,PassengerId,Pclass,Sex,Age,Companion Def,Survival Pred
0,892,1.0,0,0.452723,0,0
1,893,1.0,1,0.617566,1,1
2,894,0.5,0,0.815377,0,0
3,895,1.0,0,0.353818,0,0
4,896,1.0,1,0.287881,1,1


In [51]:
df_test['Survival Pred'].value_counts()

0    258
1    160
Name: Survival Pred, dtype: int64

Based on the machine learning prediction, **160** passengers will survive and **258** passengers will not survive.