This notebook is only used to train and save a machine learning model. The goal of the project is not to build the best model, but to test the deployment of the model. Thus, there will not be much data exploration and no sophisticated model tweaking.

In [72]:
import pandas as pd
import numpy as np
import pickle
import matplotlib.pyplot as plt

## Import the data

In [73]:
data = pd.read_csv('data/train.csv')
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Cleaning the data
This mode of data cleaning is very rough and would normally be done more thoroughly. Since the quality of the model is not that important for this project, I stick with the "sledgehammer" methods.

In [74]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [75]:
# delete cabin data because it contains many missing values
# delete columns that contain text (except for sex)
data.drop(['Cabin','Name','Ticket','PassengerId'],axis=1,inplace=True)

In [76]:
# make dummies from 'sex'
data['Male'] = [1 if gender == 'male' else 0 for gender in data['Sex']]
data.drop('Sex',axis=1,inplace=True)

In [77]:
# make dummies from 'embarked'
data = data.join(pd.get_dummies(data.Embarked,prefix='Embarked_',drop_first=True))
data.drop('Embarked',axis=1,inplace=True)

In [90]:
# replace missing values for age with mean 
data.Age.fillna(data.Age.mean(),inplace=True)

In [91]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Survived     891 non-null    int64  
 1   Pclass       891 non-null    int64  
 2   Age          891 non-null    float64
 3   SibSp        891 non-null    int64  
 4   Parch        891 non-null    int64  
 5   Fare         891 non-null    float64
 6   Male         891 non-null    int64  
 7   Embarked__Q  891 non-null    uint8  
 8   Embarked__S  891 non-null    uint8  
dtypes: float64(2), int64(5), uint8(2)
memory usage: 50.6 KB
