# **Predicting Survival on the Titanic Based on Passenger Characteristics**

This project is based on Kaggle's [Titanic - Machine Learning from Disaster Competition](https://www.kaggle.com/competitions/titanic/overview).

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

# **Setup**
Next cell imports all Python libraries needed for the project.

In [1]:
import pandas as pd
from sklearn import svm

# **Import datasets**
Each import will be showing its first 5 rows of data as a preview.

In [2]:
training_data = pd.read_csv("../data/train.csv")
training_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
test_data = pd.read_csv("../data/test.csv")
test_data.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


# **Data Preprocessing**
Check if there are any missing values in the datasets and change categorical variables to numerical variables.

In [4]:
# Checking if there are any NaN values in the training dataset

train_nan_count = training_data.isna().sum()
train_nan_count

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

There are `NaN` values found in the dataset. They are found in `Age`, `Cabin`, and `Embarked`.

In [5]:
# For the Age column, we take the mean age and fill in the missing values.

training_data['Age'].fillna(value=int(training_data['Age'].mean()), inplace=True)

In [6]:
# For Cabin, we could change it to a boolean whether or not a person had a cabin or not using 1 and 0. 

training_data.loc[training_data['Cabin'].notna(), "Cabin"] = 1
training_data['Cabin'].fillna(value=0, inplace=True)
training_data = training_data.astype({"Cabin":int})

In [7]:
# For Embarked, we will be handling NaN values by replacing it with the highest occurring category.

training_data['Embarked'].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [8]:
# S is the most common, therefore we will be filling all NaN values with it.

training_data['Embarked'].fillna(value="S", inplace=True)

In [9]:
training_data.isna().sum()

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64

All `NaN` values have been removed. Now to make sure our features are better, we need to convert our categorical column, `Embarked` and `Gender`, to numerical variables.

In [10]:
# Get dummies for the embarked column

embarked_dummies = pd.get_dummies(training_data["Embarked"])
embarked_dummies

Unnamed: 0,C,Q,S
0,0,0,1
1,1,0,0
2,0,0,1
3,0,0,1
4,0,0,1
...,...,...,...
886,0,0,1
887,0,0,1
888,0,0,1
889,1,0,0


In [11]:
# Replace values in sex with 1 for male, 0 for female

training_data.loc[training_data["Sex"] == "male", "Sex"] = 1
training_data.loc[training_data["Sex"] == "female", "Sex"] = 0

In [12]:
# Drop original embarked column and concatenate dummy values

training_data = training_data.drop(["Embarked"], axis='columns')
training_data = pd.concat([training_data, embarked_dummies], axis=1)
training_data

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,C,Q,S
0,1,0,3,"Braund, Mr. Owen Harris",1,22.0,1,0,A/5 21171,7.2500,0,0,0,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,38.0,1,0,PC 17599,71.2833,1,1,0,0
2,3,1,3,"Heikkinen, Miss. Laina",0,26.0,0,0,STON/O2. 3101282,7.9250,0,0,0,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,35.0,1,0,113803,53.1000,1,0,0,1
4,5,0,3,"Allen, Mr. William Henry",1,35.0,0,0,373450,8.0500,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",1,27.0,0,0,211536,13.0000,0,0,0,1
887,888,1,1,"Graham, Miss. Margaret Edith",0,19.0,0,0,112053,30.0000,1,0,0,1
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",0,29.0,1,2,W./C. 6607,23.4500,0,0,0,1
889,890,1,1,"Behr, Mr. Karl Howell",1,26.0,0,0,111369,30.0000,1,1,0,0


We could also simplify the `SibSp` and `Parch` columns to one column instead since this is just a count of family members.

In [13]:
# Replace SibSp and Parch with a sum of the two values as Total Family Members

training_data['Total Family Members'] = training_data.loc[:, "SibSp":"Parch"].sum(axis=1)
training_data = training_data.drop(["SibSp", "Parch"], axis='columns')
training_data

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Ticket,Fare,Cabin,C,Q,S,Total Family Members
0,1,0,3,"Braund, Mr. Owen Harris",1,22.0,A/5 21171,7.2500,0,0,0,1,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,38.0,PC 17599,71.2833,1,1,0,0,1
2,3,1,3,"Heikkinen, Miss. Laina",0,26.0,STON/O2. 3101282,7.9250,0,0,0,1,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,35.0,113803,53.1000,1,0,0,1,1
4,5,0,3,"Allen, Mr. William Henry",1,35.0,373450,8.0500,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",1,27.0,211536,13.0000,0,0,0,1,0
887,888,1,1,"Graham, Miss. Margaret Edith",0,19.0,112053,30.0000,1,0,0,1,0
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",0,29.0,W./C. 6607,23.4500,0,0,0,1,3
889,890,1,1,"Behr, Mr. Karl Howell",1,26.0,111369,30.0000,1,1,0,0,0


All of our valuable features have now been converted to numerical values. We will now preprocess the test data with the same procedures.

In [14]:
# Checking if there are any NaN values in the test dataset

test_nan_count = test_data.isna().sum()
test_nan_count

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

In [15]:
# For the Age column, we take the mean age and fill in the missing values.

test_data['Age'].fillna(value=int(test_data['Age'].mean()), inplace=True)

In [16]:
# For Cabin, we could change it to a boolean whether or not a person had a cabin or not using 1 and 0. 

test_data.loc[test_data['Cabin'].notna(), "Cabin"] = 1
test_data['Cabin'].fillna(value=0, inplace=True)
test_data = test_data.astype({"Cabin":int})

In [17]:
# For the Fare column, we take the mean fare and fill in the missing values.

test_data['Fare'].fillna(value=test_data['Fare'].mean(), inplace=True)

In [18]:
# Replace values in sex with 1 for male, 0 for female

test_data.loc[test_data["Sex"] == "male", "Sex"] = 1
test_data.loc[test_data["Sex"] == "female", "Sex"] = 0

In [19]:
# Get dummies for the embarked column

embarked_dummies = pd.get_dummies(test_data["Embarked"])
test_data = test_data.drop(["Embarked"], axis='columns')
test_data = pd.concat([test_data, embarked_dummies], axis=1)
test_data

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,C,Q,S
0,892,3,"Kelly, Mr. James",1,34.5,0,0,330911,7.8292,0,0,1,0
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",0,47.0,1,0,363272,7.0000,0,0,0,1
2,894,2,"Myles, Mr. Thomas Francis",1,62.0,0,0,240276,9.6875,0,0,1,0
3,895,3,"Wirz, Mr. Albert",1,27.0,0,0,315154,8.6625,0,0,0,1
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",0,22.0,1,1,3101298,12.2875,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
413,1305,3,"Spector, Mr. Woolf",1,30.0,0,0,A.5. 3236,8.0500,0,0,0,1
414,1306,1,"Oliva y Ocana, Dona. Fermina",0,39.0,0,0,PC 17758,108.9000,1,1,0,0
415,1307,3,"Saether, Mr. Simon Sivertsen",1,38.5,0,0,SOTON/O.Q. 3101262,7.2500,0,0,0,1
416,1308,3,"Ware, Mr. Frederick",1,30.0,0,0,359309,8.0500,0,0,0,1


In [20]:
# Replace SibSp and Parch with a sum of the two values as Total Family Members

test_data['Total Family Members'] = test_data.loc[:, "SibSp":"Parch"].sum(axis=1)
test_data = test_data.drop(["SibSp", "Parch"], axis='columns')
test_data

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,Ticket,Fare,Cabin,C,Q,S,Total Family Members
0,892,3,"Kelly, Mr. James",1,34.5,330911,7.8292,0,0,1,0,0
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",0,47.0,363272,7.0000,0,0,0,1,1
2,894,2,"Myles, Mr. Thomas Francis",1,62.0,240276,9.6875,0,0,1,0,0
3,895,3,"Wirz, Mr. Albert",1,27.0,315154,8.6625,0,0,0,1,0
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",0,22.0,3101298,12.2875,0,0,0,1,2
...,...,...,...,...,...,...,...,...,...,...,...,...
413,1305,3,"Spector, Mr. Woolf",1,30.0,A.5. 3236,8.0500,0,0,0,1,0
414,1306,1,"Oliva y Ocana, Dona. Fermina",0,39.0,PC 17758,108.9000,1,1,0,0,0
415,1307,3,"Saether, Mr. Simon Sivertsen",1,38.5,SOTON/O.Q. 3101262,7.2500,0,0,0,1,0
416,1308,3,"Ware, Mr. Frederick",1,30.0,359309,8.0500,0,0,0,1,0


In [21]:
test_nan_count = test_data.isna().sum()
test_nan_count

PassengerId             0
Pclass                  0
Name                    0
Sex                     0
Age                     0
Ticket                  0
Fare                    0
Cabin                   0
C                       0
Q                       0
S                       0
Total Family Members    0
dtype: int64

Both training and test values have been preprocessed and now have the same columns.

# **Training and Testing Model**
I will be using the Support Vector Machine Model with the `linear` kernel to check classify whether a person survives the titanic or not.

In [22]:
# Selecting features and prediction target

features = ["Pclass", "Sex", "Age", "Fare", "Cabin", "C", "Q", "S", "Total Family Members"]
X_train = training_data[features]
X_train

Unnamed: 0,Pclass,Sex,Age,Fare,Cabin,C,Q,S,Total Family Members
0,3,1,22.0,7.2500,0,0,0,1,1
1,1,0,38.0,71.2833,1,1,0,0,1
2,3,0,26.0,7.9250,0,0,0,1,0
3,1,0,35.0,53.1000,1,0,0,1,1
4,3,1,35.0,8.0500,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...
886,2,1,27.0,13.0000,0,0,0,1,0
887,1,0,19.0,30.0000,1,0,0,1,0
888,3,0,29.0,23.4500,0,0,0,1,3
889,1,1,26.0,30.0000,1,1,0,0,0


In [23]:
y_train = training_data["Survived"]
y_train

0      0
1      1
2      1
3      1
4      0
      ..
886    0
887    1
888    0
889    1
890    0
Name: Survived, Length: 891, dtype: int64

In [24]:
X_test = test_data[features]
X_test

Unnamed: 0,Pclass,Sex,Age,Fare,Cabin,C,Q,S,Total Family Members
0,3,1,34.5,7.8292,0,0,1,0,0
1,3,0,47.0,7.0000,0,0,0,1,1
2,2,1,62.0,9.6875,0,0,1,0,0
3,3,1,27.0,8.6625,0,0,0,1,0
4,3,0,22.0,12.2875,0,0,0,1,2
...,...,...,...,...,...,...,...,...,...
413,3,1,30.0,8.0500,0,0,0,1,0
414,1,0,39.0,108.9000,1,1,0,0,0
415,3,1,38.5,7.2500,0,0,0,1,0
416,3,1,30.0,8.0500,0,0,0,1,0


In [25]:
# Training SVM model with linear kernel

svm_model = svm.SVC(kernel='rbf')
svm_model.fit(X_train, y_train)

In [26]:
# Generating dataframe for predictions

predictions = svm_model.predict(X_test)
test_data["Survived"] = predictions
test_data

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,Ticket,Fare,Cabin,C,Q,S,Total Family Members,Survived
0,892,3,"Kelly, Mr. James",1,34.5,330911,7.8292,0,0,1,0,0,0
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",0,47.0,363272,7.0000,0,0,0,1,1,0
2,894,2,"Myles, Mr. Thomas Francis",1,62.0,240276,9.6875,0,0,1,0,0,0
3,895,3,"Wirz, Mr. Albert",1,27.0,315154,8.6625,0,0,0,1,0,0
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",0,22.0,3101298,12.2875,0,0,0,1,2,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
413,1305,3,"Spector, Mr. Woolf",1,30.0,A.5. 3236,8.0500,0,0,0,1,0,0
414,1306,1,"Oliva y Ocana, Dona. Fermina",0,39.0,PC 17758,108.9000,1,1,0,0,0,1
415,1307,3,"Saether, Mr. Simon Sivertsen",1,38.5,SOTON/O.Q. 3101262,7.2500,0,0,0,1,0,0
416,1308,3,"Ware, Mr. Frederick",1,30.0,359309,8.0500,0,0,0,1,0,0


In [27]:
test_data["Survived"].value_counts()

0    341
1     77
Name: Survived, dtype: int64

In [28]:
# Export to csv file final answer.

test_data.to_csv("../data/blurridge_submission_rbf.csv", index=False, columns=["PassengerId", "Survived"])

# **Conclusion**

The prediction got an accuracy score of 76.55% with SVM Linear and 66.27% with the SVM RBF.