# Titanic Dataset

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, some groups of people were more likely to survive than others.

The objective of this code is to build a predictive model to determine which sorts of people were more likely to survive the Titanic disaster. The model will utilize passenger data, including features such as name, age, gender, and socio-economic class, to predict survival outcomes. The goal is to analyze and identify key factors that influenced survival and create an accurate model that can make predictions based on these features.

To achieve this, a Random Forest model has been used. A Random Forest is a machine learning method that creates many decision trees during training. It then combines the results of these trees to make its final prediction.

#### Import Libraries

In [555]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score

#### Load and Display Training Dataset

In [556]:
data_train= pd.read_csv('train_titanic.csv')
data_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [557]:
data_train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [558]:
data_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


#### Load and Display Testing Dataset

In [559]:
data_test= pd.read_csv('test_titanic.csv')
data_test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [560]:
data_test.describe()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare
count,418.0,418.0,332.0,418.0,418.0,417.0
mean,1100.5,2.26555,30.27259,0.447368,0.392344,35.627188
std,120.810458,0.841838,14.181209,0.89676,0.981429,55.907576
min,892.0,1.0,0.17,0.0,0.0,0.0
25%,996.25,1.0,21.0,0.0,0.0,7.8958
50%,1100.5,3.0,27.0,0.0,0.0,14.4542
75%,1204.75,3.0,39.0,1.0,0.0,31.5
max,1309.0,3.0,76.0,8.0,9.0,512.3292


In [561]:
data_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.1+ KB


#### Cleaning the Training Dataset

In [562]:
# Dropping columns
data_train.drop(['Name', 'Ticket', 'Cabin'], axis=1, inplace=True)
data_train

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,1,0,3,male,22.0,1,0,7.2500,S
1,2,1,1,female,38.0,1,0,71.2833,C
2,3,1,3,female,26.0,0,0,7.9250,S
3,4,1,1,female,35.0,1,0,53.1000,S
4,5,0,3,male,35.0,0,0,8.0500,S
...,...,...,...,...,...,...,...,...,...
886,887,0,2,male,27.0,0,0,13.0000,S
887,888,1,1,female,19.0,0,0,30.0000,S
888,889,0,3,female,,1,2,23.4500,S
889,890,1,1,male,26.0,0,0,30.0000,C


In [563]:
# Fill missing values in 'Embarked' with the mode
most_common_embarked = data_train['Embarked'].mode()[0]
data_train['Embarked'].fillna(most_common_embarked, inplace=True)

In [564]:
encoder_sex = OneHotEncoder(sparse_output=False)
encoder_embarked = OneHotEncoder(sparse_output=False)

# Fit and transform 'Sex' column
encoded_sex = encoder_sex.fit_transform(data_train[['Sex']])
encoded_sex_df = pd.DataFrame(encoded_sex, columns=encoder_sex.get_feature_names_out(['Sex']))

# Fit and transform 'Embarked' column
encoded_embarked = encoder_embarked.fit_transform(data_train[['Embarked']])
encoded_embarked_df = pd.DataFrame(encoded_embarked, columns=encoder_embarked.get_feature_names_out(['Embarked']))

# Concatenate the original DataFrame (excluding 'Sex' and 'Embarked' columns) with the encoded DataFrames
columns_to_encode = ['Sex', 'Embarked']
data_train = pd.concat([data_train.drop(columns=columns_to_encode), encoded_sex_df, encoded_embarked_df], axis=1)

# Display the result
print(data_train.head())

   PassengerId  Survived  Pclass   Age  SibSp  Parch     Fare  Sex_female  \
0            1         0       3  22.0      1      0   7.2500         0.0   
1            2         1       1  38.0      1      0  71.2833         1.0   
2            3         1       3  26.0      0      0   7.9250         1.0   
3            4         1       1  35.0      1      0  53.1000         1.0   
4            5         0       3  35.0      0      0   8.0500         0.0   

   Sex_male  Embarked_C  Embarked_Q  Embarked_S  
0       1.0         0.0         0.0         1.0  
1       0.0         1.0         0.0         0.0  
2       0.0         0.0         0.0         1.0  
3       0.0         0.0         0.0         1.0  
4       1.0         0.0         0.0         1.0  


In [565]:
# Filling NaN Values
age_median = data_train['Age'].median()
data_train['Age'].fillna(age_median, inplace=True)
print(data_train.head())

   PassengerId  Survived  Pclass   Age  SibSp  Parch     Fare  Sex_female  \
0            1         0       3  22.0      1      0   7.2500         0.0   
1            2         1       1  38.0      1      0  71.2833         1.0   
2            3         1       3  26.0      0      0   7.9250         1.0   
3            4         1       1  35.0      1      0  53.1000         1.0   
4            5         0       3  35.0      0      0   8.0500         0.0   

   Sex_male  Embarked_C  Embarked_Q  Embarked_S  
0       1.0         0.0         0.0         1.0  
1       0.0         1.0         0.0         0.0  
2       0.0         0.0         0.0         1.0  
3       0.0         0.0         0.0         1.0  
4       1.0         0.0         0.0         1.0  


In [566]:
data_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Age          891 non-null    float64
 4   SibSp        891 non-null    int64  
 5   Parch        891 non-null    int64  
 6   Fare         891 non-null    float64
 7   Sex_female   891 non-null    float64
 8   Sex_male     891 non-null    float64
 9   Embarked_C   891 non-null    float64
 10  Embarked_Q   891 non-null    float64
 11  Embarked_S   891 non-null    float64
dtypes: float64(7), int64(5)
memory usage: 83.7 KB


In [567]:
print(data_train.isnull().sum())

PassengerId    0
Survived       0
Pclass         0
Age            0
SibSp          0
Parch          0
Fare           0
Sex_female     0
Sex_male       0
Embarked_C     0
Embarked_Q     0
Embarked_S     0
dtype: int64


#### Cleaning Testing Dataset

In [568]:
# Dropping columns
data_test.drop(['Name', 'Ticket', 'Cabin'], axis=1, inplace=True)
data_test

Unnamed: 0,PassengerId,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,892,3,male,34.5,0,0,7.8292,Q
1,893,3,female,47.0,1,0,7.0000,S
2,894,2,male,62.0,0,0,9.6875,Q
3,895,3,male,27.0,0,0,8.6625,S
4,896,3,female,22.0,1,1,12.2875,S
...,...,...,...,...,...,...,...,...
413,1305,3,male,,0,0,8.0500,S
414,1306,1,female,39.0,0,0,108.9000,C
415,1307,3,male,38.5,0,0,7.2500,S
416,1308,3,male,,0,0,8.0500,S


In [569]:
encoder_sex = OneHotEncoder(sparse_output=False)
encoder_embarked = OneHotEncoder(sparse_output=False)

# Fit and transform 'Sex' column
encoded_sex = encoder_sex.fit_transform(data_test[['Sex']])
encoded_sex_df = pd.DataFrame(encoded_sex, columns=encoder_sex.get_feature_names_out(['Sex']))

# Fit and transform 'Embarked' column
encoded_embarked = encoder_embarked.fit_transform(data_test[['Embarked']])
encoded_embarked_df = pd.DataFrame(encoded_embarked, columns=encoder_embarked.get_feature_names_out(['Embarked']))

# Concatenate the original DataFrame (excluding 'Sex' and 'Embarked' columns) with the encoded DataFrames
columns_to_encode = ['Sex', 'Embarked']
data_test = pd.concat([data_test.drop(columns=columns_to_encode), encoded_sex_df, encoded_embarked_df], axis=1)

# Display the result
print(data_test.head())

   PassengerId  Pclass   Age  SibSp  Parch     Fare  Sex_female  Sex_male  \
0          892       3  34.5      0      0   7.8292         0.0       1.0   
1          893       3  47.0      1      0   7.0000         1.0       0.0   
2          894       2  62.0      0      0   9.6875         0.0       1.0   
3          895       3  27.0      0      0   8.6625         0.0       1.0   
4          896       3  22.0      1      1  12.2875         1.0       0.0   

   Embarked_C  Embarked_Q  Embarked_S  
0         0.0         1.0         0.0  
1         0.0         0.0         1.0  
2         0.0         1.0         0.0  
3         0.0         0.0         1.0  
4         0.0         0.0         1.0  


In [570]:
# Filling NaN Values
age_median = data_test['Age'].median()
fare_median = data_test['Fare'].median()
data_test['Age'].fillna(age_median, inplace=True)
data_test['Fare'].fillna(fare_median, inplace=True)
print(data_test.head())

   PassengerId  Pclass   Age  SibSp  Parch     Fare  Sex_female  Sex_male  \
0          892       3  34.5      0      0   7.8292         0.0       1.0   
1          893       3  47.0      1      0   7.0000         1.0       0.0   
2          894       2  62.0      0      0   9.6875         0.0       1.0   
3          895       3  27.0      0      0   8.6625         0.0       1.0   
4          896       3  22.0      1      1  12.2875         1.0       0.0   

   Embarked_C  Embarked_Q  Embarked_S  
0         0.0         1.0         0.0  
1         0.0         0.0         1.0  
2         0.0         1.0         0.0  
3         0.0         0.0         1.0  
4         0.0         0.0         1.0  


In [571]:
data_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Age          418 non-null    float64
 3   SibSp        418 non-null    int64  
 4   Parch        418 non-null    int64  
 5   Fare         418 non-null    float64
 6   Sex_female   418 non-null    float64
 7   Sex_male     418 non-null    float64
 8   Embarked_C   418 non-null    float64
 9   Embarked_Q   418 non-null    float64
 10  Embarked_S   418 non-null    float64
dtypes: float64(7), int64(4)
memory usage: 36.1 KB


In [572]:
print(data_test.isnull().sum())

PassengerId    0
Pclass         0
Age            0
SibSp          0
Parch          0
Fare           0
Sex_female     0
Sex_male       0
Embarked_C     0
Embarked_Q     0
Embarked_S     0
dtype: int64


#### Model

In [573]:
print(data_train.columns)

Index(['PassengerId', 'Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare',
       'Sex_female', 'Sex_male', 'Embarked_C', 'Embarked_Q', 'Embarked_S'],
      dtype='object')


In [574]:
print(data_test.columns)

Index(['PassengerId', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'Sex_female',
       'Sex_male', 'Embarked_C', 'Embarked_Q', 'Embarked_S'],
      dtype='object')


In [575]:
y_train = data_train['Survived']
x_train = data_train.drop(columns=['Survived'])
x_test = data_test

In [576]:
model = RandomForestClassifier(random_state=0)
model.fit(x_train,y_train)

In [577]:
# Make predictions on the training data
y_train_pred = model.predict(x_train)
train_accuracy = accuracy_score(y_train, y_train_pred)
train_f1_score = f1_score(y_train, y_train_pred)
print('Accuracy for training data:', train_accuracy)
print('F1 score for training data:', train_f1_score)

Accuracy for training data: 1.0
F1 score for training data: 1.0


In [578]:
y_test = model.predict(x_test)
sub = pd.DataFrame({'PassengerId':data_test['PassengerId'],'Survived':y_test})

In [579]:
sub.to_csv('submission.csv',index=False)

### The model has an accuracy of 0.78947

![Kaggle Submission](titanic_submission.png)
