# Titanic Data Exploration & Prediction
This notebook's main objective is to extract relevant information from the Titanic dataset, to gain insight and predict the probabilty of survival base on the attributes/features of individuals combined with other relevant information.

## **Table of contents**
1. Importing relevant libraries & loading the dataset
2. Data cleaning & exploraion 
    - Column content 
    - Column type 
    - Removing Nan Values 
3. Data insight (Survivor Percentage by attributes) 
    - Survivor by Sex
    - Survivor by Age
    - Survivor by Class
    - Survivor by Embarked 
3. All Data insigth are define by:
      - Survived/Total onboard 
      - Survived/Total survived
      - Survival rate within the column itself)
4. Encoding data for modeling 
5. Feature scaling 
6. Model training
7. Model evaluation 
8. Conclusion 
9. Feedback

## **Importing Library**

In [1]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

Loading dataset

In [2]:
train_df = pd.read_csv('Train_Titanic.csv')
test_df = pd.read_csv('Test_Titanic.csv')
combine = [train_df,test_df]

Columns and their content 

In [None]:
train_df.nunique()

In [None]:
test_df.nunique()

Total Columns and rows

In [None]:
train_df.shape

In [None]:
test_df.shape

Data type

In [None]:
train_df.info()

In [None]:
test_df.info()

## **Data Cleaning & Insight**



Removing unecessary columns


In [5]:
train_df = train_df.drop(['Name','Ticket','Cabin','PassengerId'], axis =1)
test_df = test_df.drop(['Name','Ticket','Cabin','PassengerId'], axis =1)

Droping Nan Values

In [6]:
train_df = train_df.dropna()
test_df = test_df.dropna()

In [None]:
pd.concat([train_df['Fare'], train_df['Survived']], axis=1).sort_values(['Fare'],ascending = False).reset_index()
# train_df['Fare'].sort_values(ascending = False).reset_index()

# Survivor Percentage

## **Survivors by sex** 

Total number of male/female onboard 

In [None]:
# Collecting the total number of male and female onboard 
male1 = len(train_df.loc[train_df['Sex'] == 'male'])
female1 = len(train_df.loc[train_df['Sex'] == 'female'])

In [79]:
# Tatal male and female onbaord
print('The number of male and female who survived are %s and %s' % (male1, female1))

The number of male and female who survived are 453 and 259


Total number of **survived** male/female onboard

In [96]:
def finding(a,b,c):
    return pd.DataFrame(([[round((a/c*100)),round((b/c*100))]]),index=['Percentage%'], columns= ['Male','Female'])   

In [81]:
# Removed the non survivors
survived = train_df[train_df['Survived'] == 1]

In [88]:
male = len(survived.loc[survived['Sex'] == 'male'])
female = len(survived.loc[survived['Sex'] == 'female'])
total_survivor_sex = len(survived['Sex'])
total_people_onboard = len(train_df['Sex'])


In [330]:
# Number of female and male who survived
print('The number of male and female who survived are %s and %s' % (male, female))

The number of male and female who survived are 93 and 195


In [118]:
# Survivor Percentage comparing male and female
finding(male,female,total_survivor_sex)

Unnamed: 0,Male,Female
Percentage%,32,68


In [119]:
# Percentage who survived out of everyone onboard
finding(male,female,total_people_onboard)

Unnamed: 0,Male,Female
Percentage%,13,27


**Findings**

The fomular used for this calculation is to take the total number of female/male **survivors** divide by the total number of people who **survived**. As shown in the result, Females have a higher chance of survival (68% compare to 32%) even though there are more male passengers onboard. Well now you why you cant get rid of your wife. 



## **Survivors by age**

In [120]:
# Mean of age of people onbaord
train_df['Age'].mean()

29.64209269662921

In [121]:
# Mean of age of people who survived
survived['Age'].mean()

28.19329861111111

In [None]:
pd.concat([train_df['Age'],train_df['Survived']],axis=1).dropna().head()

In [None]:
pd.concat([survived['Age'],survived['Survived']],axis=1).dropna().head()

In [None]:
above_30 = []
below_30 = []
for x in survived['Age']:
    if x >= 30:
        above_30.append(x)
        
    else:
        below_30.append(x)
sa30 = len(above_30)
sb30 = len(below_30)
total_sur = len(train_df['Age'])
survival_rate_above30 = round(sa30/total_sur *100)
survival_rate_below30 = round(sb30/total_sur *100)
total_age_percentage = pd.DataFrame(([[round(train_df['Age'].mean()),survival_rate_above30,survival_rate_below30]]), columns= ['Age~mean','Percentage above 30','Percentage below 30'])

In [None]:
above_28 = []
below_28 = []
for x in survived['Age']:
    if x >= 28:
        above_28.append(x)
        
    else:
        below_28.append(x)
sa28 = len(above_28)
sb28 = len(below_28)
total_sur = len(survived['Age'])
survival_rate_above28 = round(sa28/total_sur *100)
survival_rate_below28 = round(sb28/total_sur *100)
survive_age_percentage = pd.DataFrame(([[round(survived['Age'].mean()),survival_rate_above28,survival_rate_below28]]), columns= ['Age~mean','Percentage above 28','Percentage below 28'])

In [122]:
# Total percentage of people who survived over total people onbaord
total_age_percentage

Unnamed: 0,Age~mean,Percentage above 30,Percentage below 30
0,30,19,22


In [123]:
# Comparison of people who survived above/below 28
survive_age_percentage

Unnamed: 0,Age~mean,Percentage above 28,Percentage below 28
0,28,51,49


**Findings**

The average age (30) is taken here as the benchmark. We than compare the the survivor rate for people who's age is above 30 and below. Shown in the result, people who are above 28 years of age have a slighter higher chance of survivial. 

For people who survivded the average age is 28. Comparing people above and below the average age, most of the people that survived is above 28 years of age. 



## **Survivor by class**

In [39]:
# People distribution by class
pclass = []
for items in train_df['Pclass'].value_counts():
    pclass.append(items)
pclass[0:2],pclass[2] = pclass[1:3],pclass[0] #swpaing the positions in the list
pd.DataFrame([pclass],index=['Total Number of People'], columns=['Class 1','Class 2','Class 3'])

Unnamed: 0,Class 1,Class 2,Class 3
Total Number of People,184,173,355


In [48]:
# People who survived by class
spclass = []
for items in survived['Pclass'].value_counts():
    spclass.append(items)
spclass[0],spclass[1],spclass[2] = spclass[0],spclass[2],spclass[1] #swpaing the positions in the list
pd.DataFrame([spclass],index=['People who survived'], columns=['Class 1','Class 2','Class 3'])

Unnamed: 0,Class 1,Class 2,Class 3
People who survived,120,83,85


In [98]:
# Total number of people with class on baord
total_number_people = pclass[0] + pclass[1] + pclass[2]
# Percentage of class who survived against everyone
def percentage_calculation(a,b,c,d):
    return pd.DataFrame(([[round(a/c*100),round(b/c*100),round(d/c*100)]]),index=['Percentage%'], columns= ['Class1','Class2','Class3'])
percentage_calculation(spclass[0],spclass[1],total_number_people,spclass[2])

Unnamed: 0,Class1,Class2,Class3
Percentage%,17,12,12


In [99]:
# Total number of people with class who survived
total_number_cpeople = spclass[0] + spclass[1] + spclass[2]
# Distribution of percentage of class who survived 
def percentage_calculation(a,b,c,d):
    return pd.DataFrame(([[round(a/c*100),round(b/c*100),round(d/c*100)]]),index=['Percentage%'], columns= ['Class1','Class2','Class3'])
percentage_calculation(spclass[0],spclass[1],total_number_cpeople,spclass[2])

Unnamed: 0,Class1,Class2,Class3
Percentage%,42,29,30


**Findings**

Class 1 has the higest survival rate even though there are more people in Class 3. From speculation, rich people who were catergorize as the higest class have the higher chance of getting on the escape boat. 

Out of everyone that survived, Class 1 have a percentage of 42%

## **Survivor by Embarked**

In [101]:
train_df['Embarked'] = train_df['Embarked'].replace({"C": "Cherbourg", "Q": "Queenstown","S": "Southamption"})

In [109]:
# People distribution by embarked
embarked = []
for items in train_df['Embarked'].value_counts():
    embarked.append(items)
pd.DataFrame([embarked],index=['Total Number of People'], columns=['Southamption','Cherbourg ','Queenstown'])

Unnamed: 0,Southamption,Cherbourg,Queenstown
Total Number of People,554,130,28


In [110]:
# People who survived by Embarked
sembarked = []
for items in survived['Pclass'].value_counts():
    sembarked.append(items)
pd.DataFrame([sembarked],index=['People who survived'], columns=['Southamption','Cherbourg ','Queenstown'])

Unnamed: 0,Southamption,Cherbourg,Queenstown
People who survived,120,85,83


In [111]:
# Embarked Pecentage 
total_number_people = embarked[0] + embarked[1] + embarked[2]
# Percentage of embarked who survived against everyone
def percentage_calculations(a,b,c,d):
    return pd.DataFrame(([[round(a/c*100),round(b/c*100),round(d/c*100)]]),index=['Percentage%'], columns= ['Southamption','Cherbourg ','Queenstown'])
percentage_calculations(embarked[0],embarked[1],total_number_people,embarked[2])

Unnamed: 0,Southamption,Cherbourg,Queenstown
Percentage%,78,18,4


In [113]:
# Total number of people who survived based on embarked
total_number_cpeople = sembarked[0] + sembarked[1] + sembarked[2]
# Distribution of percentage of class who survived 
def percentage_calculations(a,b,c,d):
    return pd.DataFrame(([[round(a/c*100),round(b/c*100),round(d/c*100)]]),index=['Percentage%'], columns= ['Southamption','Cherbourg ','Queenstown'])
percentage_calculations(sembarked[0],sembarked[1],total_number_cpeople,sembarked[2])

Unnamed: 0,Southamption,Cherbourg,Queenstown
Percentage%,42,30,29


**Findings**

For unknown reason, people who emabarked at Southamption have the higest survival rate.  

Additional Info for thought 

In [None]:
# Places of classes who embarked
train_df.groupby('Pclass')['Embarked'].value_counts()

# Encoding data for model training 

Encoding the data for 'Sex'and 'Embarked' column

Converting the string to interger for computation

In [219]:
train_df['Sex'] = train_df['Sex'].astype('category').cat.codes #changing type to catergory value thn encoding data 
pd.get_dummies(train_df, columns=["Sex"])
train_df['Embarked'] = train_df['Embarked'].astype('category').cat.codes #changing type to catergory value thn encoding data 
pd.get_dummies(train_df, columns=["Embarked"]).head(1)

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked_0,Embarked_1,Embarked_2
0,0,3,1,22.0,1,0,7.25,0,0,1


In [220]:
test_df['Embarked'] = test_df['Embarked'].astype('category').cat.codes #changing type to catergory value thn encoding data 
pd.get_dummies(test_df, columns=["Embarked"])
test_df['Sex'] = test_df['Sex'].astype('category').cat.codes #changing type to catergory value thn encoding data 
pd.get_dummies(test_df, columns=["Sex"]).head(1)

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,Embarked,Sex_0,Sex_1
0,3,34.5,0,0,7.8292,1,0,1


# Feature scaling

Feature scale 'Age' and 'Fare'

In [137]:
train_df.head(1)

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,1,22.0,1,0,7.25,2


In [221]:
X_test = test_df.iloc[:,:]
X_train = train_df.iloc[:,1:]
y_train = train_df.iloc[:,0:1]


In [222]:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

# Model Training 
1. Logistic Regression
2. K-Nearest Neighbors
3. Support Vector Machines
4. Naive Bayes classifier
5. Decision Tree
6. Random Forrest


**Logistic Regression** 

In [248]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)

#Accuracy 
lr_accuracy = round(classifier.score(X_train, y_train) * 100, 2)
lr_accuracy

80.2

**K-Nearest Neighbour**

In [247]:
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2) #choosing the metric to select the distance. P = 2 means the distance selected is the euclidean distance
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)

#Accuracy 
knn_accuracy = round(classifier.score(X_train, y_train) * 100, 2)
knn_accuracy

86.52

**Support Vector Machine**

In [246]:
# Fitting SVM to the Training set
from sklearn.svm import SVC
classifier = SVC(kernel = 'linear', random_state = 0) #choose the kernel
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)

#Accuracy 
svm_accuracy = round(classifier.score(X_train, y_train) * 100, 2)
svm_accuracy

77.95

**Naive Bayes**

In [245]:
# Fitting Naive Bayes to the Training set
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)

#Accuracy 
nb_accuracy = round(classifier.score(X_train, y_train) * 100, 2)
nb_accuracy

79.49

**Decision Tree**

In [244]:
# Fitting Decision Tree Classification to the Training set
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)


#Accuracy 
dt_accuracy = round(classifier.score(X_train, y_train) * 100, 2)
dt_accuracy

98.6

**Random Forest**

In [243]:
# Fitting Random Forest Classification to the Training set
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)

#Accuracy 
rf_accuracy = round(classifier.score(X_train, y_train) * 100, 2)
rf_accuracy

97.33

# **Model Evaluation**

In [250]:
models = pd.DataFrame({
    'Model': ['Support Vector Machines', 'K-Nearest Neighbour', 'Logistic Regression', 
              'Random Forest', 'Naive Bayes', 'Decision Tree'],
    'Score': [svm_accuracy, knn_accuracy, lr_accuracy, 
              rf_accuracy, nb_accuracy, dt_accuracy]})
models.sort_values(by='Score', ascending=False)

Unnamed: 0,Model,Score
5,Decision Tree,98.6
3,Random Forest,97.33
1,K-Nearest Neighbour,86.52
2,Logistic Regression,80.2
4,Naive Bayes,79.49
0,Support Vector Machines,77.95


# **Conclusion**

Base on the Titanic dataset we were able to extract, explore the data and gain insight. From the data extraction we know the female species have a higher survival rate even though there are more males onboard. With limited information, the speculation on the phenomena is not possible, however, this provides strength to the saying "Men die first". When study the age of why people over 28 have a higher chance of surviving, we also need to consider their **class**. As unfair as it sounds, people who are more financially capable have the higher chance of seeing tomorrow's daylight. Speculation is that all the benefits are first offered to them before considering anyone of the lower class. With that said, although it might not be 100% true but elder people are more financially capable thus why older people have a higher chance. Finally, for miraculous reasons, people who embarked at Southamption are the lucky cows for the day.

The best model for prediction is no doubt the Decision Tree (98.6%)
The model is prepared to be able to predict the future outcome of an event with the similar setting as the Titanic, however, lets not hope something that drastic would repeat itself.
