# **Background**

Tenggelamnya Kapal Titanic merupakan salah satu insiden yang paling terkenal, sehingga dibuat sebuah analisa data terkait peristiwa Titanic, disini saya ingin membuat **machine learning module** untuk mengetahui model yang mana yang lebih cocok untuk digunakan dalam **memprediksi penumpang mana yang bakalan survive dan tidak survive**

## **Import Library**

In [30]:
# Import Library

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score, f1_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

import warnings
warnings.filterwarnings("ignore")

## **Import Dataset**

In [3]:
titanic = pd.read_csv('/Purwadhika/Modul 3 Machine Learning/Dataset/titanic/train.csv')
titanic

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


## **Data Cleaning**

In [4]:
# Check Missing Value
 
titanic.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Terdapat missing value pada column **Age** & **Cabin** yang dimana nantinya column tersebut akan didrop

**Check Unique Value Count**

In [5]:
list = []
for i in titanic.columns:
    list.append([i, titanic[i].nunique()])

pd.DataFrame(list, columns = ['Nama Columns', 'Jumlah Unique'])

Unnamed: 0,Nama Columns,Jumlah Unique
0,PassengerId,891
1,Survived,2
2,Pclass,3
3,Name,891
4,Sex,2
5,Age,88
6,SibSp,7
7,Parch,7
8,Ticket,681
9,Fare,248


**List the Unique Value**

In [6]:
list = []
for i in titanic.columns:
    list.append([i, titanic[i].unique()])

pd.DataFrame(list, columns = ['Nama Columns', 'Jumlah Unique'])

Unnamed: 0,Nama Columns,Jumlah Unique
0,PassengerId,"[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14..."
1,Survived,"[0, 1]"
2,Pclass,"[3, 1, 2]"
3,Name,"[Braund, Mr. Owen Harris, Cumings, Mrs. John B..."
4,Sex,"[male, female]"
5,Age,"[22.0, 38.0, 26.0, 35.0, nan, 54.0, 2.0, 27.0,..."
6,SibSp,"[1, 0, 3, 4, 2, 5, 8]"
7,Parch,"[0, 1, 2, 5, 3, 4, 6]"
8,Ticket,"[A/5 21171, PC 17599, STON/O2. 3101282, 113803..."
9,Fare,"[7.25, 71.2833, 7.925, 53.1, 8.05, 8.4583, 51...."


**Drop Column yang tidak digunakan**

In [7]:
titanic.drop(['Name', 'Ticket', 'Cabin', 'Fare', 'Embarked'], inplace = True, axis = 1)

**Drop sisa missing value**

In [8]:
titanic.dropna(inplace = True)

Melihat info data yang sudah terupdate

In [9]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
Index: 714 entries, 0 to 890
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  714 non-null    int64  
 1   Survived     714 non-null    int64  
 2   Pclass       714 non-null    int64  
 3   Sex          714 non-null    object 
 4   Age          714 non-null    float64
 5   SibSp        714 non-null    int64  
 6   Parch        714 non-null    int64  
dtypes: float64(1), int64(5), object(1)
memory usage: 44.6+ KB


In [10]:
# titanic['Age'].astype()

## **Create Dummy Variables**

In [11]:
titanic_dummies = pd.get_dummies(titanic, columns = ['Sex'], dtype = int, drop_first = True)
titanic_dummies

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Sex_male
0,1,0,3,22.0,1,0,1
1,2,1,1,38.0,1,0,0
2,3,1,3,26.0,0,0,0
3,4,1,1,35.0,1,0,0
4,5,0,3,35.0,0,0,1
...,...,...,...,...,...,...,...
885,886,0,3,39.0,0,5,0
886,887,0,2,27.0,0,0,1
887,888,1,1,19.0,0,0,0
889,890,1,1,26.0,0,0,1


## **Train Test Split**

In [12]:
# Train Test Split

xtrain, xtest, ytrain, ytest = train_test_split(
    titanic_dummies.drop('Survived', axis = 1),
    titanic_dummies['Survived'],
    test_size = 0.2,
    random_state = 2023,
    stratify = titanic_dummies['Survived']
)

**K-Nearest Neightbour Classifier**

In [13]:
list = []
for k in range (3,21,2):
    knn = KNeighborsClassifier(n_neighbors = k)
    knn.fit(xtrain,ytrain)
    pred_knn = knn.predict(xtest.values)
    list.append([k,round(accuracy_score(ytest,pred_knn),3)])

In [14]:
pd.DataFrame(list, columns = ['Jumlah K','Nilai akurasi']).sort_values(by = 'Nilai akurasi', ascending = False).head(3)

Unnamed: 0,Jumlah K,Nilai akurasi
6,15,0.608
7,17,0.601
8,19,0.566


**Nilai K yang dapat digunakan untuk n_neighbors paling bagus yaitu 15 dengan nilai akurasi 0.608**

In [15]:
# Best KNN Model

knn = KNeighborsClassifier(n_neighbors = 15)
knn.fit(xtrain, ytrain)
pred_knn = knn.predict(xtest)
print(classification_report(ytest, pred_knn))
print(accuracy_score(ytest, pred_knn))

              precision    recall  f1-score   support

           0       0.62      0.89      0.73        85
           1       0.55      0.19      0.28        58

    accuracy                           0.61       143
   macro avg       0.58      0.54      0.51       143
weighted avg       0.59      0.61      0.55       143

0.6083916083916084


**Decission Tree Classifier**

In [31]:
md = np.arange(1, 21)
crit = ['gini', 'entropy']
ms = np.arange(2, 21)
acc_score = []
krit = []
fs = []
for i in md:
    for j in crit:
        for k in ms:
            tree = DecisionTreeClassifier(criterion=j, max_depth=i, min_samples_leaf=k)
            tree.fit(xtrain, ytrain)
            pred = tree.predict(xtest)
            krit.append((i, j, k))
            acc_score.append(accuracy_score(ytest, pred))
            fs.append(f1_score(ytest, pred))

In [38]:
pd.DataFrame({
    'Criterion' : krit,
    'Accuracy Score' : acc_score,
    'F1 Score' : fs
}).sort_values(by = 'Accuracy Score', ascending= False).head()

Unnamed: 0,Criterion,Accuracy Score,F1 Score
597,"(16, entropy, 10)",0.86014,0.824561
749,"(20, entropy, 10)",0.86014,0.824561
747,"(20, entropy, 8)",0.86014,0.824561
294,"(8, entropy, 11)",0.86014,0.824561
293,"(8, entropy, 10)",0.86014,0.824561


**Tipe Criteria yang memiliki akurasi yang bagus adalah Entropy dengan depth 16 dan min leaf 10 dengan akurasi score 0.86**

In [18]:
# Best DT Model

tree = DecisionTreeClassifier(criterion='entropy', max_depth=16, min_samples_leaf=10)
tree.fit(xtrain, ytrain)
pred_tree = tree.predict(xtest)
print(classification_report(ytest, pred_tree))
print(accuracy_score(ytest, pred_tree))

              precision    recall  f1-score   support

           0       0.87      0.89      0.88        85
           1       0.84      0.81      0.82        58

    accuracy                           0.86       143
   macro avg       0.86      0.85      0.85       143
weighted avg       0.86      0.86      0.86       143

0.8601398601398601


**Support Vector Machine**

In [22]:
lists = []
kernels = ['linear', 'poly', 'rbf', 'sigmoid']
for k in kernels :
    svc = SVC(kernel=k)
    svc.fit(xtrain, ytrain)
    pred_svc = svc.predict(xtest)
    lists.append([k, round(accuracy_score(ytest, pred_svc), 3)])

In [24]:
pd.DataFrame(lists, columns = ['Kernels','Nilai akurasi']).sort_values(by = 'Nilai akurasi', ascending = False).head()

Unnamed: 0,Kernels,Nilai akurasi
0,linear,0.825
1,poly,0.594
2,rbf,0.594
3,sigmoid,0.517


**Kernels yang memiliki akurasi tertinggi yaitu Linear dengan nilai akurasi 0.825**

In [36]:
# Best SVC Kernel Model

svc = SVC(kernel = 'linear')
svc.fit(xtrain, ytrain)
pred_svc = svc.predict(xtest)
print(classification_report(ytest, pred_svc))
print(accuracy_score(ytest, pred_svc))

              precision    recall  f1-score   support

           0       0.85      0.86      0.85        85
           1       0.79      0.78      0.78        58

    accuracy                           0.83       143
   macro avg       0.82      0.82      0.82       143
weighted avg       0.82      0.83      0.82       143

0.8251748251748252


In [27]:
print(f"Accuracy Score dari DT = {accuracy_score(ytest, pred_tree)}")
print(f"Accuracy Score dari KNN = {accuracy_score(ytest, pred_knn)}")
print(f"Accuracy Score dari SVC = {accuracy_score(ytest, pred_svc)}")

Accuracy Score dari DT = 0.8601398601398601
Accuracy Score dari KNN = 0.6083916083916084
Accuracy Score dari SVC = 0.8251748251748252


# **Conclusion**
Berdasarkan dari hasil test model yang digunakan. Ditemukan bahwa Model yang memiliki akurasi tertinggi untuk melakukan prediksi apakah penumpang Titanic selamat atau tidak selamat yaitu Model **Decission Tree Classifier dengan nilai akurasi 0.86**