### **Background**

Tenggelamnya Kapal Titanic merupakan salah satu insiden yang paling terkenal, sehingga dibuat sebuah analisa data terkait peristiwa Titanic, disini saya ingin membuat **machine learning module** untuk mengetahui model yang mana yang lebih cocok untuk digunakan dalam **memprediksi penumpang mana yang bakalan survive dan tidak survive**

In [12]:
# Import Library

import pandas as pd
import numpy as np

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix

import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

### **Import Dataset**

In [3]:
# Import Dataset 

titanic = sns.load_dataset('titanic')
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [4]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB


### **Data Cleaning**

Check Data Anomalies

In [5]:
titanic.isna().sum()

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

Terdapat Missing value di column Age dan Deck

Banyak missing value di column deck dan age yang dimana column tersebut akan di drop dikarenakan tidak digunakan

**DROP COLUMN**

In [6]:
titanic.drop(['fare', 'embarked', 'deck', 'pclass', 'who', 'embark_town', 'alive', 'adult_male', 'sibsp', 'parch'], axis = 1, inplace = True)

Drop column yang tidak digunakan

In [7]:
titanic.dropna(inplace = True)

Drop missing value

In [8]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
Index: 714 entries, 0 to 890
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   survived  714 non-null    int64   
 1   sex       714 non-null    object  
 2   age       714 non-null    float64 
 3   class     714 non-null    category
 4   alone     714 non-null    bool    
dtypes: bool(1), category(1), float64(1), int64(1), object(1)
memory usage: 23.8+ KB


### **Create Dummy Variable**

In [9]:
# Create Dummy Var

titanic_dummies = pd.get_dummies(titanic, columns = ['sex', 'class', 'alone'], dtype = int, drop_first = True)
titanic_dummies

Unnamed: 0,survived,age,sex_male,class_Second,class_Third,alone_True
0,0,22.0,1,0,1,0
1,1,38.0,0,0,0,0
2,1,26.0,0,0,1,1
3,1,35.0,0,0,0,0
4,0,35.0,1,0,1,1
...,...,...,...,...,...,...
885,0,39.0,0,0,1,0
886,0,27.0,1,1,0,1
887,1,19.0,0,0,0,1
889,1,26.0,1,0,0,1


### **Train Test Split**

In [10]:
# Train Test Split

x = titanic_dummies.drop('survived', axis = 1)
y = titanic_dummies['survived']

xtrain, xtest, ytrain, ytest = train_test_split(
    x, 
    y,
    test_size = 0.2,
    random_state = 2023,
    stratify = y
)

### **Machine Learning Modeling**

**Logistic Regression**

In [13]:
# Build Model

logreg = LogisticRegression()
logreg.fit(xtrain, ytrain)
# Predict

pred = logreg.predict(xtest)
print(classification_report(ytest, pred))
print(accuracy_score(ytest, pred))
print(confusion_matrix(ytest, pred))

              precision    recall  f1-score   support

           0       0.85      0.85      0.85        85
           1       0.78      0.78      0.78        58

    accuracy                           0.82       143
   macro avg       0.81      0.81      0.81       143
weighted avg       0.82      0.82      0.82       143

0.8181818181818182
[[72 13]
 [13 45]]


**Decission Tree Classifier**

In [14]:
dt = DecisionTreeClassifier()
dt.fit(xtrain, ytrain)
pred_tree = dt.predict(xtest)
print(classification_report(ytest, pred_tree))
print(accuracy_score(ytest, pred_tree))
print(confusion_matrix(ytest, pred_tree))

              precision    recall  f1-score   support

           0       0.88      0.86      0.87        85
           1       0.80      0.83      0.81        58

    accuracy                           0.85       143
   macro avg       0.84      0.84      0.84       143
weighted avg       0.85      0.85      0.85       143

0.8461538461538461
[[73 12]
 [10 48]]


**K Nearest Neighbors Regressor**

In [15]:
knn = KNeighborsClassifier()
knn.fit(xtrain, ytrain)
pred_knn = knn.predict(xtest)
print(classification_report(ytest, pred_knn))
print(accuracy_score(ytest, pred_knn))
print(confusion_matrix(ytest, pred_knn))

              precision    recall  f1-score   support

           0       0.83      0.88      0.86        85
           1       0.81      0.74      0.77        58

    accuracy                           0.83       143
   macro avg       0.82      0.81      0.82       143
weighted avg       0.82      0.83      0.82       143

0.8251748251748252
[[75 10]
 [15 43]]


**Support Vector Classifier**

In [16]:
svc = SVC()
svc.fit(xtrain, ytrain)
pred_svc = svc.predict(xtest)
print(classification_report(ytest, pred_svc))
print(accuracy_score(ytest, pred_svc))
print(confusion_matrix(ytest, pred_svc))

              precision    recall  f1-score   support

           0       0.61      0.94      0.74        85
           1       0.55      0.10      0.17        58

    accuracy                           0.60       143
   macro avg       0.58      0.52      0.46       143
weighted avg       0.58      0.60      0.51       143

0.6013986013986014
[[80  5]
 [52  6]]


In [17]:
print(f"Accuracy Score dari LogReg = {accuracy_score(ytest, pred)}")
print(f"Accuracy Score dari DT = {accuracy_score(ytest, pred_tree)}")
print(f"Accuracy Score dari KNN = {accuracy_score(ytest, pred_knn)}")
print(f"Accuracy Score dari SVC = {accuracy_score(ytest, pred_svc)}")
print(f"Matrix dari LogReg = {confusion_matrix(ytest, pred)}")
print(f"Matrix dari DT = {confusion_matrix(ytest, pred_tree)}")
print(f"Matrix dari KNN = {confusion_matrix(ytest, pred_knn)}")
print(f"Matrix dari SVC = {confusion_matrix(ytest, pred_svc)}")

Accuracy Score dari LogReg = 0.8181818181818182
Accuracy Score dari DT = 0.8461538461538461
Accuracy Score dari KNN = 0.8251748251748252
Accuracy Score dari SVC = 0.6013986013986014
Matrix dari LogReg = [[72 13]
 [13 45]]
Matrix dari DT = [[73 12]
 [10 48]]
Matrix dari KNN = [[75 10]
 [15 43]]
Matrix dari SVC = [[80  5]
 [52  6]]


### **Conclusion**


Hasil dari Test berdasarkan dari 4 Model tersebut.

**Model Decission Tree memiliki accuracy tertinggi dengan total 86% dengan total precision sebesar 84%**