<a href="https://colab.research.google.com/github/aka-gera/Data_Classification/blob/main/Titanic_Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [71]:
from google.colab import drive
drive.mount('/content/drive')

!pwd
%cd /content/drive/MyDrive/ML2023/data-analysis
# !ls

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/My Drive/ML2023/data-analysis
/content/drive/MyDrive/ML2023/data-analysis


# **CLASSIFIER**

*This algorithm will identify the optimal classification machine learning model for a given dataset.*

# Import the helper classes

In [72]:
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
import math
from plotly.subplots import make_subplots

from aka_data_analysis.aka_plot import aka_plot, aka_correlation_analysis
from aka_data_analysis.aka_learning import aka_learn,aka_clean,aka_filter

aka_plot = aka_plot()
aka_corr_an = aka_correlation_analysis()
aka_clean = aka_clean()
aka_learn = aka_learn()
aka_filter = aka_filter()
aka_corr_an = aka_correlation_analysis()

In [73]:
# import warnings
# from sklearn.exceptions import FitFailedWarning
# # Filter out the FitFailedWarning
# warnings.filterwarnings("ignore", category=FitFailedWarning)
# warnings.filterwarnings("ignore", category=UserWarning)

In [74]:

import matplotlib.pyplot as plt

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
import plotly.figure_factory as ff
import plotly.express as px
import numpy as np
import pandas as pd


# Dataset Information




The data is provided by: https://www.kaggle.com/datasets/sakshisatre/titanic-dataset




This dataset is frequently utilized for predictive modeling and statistical analysis aimed at identifying factors associated with a higher likelihood of survival during the Titanic disaster. Researchers often explore various variables such as socio-economic status, age, gender, and more to discern patterns and correlations with survival outcomes. By analyzing these factors, researchers seek to understand the complex interplay of demographic, socio-economic, and situational factors that influenced survival rates among Titanic passengers.


| Variable  | Description                                                                                               |
|-----------|-----------------------------------------------------------------------------------------------------------|
| Pclass    | Ticket class indicating the socio-economic status of the passenger. 1 = Upper, 2 = Middle, 3 = Lower.    |
| Survived  | A binary indicator showing whether the passenger survived (1) or not (0) during the Titanic disaster.    |
| Name      | The full name of the passenger, including title (e.g., Mr., Mrs., etc.).                                 |
| Sex       | The gender of the passenger, denoted as either male or female.                                            |
| Age       | The age of the passenger in years.                                                                        |
| SibSp     | The number of siblings or spouses aboard the Titanic for the respective passenger.                        |
| Parch     | The number of parents or children aboard the Titanic for the respective passenger.                        |
| Ticket    | The ticket number assigned to the passenger.                                                              |
| Fare      | The fare paid by the passenger for the ticket.                                                            |
| Cabin     | The cabin number assigned to the passenger, if available.                                                  |
| Embarked  | The port of embarkation for the passenger. It can take one of three values: C = Cherbourg, Q = Queenstown, S = Southampton. |
| Boat      | If the passenger survived, this column contains the identifier of the lifeboat they were rescued in.      |
| Body      | If the passenger did not survive, this column contains the identification number of their recovered body, if applicable. |
| Home.dest | The destination or place of residence of the passenger.                                                   |


# Import Dataset

In [75]:
df = aka_clean.df_get('Titanic_Dataset/Titanic Dataset.csv')

# Clean Data

In [76]:
df.head(5)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


## Swap the target and the last feature

In [77]:
df = aka_clean.swap_features(df,1)
df.head()

Unnamed: 0,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest,survived
0,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO",1
1,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON",1
2,1,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON",0
3,1,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON",0
4,1,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON",0


## Drop feature(s)


In [78]:
feat = []
df = aka_clean.drop_feature(df,feat)
df.head()

Unnamed: 0,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest,survived
0,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO",1
1,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON",1
2,1,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON",0
3,1,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON",0
4,1,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON",0


In [79]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   pclass     1309 non-null   int64  
 1   name       1309 non-null   object 
 2   sex        1309 non-null   object 
 3   age        1046 non-null   float64
 4   sibsp      1309 non-null   int64  
 5   parch      1309 non-null   int64  
 6   ticket     1309 non-null   object 
 7   fare       1308 non-null   float64
 8   cabin      295 non-null    object 
 9   embarked   1307 non-null   object 
 10  boat       486 non-null    object 
 11  body       121 non-null    float64
 12  home.dest  745 non-null    object 
 13  survived   1309 non-null   int64  
dtypes: float64(3), int64(4), object(7)
memory usage: 143.3+ KB


In [80]:
df.describe()

Unnamed: 0,pclass,age,sibsp,parch,fare,body,survived
count,1309.0,1046.0,1309.0,1309.0,1308.0,121.0,1309.0
mean,2.294882,29.881138,0.498854,0.385027,33.295479,160.809917,0.381971
std,0.837836,14.413493,1.041658,0.86556,51.758668,97.696922,0.486055
min,1.0,0.17,0.0,0.0,0.0,1.0,0.0
25%,2.0,21.0,0.0,0.0,7.8958,72.0,0.0
50%,3.0,28.0,0.0,0.0,14.4542,155.0,0.0
75%,3.0,39.0,1.0,0.0,31.275,256.0,1.0
max,3.0,80.0,8.0,9.0,512.3292,328.0,1.0


##  Convert categorical variables into numerical representations

In [81]:
mapping,swapMapping = aka_clean.CleaningVar(df)
df = aka_clean.CleaningDF(df,mapping)
df.head()

Unnamed: 0,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest,survived
0,1,0,0,29.0,0,0,0,211.3375,0.0,0.0,0.0,1.0,0.0,1
1,1,1,1,0.92,1,2,1,151.55,1.0,0.0,1.0,1.0,1.0,1
2,1,2,0,2.0,1,2,1,151.55,1.0,0.0,16.0,1.0,1.0,0
3,1,3,1,30.0,1,2,1,151.55,1.0,0.0,16.0,135.0,1.0,0
4,1,4,0,25.0,1,2,1,151.55,1.0,0.0,16.0,1.0,1.0,0


## Clean Dataset

In [82]:

confidence_interval_limit =   3            # Define the limits m of the confidence interval [-m, m] and eliminate the outliers'''

correlation_percentage_threshold = .8      # Set the limit of the correlation between the feature to be removed

df_filtered,corr_tmp = aka_learn.filter_drop_corr_df(df,confidence_interval_limit,correlation_percentage_threshold)

diff_shape = (df.shape[0]-df_filtered.shape[0],df.shape[1]-df_filtered.shape[1])
diff_shape

(195, 2)

## Graph the features that are highly correlated


In [83]:
fig = aka_corr_an.Plot_Correlate_Features(df,list(corr_tmp),400,500,3)
if fig is not None:
    fig.show()

## Visualize the distribution of the filtered dataset

In [84]:
aka_plot.Plot_box_Features(df,df_filtered,400,500,3,range(df_filtered.shape[1]))

# Search for the most effective ML algorithm to learn the dataset

## Choose the parameters for the searching

<center>


<font size="45">

**Table of the list of machine learning algorithms to use in the search**

|Keys | ML name |
------ |----------------------
|LGC| Logistic Regression |    
|DTC| Decision Tree Classifier|
|KNN| K-Nearest Neighbors |
|SVC| Support Vector Classification |
|GNB| Gaussian Naive Bayes |
|SGD| Stochastic Gradient Descent |
|ABC| AdaBoost classifier|
|RFC| Random Forest Classifier|
|GBC| Gradient Boosting Classifier|

</font>

</center>

In [85]:

mls = ['DTC','RFC', 'KNN', 'ABC']    # Choose the key of the Machine learning algorithm'GNB',


pre_proc = 'NM'                             # Choose between 'XY' to standardize both 'X' and 'Y',
                                              #                'X' to standardize only 'X',
                                              #                'Y' to standardize only 'Y',
                                              #                'none' to not standardize the dataset
disp_dash = 'all'                             # Choose between  'all' to diplay all report of the ML
                                              #                 'sup'  to display the most significant report

mach = 'adv'                                  # Choose between  'adv' to use advanced parameters in the ML model
                                              #                  'none' to use a default parameters

file_name = 'data'                            # Enter the name of the output data file for the report

file_name_scre = 0.85                        # Enter the minimum value of the ML score to be saved in the report

In [86]:
aka_learn.Search_ML_2(df_filtered,mls,mach,pre_proc,confidence_interval_limit,correlation_percentage_threshold,diff_shape,disp_dash,file_name,file_name_scre)

conf_inter  corr_per  size_removed  ML   score      MSE    simul_time(min)
___________________________________________________________________________
  3      0.8     (195, 2)     DTC     93.731     0.063      0.03 
  3      0.8     (195, 2)     RFC     92.537     0.075      0.18 
  3      0.8     (195, 2)     KNN     76.119     0.239      0.03 
  3      0.8     (195, 2)     ABC     93.731     0.063      0.50 


In [87]:
std_inter = [-3,3]
corr_per = 0.9
ml = 'RFC'
pre_proc = 'none'

clf,scre,MSE_,corr_tmp,df_,y_test,y_pred = aka_learn.ML(df,std_inter,corr_per,pre_proc,ml)



## Confusion Matrix

In [88]:
y_pred_ = aka_clean.swap_map(y_pred,swapMapping)
y_test_ = aka_clean.swap_map(y_test,swapMapping)
Label = [ str(un) for un in np.unique(pd.concat([y_pred_, y_test_]))]

In [89]:
shw = 1
fig2 =  aka_plot.plot_confusion_matrix(y_test_,y_pred_,Label,shw)
fig2.show()

## Classification Report

In [90]:
shw = 1
fig3 =  aka_plot.plot_classification_report(y_test_,y_pred_,Label,shw)
fig3.show()