<a href="https://colab.research.google.com/github/ahrimhan/data_anonymization/blob/master/ML_titanic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Machine Learning Models Using Titanic Data**


---

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ahrimhan/tree/master/data_anonymization/ML_titanic.ipynb)

In this project, I aim to do hands-on-experience on de-identifying sensitive data using various anonymization techniques and observe the effects of accuracy on machine learning models.

The anonymization process is exaplained and implemented [here](https://github.com/ahrimhan/anonymization/blob/master/anonymization_titanic.ipynb).

We built machine learning models using classification techniques.  
* Logistic Regression
* Support Vector Machines (SVM)
* Random Forest
* Decision Tree
* Stochastic Gradient Decent
* Gaussian Naive Bayes
* K-Nearest Neighbors (KNN)


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [0]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [0]:
pd.set_option('display.max_columns', None)

In [0]:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score

# machine learning models
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier

In [0]:
origin_df = pd.read_csv('./drive/My Drive/data_anonymization/data/anony_org_titanic.csv', sep='\t', encoding='utf-8')
origin_df.drop("Unnamed: 0", axis=1, inplace=True)

In [6]:
origin_df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Fare,Embarked,Age_man_bin5,Age_man_bin8,Fare_bin3,FamilySize,FamilySize_bin,Title
0,0,3,male,22.0,7,S,Young_Adult,21-30,Cheap,2,Small,Mr
1,1,1,female,38.0,71,C,Middel_Aged_Adult,31-40,Moderate,2,Small,Mrs
2,1,3,female,26.0,7,S,Young_Adult,21-30,Cheap,1,Alone,Miss
3,1,1,female,35.0,53,S,Middel_Aged_Adult,31-40,Moderate,2,Small,Mrs
4,0,3,male,35.0,8,S,Middel_Aged_Adult,31-40,Cheap,1,Alone,Mr


In [7]:
origin_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 889 entries, 0 to 888
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Survived        889 non-null    int64  
 1   Pclass          889 non-null    int64  
 2   Sex             889 non-null    object 
 3   Age             889 non-null    float64
 4   Fare            889 non-null    int64  
 5   Embarked        889 non-null    object 
 6   Age_man_bin5    889 non-null    object 
 7   Age_man_bin8    889 non-null    object 
 8   Fare_bin3       889 non-null    object 
 9   FamilySize      889 non-null    int64  
 10  FamilySize_bin  889 non-null    object 
 11  Title           889 non-null    object 
dtypes: float64(1), int64(4), object(7)
memory usage: 83.5+ KB


In [0]:
origin_df['Pclass'] = origin_df["Pclass"].astype("category").cat.as_ordered()

In [9]:
categorical_feature = origin_df.select_dtypes(include=['category', 'object']).columns
categorical_feature

Index(['Pclass', 'Sex', 'Embarked', 'Age_man_bin5', 'Age_man_bin8',
       'Fare_bin3', 'FamilySize_bin', 'Title'],
      dtype='object')

In [0]:
train_df = pd.read_csv('./drive/My Drive/data_anonymization/data/anony_encod_titanic.csv', sep='\t', encoding='utf-8')

In [11]:
#class imbalance check
train_df.groupby(['Survived'], as_index=False).size()

Survived
0    549
1    340
dtype: int64

In [0]:
train_df.drop("Unnamed: 0", axis=1, inplace=True)

In [13]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 889 entries, 0 to 888
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Survived        889 non-null    int64  
 1   Pclass          889 non-null    int64  
 2   Sex             889 non-null    int64  
 3   Age             889 non-null    float64
 4   Fare            889 non-null    int64  
 5   Embarked        889 non-null    int64  
 6   Age_man_bin5    889 non-null    int64  
 7   Age_man_bin8    889 non-null    int64  
 8   Fare_bin3       889 non-null    int64  
 9   FamilySize      889 non-null    int64  
 10  FamilySize_bin  889 non-null    int64  
 11  Title           889 non-null    int64  
dtypes: float64(1), int64(11)
memory usage: 83.5 KB


In [0]:
gen_df = pd.read_csv('./drive/My Drive/data_anonymization/data/anony_gen_titanic.csv', sep='\t', encoding='utf-8')

In [0]:
gen_df.drop("Unnamed: 0", axis=1, inplace=True)

In [16]:
gen_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Pclass          1000 non-null   int64 
 1   Sex             1000 non-null   object
 2   Embarked        1000 non-null   object
 3   Age_man_bin5    1000 non-null   object
 4   Age_man_bin8    1000 non-null   object
 5   Fare_bin3       1000 non-null   object
 6   FamilySize_bin  1000 non-null   object
 7   Title           1000 non-null   object
 8   Survived        1000 non-null   int64 
dtypes: int64(2), object(7)
memory usage: 70.4+ KB


## **Label Encoding**

* "anony_org_titanic.csv" : original data
* "anony_encod_titanic.csv" : encoded data
* "anony_gen_titanic.csv": artificially generated data considering distribution (only for categorical data, size = 1000 rows)  
For more detailed explanation, please see the anonymization_titanic.ipynb.

In [0]:
train_df[categorical_feature] = train_df[categorical_feature].astype("category")
gen_df[categorical_feature] = gen_df[categorical_feature].astype("category")

In [0]:
train_df['Survived'] = train_df["Survived"].astype("category")
gen_df['Survived'] = gen_df["Survived"].astype("category")

In [19]:
gen_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype   
---  ------          --------------  -----   
 0   Pclass          1000 non-null   category
 1   Sex             1000 non-null   category
 2   Embarked        1000 non-null   category
 3   Age_man_bin5    1000 non-null   category
 4   Age_man_bin8    1000 non-null   category
 5   Fare_bin3       1000 non-null   category
 6   FamilySize_bin  1000 non-null   category
 7   Title           1000 non-null   category
 8   Survived        1000 non-null   category
dtypes: category(9)
memory usage: 10.3 KB


In [0]:
lab = LabelEncoder()

In [0]:
Y_gen = gen_df["Survived"]

In [0]:
gen_df.drop("Survived", axis=1, inplace=True)

In [0]:
X_gen_lab = gen_df.apply(lab.fit_transform)

In [0]:
ohe = OneHotEncoder()

In [0]:
X_gen_ohe = ohe.fit_transform(gen_df)

## **Classification Machine Learning Models**

In [0]:
# classification models

# Logistic Regression
logreg = LogisticRegression(random_state=42, max_iter=1000)

# Support Vector Machines (SVM)
svc = SVC()

# Random Forest
random_forest = RandomForestClassifier(n_estimators=100)

# Decision Tree
decision_tree = DecisionTreeClassifier()

# Stochastic Gradient Descent
sgd = SGDClassifier()

# Gaussian Naive Bayes
gaussian = GaussianNB()

# K-Nearest Neighbors (KNN)
knn = KNeighborsClassifier(n_neighbors = 3)

In [0]:
#"acc_x": measured using testing data with x model
#"acc_x_score": measured using training data with x model

# Logistic Regression
acc_log = [] #accuracy_score(Y_pred, Y_test)
acc_log_score = [] #logreg.score(X_train, Y_train)

# Support Vector Machines (SVM)
acc_svc = []
acc_svc_score = []

# Random Forest
acc_random_forest = []
acc_random_forest_score = []

# Decision Tree
acc_decision_tree = []
acc_decision_tree_score = []

# Stochastic Gradient Decent
acc_sgd = []
acc_sgd_score = []

# Gaussian Naive Bayes
acc_gaussian = []
acc_gaussian_score = []

# K-Nearest Neighbors (KNN)
acc_knn = [] 
acc_knn_score = []

In [0]:
def getLogReg(X_train, X_test, Y_train, Y_test):
  logreg.fit(X_train, Y_train)
  Y_pred = logreg.predict(X_test)
  # acc_log.append(round(logreg.score(X_test, Y_test) * 100, 2)) #same with accuracy_score(Y_pred, Y_test)
  acc_log.append(round(accuracy_score(Y_pred, Y_test) * 100, 2))
  acc_log_score.append(round(logreg.score(X_train, Y_train) * 100, 2))

In [0]:
def getSVC(X_train, X_test, Y_train, Y_test):
  svc.fit(X_train, Y_train)
  Y_pred = svc.predict(X_test)
  acc_svc.append(round(accuracy_score(Y_pred, Y_test) * 100, 2))
  acc_svc_score.append(round(svc.score(X_train, Y_train) * 100, 2))

In [0]:
def getRandomForest(X_train, X_test, Y_train, Y_test):
  random_forest.fit(X_train, Y_train)
  Y_pred = random_forest.predict(X_test)
  acc_random_forest.append(round(accuracy_score(Y_pred, Y_test) * 100, 2))
  acc_random_forest_score.append(round(random_forest.score(X_train, Y_train) * 100, 2))

In [0]:
def getDecisionTree(X_train, X_test, Y_train, Y_test):
  decision_tree.fit(X_train, Y_train)
  Y_pred = decision_tree.predict(X_test)
  acc_decision_tree.append(round(accuracy_score(Y_pred, Y_test) * 100, 2))
  acc_decision_tree_score.append(round(decision_tree.score(X_train, Y_train) * 100, 2))

In [0]:
def getStochasticGradientDescent(X_train, X_test, Y_train, Y_test):
  sgd.fit(X_train, Y_train)
  Y_pred = sgd.predict(X_test)
  acc_sgd.append(round(accuracy_score(Y_pred, Y_test) * 100, 2))
  acc_sgd_score.append(round(sgd.score(X_train, Y_train) * 100, 2))

In [0]:
def getGaussianNaiveBayes(X_train, X_test, Y_train, Y_test):
  gaussian.fit(X_train, Y_train)
  Y_pred = gaussian.predict(X_test)
  acc_gaussian.append(round(accuracy_score(Y_pred, Y_test) * 100, 2))
  acc_gaussian_score.append(round(gaussian.score(X_train, Y_train) * 100, 2))

In [0]:
def getKNN(X_train, X_test, Y_train, Y_test, scaler_bool=False):

  if(scaler_bool == True):
    scaler = MinMaxScaler()
    scaler.fit(X_train)

    X_train = scaler.transform(X_train)
    X_test = scaler.transform(X_test)
  
  knn.fit(X_train, Y_train)
  Y_pred = knn.predict(X_test)
  acc_knn.append(round(accuracy_score(Y_pred, Y_test) * 100, 2))
  acc_knn_score.append(round(knn.score(X_train, Y_train) * 100, 2))

In [0]:
def buildClfModels(X_train, X_test, Y_train, Y_test):
  # X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size)

  # print("Training data size:", X_train.shape[0])
  # print("Testing data size:", X_test.shape[0])

  #Logistic Regression
  getLogReg(X_train, X_test, Y_train, Y_test)

  # Support Vector Machines (SVM)
  getSVC(X_train, X_test, Y_train, Y_test)

  # Random Forest
  getRandomForest(X_train, X_test, Y_train, Y_test)

  # Decision Tree
  getDecisionTree(X_train, X_test, Y_train, Y_test)

  # Stochastic Gradient Decent
  getStochasticGradientDescent(X_train, X_test, Y_train, Y_test)

  # Gaussian Naive Bayes
  getGaussianNaiveBayes(X_train, X_test, Y_train, Y_test)

  # #k-Nearest Neighbors algorithm
  getKNN(X_train, X_test, Y_train, Y_test, scaler_bool=True)

## **Data Set**

Due to the anonmymization level, we have transformed the features (e.g., "Age", "Fare", and "Familysize") of the numerical data to fine-grained/coarse-grained levels of categorical data.

In [0]:
age = ["Age", "Age_man_bin8", "Age_man_bin5"]
fare = ["Fare", "Fare_bin3"]
familysize = ["FamilySize", "FamilySize_bin"]

The following features remain same and we call them as basic features: ['Pclass', 'Sex', 'Embarked', 'Title'].

In [0]:
X_basic = train_df.drop(["Survived"]+ age + fare + familysize, axis=1)

In [38]:
X_basic.columns.values

array(['Pclass', 'Sex', 'Embarked', 'Title'], dtype=object)

For each data, there is 889 rows and 7 columns.
* Total 889 rows: training (711) + testing (178)
* Total 7 columns: ['Pclass', 'Sex', 'Embarked', 'Title'] +  variations of [age, fare, familysize]

In [0]:
X = X_basic.copy()
Y = train_df["Survived"]

In [0]:
data_index_dict = {}
data_all_index_dict = {}

i = 0
for c in fare: 
  X[c] = train_df[c]

  for c1 in familysize:
    X[c1] = train_df[c1]

    for c2 in age:
      X[c2] = train_df[c2]
      # data index dictionary (variations of Age, Fare, and FamilySize)
      data_index_dict[i] = X.columns.drop(X_basic.columns).values
      data_all_index_dict[i] = X.columns.values
      i+=1
      X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)  #split train data 80% and test data 20%

      # print("Training data size:", X_train.shape[0])
      # print("Testing data size:", X_test.shape[0])

      buildClfModels(X_train, X_test, Y_train, Y_test)
      X = X.drop(c2, axis=1)
    
    X = X.drop(c1, axis=1)
  X = X.drop(c, axis=1)

**12 different data set.**    
For each data, one type of feature is chosen from "Age", "Fare", and "Familysize" in addition to the basic features 'Pclass', 'Sex', 'Embarked', 'Title'.

The following number indicates the index of dataset (`data_index_dict`).
The upper dataset is more accurate, and the lower data is more abstracted (or more anonymized and categorized).

* All numeric features: 0
* Two numeric and one anonymized features: 1 < 2, 3, 6
* One numeric and two anonymized features: 4 < 5, 7 < 8, 9
* All anonymized features: 10 < 11

**Please note that even though at the same level of numeric and anonymized features, there exists the difference abstraction.**

For example, features of dataset are as follows.  
dataset 1: ['Pclass', 'Sex', 'Embarked', 'Title', 'Fare', 'FamilySize', **'Age_man_bin8'**] and  
dataset 2: ['Pclass', 'Sex', 'Embarked', 'Title', 'Fare', 'FamilySize', **'Age_man_bin5'**].  
The all features are same except 'Age_man_bin8' and 'Age_man_bin5'.
'Age_man_bin8' has splitted age ranges into 8 while 'Age_man_bin5' cut into 5. In other words, **'Age_man_bin5' has more larger age ranges in each bucket than 'Age_man_bin8'; therefore it can be regarded as to the generalization of 'Age_man_bin5' is more abstracted (lost data) than the one for 'Age_man_bin8'.**
In this case, we denote **dataset 1 < dataset 2**, that **dataset 2 is more abstracted and anonymized**.

In [41]:
data_all_index_dict

{0: array(['Pclass', 'Sex', 'Embarked', 'Title', 'Fare', 'FamilySize', 'Age'],
       dtype=object),
 1: array(['Pclass', 'Sex', 'Embarked', 'Title', 'Fare', 'FamilySize',
        'Age_man_bin8'], dtype=object),
 2: array(['Pclass', 'Sex', 'Embarked', 'Title', 'Fare', 'FamilySize',
        'Age_man_bin5'], dtype=object),
 3: array(['Pclass', 'Sex', 'Embarked', 'Title', 'Fare', 'FamilySize_bin',
        'Age'], dtype=object),
 4: array(['Pclass', 'Sex', 'Embarked', 'Title', 'Fare', 'FamilySize_bin',
        'Age_man_bin8'], dtype=object),
 5: array(['Pclass', 'Sex', 'Embarked', 'Title', 'Fare', 'FamilySize_bin',
        'Age_man_bin5'], dtype=object),
 6: array(['Pclass', 'Sex', 'Embarked', 'Title', 'Fare_bin3', 'FamilySize',
        'Age'], dtype=object),
 7: array(['Pclass', 'Sex', 'Embarked', 'Title', 'Fare_bin3', 'FamilySize',
        'Age_man_bin8'], dtype=object),
 8: array(['Pclass', 'Sex', 'Embarked', 'Title', 'Fare_bin3', 'FamilySize',
        'Age_man_bin5'], dtype=object),
 9

## **Results**

### Score

"Score" is measured using training data. 

**This represents how the trained model is fitted into model.**

In [42]:
result_models_score_dict = {
               "acc_log_score": acc_log_score,
               "acc_svc_score": acc_svc_score,
               "acc_random_forest_score": acc_random_forest_score,
               "acc_decision_tree_score": acc_decision_tree_score,
               "acc_sgd_score": acc_sgd_score,
               "acc_gaussian_score": acc_gaussian_score,
               "acc_knn_score": acc_knn_score
              }

result_models_score = pd.DataFrame(result_models_score_dict)
result_models_score

Unnamed: 0,acc_log_score,acc_svc_score,acc_random_forest_score,acc_decision_tree_score,acc_sgd_score,acc_gaussian_score,acc_knn_score
0,79.89,69.34,97.47,97.47,74.26,79.89,88.47
1,79.32,66.95,94.66,94.66,61.32,80.31,86.36
2,79.18,66.67,93.53,93.53,76.23,79.75,87.2
3,81.43,69.06,96.91,96.91,71.73,79.75,87.9
4,81.01,68.64,93.39,93.39,62.87,79.04,86.92
5,82.14,67.09,93.53,93.53,72.01,79.61,88.33
6,79.89,65.4,94.37,94.37,75.11,79.61,87.06
7,78.48,82.28,87.62,87.62,68.92,79.47,85.09
8,78.62,82.56,87.48,87.48,66.53,79.32,85.65
9,81.86,70.6,94.66,94.66,80.03,80.03,88.33


### Accuracy

"Accuracy" is measured using testing data.

**This reflects the model accuracy to new data.**
Please note that the total number of dataset is 889, and this is split into training/test data 711 and 178 (80\% and 20\%).

The **train/test split technique is not stable **in that it may not split the data randomly and the data can be selected only from specific groups. This will result in **overfitting**. For obtaining the "Accuracy", **model validation should be cross validated** (e.g.,**k-fold cross validation**). In this project, the aim is to investigate the effects between the anonymization (e.g., privacy) and the prediction model accuracy (e.g., utility).

I think this represents the more realistic result than "Score".

In [43]:
result_models_acc_dict = {
               "acc_log": acc_log,
               "acc_svc": acc_svc, 
               "acc_random_forest": acc_random_forest,
               "acc_decision_tree": acc_decision_tree,
               "acc_sgd": acc_sgd,
               "acc_gaussian": acc_gaussian,
               "acc_knn": acc_knn
              }

result_models = pd.DataFrame(result_models_acc_dict)
result_models

Unnamed: 0,acc_log,acc_svc,acc_random_forest,acc_decision_tree,acc_sgd,acc_gaussian,acc_knn
0,78.65,65.73,78.09,78.09,76.4,78.09,82.02
1,80.34,66.29,80.34,79.78,63.48,79.21,79.78
2,79.21,73.6,84.27,83.71,78.65,78.09,82.02
3,78.65,65.17,82.58,82.02,73.03,76.97,84.83
4,83.15,68.54,80.34,78.65,64.61,79.21,80.9
5,80.34,63.48,77.53,76.4,70.22,75.84,77.53
6,82.02,71.35,79.78,78.09,76.4,82.58,82.02
7,82.02,85.39,84.27,84.27,64.04,81.46,81.46
8,83.15,85.39,83.15,80.34,67.98,83.71,82.58
9,75.84,69.1,77.53,76.4,73.03,75.28,79.78


We summarized the best score/ accuracy in each classification model.

In [0]:
best_models = pd.DataFrame({
    'Model': ['Logistic Regression', 'Support Vector Machines', 'Random Forest',
              'Decision Tree', 'Stochastic Gradient Decent', 'Gaussian Naive Bayes', 
              'K-Nearest Neighbors'],
    'Score': result_models_score.max().to_list(),
    'Accuracy': result_models.max()
              })

Ordered by "Score" measured using training data.

In [45]:
best_models.sort_values(by='Score', ascending=False)

Unnamed: 0,Model,Score,Accuracy
acc_random_forest,Random Forest,97.47,84.27
acc_decision_tree,Decision Tree,97.47,84.27
acc_knn,K-Nearest Neighbors,88.47,84.83
acc_svc,Support Vector Machines,84.39,85.39
acc_log,Logistic Regression,82.14,83.15
acc_gaussian,Gaussian Naive Bayes,80.31,83.71
acc_sgd,Stochastic Gradient Decent,80.03,78.65


Ordered by "Accuracy" measured using testing data.

In [46]:
best_models.sort_values(by='Accuracy', ascending=False)

Unnamed: 0,Model,Score,Accuracy
acc_svc,Support Vector Machines,84.39,85.39
acc_knn,K-Nearest Neighbors,88.47,84.83
acc_random_forest,Random Forest,97.47,84.27
acc_decision_tree,Decision Tree,97.47,84.27
acc_gaussian,Gaussian Naive Bayes,80.31,83.71
acc_log,Logistic Regression,82.14,83.15
acc_sgd,Stochastic Gradient Decent,80.03,78.65


**We summarized the best models in each data set.**

In [0]:
best_data = pd.DataFrame({
    'Data': data_index_dict,
    'Score(Train Set)': result_models_score.max(axis=1),
    'Score_Model': result_models_score.idxmax(axis=1),
    'Accuracy(Test Set)': result_models.max(axis=1),
    'Accuracy_Model':result_models.idxmax(axis=1)
              })

Ordered by "Score" measured using training data.

In [48]:
best_data.sort_values(by='Score(Train Set)', ascending=False)

Unnamed: 0,Data,Score(Train Set),Score_Model,Accuracy(Test Set),Accuracy_Model
0,"[Fare, FamilySize, Age]",97.47,acc_random_forest_score,82.02,acc_knn
3,"[Fare, FamilySize_bin, Age]",96.91,acc_random_forest_score,84.83,acc_knn
1,"[Fare, FamilySize, Age_man_bin8]",94.66,acc_random_forest_score,80.34,acc_log
9,"[Fare_bin3, FamilySize_bin, Age]",94.66,acc_random_forest_score,79.78,acc_knn
6,"[Fare_bin3, FamilySize, Age]",94.37,acc_random_forest_score,82.58,acc_gaussian
2,"[Fare, FamilySize, Age_man_bin5]",93.53,acc_random_forest_score,84.27,acc_random_forest
5,"[Fare, FamilySize_bin, Age_man_bin5]",93.53,acc_random_forest_score,80.34,acc_log
4,"[Fare, FamilySize_bin, Age_man_bin8]",93.39,acc_random_forest_score,83.15,acc_log
7,"[Fare_bin3, FamilySize, Age_man_bin8]",87.62,acc_random_forest_score,85.39,acc_svc
11,"[Fare_bin3, FamilySize_bin, Age_man_bin5]",87.62,acc_random_forest_score,80.34,acc_log


Ordered by "Accuracy" measured using testing data.

In [49]:
best_data.sort_values(by='Accuracy(Test Set)', ascending=False)

Unnamed: 0,Data,Score(Train Set),Score_Model,Accuracy(Test Set),Accuracy_Model
7,"[Fare_bin3, FamilySize, Age_man_bin8]",87.62,acc_random_forest_score,85.39,acc_svc
8,"[Fare_bin3, FamilySize, Age_man_bin5]",87.48,acc_random_forest_score,85.39,acc_svc
3,"[Fare, FamilySize_bin, Age]",96.91,acc_random_forest_score,84.83,acc_knn
2,"[Fare, FamilySize, Age_man_bin5]",93.53,acc_random_forest_score,84.27,acc_random_forest
10,"[Fare_bin3, FamilySize_bin, Age_man_bin8]",87.34,acc_random_forest_score,84.27,acc_svc
4,"[Fare, FamilySize_bin, Age_man_bin8]",93.39,acc_random_forest_score,83.15,acc_log
6,"[Fare_bin3, FamilySize, Age]",94.37,acc_random_forest_score,82.58,acc_gaussian
0,"[Fare, FamilySize, Age]",97.47,acc_random_forest_score,82.02,acc_knn
1,"[Fare, FamilySize, Age_man_bin8]",94.66,acc_random_forest_score,80.34,acc_log
5,"[Fare, FamilySize_bin, Age_man_bin5]",93.53,acc_random_forest_score,80.34,acc_log


## Synthetic Data

**Two more data set: Synthetic data (generated 1000 rows).**

* Index 12: ['Age_man_bin8', 'Fare_bin3', 'FamilySize_bin', 'Pclass', 'Sex', 'Embarked', 'Title']
* Index 13: ['Age_man_bin5', 'Fare_bin3', 'FamilySize_bin', 'Pclass', 'Sex', 'Embarked', 'Title']

In [50]:
# Synthetic data

dup_cols = ["Age_man_bin8", "Age_man_bin5"]

for c in dup_cols:
  X_gen = X_gen_lab.drop(c, axis=1)

  data_index_dict[i] = "Synthetic data: ['"+ c +"', 'Fare_bin3', 'FamilySize_bin']"
  data_all_index_dict[i] = [c, 'Fare_bin3', 'FamilySize_bin'] + ['Pclass', 'Sex', 'Embarked', 'Title'] 
  print(data_all_index_dict[i])
  i+=1

  X_train, X_test, Y_train, Y_test = train_test_split(X_gen, Y_gen, test_size=0.2)  #split train data 80% and test data 20%
  buildClfModels(X_train, X_test, Y_train, Y_test)

['Age_man_bin8', 'Fare_bin3', 'FamilySize_bin', 'Pclass', 'Sex', 'Embarked', 'Title']
['Age_man_bin5', 'Fare_bin3', 'FamilySize_bin', 'Pclass', 'Sex', 'Embarked', 'Title']


In [0]:
result_models_score = pd.DataFrame(result_models_score_dict)
result_models = pd.DataFrame(result_models_acc_dict)

In [0]:
best_models = pd.DataFrame({
    'Model': ['Logistic Regression', 'Support Vector Machines', 'Random Forest',
              'Decision Tree', 'Stochastic Gradient Decent', 'Gaussian Naive Bayes', 
              'K-Nearest Neighbors'],
    'Score': result_models_score.max().to_list(),
    'Accuracy': result_models.max()
              })

best_data = pd.DataFrame({
    'Data': data_index_dict,
    'Score(Train Set)': result_models_score.max(axis=1),
    'Score_Model': result_models_score.idxmax(axis=1),
    'Accuracy(Test Set)': result_models.max(axis=1),
    'Accuracy_Model':result_models.idxmax(axis=1)
              })

In [57]:
best_data.iloc[12:]

Unnamed: 0,Data,Score(Train Set),Score_Model,Accuracy(Test Set),Accuracy_Model
12,"Synthetic data: ['Age_man_bin8', 'Fare_bin3', ...",79.5,acc_random_forest_score,62.0,acc_log
13,"Synthetic data: ['Age_man_bin5', 'Fare_bin3', ...",82.25,acc_random_forest_score,61.0,acc_sgd
