<a href="https://colab.research.google.com/github/ahrimhan/data_anonymization/blob/master/ML_titanic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Machine Learning Models Using Titanic Data**


---

In this project, I aim to do hands-on-experience on de-identifying sensitive data using various anonymization techniques and observe the effects of accuracy on machine learning models.

The anonymization process is exaplained and implemented [here](https://github.com/ahrimhan/data_anonymization/blob/master/anonymization_titanic.ipynb).

We built machine learning models using classification techniques.  
* Logistic Regression
* Support Vector Machines (SVM)
* Random Forest
* Decision Tree
* Stochastic Gradient Decent
* Gaussian Naive Bayes
* K-Nearest Neighbors (KNN)


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [0]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [0]:
pd.set_option('display.max_columns', None)

In [0]:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score

# machine learning models
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier

In [0]:
origin_df = pd.read_csv('./drive/My Drive/data_anonymization/data/anony_org_titanic.csv', sep='\t', encoding='utf-8')
origin_df.drop("Unnamed: 0", axis=1, inplace=True)

In [6]:
origin_df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Fare,Embarked,Age_man_bin5,Age_man_bin8,Fare_bin3,FamilySize,FamilySize_bin,Title
0,0,3,male,22.0,7,S,Young_Adult,21-30,Cheap,2,Small,Mr
1,1,1,female,38.0,71,C,Middel_Aged_Adult,31-40,Moderate,2,Small,Mrs
2,1,3,female,26.0,7,S,Young_Adult,21-30,Cheap,1,Alone,Miss
3,1,1,female,35.0,53,S,Middel_Aged_Adult,31-40,Moderate,2,Small,Mrs
4,0,3,male,35.0,8,S,Middel_Aged_Adult,31-40,Cheap,1,Alone,Mr


In [7]:
origin_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 889 entries, 0 to 888
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Survived        889 non-null    int64  
 1   Pclass          889 non-null    int64  
 2   Sex             889 non-null    object 
 3   Age             889 non-null    float64
 4   Fare            889 non-null    int64  
 5   Embarked        889 non-null    object 
 6   Age_man_bin5    889 non-null    object 
 7   Age_man_bin8    889 non-null    object 
 8   Fare_bin3       889 non-null    object 
 9   FamilySize      889 non-null    int64  
 10  FamilySize_bin  889 non-null    object 
 11  Title           889 non-null    object 
dtypes: float64(1), int64(4), object(7)
memory usage: 83.5+ KB


In [0]:
origin_df['Pclass'] = origin_df["Pclass"].astype("category").cat.as_ordered()

In [9]:
categorical_feature = origin_df.select_dtypes(include=['category', 'object']).columns
categorical_feature

Index(['Pclass', 'Sex', 'Embarked', 'Age_man_bin5', 'Age_man_bin8',
       'Fare_bin3', 'FamilySize_bin', 'Title'],
      dtype='object')

In [0]:
train_df = pd.read_csv('./drive/My Drive/data_anonymization/data/anony_encod_titanic.csv', sep='\t', encoding='utf-8')

In [11]:
#class imbalance check
train_df.groupby(['Survived'], as_index=False).size()

Survived
0    549
1    340
dtype: int64

In [0]:
train_df.drop("Unnamed: 0", axis=1, inplace=True)

In [13]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 889 entries, 0 to 888
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Survived        889 non-null    int64  
 1   Pclass          889 non-null    int64  
 2   Sex             889 non-null    int64  
 3   Age             889 non-null    float64
 4   Fare            889 non-null    int64  
 5   Embarked        889 non-null    int64  
 6   Age_man_bin5    889 non-null    int64  
 7   Age_man_bin8    889 non-null    int64  
 8   Fare_bin3       889 non-null    int64  
 9   FamilySize      889 non-null    int64  
 10  FamilySize_bin  889 non-null    int64  
 11  Title           889 non-null    int64  
dtypes: float64(1), int64(11)
memory usage: 83.5 KB


In [0]:
gen_df = pd.read_csv('./drive/My Drive/data_anonymization/data/anony_gen_titanic.csv', sep='\t', encoding='utf-8')

In [0]:
gen_df.drop("Unnamed: 0", axis=1, inplace=True)

In [16]:
gen_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Pclass          1000 non-null   int64 
 1   Sex             1000 non-null   object
 2   Embarked        1000 non-null   object
 3   Age_man_bin5    1000 non-null   object
 4   Age_man_bin8    1000 non-null   object
 5   Fare_bin3       1000 non-null   object
 6   FamilySize_bin  1000 non-null   object
 7   Title           1000 non-null   object
 8   Survived        1000 non-null   int64 
dtypes: int64(2), object(7)
memory usage: 70.4+ KB


## **Label Encoding**

* "anony_org_titanic.csv" : original data
* "anony_encod_titanic.csv" : encoded data
* "anony_gen_titanic.csv": artificially generated data considering distribution (only for categorical data, size = 1000 rows)  
For more detailed explanation, please see the anonymization_titanic.ipynb.

In [0]:
train_df[categorical_feature] = train_df[categorical_feature].astype("category")
gen_df[categorical_feature] = gen_df[categorical_feature].astype("category")

In [0]:
train_df['Survived'] = train_df["Survived"].astype("category")
gen_df['Survived'] = gen_df["Survived"].astype("category")

In [19]:
gen_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype   
---  ------          --------------  -----   
 0   Pclass          1000 non-null   category
 1   Sex             1000 non-null   category
 2   Embarked        1000 non-null   category
 3   Age_man_bin5    1000 non-null   category
 4   Age_man_bin8    1000 non-null   category
 5   Fare_bin3       1000 non-null   category
 6   FamilySize_bin  1000 non-null   category
 7   Title           1000 non-null   category
 8   Survived        1000 non-null   category
dtypes: category(9)
memory usage: 10.3 KB


In [0]:
lab = LabelEncoder()

In [0]:
Y_gen = gen_df["Survived"]

In [0]:
gen_df.drop("Survived", axis=1, inplace=True)

In [0]:
X_gen_lab = gen_df.apply(lab.fit_transform)

In [0]:
ohe = OneHotEncoder()

In [0]:
X_gen_ohe = ohe.fit_transform(gen_df)

## **Classification Machine Learning Models**

In [0]:
# classification models

# Logistic Regression
logreg = LogisticRegression(random_state=42, max_iter=1000)

# Support Vector Machines (SVM)
svc = SVC()

# Random Forest
random_forest = RandomForestClassifier(n_estimators=100)

# Decision Tree
decision_tree = DecisionTreeClassifier()

# Stochastic Gradient Descent
sgd = SGDClassifier()

# Gaussian Naive Bayes
gaussian = GaussianNB()

# K-Nearest Neighbors (KNN)
knn = KNeighborsClassifier(n_neighbors = 3)

In [0]:
#"acc_x": measured using testing data with x model
#"acc_x_score": measured using training data with x model

# Logistic Regression
acc_log = [] #accuracy_score(Y_pred, Y_test)
acc_log_score = [] #logreg.score(X_train, Y_train)

# Support Vector Machines (SVM)
acc_svc = []
acc_svc_score = []

# Random Forest
acc_random_forest = []
acc_random_forest_score = []

# Decision Tree
acc_decision_tree = []
acc_decision_tree_score = []

# Stochastic Gradient Decent
acc_sgd = []
acc_sgd_score = []

# Gaussian Naive Bayes
acc_gaussian = []
acc_gaussian_score = []

# K-Nearest Neighbors (KNN)
acc_knn = [] 
acc_knn_score = []

In [0]:
def getLogReg(X_train, X_test, Y_train, Y_test):
  logreg.fit(X_train, Y_train)
  Y_pred = logreg.predict(X_test)
  # acc_log.append(round(logreg.score(X_test, Y_test) * 100, 2)) #same with accuracy_score(Y_pred, Y_test)
  acc_log.append(round(accuracy_score(Y_pred, Y_test) * 100, 2))
  acc_log_score.append(round(logreg.score(X_train, Y_train) * 100, 2))

In [0]:
def getSVC(X_train, X_test, Y_train, Y_test):
  svc.fit(X_train, Y_train)
  Y_pred = svc.predict(X_test)
  acc_svc.append(round(accuracy_score(Y_pred, Y_test) * 100, 2))
  acc_svc_score.append(round(svc.score(X_train, Y_train) * 100, 2))

In [0]:
def getRandomForest(X_train, X_test, Y_train, Y_test):
  random_forest.fit(X_train, Y_train)
  Y_pred = random_forest.predict(X_test)
  acc_random_forest.append(round(accuracy_score(Y_pred, Y_test) * 100, 2))
  acc_random_forest_score.append(round(random_forest.score(X_train, Y_train) * 100, 2))

In [0]:
def getDecisionTree(X_train, X_test, Y_train, Y_test):
  decision_tree.fit(X_train, Y_train)
  Y_pred = decision_tree.predict(X_test)
  acc_decision_tree.append(round(accuracy_score(Y_pred, Y_test) * 100, 2))
  acc_decision_tree_score.append(round(decision_tree.score(X_train, Y_train) * 100, 2))

In [0]:
def getStochasticGradientDescent(X_train, X_test, Y_train, Y_test):
  sgd.fit(X_train, Y_train)
  Y_pred = sgd.predict(X_test)
  acc_sgd.append(round(accuracy_score(Y_pred, Y_test) * 100, 2))
  acc_sgd_score.append(round(sgd.score(X_train, Y_train) * 100, 2))

In [0]:
def getGaussianNaiveBayes(X_train, X_test, Y_train, Y_test):
  gaussian.fit(X_train, Y_train)
  Y_pred = gaussian.predict(X_test)
  acc_gaussian.append(round(accuracy_score(Y_pred, Y_test) * 100, 2))
  acc_gaussian_score.append(round(gaussian.score(X_train, Y_train) * 100, 2))

In [0]:
def getKNN(X_train, X_test, Y_train, Y_test, scaler_bool=False):

  if(scaler_bool == True):
    scaler = MinMaxScaler()
    scaler.fit(X_train)

    X_train = scaler.transform(X_train)
    X_test = scaler.transform(X_test)
  
  knn.fit(X_train, Y_train)
  Y_pred = knn.predict(X_test)
  acc_knn.append(round(accuracy_score(Y_pred, Y_test) * 100, 2))
  acc_knn_score.append(round(knn.score(X_train, Y_train) * 100, 2))

In [0]:
def buildClfModels(X_train, X_test, Y_train, Y_test):
  # print("Training data size:", X_train.shape[0])
  # print("Testing data size:", X_test.shape[0])

  #Logistic Regression
  getLogReg(X_train, X_test, Y_train, Y_test)

  # Support Vector Machines (SVM)
  getSVC(X_train, X_test, Y_train, Y_test)

  # Random Forest
  getRandomForest(X_train, X_test, Y_train, Y_test)

  # Decision Tree
  getDecisionTree(X_train, X_test, Y_train, Y_test)

  # Stochastic Gradient Decent
  getStochasticGradientDescent(X_train, X_test, Y_train, Y_test)

  # Gaussian Naive Bayes
  getGaussianNaiveBayes(X_train, X_test, Y_train, Y_test)

  # #k-Nearest Neighbors algorithm
  getKNN(X_train, X_test, Y_train, Y_test, scaler_bool=True)

## **12 Different Data Sets**

Due to the anonmymization level, we have transformed the features (e.g., "Age", "Fare", and "Familysize") of the numerical data to fine-grained/coarse-grained levels of categorical data.

In [0]:
age = ["Age", "Age_man_bin8", "Age_man_bin5"]
fare = ["Fare", "Fare_bin3"]
familysize = ["FamilySize", "FamilySize_bin"]

The following features remain same and we call them as basic features: ['Pclass', 'Sex', 'Embarked', 'Title'].

In [0]:
X_basic = train_df.drop(["Survived"]+ age + fare + familysize, axis=1)

In [38]:
X_basic.columns.values

array(['Pclass', 'Sex', 'Embarked', 'Title'], dtype=object)

For each data, there is 889 rows and 7 columns.

* Total 889 rows: training and test data 711 and 178 (80\% and 20\%).
* Total 7 columns: ['Pclass', 'Sex', 'Embarked', 'Title'] +  variations of [age, fare, familysize]

In [0]:
X = X_basic.copy()
Y = train_df["Survived"]

In [0]:
data_index_dict = {}
data_all_index_dict = {}

i = 0
for c in fare: 
  X[c] = train_df[c]

  for c1 in familysize:
    X[c1] = train_df[c1]

    for c2 in age:
      X[c2] = train_df[c2]
      # data index dictionary (variations of Age, Fare, and FamilySize)
      data_index_dict[i] = X.columns.drop(X_basic.columns).values
      data_all_index_dict[i] = X.columns.values
      i+=1
      X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)  #split train data 80% and test data 20%

      # print("Training data size:", X_train.shape[0])
      # print("Testing data size:", X_test.shape[0])

      buildClfModels(X_train, X_test, Y_train, Y_test)
      X = X.drop(c2, axis=1)
    
    X = X.drop(c1, axis=1)
  X = X.drop(c, axis=1)

For each data, one type of feature is chosen from "Age", "Fare", and "Familysize" in addition to the basic features 'Pclass', 'Sex', 'Embarked', 'Title'.

The following number indicates the index of dataset (`data_index_dict`).  
**The higher the index, the more likely the data to be more accurate, and the lower the index, the more abstract (or more anonymized) the data.**

We denote **dataset 1 ~> dataset 2**, that **dataset 2 is more abstracted and anonymized than dataset 1.**  
1. All numeric features: 0
2. Two numeric and one anonymized features: 1 ~> 2, 3, 6
3. One numeric and two anonymized features: 4 ~> 5, 7 ~> 8, 9
4. All anonymized features: 10 ~> 11

Please note that even though at the same level of numeric and anonymized features, there exists the difference abstraction.  
For example, features of dataset are as follows: 
* ['Pclass', 'Sex', 'Embarked', 'Title', 'Fare', 'FamilySize', **'Age_man_bin8'**] 
* ['Pclass', 'Sex', 'Embarked', 'Title', 'Fare', 'FamilySize', **'Age_man_bin5'**]

The all features are same except 'Age_man_bin8' and 'Age_man_bin5'.
'Age_man_bin8' has cut age ranges into 8 while 'Age_man_bin5' cut into 5. In other words, **'Age_man_bin5' has more larger age ranges in each bucket than 'Age_man_bin8'; therefore it can be regarded as to the generalization of 'Age_man_bin5' is more abstracted (lost data) than the one for 'Age_man_bin8'.**
In this case, we denote this relation as [..., 'Age_man_bin8'] ~> [..., 'Age_man_bin5'].

In [41]:
data_all_index_dict

{0: array(['Pclass', 'Sex', 'Embarked', 'Title', 'Fare', 'FamilySize', 'Age'],
       dtype=object),
 1: array(['Pclass', 'Sex', 'Embarked', 'Title', 'Fare', 'FamilySize',
        'Age_man_bin8'], dtype=object),
 2: array(['Pclass', 'Sex', 'Embarked', 'Title', 'Fare', 'FamilySize',
        'Age_man_bin5'], dtype=object),
 3: array(['Pclass', 'Sex', 'Embarked', 'Title', 'Fare', 'FamilySize_bin',
        'Age'], dtype=object),
 4: array(['Pclass', 'Sex', 'Embarked', 'Title', 'Fare', 'FamilySize_bin',
        'Age_man_bin8'], dtype=object),
 5: array(['Pclass', 'Sex', 'Embarked', 'Title', 'Fare', 'FamilySize_bin',
        'Age_man_bin5'], dtype=object),
 6: array(['Pclass', 'Sex', 'Embarked', 'Title', 'Fare_bin3', 'FamilySize',
        'Age'], dtype=object),
 7: array(['Pclass', 'Sex', 'Embarked', 'Title', 'Fare_bin3', 'FamilySize',
        'Age_man_bin8'], dtype=object),
 8: array(['Pclass', 'Sex', 'Embarked', 'Title', 'Fare_bin3', 'FamilySize',
        'Age_man_bin5'], dtype=object),
 9

## **Results**

### Score

"Score" is measured using training data. 

**This represents how the model is fitted into train data.**

In [42]:
result_models_score_dict = {
               "acc_log_score": acc_log_score,
               "acc_svc_score": acc_svc_score,
               "acc_random_forest_score": acc_random_forest_score,
               "acc_decision_tree_score": acc_decision_tree_score,
               "acc_sgd_score": acc_sgd_score,
               "acc_gaussian_score": acc_gaussian_score,
               "acc_knn_score": acc_knn_score
              }

result_models_score = pd.DataFrame(result_models_score_dict)
result_models_score

Unnamed: 0,acc_log_score,acc_svc_score,acc_random_forest_score,acc_decision_tree_score,acc_sgd_score,acc_gaussian_score,acc_knn_score
0,80.31,68.35,97.19,97.19,75.81,79.89,87.34
1,79.61,68.21,93.95,93.95,77.22,79.47,87.06
2,80.45,66.81,93.81,93.81,62.31,80.59,88.61
3,80.73,67.79,96.91,96.91,67.37,78.48,88.05
4,81.43,68.21,93.53,93.53,62.45,80.59,86.5
5,81.15,68.64,92.83,92.83,77.36,79.18,86.36
6,79.61,63.43,94.09,94.09,73.0,79.18,87.06
7,79.18,83.83,89.03,89.03,77.07,80.59,87.2
8,79.61,83.97,88.33,88.33,70.75,80.03,85.51
9,80.73,64.98,93.67,93.67,56.26,79.32,87.76


### Accuracy

"Accuracy" is measured using testing data.

**This reflects the model accuracy to new data.**
Please note that the total number of dataset is 889, and this is split into training/test data 711 and 178 (80\% and 20\%).

The **train/test split technique is not stable** in that it may not split the data randomly and the data can be selected only from specific groups. This will result in **overfitting**. For obtaining the "Accuracy", **model validation should be cross validated** (e.g., **k-fold cross validation**). In this project, the aim is to investigate the effects between the anonymization (e.g., privacy) and the prediction model accuracy (e.g., utility).

I think this represents the more realistic result than "Score".

In [43]:
result_models_acc_dict = {
               "acc_log": acc_log,
               "acc_svc": acc_svc, 
               "acc_random_forest": acc_random_forest,
               "acc_decision_tree": acc_decision_tree,
               "acc_sgd": acc_sgd,
               "acc_gaussian": acc_gaussian,
               "acc_knn": acc_knn
              }

result_models = pd.DataFrame(result_models_acc_dict)
result_models

Unnamed: 0,acc_log,acc_svc,acc_random_forest,acc_decision_tree,acc_sgd,acc_gaussian,acc_knn
0,78.65,66.85,80.34,82.58,78.09,79.21,83.15
1,81.46,67.42,84.27,82.58,76.97,82.02,78.09
2,76.4,69.66,79.78,82.58,57.87,77.53,81.46
3,83.71,69.66,84.27,84.27,70.79,82.58,84.27
4,79.21,64.04,82.02,83.71,58.99,79.21,82.02
5,78.65,67.98,84.27,85.96,75.28,77.53,75.28
6,79.78,69.66,80.9,78.65,78.09,80.34,80.34
7,78.65,80.34,76.4,77.53,73.6,78.65,77.53
8,76.97,80.9,80.34,79.78,69.1,79.21,79.21
9,80.9,60.11,79.21,79.21,57.87,79.21,84.83


We summarized the best score/ accuracy in each classification model.

In [0]:
best_models = pd.DataFrame({
    'Model': ['Logistic Regression', 'Support Vector Machines', 'Random Forest',
              'Decision Tree', 'Stochastic Gradient Decent', 'Gaussian Naive Bayes', 
              'K-Nearest Neighbors'],
    'Score': result_models_score.max().to_list(),
    'Accuracy': result_models.max()
              })

Ordered by "Score" measured using training data.

In [45]:
best_models.sort_values(by='Score', ascending=False)

Unnamed: 0,Model,Score,Accuracy
acc_random_forest,Random Forest,97.19,85.39
acc_decision_tree,Decision Tree,97.19,85.96
acc_knn,K-Nearest Neighbors,88.61,84.83
acc_svc,Support Vector Machines,83.97,87.08
acc_log,Logistic Regression,81.43,83.71
acc_gaussian,Gaussian Naive Bayes,80.59,83.15
acc_sgd,Stochastic Gradient Decent,77.36,80.34


Ordered by "Accuracy" measured using testing data.

In [46]:
best_models.sort_values(by='Accuracy', ascending=False)

Unnamed: 0,Model,Score,Accuracy
acc_svc,Support Vector Machines,83.97,87.08
acc_decision_tree,Decision Tree,97.19,85.96
acc_random_forest,Random Forest,97.19,85.39
acc_knn,K-Nearest Neighbors,88.61,84.83
acc_log,Logistic Regression,81.43,83.71
acc_gaussian,Gaussian Naive Bayes,80.59,83.15
acc_sgd,Stochastic Gradient Decent,77.36,80.34


**We summarized the best models in each data set.**

In the above description of the 12 different data sets, we explained that the higher the index, the more likely the data to be more accurate, and the lower the index, the more abstract (or more anonymized) the data.

1. All numeric features: 0
2. Two numeric and one anonymized features: 1 ~> 2, 3, 6
3. One numeric and two anonymized features: 4 ~> 5, 7 ~> 8, 9
4. All anonymized features: 10 ~> 11

We denote dataset 1 ~> dataset 2, that dataset 2 is more abstracted and anonymized than dataset 1.


In [0]:
best_data = pd.DataFrame({
    'Data': data_index_dict,
    'Score(Train Set)': result_models_score.max(axis=1),
    'Score_Model': result_models_score.idxmax(axis=1),
    'Accuracy(Test Set)': result_models.max(axis=1),
    'Accuracy_Model':result_models.idxmax(axis=1)
              })

Ordered by "Score" measured using training data.

In [48]:
best_data.sort_values(by='Score(Train Set)', ascending=False)

Unnamed: 0,Data,Score(Train Set),Score_Model,Accuracy(Test Set),Accuracy_Model
0,"[Fare, FamilySize, Age]",97.19,acc_random_forest_score,83.15,acc_knn
3,"[Fare, FamilySize_bin, Age]",96.91,acc_random_forest_score,84.27,acc_random_forest
6,"[Fare_bin3, FamilySize, Age]",94.09,acc_random_forest_score,80.9,acc_random_forest
1,"[Fare, FamilySize, Age_man_bin8]",93.95,acc_random_forest_score,84.27,acc_random_forest
2,"[Fare, FamilySize, Age_man_bin5]",93.81,acc_random_forest_score,82.58,acc_decision_tree
9,"[Fare_bin3, FamilySize_bin, Age]",93.67,acc_random_forest_score,84.83,acc_knn
4,"[Fare, FamilySize_bin, Age_man_bin8]",93.53,acc_random_forest_score,83.71,acc_decision_tree
5,"[Fare, FamilySize_bin, Age_man_bin5]",92.83,acc_random_forest_score,85.96,acc_decision_tree
7,"[Fare_bin3, FamilySize, Age_man_bin8]",89.03,acc_random_forest_score,80.34,acc_svc
8,"[Fare_bin3, FamilySize, Age_man_bin5]",88.33,acc_random_forest_score,80.9,acc_svc


**==> Results:**

**Results from the "Score".**  
The "Score" is measured using training data.
The "Score" represents how the model is fitted into train data.

In this case, we could observe the trends that
**the lower index of datasets (more accurate data) produces the higher accuracy scores, while the higher index of datasets (more anonymized and lossed data) produces the lower accuracy scores.**  

For example, data of index 0 (consisting of all numeric data for fare, family size, and age) built the random forest model with the accuracy of 97.19\% (highest accuracy among all datasets).
The second to fifth highest ranked order was the data of index 1,2,3,6 that have two numeric and one categorical features.
The datasets of index 10 and index 11 are composed of all categorical data (which had been anonymized) and produces least  results of accuracy (86.78\% and 85.79\%, respectively).

**If all the other variables are the same and there is an abstraction level difference in one variable, we can observe that the rank order of accuracy is maintained in all cases.** 

For instance, 'Age_man_bin5' is more abstracted than 'Age_man_bin8', and the models containing 'Age_man_bin5' instead of 'Age_man_bin8' have lower model accuracy (e.g., model accuracy of dataset of index: 1 > 2, 4 > 5, 7 > 8, and 10 > 11).

Ordered by "Accuracy" measured using testing data.

In [49]:
best_data.sort_values(by='Accuracy(Test Set)', ascending=False)

Unnamed: 0,Data,Score(Train Set),Score_Model,Accuracy(Test Set),Accuracy_Model
10,"[Fare_bin3, FamilySize_bin, Age_man_bin8]",86.78,acc_random_forest_score,87.08,acc_svc
5,"[Fare, FamilySize_bin, Age_man_bin5]",92.83,acc_random_forest_score,85.96,acc_decision_tree
11,"[Fare_bin3, FamilySize_bin, Age_man_bin5]",85.79,acc_random_forest_score,85.96,acc_svc
9,"[Fare_bin3, FamilySize_bin, Age]",93.67,acc_random_forest_score,84.83,acc_knn
1,"[Fare, FamilySize, Age_man_bin8]",93.95,acc_random_forest_score,84.27,acc_random_forest
3,"[Fare, FamilySize_bin, Age]",96.91,acc_random_forest_score,84.27,acc_random_forest
4,"[Fare, FamilySize_bin, Age_man_bin8]",93.53,acc_random_forest_score,83.71,acc_decision_tree
0,"[Fare, FamilySize, Age]",97.19,acc_random_forest_score,83.15,acc_knn
2,"[Fare, FamilySize, Age_man_bin5]",93.81,acc_random_forest_score,82.58,acc_decision_tree
6,"[Fare_bin3, FamilySize, Age]",94.09,acc_random_forest_score,80.9,acc_random_forest


**==> Results:**

**Results from the "Accuracy".**  
The "Accuracy" is calculated using the data that are splitted and reserved for the test use. Therefore, it reflects the scores how the model can predict to the new data.
The "Accuracy" represents more realistic results.

In this project, the test data is 178 out of 889 data, and only one time split data is used and the model is not cross validated.
The data is pretty small, so the the order result may not be accurate.

<!-- However, we could observe a few things. As similar to "Score" results, **in most cases, the rank order of accuracy is maintained when only one variable is different** (e.g., model accuracy of dataset of index: 4 > 5, 7 > 8, and 10 > 11). -->

**We carefully assume that anonymized cases can be more fittable in certain types of prediction models and may produce high accurate results.**  

The dataset of index 10 and 11, consisting of all three anonymized features, produce the best and the third best "Accuracy" results, respectively.  
The model of index 11 with the feature 'Age_man_bin5' is more abstracted and produces the lower accuracy than the model of index 10 with 'Age_man_bin8'.  
Those two results are from the SVM models. 

Still, we need to apply bigger data to draw the more concrete conclusion from the "Accuracy" results.

## Synthetic Data

**Two more data set: Synthetic data (generated 1000 rows).**

* Index 12: ['Age_man_bin8', 'Fare_bin3', 'FamilySize_bin', 'Pclass', 'Sex', 'Embarked', 'Title']
* Index 13: ['Age_man_bin5', 'Fare_bin3', 'FamilySize_bin', 'Pclass', 'Sex', 'Embarked', 'Title']

**==> Results:**  

The synthetic data seems to build a poor quality prediction model.
For building prediction models, **I recommend to use other anonymization techiniques and NOT to use synthetic data.**

In [50]:
# Synthetic data

dup_cols = ["Age_man_bin8", "Age_man_bin5"]

for c in dup_cols:
  X_gen = X_gen_lab.drop(c, axis=1)

  data_index_dict[i] = "Synthetic data: ['"+ c +"', 'Fare_bin3', 'FamilySize_bin']"
  data_all_index_dict[i] = [c, 'Fare_bin3', 'FamilySize_bin'] + ['Pclass', 'Sex', 'Embarked', 'Title'] 
  print(data_all_index_dict[i])
  i+=1

  X_train, X_test, Y_train, Y_test = train_test_split(X_gen, Y_gen, test_size=0.2)  #split train data 80% and test data 20%
  buildClfModels(X_train, X_test, Y_train, Y_test)

['Age_man_bin8', 'Fare_bin3', 'FamilySize_bin', 'Pclass', 'Sex', 'Embarked', 'Title']
['Age_man_bin5', 'Fare_bin3', 'FamilySize_bin', 'Pclass', 'Sex', 'Embarked', 'Title']


In [0]:
result_models_score = pd.DataFrame(result_models_score_dict)
result_models = pd.DataFrame(result_models_acc_dict)

In [0]:
best_models = pd.DataFrame({
    'Model': ['Logistic Regression', 'Support Vector Machines', 'Random Forest',
              'Decision Tree', 'Stochastic Gradient Decent', 'Gaussian Naive Bayes', 
              'K-Nearest Neighbors'],
    'Score': result_models_score.max().to_list(),
    'Accuracy': result_models.max()
              })

best_data = pd.DataFrame({
    'Data': data_index_dict,
    'Score(Train Set)': result_models_score.max(axis=1),
    'Score_Model': result_models_score.idxmax(axis=1),
    'Accuracy(Test Set)': result_models.max(axis=1),
    'Accuracy_Model':result_models.idxmax(axis=1)
              })

In [53]:
best_data.iloc[12:]

Unnamed: 0,Data,Score(Train Set),Score_Model,Accuracy(Test Set),Accuracy_Model
12,"Synthetic data: ['Age_man_bin8', 'Fare_bin3', ...",78.62,acc_random_forest_score,62.5,acc_sgd
13,"Synthetic data: ['Age_man_bin5', 'Fare_bin3', ...",82.5,acc_decision_tree_score,61.5,acc_log
