<a href="https://colab.research.google.com/github/bayarra/cs598-dlh/blob/main/DLH_240_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Paper 240. Application of deep and machine learning techniques for multi-label classification performance on psychotic disorder diseases



##PART 1: Accessing the Study and its Dataset.

###How to access the Study

To pdf version of this study can be accessed at [this link](https://www.sciencedirect.com/science/article/pii/S2352914821000356). 

###How to access the Dataset

To access this dataset, you will need to download the CSV from [this](https://www.sciencedirect.com/science/article/pii/S2352340917303487) link. This is an additional study that is cited in our study. 

The link to download our dataset is present in the section "*Appendix A. Supplementary material*"

Clicking on this link in this section will give you the option to locally download the dataset that is being used here in the CSV format. 

###How to Mount (Add) this Dataset to the Google Collabatory Notebook

In order to import this dataset into the Google Collabatory Drive, you will need to upload the CSV into your Google Drive account, and then execute the following code block 

####NOTE: This code block is already in the Collab Notebook in PART 3. You do NOT need to do this again.
```
drive.mount('/content/drive/')
path = "/content/drive/MyDrive/mmc240_c.csv"
df = pd.read_csv(path)
```
This code block does the following steps:

1. Mounts your Google drive to this collabatory notebook

2. Pandas will attempt to go to the path and open the file, while converting it to a dataframe.


##PART 2: Dependency Description.

The following is a list of all of the packages used in this Collabatory Notebook in order to run this project. A brief description of each dependency and its purpose is below.

Basic libraries that allow for dataset modification

```
#pandas allows us to easily modify and one hot encode the values in the dataset.
import pandas as pd

# numpy is used to convert the values in our dataset into a tensor for the Neural Network. 
import numpy as np

# test_train_split is used to split the dataset into the correct ratios as specified in the original study.
from sklearn.model_selection import train_test_split


"""
This library allows the Google Collab Notebook to access Google Drive and mount it, so that the CSV file can be accessed.
"""
from google.colab import drive

"""
SMOTE is an algorithm that is used to balance the dataset. This helps us get a more realistic picture of how accurate the Neural Networks are.
"""
from imblearn.over_sampling import SMOTE

"""
skelarn LabelEncoder is an encoder that will convert Strings and Characters into integers. This is helpful when we want to convert basic attributes of our dataset into simple binary values (such as 0 and 1)
"""
from sklearn.preprocessing import LabelEncoder

```
Classic Machine Learning Models

```
"""
MLPClassifier, LinearSVC, RandomForestClassifier, and DecisionTreeClassifier from Sklearn allow us to easily run the MLP, SVM, RF, and
DT Machine Learning Algorithms as a baseline to compare the Neural Networks to without requiring us to have to build each
algorithm from scratch.
"""
from sklearn.neural_network import MLPClassifier
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
```

Keras API and it's dependencies (Used to Build the Neural Networks in this study)
```
"""
Because the Keras API uses Tensors, this is required to convert our pandas dataframe to a tensor
""" 
import tensorflow as tf

"""
This is the basic Keras API which is used to set up the Neural Network Model
"""
from tensorflow import keras
"""
These are the basic types of Keras Neural Network models that we use in order to build the Neural Network
"""
from tensorflow.keras.models import Sequential, Model, load_model

"""
These represents the various layers that we use in the Neural Network to build what is described in the original study.
"""
from tensorflow.keras.layers import Dense,Input,Dropout,Flatten,Conv2D,MaxPool2D

"""
this part of Keras allows us to convert a dataframe into a numpy array that can then be converted into a tensor. This is important in Part 12, when we are required to hot encode target, which converts to a pandas dataframe. This library helps convert that dataframe into a numpy array which can then be converted directly into a tensor that can be used in the Neural Network model.
"""
from keras.utils import np_utils
```



##PART 3: Installing Basic Dependencies, Taking a look at the dataset in raw form.

In this section, we will import/install the basic dependencies for this project and then, attempt to load/view the CSV file via the instructions mentioned above.

To do this, we will run the these two blocks, which will import all of our dependcies and load the Raw CSV dataset into the Collabatory Notebook environment. 

In [None]:
import numpy as np
import pandas as pd
from google.colab import drive
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from imblearn.over_sampling import SMOTE

In [None]:
drive.mount('/content/drive/')
path = "/content/drive/MyDrive/mmc240_c.csv"
df = pd.read_csv(path)
df

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


Unnamed: 0,sex,age,faNoily_status,religion,occupation,genetic,status,loss_of_parent,divorse,Injury,Spiritual_consult,Insominia,shizopherania,vascula_demetia,MBD,Bipolar,agecode
0,M,18,Yes,C,STUDENT,Yes,S,Yes,No,No,Yes,N,P,P,P,N,1
1,F,30,Yes,M,ARTISAN,Yes,S,Yes,No,Yes,Yes,P,P,P,N,N,1
2,M,22,Yes,C,STUDENT,No,S,No,No,No,Yes,P,P,P,N,P,1
3,M,35,No,M,ARTISAN,No,M,No,No,No,Yes,P,P,N,N,P,2
4,M,30,Yes,M,ARTISAN,Yes,M,No,No,No,Yes,P,P,P,P,P,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,F,73,Yes,M,RETIRED,Yes,S,Yes,No,No,Yes,P,N,P,N,P,3
496,F,50,No,M,ARTISAN,No,M,Yes,No,No,No,P,P,N,P,P,2
497,F,32,No,C,FORCE,No,M,No,No,No,Yes,N,P,P,P,N,2
498,M,13,Yes,C,STUDENT,No,S,Yes,No,No,No,N,P,N,N,N,1


## PART 4: Label encode the CSV data.
Now that we have the basics importeed into the Collabatory Notebook, we now need to encode the values, as Machine Leanring Models cannot typically handle Strings/Characters for input.

To do this, we will utilize sklearns LabelEncoder, as this allows us to easily convert basic values, like Yes and No, to simple binary values.

These values will first be transformed, and then we will drop the original column from the Pandas Dataframe that represents our dataset.

After the original column is dropped, the new column that represents the encoded values is added to the dataframe with the same name.

In [None]:
from sklearn.preprocessing import LabelEncoder

"""Create the Label Encoder"""
label_encoder = LabelEncoder()

"""Convert Sex values (M/F) to binary values"""
sex_label = label_encoder.fit_transform(df["sex"])
Data = df.drop("sex", axis='columns')
Data["sex"] = sex_label

"""Convert faNoily_status column (Yes/No) to binary values"""
family_status = label_encoder.fit_transform(Data["faNoily_status"])
Data2 = Data.drop("faNoily_status", axis='columns')
Data2["family_status"] = family_status

"""Convert genetic column (Yes/No) to binary values"""
genetic_status = label_encoder.fit_transform(Data2["genetic"])
Data3 = Data2.drop("genetic", axis='columns')
Data3["genetic"] = genetic_status

"""Convert status column (S/M) to binary values"""
status = label_encoder.fit_transform(Data3["status"])
Data3 = Data3.drop("status", axis='columns')
Data3["status"] = status

"""Loss of Parent"""
loss_of_parent = label_encoder.fit_transform(Data3["loss_of_parent"])
Data4 = Data3.drop("loss_of_parent", axis='columns')
Data4["loss_of_parent"] = loss_of_parent

"""Divorce"""
divorce = label_encoder.fit_transform(Data4["divorse"])
Data5 = Data4.drop("divorse", axis='columns')
Data5["divorce"] = divorce

"""Injury"""
injury = label_encoder.fit_transform(Data5["Injury"])
Data6 = Data5.drop("Injury", axis='columns')
Data6["Injury"] = injury

"""Spiritual Consult"""
spiritual_consult = label_encoder.fit_transform(Data6["Spiritual_consult"])
Data7 = Data6.drop("Spiritual_consult", axis='columns')
Data7["Spiritual_consult"] = spiritual_consult


Now that the basic columns have been converted, the more complex rows with multiple values (such as religion, occupation, etc) now need to be encoded one-hot encoding using panda's get_dummies method. The get_dummies function converts every unique values into separate columns with binary values. For example, the religion column has 3 unique values 'C', 'M', 'O', so it converts into three columns called religion_C, religion_M, religion_O which each has binary values. Numeric encoding is not useful for values that no relation with each others. We have used feature name as prefix for these conversions. 

In [None]:
"""religion"""
one_hot = pd.get_dummies(Data7['religion'], prefix='religion')
Data8 = Data7.drop("religion", axis='columns')
Data8 = Data8.join(one_hot)

one_hot = pd.get_dummies(Data8['occupation'], prefix='occupation')
Data9 = Data8.drop("occupation", axis='columns')
Data9 = Data9.join(one_hot)

one_hot = pd.get_dummies(Data9['agecode'], prefix='agecode')
Data9 = Data9.drop("agecode", axis='columns')
Data9 = Data9.join(one_hot)

Data9

Unnamed: 0,age,Insominia,shizopherania,vascula_demetia,MBD,Bipolar,sex,family_status,genetic,status,...,religion_O,occupation_ARTISAN,occupation_C/SERVANT,occupation_FORCE,occupation_RETIRED,occupation_STUDENT,occupation_UNEMPLYD,agecode_1,agecode_2,agecode_3
0,18,N,P,P,P,N,1,1,1,1,...,0,0,0,0,0,1,0,1,0,0
1,30,P,P,P,N,N,0,1,1,1,...,0,1,0,0,0,0,0,1,0,0
2,22,P,P,P,N,P,1,1,0,1,...,0,0,0,0,0,1,0,1,0,0
3,35,P,P,N,N,P,1,0,0,0,...,0,1,0,0,0,0,0,0,1,0
4,30,P,P,P,P,P,1,1,1,0,...,0,1,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,73,P,N,P,N,P,0,1,1,1,...,0,0,0,0,1,0,0,0,0,1
496,50,P,P,N,P,P,0,0,0,0,...,0,1,0,0,0,0,0,0,1,0
497,32,N,P,P,P,N,0,0,0,0,...,0,0,0,1,0,0,0,0,1,0
498,13,N,P,N,N,N,1,1,0,1,...,0,0,0,0,0,1,0,1,0,0


Now that Y columns have been converted, some has multiple values, for example for the vascula demetia, it has 4 values:  'N', 'N       ­>', 'P', 'P       ­>'. We cleaned up those unicode characters and kept the only values N and P. After cleaning up the unicode characters, the five dependent variables (target values) are converted into binary using the label_encoder. 

Additionally, in later sections of this study, we are wanting to predict multiple labels at the same time. Generally, we can just pull out all of the columns that we want to use at runtime and pass them into Y, but in our specific use case this is not possible due to SMOTE not supporting multi-class values of Y.

Because of this limitation, the original study concatenates all five target (dependent variables) into a single string and places it into a singlular column called "target". 

In order to remove rare combinations, and to improve accuracy, the original study removes any 'target' (string value that represents the five dependent variables) that do not occur more than 6 times. 

This data is then later used for the Multi-Label Neural Network and comparing it to Classical Machine Learning models. 

Additionally, the data is then split into multiple dataframes called data, data_orig, data2, data_feat, and data3. An explanation of each is below.

```
data/data_orig: Dataframe that contains the individual dependent variables, does not contain target. Used for testing the Single-label Neural Network.

data2/data3/data_feat: Dataframe that has removed the individual dependent variable and combined them into one column called 'target'. Used for the multi-label Neural Network.
```

In [None]:
#Insominia N|P
insominia = label_encoder.fit_transform(Data9["Insominia"])
data = Data9.drop("Insominia", axis='columns')
data["Insominia"] = insominia

#shizopherania N|P
shizopherania = label_encoder.fit_transform(data["shizopherania"])
data = data.drop("shizopherania", axis='columns')
data["shizopherania"] = shizopherania

#vascula demetia N|P
vascula_demetia = label_encoder.fit_transform(data["vascula_demetia"])
data = data.drop("vascula_demetia", axis='columns')
data["vascula_demetia"] = vascula_demetia

#MBD N|P
MBD = label_encoder.fit_transform(Data9["MBD"])
data = data.drop("MBD", axis='columns')
data["MBD"] = MBD

#Bipolar N|P
Bipolar = label_encoder.fit_transform(Data9["Bipolar"])
data = data.drop("Bipolar", axis='columns')
data["Bipolar"] = Bipolar

data_orig = data.copy()
data_feat = data.copy()
del data_feat['Insominia']
del data_feat['shizopherania']
del data_feat['vascula_demetia']
del data_feat['MBD']
del data_feat['Bipolar']

# Concatenate 5 target features into 1 single column
data['target']=data['Insominia'].astype(str)+data['shizopherania'].astype(str)+data['vascula_demetia'].astype(str)+data['MBD'].astype(str)+data['Bipolar'].astype(str)

# Exclude the combinations that occurs less than 6 to make good balancing before use SMOTE algorithm. 
data = data.groupby('target').filter(lambda x: len(x) > 6)

data3 = data.copy()

data2 = data.copy()

del data['target']


# remove these values since they are now represented by 'target'
del data2['Insominia']
del data2['shizopherania']
del data2['vascula_demetia']
del data2['MBD']
del data2['Bipolar']


data2

Unnamed: 0,age,sex,family_status,genetic,status,loss_of_parent,divorce,Injury,Spiritual_consult,religion_C,...,occupation_ARTISAN,occupation_C/SERVANT,occupation_FORCE,occupation_RETIRED,occupation_STUDENT,occupation_UNEMPLYD,agecode_1,agecode_2,agecode_3,target
0,18,1,1,1,1,1,0,0,1,1,...,0,0,0,0,1,0,1,0,0,01110
2,22,1,1,0,1,0,0,0,1,1,...,0,0,0,0,1,0,1,0,0,11101
3,35,1,0,0,0,0,0,0,1,0,...,1,0,0,0,0,0,0,1,0,11001
4,30,1,1,1,0,0,0,0,1,0,...,1,0,0,0,0,0,1,0,0,11111
5,86,0,1,0,0,1,0,0,1,1,...,0,0,0,1,0,0,0,0,1,01100
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,73,0,1,1,1,1,0,0,1,0,...,0,0,0,1,0,0,0,0,1,10101
496,50,0,0,0,0,1,0,0,0,0,...,1,0,0,0,0,0,0,1,0,11011
497,32,0,0,0,0,0,0,0,1,1,...,0,0,1,0,0,0,0,1,0,01110
498,13,1,1,0,1,1,0,0,0,1,...,0,0,0,0,1,0,1,0,0,01000


#PART 5: Create Test/Training Set from Data
Now that we have properly encoded our dataset, we will give an example of how to split the dataset using sklearns train_test_split library. 

In the example below, we are taking out Insomnia as our Y value (value that we want to predict) while splitting the dataset into the respective test and training set. A important note is that we are explictly specifying that the test set is 30% of the values, and the training set is 70% of the values. This is set to this specific ratio in order to best replicate what was done in the original study. 

In [None]:
from sklearn.model_selection import train_test_split
y = data.pop('Insominia')
X = data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
X_train

Unnamed: 0,age,sex,family_status,genetic,status,loss_of_parent,divorce,Injury,Spiritual_consult,religion_C,...,occupation_RETIRED,occupation_STUDENT,occupation_UNEMPLYD,agecode_1,agecode_2,agecode_3,shizopherania,vascula_demetia,MBD,Bipolar
6,68,1,0,0,0,1,0,0,1,0,...,1,0,0,0,0,1,1,1,1,1
133,19,0,1,0,1,1,0,0,1,0,...,0,0,0,1,0,0,1,1,1,1
304,19,1,0,1,1,0,0,0,1,0,...,0,1,0,1,0,0,1,1,0,0
380,21,0,0,0,1,1,0,0,1,1,...,0,1,0,1,0,0,1,1,0,0
335,56,0,0,0,0,1,1,1,1,0,...,0,0,0,0,0,1,1,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
321,32,1,1,0,1,1,0,0,1,0,...,0,0,0,0,1,0,1,1,1,0
186,40,0,0,0,0,0,1,0,0,1,...,0,0,1,0,1,0,1,1,0,1
369,34,1,1,1,0,1,0,0,1,0,...,0,0,0,0,1,0,1,1,0,0
400,43,0,0,0,0,1,0,0,0,1,...,0,0,0,0,1,0,1,0,0,0


To better understand why we are balancing the dataset using SMOTE, lets take a look at the ratio of the respective values in the Insomnia column.

In [None]:
df['Insominia'].value_counts()/len(df)

N    0.594
P    0.406
Name: Insominia, dtype: float64

As we can see, there is a significant larger amount of N values (Negative) compared to P values (positive). In an ideal dataset, we want to have the same number of N samples and P samples, as this will prevent our Machine Learning models from just guessing N due to the larger ratio of N value versus P values.

##PART 6: Balance Insomnia dataset with SMOTE
The following is an example of how we will be using the SMOTE algorithm in the future to split the data, and what the overall dataset will look after it is properly balanced. The new dataset will now have 400 samples, because this is the number of total samples required so that there will be an equivilant ratio between all of dependent values in this dataset.


In [None]:
#TODO: How many random samples do we need?
smote = SMOTE(random_state=101)
X_train_new, y_train_new = smote.fit_resample(X_train, y_train.ravel())
X_train_new

Unnamed: 0,age,sex,family_status,genetic,status,loss_of_parent,divorce,Injury,Spiritual_consult,religion_C,...,occupation_RETIRED,occupation_STUDENT,occupation_UNEMPLYD,agecode_1,agecode_2,agecode_3,shizopherania,vascula_demetia,MBD,Bipolar
0,68,1,0,0,0,1,0,0,1,0,...,1,0,0,0,0,1,1,1,1,1
1,19,0,1,0,1,1,0,0,1,0,...,0,0,0,1,0,0,1,1,1,1
2,19,1,0,1,1,0,0,0,1,0,...,0,1,0,1,0,0,1,1,0,0
3,21,0,0,0,1,1,0,0,1,1,...,0,1,0,1,0,0,1,1,0,0
4,56,0,0,0,0,1,1,1,1,0,...,0,0,0,0,0,1,1,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
397,55,0,0,0,0,0,0,0,1,1,...,0,0,0,0,0,1,0,1,0,1
398,34,0,1,0,0,0,0,0,0,0,...,0,0,1,0,1,0,1,1,1,1
399,56,0,0,0,0,1,0,0,1,0,...,0,0,0,0,0,1,1,1,0,1
400,56,0,0,0,0,1,0,0,1,0,...,0,0,0,0,0,1,1,1,0,1


##PART 7: Multi-Label Classification with non-balanced data using Classic ML models
In this section, we begin to evaluate the original study by running the standard machine learning algorithms on non-balanced datasets to check their accuracy values. This is done by splitting the dataset into a training and testing set. Since the original study used MLP, Support Vector Machine (SVM), RandomForest, and DecisionTree, we are using the same Machine Learning models.

In this part, the models are first trained with the training set and then evaluated with the test set, and then their accuracy values are outputted. These are the accuracy values for the Multi-Label unbalanced data. The best one of these algorithms will then be selected to compare aganist the Multi-Label Neural Network that will be tested in part 9. 

In our testing, on average, MLP was the most accurate algorihtm. With that being said, your results may vary as it did not find MLP to be the most accurate on every single run.

In [None]:
data_temp = data2.copy()
y_5 = data_temp.pop('target')
X_5 = data_temp
X_train, X_test, y_train, y_test = train_test_split(X_5, y_5, test_size=0.2)

result_table = []
MLP = MLPClassifier(alpha=1, max_iter=1000)
MLP.fit(X_train, y_train)
score_MLP = MLP.score(X_test, y_test)
print("MLP: ", score_MLP)
result_table.append(f'MLP,Non-Balanced,{score_MLP}')

svc = LinearSVC(C=0.025, max_iter=10000)
svc.fit(X_train, y_train)
score_svm = svc.score(X_test, y_test)
print("SVM: ", score_svm)
result_table.append(f'SVM,Non-Balanced,{score_svm}')

#RF = RandomForestClassifier(max_depth=5, n_estimators=100, max_features=1)
RF = RandomForestClassifier(max_depth=5, n_estimators=100)
RF.fit(X_train, y_train)
score_RF = RF.score(X_test, y_test)
print("RandomForest: ", score_RF)
result_table.append(f'RandomForest,Non-Balanced,{score_RF}')

decistion_tree = DecisionTreeClassifier(max_depth=5)
decistion_tree.fit(X_train, y_train)
score_decistion_tree = decistion_tree.score(X_test, y_test)
print("DecitionTree: ", score_decistion_tree)
result_table.append(f'DecisionTree,Non-Balanced,{score_decistion_tree}')


MLP:  0.3711340206185567
SVM:  0.3711340206185567
RandomForest:  0.4020618556701031
DecitionTree:  0.4020618556701031


#PART 8: Multi-Label Classification with balanced data using Classic ML models
This part is very similar to Part 7, but we are now using Balanced Data inside of the classical Machine Learning models instead of unbalanced data. The best model from this section will be compared aganist the Multi-Label Neural Network from Part 12. 

In our testing, on average, the Random Forest model is generally the most accurate. With that being said, your results may vary as in our testing, Random Forest was not the most accurate on every single run. 


In [None]:
data_temp = data2.copy()
#print(data2)
y_6 = data_temp.pop('target')
X_6 = data_temp
X_train, X_test, y_train, y_test = train_test_split(X_6, y_6, test_size=0.2)

"""TODO: Fix smote here"""
smote = SMOTE(random_state=101)
X_train_new, y_train_new = smote.fit_resample(X_train, y_train.ravel())

MLP = MLPClassifier(alpha=1, max_iter=1000)
MLP.fit(X_train_new, y_train_new)
score_MLP = MLP.score(X_test, y_test)
print(f"MLP score is : ", score_MLP)
result_table.append(f'MLP,Balanced,{score_MLP}')

svc = LinearSVC(C=0.025, max_iter=10000)
svc.fit(X_train_new, y_train_new)
score_svm = svc.score(X_test, y_test)
print(f"SVM score is: ", score_svm)
result_table.append(f'SVM,Balanced,{score_svm}')

#RF = RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1)
RF = RandomForestClassifier(max_depth=5, n_estimators=100)
RF.fit(X_train_new, y_train_new)
score_RF = RF.score(X_test, y_test)
print(f"RandomForest score is: ", score_RF)
result_table.append(f'RandomForest,Balanced,{score_RF}')

decistion_tree = DecisionTreeClassifier(max_depth=5)
decistion_tree.fit(X_train_new, y_train_new)
score_decistion_tree = decistion_tree.score(X_test, y_test)
print("DecitionTree score is: ", score_decistion_tree)
result_table.append(f'DecisionTree,Balanced,{score_decistion_tree}')


MLP score is :  0.35051546391752575
SVM score is:  0.36082474226804123
RandomForest score is:  0.3711340206185567
DecitionTree score is:  0.35051546391752575


##PART 9: Single Label Classificaiton Neural Network With unbalanced data
In this part we build/evaluate the Single Label Neural Network that is mentioned in the original study. In this case, we do not use target, since we are wanting the accuracy of only ONE dependent variable, not all five. 

The Neural Network model is defined in the function called NeuralNetwork_A, and our code will call on this function and pass it the specific dependent variable (target PDD) value one by one into a new Model, where it is then fit with the training samples and tested with the validation/test data. 


In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential, Model, load_model
from tensorflow.keras.layers import Dense,Input,Dropout,Flatten,Conv2D,MaxPool2D

target_names = ['Insominia', 'shizopherania', 'vascula_demetia', 'MBD', 'Bipolar']

def NeuralNetwork_A(X, y):
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

  X_train_test, X_validation, y_train_test, y_validation = train_test_split(X_train, y_train, test_size=0.3)



  model = Sequential()
  model.add(Dense(15, input_dim=25, activation='relu', name='Input'))
  model.add(Dropout(0.4))
  model.add(Dense(20, activation='relu', name='Hidden1'))
  model.add(Dropout(0.4))
  model.add(Dense(40, activation='relu', name='Hidden2'))
  model.add(Dropout(0.4))
  model.add(Dense(50, activation='relu', name='Hidden3'))
  model.add(Dropout(0.4))
  model.add(Dense(1, activation='sigmoid', name='Output'))

  model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

  X_train_np = np.asarray(X_train_test).astype('int')

  X_train_tensor = tf.convert_to_tensor(X_train_np)
  y_train_tensor = tf.convert_to_tensor(y_train_test)

  #print(X_train_tensor.shape)
  #print(y_train_tensor.shape)

  model.fit(X_train_tensor, y_train_tensor, epochs=40, batch_size=10)


  # evaluate the model with the validation data
  X_validate_np = np.asarray(X_validation).astype('int')

  X_validate_tensor = tf.convert_to_tensor(X_validate_np)
  y_validate_tensor = tf.convert_to_tensor(y_validation)

  r_val = model.evaluate(X_validate_tensor, y_validate_tensor, batch_size=10)

  #evaluate the model with test data
  X_test_np = np.asarray(X_test).astype('int')

  X_test_tensor = tf.convert_to_tensor(X_test_np)
  y_test_tensor = tf.convert_to_tensor(y_test)

  r_test = model.evaluate(X_test_tensor, y_test_tensor, batch_size=10)

  return r_val, r_test

def prep_data(X, target):
  y = X.pop(target)
  return X, y

results1 = []
result_table2 = []
for t in target_names:
  x_i = data_orig.copy()
  x_i, y_i = prep_data(x_i, t)
  r_val, r_test = NeuralNetwork_A(x_i, y_i)
  results1.append(f"{t} - Validation loss & accuracy: {r_val}")
  results1.append(f"{t} - Test loss & accuracy: {r_test}")
  result_table2.append(f'{t},Non-Balanced,Validation,{r_val[1]}')
  result_table2.append(f'{t},Non-Balanced,Test,{r_test[1]}')

for r in results1:
  print(r)


Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40
Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40
Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epo

#PART 10: Single Label Classification with balanced data using a Neural Network
This part is very similar to Part 9, though we are now using SMOTE to balance our data before passing it into the Neural Network. The Neural Network model here has same layers as previous part 9, and operates in the same fashion. Our code loops through each dependent variable (target PDD) and calls our function which defines/creates a new Neural Network. We then have it fitted to the training data, and tested on the validation/test data.  

In [None]:
def NeuralNetwork_B(X, y):
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

  X_train_test, X_validation, y_train_test, y_validation = train_test_split(X_train, y_train, test_size=0.3)

  # SMOTE the training data
  smote = SMOTE(random_state=101)
  X_train_new, y_train_new = smote.fit_resample(X_train_test, y_train_test.ravel())

  # SMOTE the testing data
  X_validate_new, y_validate_new = smote.fit_resample(X_validation, y_validation.ravel())


  model = Sequential()
  model.add(Dense(15, input_dim=25, activation='relu', name='Input'))
  model.add(Dropout(0.4))
  model.add(Dense(20, activation='relu', name='Hidden1'))
  model.add(Dropout(0.4))
  model.add(Dense(40, activation='relu', name='Hidden2'))
  model.add(Dropout(0.4))
  model.add(Dense(50, activation='relu', name='Hidden3'))
  model.add(Dropout(0.4))
  model.add(Dense(1, activation='sigmoid', name='Output'))

  model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

  # conver the data to the proper tensor type
  X_train_np = np.asarray(X_train_new).astype('int')

  X_train_tensor = tf.convert_to_tensor(X_train_np)
  y_train_tensor = tf.convert_to_tensor(y_train_new)

  #train the model
  model.fit(X_train_tensor, y_train_tensor, epochs=40, batch_size=10)

  # evaluate the model with the validation data
  X_validate_np = np.asarray(X_validate_new).astype('int')

  X_validate_tensor = tf.convert_to_tensor(X_validate_np)
  y_validate_tensor = tf.convert_to_tensor(y_validate_new)

  r_val = model.evaluate(X_validate_tensor, y_validate_tensor, batch_size=10)

  #evaluate the model with test data
  X_test_np = np.asarray(X_test).astype('int')

  X_test_tensor = tf.convert_to_tensor(X_test_np)
  y_test_tensor = tf.convert_to_tensor(y_test)

  r_test = model.evaluate(X_test_tensor, y_test_tensor, batch_size=10)

  return r_val, r_test

results2 = []
for t in target_names:
  x_i = data3.copy()
  del x_i['target']
  x_i, y_i = prep_data(x_i, t)
  r_val, r_test = NeuralNetwork_B(x_i, y_i)
  results2.append(f"{t} - Validation loss & accuracy: {r_val}")
  results2.append(f"{t} - Test loss & accuracy: {r_test}")
  result_table2.append(f'{t},Balanced,Validation,{r_val[1]}')
  result_table2.append(f'{t},Balanced,Test,{r_test[1]}')
for r in results2:
  print(r)


Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40
Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40
Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epo

#PART 11: Multi-Label classification with non-balanced data using Neural Network

In this part, we build/evaluate the Multi-Label Neural Network described in the original study, and then get its accuracy on unbalanced data. Since this is a multi-label Neural network, the target is provided as a [N, 5] matrix, which N represents each of the 5 dependent variables (classes). The dataset The output of the model will be 5 classes which represents the accuracy of the network predicting each of the 5 targets.  

This model works by first getting the 5 depedent variables out of the dataset, and then splitting the dataset into train, validate, and test sets. 

From here, the Neural Network is created via the keras API, and the layers are defined as described in the original study. 

Because the Neural Network uses tensors, we need to convert the X and y values into tensors, and then they can be passed into the model as their respective set. Training is passed into the fit function, with the validation and test set passed into the evaluate function.

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Sequential, Model, load_model
from tensorflow.keras.layers import Dense,Input,Dropout,Flatten,Conv2D,MaxPool2D

data_temp_2 = data3.copy()

y = pd.DataFrame([data_temp_2.pop(x) for x in ['Insominia', 'shizopherania', 'vascula_demetia', 'MBD', 'Bipolar']]).T

X = data_temp_2

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

X_train_test, X_validation, y_train_test, y_validation = train_test_split(X_train, y_train, test_size=0.3)


model = Sequential()
model.add(Dense(15, input_dim=22, activation='relu', name='Input'))
model.add(Dropout(0.03))
model.add(Dense(20, activation='relu', name='Hidden1'))
model.add(Dropout(0.03))
model.add(Dense(20, activation='relu', name='Hidden2'))
model.add(Dropout(0.03))
model.add(Dense(40, activation='relu', name='Hidden3'))
model.add(Dropout(0.03))
model.add(Dense(5, activation='sigmoid', name='Output'))

model.compile(loss='binary_crossentropy', optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3), metrics=['accuracy'])

X_train_np = np.asarray(X_train_test).astype('int')

X_train_tensor = tf.convert_to_tensor(X_train_np)
y_train_tensor = tf.convert_to_tensor(y_train_test)

model.fit(X_train_tensor, y_train_tensor, epochs=40, batch_size=10)


# evaluate the model with the validation data
X_validate_np = np.asarray(X_validation).astype('int')

X_validate_tensor = tf.convert_to_tensor(X_validate_np)
y_validate_tensor = tf.convert_to_tensor(y_validation)

res_val_multi = model.evaluate(X_validate_tensor, y_validate_tensor, batch_size=10)
print(f"Validation Loss and Accuracy is {res_val_multi}")
result_table2.append(f'Multi-Label,Non-Balanced,Validation,{res_val_multi[1]}')

#evaluate the model with test data
X_test_np = np.asarray(X_test).astype('int')

X_test_tensor = tf.convert_to_tensor(X_test_np)
y_test_tensor = tf.convert_to_tensor(y_test)

res_test_multi = model.evaluate(X_test_tensor, y_test_tensor, batch_size=10)
print(f"Test Loss and Accuracy is {res_test_multi}")
result_table2.append(f'Multi-Label,Non-Balanced,Test,{res_test_multi[1]}')


Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40
Validation Loss and Accuracy is [0.38085827231407166, 0.5299145579338074]
Test Loss and Accuracy is [0.4045514464378357, 0.5567010045051575]


#PART 12: Multi-Lablel classificaiton with balanced data using Neural Network

In the final section, we have built the Multi-Label Neural Network model for Multi-Label classifaction on balanced datasets, as described in the original study. This section has 3 main steps.
1. Balance the dataset using SMOTE algorithm by providing single 'target' column which has concatenated all 5 binary targets. 
2. Split the 'target' into 5 separate class on the balanced data that created in 1st step. 
3. Pass the dataset, and target into the Neural Network model and evaluate, similar to structure as Part 11. 

The Neural Network is exactly the same as Part 11, outside of the pre-work required to balance the dataset.

In [None]:
from keras.utils import np_utils

data_temp = data2.copy()

y = data_temp.pop('target')

X = data_temp

# SMOTE the training data
smote = SMOTE(random_state=101)
X_new, y_new = smote.fit_resample(X, y.ravel())

# Split the target class into 5 separate columns. 
y_new2 = []
for val in y_new.ravel():
  item = []
  for char in val:
    item.append(char)
  y_new2.append(item)

y_new3 = pd.DataFrame(y_new2, columns = ['Insominia', 'shizopherania', 'vascula_demetia', 'MBD', 'Bipolar']).astype('int')

# Split into train/test

X_train, X_test, y_train, y_test = train_test_split(X_new, y_new3, test_size=0.2)

X_train_test, X_validation, y_train_test, y_validation = train_test_split(X_train, y_train, test_size=0.3)


# The NN model
model = Sequential()
model.add(Dense(15, input_dim=21, activation='relu', name='Input'))
model.add(Dropout(0.1))
model.add(Dense(20, activation='relu', name='Hidden1'))
model.add(Dropout(0.1))
model.add(Dense(20, activation='relu', name='Hidden2'))
model.add(Dropout(0.1))
model.add(Dense(40, activation='relu', name='Hidden3'))
model.add(Dropout(0.1))
model.add(Dense(5, activation='sigmoid', name='Output'))

model.compile(loss='binary_crossentropy', optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3), metrics=['accuracy'])

X_train_np = np.asarray(X_train_test).astype('int')

X_train_tensor = tf.convert_to_tensor(X_train_np)
y_train_tensor = tf.convert_to_tensor(y_train_test)

model.fit(X_train_tensor, y_train_tensor, epochs=40, batch_size=10)


# evaluate the model with the validation data
X_validate_np = np.asarray(X_validation).astype('int')

X_validate_tensor = tf.convert_to_tensor(X_validate_np)
y_validate_tensor = tf.convert_to_tensor(y_validation)

res_val_multi = model.evaluate(X_validate_tensor, y_validate_tensor, batch_size=10)
print(f'Validation Loss and Accuracy is {res_val_multi}')
result_table2.append(f'Multi-Label,Balanced,Validation,{res_val_multi[1]}')

X_test_np = np.asarray(X_test).astype('int')

X_test_tensor = tf.convert_to_tensor(X_test_np)
y_test_tensor = tf.convert_to_tensor(y_test)

res_test_multi = model.evaluate(X_test_tensor, y_test_tensor, batch_size=10)
print(f'Test Loss and Accuracy is {res_test_multi}')
result_table2.append(f'Multi-Label,Balanced,Test,{res_test_multi[1]}')

Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40
Validation Loss and Accuracy is [0.5269338488578796, 0.29209622740745544]
Test Loss and Accuracy is [0.5487619042396545, 0.32510289549827576]


In [None]:
print('Machine learning models accuracy\n')
for r in result_table:
  l = r.split(",")
  #print(l)
  print ("{:<15} {:<15} {:<15}".format(l[0], l[1], l[2]))
  #print(' '.join(map(str, l)))
print('\n\nDeep learning models accuracy\n')
for r in result_table2:
  l = r.split(",")
  print ("{:<15} {:<15} {:<15} {:<15}".format(l[0], l[1], l[2], l[3]))


Machine learning models accuracy

MLP             Non-Balanced    0.3711340206185567
SVM             Non-Balanced    0.3711340206185567
RandomForest    Non-Balanced    0.4020618556701031
DecisionTree    Non-Balanced    0.4020618556701031
MLP             Balanced        0.35051546391752575
SVM             Balanced        0.36082474226804123
RandomForest    Balanced        0.3711340206185567
DecisionTree    Balanced        0.35051546391752575


Deep learning models accuracy

Insominia       Non-Balanced    Validation      0.5916666388511658
Insominia       Non-Balanced    Test            0.6200000047683716
shizopherania   Non-Balanced    Validation      0.8083333373069763
shizopherania   Non-Balanced    Test            0.8600000143051147
vascula_demetia Non-Balanced    Validation      0.75           
vascula_demetia Non-Balanced    Test            0.699999988079071
MBD             Non-Balanced    Validation      0.5916666388511658
MBD             Non-Balanced    Test            0.5      