<a href="https://colab.research.google.com/github/arvynathaniel/Python/blob/main/Disease_Prediction_(Artificial_Neural_Network).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##**Disease Prediction**

In this project, we will be looking at a pair of datasets containing symptoms of a disease and their prognosis. The main objective of this project is to predict what kind of disease is likely to be based on a set of symptoms that occur. To do so, some machine learning algorithms will be used. We will feed the 'train' dataset to the machine learning algorithms for the pattern recognizing and learning process, then test the model with the 'test' dataset.

The main work sequence that will be performed in this project:
1.   Calling in the libraries and dataset
2.   Prediction models building

Our thanks to the provider of the original datasets.
Source: https://www.kaggle.com/datasets/kaushil268/disease-prediction-using-machine-learning

My data cleaning process of the original 'training' dataset
https://colab.research.google.com/drive/1zioB8m0Xr5aJKFe0pc6qXyCbKP4ORF8i?usp=sharing

##**I. Calling in the Libraries and Datasets**

###Ia. Libraries

In [None]:
# pandas to help us visualizing and manipulating the data in a tabular form
import pandas as pd

# numpy to help us with mathematical operations
import numpy as np

# sklearn to help us in the model building part
from sklearn.preprocessing import LabelEncoder
lb = LabelEncoder()

# tensorflow to help us in the model building part
import tensorflow as tf
from tensorflow import keras

###Ib. Datasets


In [None]:
train = pd.read_csv('Training Dataset (Cleaned) - Disease Prediction.csv')
test = pd.read_csv('Testing.csv')

##**II. Data Overview**

For a little recap of how the data looks like, we will display the information of the dataset as follow:

In [None]:
train.head()

Unnamed: 0.1,Unnamed: 0,itching,skin_rash,nodal_skin_eruptions,continuous_sneezing,shivering,chills,joint_pain,stomach_pain,acidity,...,blackheads,scurring,skin_peeling,silver_like_dusting,small_dents_in_nails,inflammatory_nails,blister,red_sore_around_nose,yellow_crust_ooze,prognosis
0,0,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Fungal infection
1,1,0,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Fungal infection
2,2,1,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Fungal infection
3,3,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Fungal infection
4,4,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Fungal infection


In [None]:
# Dropping the unique identifier column
train.drop('Unnamed: 0', axis = 1, inplace = True)

In [None]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4920 entries, 0 to 4919
Columns: 133 entries, itching to prognosis
dtypes: int64(132), object(1)
memory usage: 5.0+ MB


The 'train' dataset consists of 4920 entries and 133 columns

##**III. Model Building**

####IIIa. Splitting the 'train' and 'test' datasets 

In [None]:
# X_train and X_test contain the set of symptoms
# y_train and y_test contain the prognosis, which is the answer to the symptoms
X_train = train.drop('prognosis', axis = 1)
y_train = train['prognosis']
X_test = test.drop('prognosis', axis = 1)
y_test = test['prognosis']

In [None]:
# Transforming the categorical value of y_train and y_test into numerical value
lb = LabelEncoder()
y_train_num = lb.fit_transform(y_train)
y_test_num = lb.fit_transform(y_test)

####IIIb. Artificial Neural Network (ANN) parameter and model

In [None]:
# Shape of X_train to use as the input_shape
X_train.shape

(4920, 132)

In [None]:
# Number of possible outcome (number of unique disease)
len(y_train.unique())

41

The artificial neural network model will be learning a dataset that has 132 features and 42 possible outputs.

Considering the relatively large number of features involved in this learning process, we will try using 3 hidden layers in the model. The layers are illustrated as follow:

![Picture](https://drive.google.com/uc?export=view&id=1otK5OiExTq2yEqw8PLXmT3qyPF8DPhul)

Aside from the number of the hidden layers, we also need to determine the number of neurons in each hidden layer. For this, we will be using the formula: {n = sqrt (input * output)}. The following is the applications of the formula:

*   n1 = sqrt (n0 * n2)
*   n2 = sqrt (n1 * n3)
*   n3 = sqrt (n2 * n4)

We then obtain:

*   n1 = 98.5431 ~ n1 = 99
*   n2 = 73.5663 ~ n2 = 74
*   n3 = 54.9201 ~ n3 = 55

For the layers connection, we will be using the Dense type layer, which is basically a regular deeply connected neural network layer and is one the most used ones in practices.

For the activation functions, we will try using the most commonly used ones, which are:

*   Rectified Linear Unit (ReLU) for the hidden layers
*   Softmax for the output layer








In [None]:
# Model Building
model = keras.Sequential([# hidden layer 1
                          keras.layers.Dense(units = 99, input_shape = (132,)),
                          keras.layers.Activation('relu'),
                          # hidden layer 2 
                          keras.layers.Dense(units = 74),
                          keras.layers.Activation('relu'),
                          # hidden layer 3
                          keras.layers.Dense(units = 56),
                          keras.layers.Activation('relu'),
                          # output layer
                          keras.layers.Dense(42),
                          keras.layers.Activation('softmax')])

# Note: some of the parameters may differ from the original set of parameter
# This is due to parameter tuning in order to increase model accuracy

# Model shape
model.summary()

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_12 (Dense)            (None, 99)                13167     
                                                                 
 activation_12 (Activation)  (None, 99)                0         
                                                                 
 dense_13 (Dense)            (None, 74)                7400      
                                                                 
 activation_13 (Activation)  (None, 74)                0         
                                                                 
 dense_14 (Dense)            (None, 56)                4200      
                                                                 
 activation_14 (Activation)  (None, 56)                0         
                                                                 
 dense_15 (Dense)            (None, 42)               

After building the structure of the model, the model then needs to be compiled. In this compilation process, there are some parameters that need to be specified as well:

*   'adam' for the optimizer, which often used as a default optimizer, since it usually has faster computation time
*   'Sparse Categorical Cross Entropy' for the loss function, which calculates the crossentropy loss between labels and predictions
*   'accuracy' for the metrics, which calculates how often prediction equals labels


In [None]:
# Compiling the built model
model.compile(optimizer = 'adam',
              loss = keras.losses.SparseCategoricalCrossentropy(),
              metrics = ['accuracy'])

# Fitting the 'train' data into the model
model.fit(X_train, 
          y_train_num, 
          epochs = 5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f5a5ea44e90>

####IIIc. Model testing and scoring

Now, let us try predicting the result of the 'test' datasets based on the model we have built.

In [None]:
# Predicting using the built model
modelpred = model.predict(X_test)

# Calculating the accuracy of the
acc = model.evaluate(X_test, y_test_num, steps = 5)



From the Artificial Network Neural model that we have build, we have successfully predicted the outcome of the 'test' datasets in 100% accuracy.