<a href="https://colab.research.google.com/github/arvynathaniel/Python/blob/main/Disease_Prediction_(ML).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##**Disease Prediction**

In this project, we will be looking at a pair of datasets containing symptoms of a disease and their prognosis. The main objective of this project is to predict what kind of disease is likely to be based on a set of symptoms that occur. To do so, some machine learning algorithms will be used. We will feed the 'train' dataset to the machine learning algorithms for the pattern recognizing and learning process, then test the model with the 'test' dataset.

The main work sequence that will be performed in this project:
1.   Calling in the libraries and dataset
2.   Performing a light data exploration on the dataset
3.   Prediction models building


Our thanks to the provider of this pair of datasets.
Source: https://www.kaggle.com/datasets/kaushil268/disease-prediction-using-machine-learning

##**I. Calling in the Libraries and Datasets**

###Ia. Libraries

In [3]:
# pandas to help us visualizing and manipulating the data in a tabular form
import pandas as pd

# sklearn to help us in the model building part
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

# termcolor to help us coloring printed string
from termcolor import colored

###Ib. Datasets


In [4]:
train = pd.read_csv('Training.csv')
test = pd.read_csv('Testing.csv')

##**II. Light Data Exploration**

###IIa. General overview

There are some information the provider of the datasets gave us:
1. Each CSV file has 133 columns. 132 of these columns are symptoms that a person experiences and last column is the prognosis.

2. These symptoms are mapped to 42 diseases you can classify these set of symptoms to.

In [5]:
train.head()

Unnamed: 0,itching,skin_rash,nodal_skin_eruptions,continuous_sneezing,shivering,chills,joint_pain,stomach_pain,acidity,ulcers_on_tongue,...,scurring,skin_peeling,silver_like_dusting,small_dents_in_nails,inflammatory_nails,blister,red_sore_around_nose,yellow_crust_ooze,prognosis,Unnamed: 133
0,1,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,Fungal infection,
1,0,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,Fungal infection,
2,1,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,Fungal infection,
3,1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,Fungal infection,
4,1,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,Fungal infection,


It seems like each of the symptom is presented in a boolean value, whereas 1 stands for True and 0 for False. The unique values for each symptom feature should only be two then, '1' and '0'. We will check them on later in section IIc. 

In [6]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4920 entries, 0 to 4919
Columns: 134 entries, itching to Unnamed: 133
dtypes: float64(1), int64(132), object(1)
memory usage: 5.0+ MB


The 'train' dataset consists of 4920 entries and 133 columns

###IIb. Missing values

In [7]:
pd.set_option('display.max_rows',None)
train.isna().sum()

itching                              0
skin_rash                            0
nodal_skin_eruptions                 0
continuous_sneezing                  0
shivering                            0
chills                               0
joint_pain                           0
stomach_pain                         0
acidity                              0
ulcers_on_tongue                     0
muscle_wasting                       0
vomiting                             0
burning_micturition                  0
spotting_ urination                  0
fatigue                              0
weight_gain                          0
anxiety                              0
cold_hands_and_feets                 0
mood_swings                          0
weight_loss                          0
restlessness                         0
lethargy                             0
patches_in_throat                    0
irregular_sugar_level                0
cough                                0
high_fever               

From the report above, it can be seen that there are no missing values, except for the 'Unnamed: 133' feature, which has 4920 missing values (totally empty feature). We will omit this feature since this feature does not provide any insight.

In [8]:
train.drop('Unnamed: 133', axis = 1, inplace = True)

###IIc. Unique values

####Symptom features

In [9]:
# Getting the feature names in a list
listtrainCols = list(train.columns.values[:-1])

# Calculating the number of each feature's unique values
ListUVcount = []
for i in train.columns[:-1]:
    UVcount = len(train[i].unique())
    ListUVcount.append(UVcount)

# Storing the unique values of each feature in a list
ListUV = []
for i in train.columns[:-1]:
    UV = list(train[i].unique())
    ListUV.append(UV)

# Creating the unique values table
Table = pd.DataFrame(list(zip(listtrainCols, ListUVcount, ListUV)))
Table.columns = ['Feature', 'Number of Unique Values', 'Unique Values']

# Displaying the report
print('Unique values in each feature:')
print(Table.to_string(index = False))

Unique values in each feature:
                       Feature  Number of Unique Values Unique Values
                       itching                        2        [1, 0]
                     skin_rash                        2        [1, 0]
          nodal_skin_eruptions                        2        [1, 0]
           continuous_sneezing                        2        [0, 1]
                     shivering                        2        [0, 1]
                        chills                        2        [0, 1]
                    joint_pain                        2        [0, 1]
                  stomach_pain                        2        [0, 1]
                       acidity                        2        [0, 1]
              ulcers_on_tongue                        2        [0, 1]
                muscle_wasting                        2        [0, 1]
                      vomiting                        2        [0, 1]
           burning_micturition                        2    

Each of the symptom feature has only two unique values '0' and '1', indicating that there are no abnormal values within it, except for the 'fluid_overload' feature. Let's do a little checking of the values in the 'fluid_overload' feature.

In [10]:
# Checking the value count of the 'fluid_overload' feature
print("Count of value '0' in the 'fluid_overload' feature : " + str((train['fluid_overload'] == 0).sum()))

Count of value '0' in the 'fluid_overload' feature : 4920


The 'fluid_overload' feature has 4920 entries with '0' value, which means that all of its value is '0' and indeed has only 1 unique value. Let's move on to the next part.

####'prognosis' feature

In [11]:
# Storing the unique values of 'prognosis' feature in a list
listUV = train['prognosis'].unique()

# Reporting the number of the unique values
print("Number of unique values in 'prognosis' feature : " + str(len(listUV)) + '\n')

# Reporting the unique values and each of its count
listUVcount = []
for i in listUV:
    UVcount = (train['prognosis'] == i).sum()
    listUVcount.append(UVcount)

# Creating a table for the report:
table = pd.DataFrame(list(zip(listUV, listUVcount)))
table.columns = ['Prognosis', 'Count']

print("Unique values of the 'progosis' feature and each of its count : ")
print(table)

Number of unique values in 'prognosis' feature : 41

Unique values of the 'progosis' feature and each of its count : 
                                  Prognosis  Count
0                          Fungal infection    120
1                                   Allergy    120
2                                      GERD    120
3                       Chronic cholestasis    120
4                             Drug Reaction    120
5                       Peptic ulcer diseae    120
6                                      AIDS    120
7                                 Diabetes     120
8                           Gastroenteritis    120
9                          Bronchial Asthma    120
10                            Hypertension     120
11                                 Migraine    120
12                     Cervical spondylosis    120
13             Paralysis (brain hemorrhage)    120
14                                 Jaundice    120
15                                  Malaria    120
16             

There are equally 120 entries for every prognosis, which is a good thing, since it means that each prognosis has adequate samples.

##III. Model Building

####IIIa. Splitting the 'train' and 'test' datasets 

In [12]:
# X_train and X_test contain the set of symptoms
# y_train and y_test contain the prognosis, which is the answer to the symptoms
X_train = train.drop('prognosis', axis = 1)
y_train = train['prognosis']
X_test = test.drop('prognosis', axis = 1)
y_test = test['prognosis']

####IIIb. Decision Tree Classifier

In [28]:
# Setting up the model
DTClassifier = DecisionTreeClassifier(criterion = 'entropy',
                                      min_samples_leaf = 2)

# Fitting the 'train' data into the model
DTClassifier.fit(X_train, y_train)

# Predicting the 'test' dataset
DTpred = DTClassifier.predict(X_test)

# Getting the accuracy of the model
acc = DTClassifier.score(X_test, y_test)
print('Decision Tree Classifier model accuracy: {:.2f}%'.format(acc*100))

Decision Tree Classifier model accuracy: 100.00%


####IIIc. Random Forest

In [29]:
# Setting up the model
RFClassifier = RandomForestClassifier(criterion = 'entropy', 
                                      min_samples_leaf = 2)

# Fitting the 'train' data into the model
RFClassifier.fit(X_train, y_train)

# Predicting the 'test' dataset
RFpred = RFClassifier.predict(X_test)

# Getting the accuracy of the model
acc = RFClassifier.score(X_test, y_test)

print('Random Forest Classifier model accuracy: {:.2f}%'.format(acc*100))

Random Forest Classifier model accuracy: 100.00%


####IIId. Naive-Bayes

In [30]:
# Setting up the model
gnb = GaussianNB()

# Fitting the 'train' data into the model
gnb.fit(X_train, y_train)

# Predicting the 'test' dataset
gnbpred = gnb.predict(X_test)

# Getting the accuracy of the model
acc = gnb.score(X_test, y_test)

print('GaussianNB model accuracy: {:.2f}%'.format(acc*100))

GaussianNB model accuracy: 100.00%


###IIIe. Models Accuracy Report

In [31]:
# Storing the models in a list for
predictions = [DTpred, RFpred, gnbpred]

# Storing the name of the models in a list
models = ['Decision Tree Classifier', 'Random Forest Classifier', 'Naive Bayes']

# Calculating the score of the models
for i, j in zip(predictions, models):
    report = classification_report(y_test, i)
    print(colored((j + ' Model Score'), 'blue', attrs = ['bold']))
    print(report)

[1m[34mDecision Tree Classifier Model Score[0m
                                         precision    recall  f1-score   support

(vertigo) Paroymsal  Positional Vertigo       1.00      1.00      1.00         1
                                   AIDS       1.00      1.00      1.00         1
                                   Acne       1.00      1.00      1.00         1
                    Alcoholic hepatitis       1.00      1.00      1.00         1
                                Allergy       1.00      1.00      1.00         1
                              Arthritis       1.00      1.00      1.00         1
                       Bronchial Asthma       1.00      1.00      1.00         1
                   Cervical spondylosis       1.00      1.00      1.00         1
                            Chicken pox       1.00      1.00      1.00         1
                    Chronic cholestasis       1.00      1.00      1.00         1
                            Common Cold       1.00      1.

All three models perform well with 100% accuracy.