<a href="https://colab.research.google.com/github/arvynathaniel/Python/blob/main/Disease_Prediction_(Support_Vector_Machine).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Disease Prediction**

In this project, we will be looking at a pair of datasets containing symptoms of a disease and their prognosis. The main objective of this project is to predict what kind of disease is likely to be based on a set of symptoms that occur. To do so, some machine learning algorithms will be used. We will feed the 'train' dataset to the machine learning algorithms for the pattern recognizing and learning process, then test the model with the 'test' dataset.

The main work sequence that will be performed in this project:
1.   Calling in the libraries and dataset
2.   Prediction models building

Our thanks to the provider of the original datasets.
Source: https://www.kaggle.com/datasets/kaushil268/disease-prediction-using-machine-learning

My data cleaning process of the original 'training' dataset
https://colab.research.google.com/drive/1zioB8m0Xr5aJKFe0pc6qXyCbKP4ORF8i?usp=sharing

#**I. Calling in the Libraries and Datasets**

##Ia. Libraries

In [55]:
# pandas to help us visualizing and manipulating the data in a tabular form
import pandas as pd

# numpy to help us with mathematical operations
import numpy as np

# sklearn to help us in the model building part
from sklearn.svm import SVC
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import LabelEncoder
lb = LabelEncoder()

##Ib. Datasets


In [None]:
train = pd.read_csv('Training Dataset (Cleaned) - Disease Prediction.csv')
test = pd.read_csv('Testing.csv')

#**II. Data Overview**

For a little recap of how the data looks like, we will display the information of the dataset as follow:

In [None]:
train.head()

Unnamed: 0.1,Unnamed: 0,itching,skin_rash,nodal_skin_eruptions,continuous_sneezing,shivering,chills,joint_pain,stomach_pain,acidity,...,blackheads,scurring,skin_peeling,silver_like_dusting,small_dents_in_nails,inflammatory_nails,blister,red_sore_around_nose,yellow_crust_ooze,prognosis
0,0,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Fungal infection
1,1,0,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Fungal infection
2,2,1,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Fungal infection
3,3,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Fungal infection
4,4,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Fungal infection


In [None]:
# Dropping the unique identifier column
train.drop('Unnamed: 0', axis = 1, inplace = True)

In [None]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4920 entries, 0 to 4919
Columns: 133 entries, itching to prognosis
dtypes: int64(132), object(1)
memory usage: 5.0+ MB


The 'train' dataset consists of 4920 entries and 133 columns

#**III. Model Building**

##IIIa. Splitting the 'train' and 'test' datasets 

In [None]:
# X_train and X_test contain the set of symptoms
# y_train and y_test contain the prognosis, which is the answer to the symptoms
X_train = train.drop('prognosis', axis = 1)
y_train = train['prognosis']
X_test = test.drop('prognosis', axis = 1)
y_test = test['prognosis']

##IIIb. Support Vector Machine parameter tuning and selection

###Parameter grid

In this SVM model, there are some important parameters that need to be set. The parameters that will be set are as follow:

*   kernel: functions that map lower dimensional data into a higher dimensional data
*   c: a parameter that tells the model how many errors are still tolerable
*   gamma: a parameter that tells the model how many data points should be considered when setting the width of hyper plane's margin lane



In [59]:
# kernel
kernel = ['poly', 'rbf', 'sigmoid']

# C
C = [0.1, 10, 100, 1000]

# gamma
gamma = [0.001, 0.01, 0.1, 1.0]

# Assigning the parameters into a parameter grid
parameter_grid = {'kernel': kernel,
                  'C': C,
                  'gamma': gamma}

###Optimal set of parameters search

From the parameter grid that has just been created, an optimal set of parameters, which is the best, if not one of the better ones is to be determined using an iteration method that will try assigning each combination of parameters into the model.

In [61]:
# The machine learning model that will be used
svm = SVC()

# Assigning the parameter grid into the model
svm_random = GridSearchCV(estimator = svm,
                          param_grid = parameter_grid,
                          refit = True,
                          verbose = 2)

# Assigning the data into the random model
svm_random.fit(X_train, y_train)

Fitting 5 folds for each of 48 candidates, totalling 240 fits
[CV] END ....................C=0.1, gamma=0.001, kernel=poly; total time=   3.8s
[CV] END ....................C=0.1, gamma=0.001, kernel=poly; total time=   2.8s
[CV] END ....................C=0.1, gamma=0.001, kernel=poly; total time=   2.4s
[CV] END ....................C=0.1, gamma=0.001, kernel=poly; total time=   2.5s
[CV] END ....................C=0.1, gamma=0.001, kernel=poly; total time=   2.4s
[CV] END .....................C=0.1, gamma=0.001, kernel=rbf; total time=   3.6s
[CV] END .....................C=0.1, gamma=0.001, kernel=rbf; total time=   3.6s
[CV] END .....................C=0.1, gamma=0.001, kernel=rbf; total time=   4.5s
[CV] END .....................C=0.1, gamma=0.001, kernel=rbf; total time=   3.6s
[CV] END .....................C=0.1, gamma=0.001, kernel=rbf; total time=   3.6s
[CV] END .................C=0.1, gamma=0.001, kernel=sigmoid; total time=   3.3s
[CV] END .................C=0.1, gamma=0.001, k

GridSearchCV(estimator=SVC(),
             param_grid={'C': [0.1, 10, 100, 1000],
                         'gamma': [0.001, 0.01, 0.1, 1.0],
                         'kernel': ['poly', 'rbf', 'sigmoid']},
             verbose=2)

In [62]:
# Getting the best 
svm_random.best_params_

{'C': 0.1, 'gamma': 0.001, 'kernel': 'poly'}

##IIIc. Support Vector Machine model, prediction, and accuracy

We will now use one of the best set of parameters that have been acquired through the previous iteration process into the SVM model.

In [65]:
# Setting up the model using the set of parameter previously set
svm = SVC(kernel = 'poly',
          C = 0.1,
          gamma = 0.001)

# Fitting in the training data to the model for the model to learn|
svm.fit(X_train, y_train)

# Predicting the 'test' dataset
svmpred = svm.predict(X_test)

# Getting the accuracy of the model
acc = svm.score(X_test, y_test)

# General accuracy report
print('SVM model accuracy: {:.2f}%'.format(acc*100))

Support Vector Machine model accuracy: 100.00%


The SVM model has a general accuracy of 100%! Let us take a brief look at a more detailed accuracy report using the 'classification_report'.

In [64]:
print(classification_report(y_test, svmpred))

                                         precision    recall  f1-score   support

(vertigo) Paroymsal  Positional Vertigo       1.00      1.00      1.00         1
                                   AIDS       1.00      1.00      1.00         1
                                   Acne       1.00      1.00      1.00         1
                    Alcoholic hepatitis       1.00      1.00      1.00         1
                                Allergy       1.00      1.00      1.00         1
                              Arthritis       1.00      1.00      1.00         1
                       Bronchial Asthma       1.00      1.00      1.00         1
                   Cervical spondylosis       1.00      1.00      1.00         1
                            Chicken pox       1.00      1.00      1.00         1
                    Chronic cholestasis       1.00      1.00      1.00         1
                            Common Cold       1.00      1.00      1.00         1
                           

For notes about the report, 'precision' and 'recall' mean the following:
(source: https://www.statology.org/sklearn-classification-report/)

*   precision: Percentage of correct positive predictions relative to total positive predictions
*   recall: Percentage of correct positive predictions relative to total actual positives
*   f1-score: 2 x (precision x recall) / (precision + recall)

From the report, it tells that the model does a really good job predicting which disease is likely to be suffered by patient based on the symptoms the patient is having. 

