# Diabetic or Not

Ref: [Sidhardhan's ML projects](https://www.youtube.com/watch?v=fiz1ORTBGpY&list=PLfFghEzKVmjvuSA67LszN1dZ-Dd_pkus6)

Author: Dathabase

> ### **Aim:**
To predict whether a female has diabetes or not, by using a SVM model

SVM algorithm:
- Most important supervised learning model !
- SVM plots all input data separated by a **hyperplane** - the SVM algo then categorises any new data introduced to the model into these groups.


> ### **Workflow:**

1. Get diabetes data (PIMA dataset*) and respective labels
1. Data pre-processing
1. Train, Test, Split
1. Use Support Vector Machine Classifier (supervised learning)
1. Train Model using SVM classifier to predict whether a person is diabetic/non-diabetic


*This original dataset has been provided by the National Institute of Diabetes and Digestive and Kidney Diseases and is used to predict whether a patient (females; minimum 21 years old belonging to Pima Indian heritage) is likely to get diabetes based on Age, Glucose, Blood pressure, Insulin, BMI, etc. [Ref: Analytics Vidhya](https://www.analyticsvidhya.com/blog/2021/07/diabetes-prediction-with-pycaret/#:~:text=Diabetes%20Pedigree%20Function%3A%20indicates%20the,%3D%20yes%2C%200%20%3D%20no)

In [1]:
# importing required modules

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler # to standardise the data to a common range
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.metrics import accuracy_score

## **Data Collection and Pre-processing**

In [2]:
diabetes_df = pd.read_csv("diabetes_data.csv")
diabetes_df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


Columns
- `Preganancies`: number of pregrancies
- `Glucose`: indicates the plasma glucose concentration
- `BloodPressure`: diastolic blood pressure level [mm/Hg]
- `SkinThickness`: tricep skinfold thickness [mm] (indicates the amount of fat in that muscle)
- `Insulin`: serum insulin level [U/mL]
- `BMI`: Body Mass Index $\bigg(\frac{weight}{height^2}\bigg)$ [kg/m^2]
- `DiabetesPedigreeFunction`: scores likelihood of diabetes based on family history
- `Age`: age of the female patient
- `Outcome`: 1 = diabetic; 0 = not diabetic

In [3]:
print(f"The dataframe has {diabetes_df.shape[0]} rows and {diabetes_df.shape[1]} columns")
print("Here are some basic stats regarding the numerical data:")
diabetes_df.describe()

The dataframe has 768 rows and 9 columns
Here are some basic stats regarding the numerical data:


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [4]:
# number of patients that are diabetic (value = 1) and non-diabetic (value = 0)
diabetes_df.loc[:,'Outcome'].value_counts()

0    500
1    268
Name: Outcome, dtype: int64

In [5]:
# mean value of all categories, for a diabetic vs non-diabetic
diabetes_df.groupby('Outcome').mean() 

Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,3.298,109.98,68.184,19.664,68.792,30.3042,0.429734,31.19
1,4.865672,141.257463,70.824627,22.164179,100.335821,35.142537,0.5505,37.067164


Separate the input data, X i.e. **features** (columns 0 to 7) and the data to be predicted, y i.e. **labels** (column 8)

In [6]:
X = diabetes_df.drop('Outcome', axis=1)
Y = diabetes_df.iloc[:,8]

Due to the varying numerical ranges of data in each column, it'll be difficult for our model to predict good predictions. So data needs to be standardised...

In [7]:
# create an instance
scaler = StandardScaler()

# # fit then transform
# scaler.fit(X)
# X_standardised = scaler.transform(X)

# fit and transform the data to get it into the same range [-1,1]
X_standardised = scaler.fit_transform(X)
X = X_standardised

## **Train, Test, Split**

In [8]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, 
                                                    random_state = 2,
                                                    test_size = 0.2,
                                                    stratify = Y)

Parameters:
- `X_train` = training data; `Y_train` = corresponding labels of the training data
- `X_test` = testing data; `Y_test` = corresponding labels of the testing data
- 20% of data will be in the testing set => `test_size = 0.2`
- split the data into rock (R) and mine (M) => `stratify = Y`
- should spilt the data similarly whenever the function is called => `random_state = 2`

## **Model Training and Evaluation**

In [9]:
# SVC = Support Vector Classifier
diabetic_svm_model = svm.SVC(kernel='linear')
# linear kernal since data is linearly separable i.e. separated by a single line

In [10]:
# fit training data
diabetic_svm_model.fit(X_train, Y_train)

SVC(kernel='linear')

In [11]:
# accuracy on training data
X_train_pred = diabetic_svm_model.predict(X_train)
accuracy_X_train_pred = accuracy_score(X_train_pred, Y_train)
print(f"Accuracy on Training data: {accuracy_X_train_pred*100:.1f}%")

Accuracy on Training data: 78.7%


In [12]:
# accuracy on testing data
X_test_pred = diabetic_svm_model.predict(X_test)
accuracy_X_test_pred = accuracy_score(X_test_pred, Y_test)
print(f"Accuracy on Testing data: {accuracy_X_test_pred*100:.1f}%")

Accuracy on Testing data: 77.3%


## **Making a Predictive System**

Choose a random row from the data set (that is not in the training data!) as input data and see if the model correctly predicts the result. 

Dealing with standardised data means that we lose the index position of each row of data points. Hence we will have to check for each row in the training set to the full data set and generate a list of indices...

In [13]:
from random import choice

def rand_row_idx(X, X_train):
  '''
  function that accepts a standardised training set and full dataset and 
  returns the row indices of the values used in the training set 
  and a random row index whose values haven't been used in that training set
  '''
  train_idxs = []
  for row in X_train:
    row_idx = np.where(X == row)
    if len(row_idx[0]) != 0:
      train_idxs.append(row_idx[0][0])
    else:
      train_idxs.append(-1)
  remaining_idxs = [i for i in range(0,X.shape[0]) if i not in train_idxs] # remaining row indices
  return train_idxs, choice(remaining_idxs) # selects a single row index at random


In [14]:
# get random row index
row_idx = rand_row_idx(X, X_train)[1]
row_idx

260

In [15]:
# check whether row index choice is in the list of indices of the training set
if row_idx in rand_row_idx(X, X_train)[0]:
  print(True)
else:
  print(False)

False


Get a set of values corresponding to the random row index and use your model to make a prediction

In [16]:
# reshape input data
new_input = np.asarray(X[row_idx]).reshape(1,-1)

def diabetic_or_not(X):
  '''function that takes a list of medical parameters as input and predicts 
  whether that person is diabetic or not'''
  return diabetic_svm_model.predict(X)

In [18]:
# test with new data
if diabetic_or_not(new_input) == 1:
  print("The person HAS diabetes")
elif diabetic_or_not(new_input) == 0:
  print("The person DOESN'T HAVE diabetes")
else:
  print("Model has failed to predict correctly")

if diabetic_or_not(new_input) == Y[row_idx]:
  print("This is a correct prediction :)")
else:
  print("This is an incorrect prediction :(")

The person HAS diabetes
This is an incorrect prediction :(
