# Linear Discriminant Analysis on Diabetes Data

## Mohammad Abdul Wahed

## Contents


*   Objective
*   Description of Diabetes Dataset
*   Importing Libraries
*   Loading Data
*   Replacing '0' values with NaN in Glucose,	BloodPressure,	SkinThickness,	Insulin and	BMI	colums
*   Counting the number of null values
*   Imputing missing values using Multiple Imputation by Chained Equations(MICE)
*   Splitting the data into train and test set using Twinning technique
*   Fitting a model using Linear Discriminant Analysis
*   Using the model to predict diabetes using test dataset
*   Model evaluation and accuracy











## Objective

 The objective is to develop a model that predicts based on diagnostic measurements whether a patient has diabetes.

## Description of Diabetes Dataset

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases.


Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.



*   Pregnancies: Number of times pregnant
*   Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
*   BloodPressure: Diastolic blood pressure (mm Hg)
*   SkinThickness: Triceps skin fold thickness (mm)
*   Insulin: 2-Hour serum insulin (mu U/ml)
*   BMI: Body mass index (weight in kg/(height in m)^2)
*   DiabetesPedigreeFunction: Diabetes pedigree function
*   Age: Age (years)
*   Outcome: Class variable (0 or 1)




## Importing Libraries

In [30]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis 
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score

## Loading Data

In [2]:
data = pd.read_csv("diabetes.csv")

In [3]:
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [4]:
data.shape

(768, 9)

There are 768 rows and 9 columns

In [5]:
data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Pregnancies,768.0,3.845052,3.369578,0.0,1.0,3.0,6.0,17.0
Glucose,768.0,120.894531,31.972618,0.0,99.0,117.0,140.25,199.0
BloodPressure,768.0,69.105469,19.355807,0.0,62.0,72.0,80.0,122.0
SkinThickness,768.0,20.536458,15.952218,0.0,0.0,23.0,32.0,99.0
Insulin,768.0,79.799479,115.244002,0.0,0.0,30.5,127.25,846.0
BMI,768.0,31.992578,7.88416,0.0,27.3,32.0,36.6,67.1
DiabetesPedigreeFunction,768.0,0.471876,0.331329,0.078,0.24375,0.3725,0.62625,2.42
Age,768.0,33.240885,11.760232,21.0,24.0,29.0,41.0,81.0
Outcome,768.0,0.348958,0.476951,0.0,0.0,0.0,1.0,1.0


As we see, the minimum value in Glucose, BloodPressure, SkinThickness, insulin, BMI is '0' indicating that they are actually null values. We replace the '0' values with NaN and impute NaN(missing values) using MICE

Also we see that the mean of outcome is 0.34 which means that the dataset is imbalanced(outcome '0' and outcome '1' are not in proportion). The F1 score metric becomes especially valuable when working on classification models in which our data set is imbalanced. We will implement it later in this notebook.

## Replacing '0' values with NaN

In [6]:
data[["Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI"]] = data[["Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI"]].replace(0, np.NaN)

## Counting the number of null values

In [7]:
data.isnull().sum()

Pregnancies                   0
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64

## Imputing missing values using Multiple Imputation by Chained Equations(MICE)

Let's first install the `miceforest` package.

In [8]:
!pip install miceforest --no-cache-dir


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting miceforest
  Downloading miceforest-5.6.3-py3-none-any.whl (57 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.0/58.0 KB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
Collecting dill
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 KB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting blosc
  Downloading blosc-1.11.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m63.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: dill, blosc, miceforest
Successfully installed blosc-1.11.1 dill-0.3.6 miceforest-5.6.3


Installing latest version of `miceforest`

In [9]:
!pip install git+https://github.com/AnotherSamWilson/miceforest.git

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/AnotherSamWilson/miceforest.git
  Cloning https://github.com/AnotherSamWilson/miceforest.git to /tmp/pip-req-build-e1rfx4_h
  Running command git clone --filter=blob:none --quiet https://github.com/AnotherSamWilson/miceforest.git /tmp/pip-req-build-e1rfx4_h
  Resolved https://github.com/AnotherSamWilson/miceforest.git to commit d9359a89204e3b5f10cc02e7e621a22c213e5453
  Preparing metadata (setup.py) ... [?25l[?25hdone


In [10]:
import miceforest as mf

We have the original dataset with missing values(NaN) in `data`. Let's try to impute the missing values in the data with `miceforest`.

In [11]:
# Create kernel. 
kds = mf.ImputationKernel(
  data,
  save_all_iterations=True,
  random_state=100
)

# Run the MICE algorithm for 5 iterations
kds.mice(5)

# Return the completed dataset.
data_imputed = kds.complete_data()

In [12]:
data_imputed.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

Thus, missing values are imputed

## Splitting the data into training and test set using Twinning technique
Twinning technique partitions datasets into statistically similar disjoint sets, termed as twins.

Let's install the twinning package

In [13]:
pip install git+https://github.com/avkl/twinning.git

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/avkl/twinning.git
  Cloning https://github.com/avkl/twinning.git to /tmp/pip-req-build-kbz8sbe_
  Running command git clone --filter=blob:none --quiet https://github.com/avkl/twinning.git /tmp/pip-req-build-kbz8sbe_
  Resolved https://github.com/avkl/twinning.git to commit 8c6ffdd73531039733a52f0f8cf67efe4f38383f
  Preparing metadata (setup.py) ... [?25l[?25hdone
Processing //tmp/pip-req-build-kbz8sbe_/twinning_cpp
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: twinning, twinning_cpp
  Building wheel for twinning (setup.py) ... [?25l[?25hdone
  Created wheel for twinning: filename=twinning-1.0-py3-none-any.whl size=9453 sha256=980b831717b29b8f0de415b3df62c8ea4fb6fdd67d5deb9919d87b3112dffc09

In [14]:
from twinning import twin

The following code generates an 80-20 partition of the imputed dataset `data_imputed`. twin() accepts a numpy ndarray as the dataset, and an integer parameter r representing the inverse of the partitioning ratio, i.e., for an 80-20 split, r = 1 / 0.2 = 5. The function returns indices of the smaller twin.

In [15]:
twin_idx = twin(data_imputed.to_numpy(), r=5)

In [16]:
twin_idx

array([365,  49, 733, 553, 482,   1, 196, 752, 694, 288, 156, 421, 446,
       172, 431, 615, 765, 257, 158, 368, 601, 639, 407, 418,  55, 742,
        96, 253,  74, 483, 556, 665, 201, 452, 726, 315, 524, 373, 572,
       241, 570, 567, 583, 350, 652,  29, 160, 517, 669, 436,  44, 555,
        63, 605, 324, 682, 490, 117,  62, 503, 439, 380, 538, 428, 178,
       349, 337, 219, 712,  37, 386, 451, 326, 356, 419,  17,  25, 749,
       114, 427, 189, 164,  31, 675, 545,  11, 663, 603, 152, 408,  53,
       355, 298, 618, 330, 371, 534, 576, 393, 325, 340, 688, 169, 633,
       364, 711,  42, 582, 123, 361, 670,  40, 738, 190, 232, 336, 699,
       379, 536, 560, 480, 391, 484,  99,  38, 539, 270, 622, 153, 753,
       227, 647, 569,   6, 197, 308, 130, 392, 519, 115, 212, 440, 691,
       363, 674, 559,  86, 111, 294, 335,  18, 661, 177, 434],
      dtype=uint64)

Creating a dataframe by dropping indices in twin_idx to create bigger twin which will be used to train the model

In [17]:
data_imputed_train=data_imputed.drop(data_imputed.index[[300, 326, 356, 535,  17, 214, 754, 502, 175, 417, 761,  48, 276,
        25, 298, 590, 314, 683, 164, 539,  16, 746, 732, 399, 427, 766,
       269, 646, 110,  64, 161, 302, 463, 704, 496, 601, 752, 694,  60,
        89, 307, 101, 686, 167, 224, 202, 137, 610, 142, 423, 624,  74,
       457, 168, 411, 138, 470, 305,  59, 127, 467, 426,  87, 226, 591,
       343, 628, 568, 641, 652, 251,  91, 135, 482, 450,  90, 650, 367,
       639, 239,   1, 158, 720, 352, 462, 505, 281, 222,  98, 760, 644,
       564, 354, 410, 345, 285, 295, 405,  18, 599, 530, 346, 279, 434,
       477, 500, 728, 139, 178, 211, 267, 534, 449, 466, 242, 586,  11,
       236, 749, 361,  30, 582, 194, 668, 286, 753, 412, 379, 536, 739,
       303, 485,   6, 258, 227, 186, 115, 140, 129, 743, 245, 487, 123,
       330, 659, 395, 370, 159, 558, 212, 672, 662, 532, 254]])

Splitting the data into train and test set

In [18]:
X_train = data_imputed_train.iloc[:, :-1].values
Y_train = data_imputed_train.iloc[:, -1].values
X_test = data_imputed.iloc[[300, 326, 356, 535,  17, 214, 754, 502, 175, 417, 761,  48, 276,
        25, 298, 590, 314, 683, 164, 539,  16, 746, 732, 399, 427, 766,
       269, 646, 110,  64, 161, 302, 463, 704, 496, 601, 752, 694,  60,
        89, 307, 101, 686, 167, 224, 202, 137, 610, 142, 423, 624,  74,
       457, 168, 411, 138, 470, 305,  59, 127, 467, 426,  87, 226, 591,
       343, 628, 568, 641, 652, 251,  91, 135, 482, 450,  90, 650, 367,
       639, 239,   1, 158, 720, 352, 462, 505, 281, 222,  98, 760, 644,
       564, 354, 410, 345, 285, 295, 405,  18, 599, 530, 346, 279, 434,
       477, 500, 728, 139, 178, 211, 267, 534, 449, 466, 242, 586,  11,
       236, 749, 361,  30, 582, 194, 668, 286, 753, 412, 379, 536, 739,
       303, 485,   6, 258, 227, 186, 115, 140, 129, 743, 245, 487, 123,
       330, 659, 395, 370, 159, 558, 212, 672, 662, 532, 254], :-1].values
Y_test = data_imputed.iloc[[300, 326, 356, 535,  17, 214, 754, 502, 175, 417, 761,  48, 276,
        25, 298, 590, 314, 683, 164, 539,  16, 746, 732, 399, 427, 766,
       269, 646, 110,  64, 161, 302, 463, 704, 496, 601, 752, 694,  60,
        89, 307, 101, 686, 167, 224, 202, 137, 610, 142, 423, 624,  74,
       457, 168, 411, 138, 470, 305,  59, 127, 467, 426,  87, 226, 591,
       343, 628, 568, 641, 652, 251,  91, 135, 482, 450,  90, 650, 367,
       639, 239,   1, 158, 720, 352, 462, 505, 281, 222,  98, 760, 644,
       564, 354, 410, 345, 285, 295, 405,  18, 599, 530, 346, 279, 434,
       477, 500, 728, 139, 178, 211, 267, 534, 449, 466, 242, 586,  11,
       236, 749, 361,  30, 582, 194, 668, 286, 753, 412, 379, 536, 739,
       303, 485,   6, 258, 227, 186, 115, 140, 129, 743, 245, 487, 123,
       330, 659, 395, 370, 159, 558, 212, 672, 662, 532, 254], -1].values


## Fitting a model using Linear Discriminant Analysis

In [19]:
#Fit the LDA model
model = LinearDiscriminantAnalysis()
model.fit(X_train, Y_train)
model.score(X_train, Y_train)

0.7833876221498371

We can see that the model performed an accuracy of 78.33%.

## Using the model to predict diabetes using test dataset

We will now use this model to predict using .predict function

In [20]:
Y_pred = model.predict(X_test)

## Model evaluation and accuracy

Since our dataset is imbalanced, we use F1 score as our performance metric

The F1 score can be interpreted as a harmonic mean of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0



In [27]:
confusion_matrix(Y_test, Y_pred)

array([[88, 15],
       [25, 26]])

In [28]:
print(classification_report(Y_test, Y_pred))

              precision    recall  f1-score   support

           0       0.78      0.85      0.81       103
           1       0.63      0.51      0.57        51

    accuracy                           0.74       154
   macro avg       0.71      0.68      0.69       154
weighted avg       0.73      0.74      0.73       154



Computing the F1 score

 We use  `average = 'weighted'`.This accounts for label imbalance

In [29]:
 f1_score(Y_test, Y_pred, average='weighted')

0.7321559278081017

Our overall accuracy is 73%