# Linear Discriminant Analysis on Diabetes Data

## Mohammad Abdul Wahed

## Contents


*   Objective
*   Description of Diabetes Dataset
*   Importing Libraries
*   Loading Data
*   Replacing '0' values with NaN in Glucose,	BloodPressure,	SkinThickness,	Insulin and	BMI	colums
*   Counting the number of null values
*   Imputing missing values using Multiple Imputation by Chained Equations(MICE)
*   Scaling the data
*   Splitting the data into train and test set using Twinning technique
*   Fitting a model using Linear Discriminant Analysis
*   Using the model to predict diabetes using test dataset
*   Model evaluation and accuracy











## Objective

 The objective is to develop a model that predicts based on diagnostic measurements whether a patient has diabetes.

## Description of Diabetes Dataset

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases.


Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.



*   Pregnancies: Number of times pregnant
*   Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
*   BloodPressure: Diastolic blood pressure (mm Hg)
*   SkinThickness: Triceps skin fold thickness (mm)
*   Insulin: 2-Hour serum insulin (mu U/ml)
*   BMI: Body mass index (weight in kg/(height in m)^2)
*   DiabetesPedigreeFunction: Diabetes pedigree function
*   Age: Age (years)
*   Outcome: Class variable (0 or 1)




## Importing Libraries

In [20]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis 
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score

## Loading Data

In [2]:
data = pd.read_csv("diabetes.csv")

In [3]:
data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [4]:
data.shape

(768, 9)

There are 768 rows and 9 columns

In [5]:
data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Pregnancies,768.0,3.845052,3.369578,0.0,1.0,3.0,6.0,17.0
Glucose,768.0,120.894531,31.972618,0.0,99.0,117.0,140.25,199.0
BloodPressure,768.0,69.105469,19.355807,0.0,62.0,72.0,80.0,122.0
SkinThickness,768.0,20.536458,15.952218,0.0,0.0,23.0,32.0,99.0
Insulin,768.0,79.799479,115.244002,0.0,0.0,30.5,127.25,846.0
BMI,768.0,31.992578,7.88416,0.0,27.3,32.0,36.6,67.1
DiabetesPedigreeFunction,768.0,0.471876,0.331329,0.078,0.24375,0.3725,0.62625,2.42
Age,768.0,33.240885,11.760232,21.0,24.0,29.0,41.0,81.0
Outcome,768.0,0.348958,0.476951,0.0,0.0,0.0,1.0,1.0


As we see, the minimum value in Glucose, BloodPressure, SkinThickness, insulin, BMI is '0' indicating that they are actually null values. We replace the '0' values with NaN and impute NaN(missing values) using Multiple Imputation by Chained Equations (MICE)

Also we see that the mean of outcome is 0.34 which means that the dataset is imbalanced(outcome '0' and outcome '1' are not in proportion). The F1 score metric becomes especially valuable when working on classification models in which our data set is imbalanced. We will implement it later in this notebook.

## Replacing '0' values with NaN

In [6]:
data[["Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI"]] = data[["Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI"]].replace(0, np.NaN)

## Counting the number of null values

In [7]:
data.isnull().sum()

Pregnancies                   0
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64

## Imputing missing values using Multiple Imputation by Chained Equations(MICE)

Let's first install the `miceforest` package.

In [None]:
!pip install miceforest --no-cache-dir


Installing latest version of `miceforest`

In [None]:
!pip install git+https://github.com/AnotherSamWilson/miceforest.git

In [10]:
import miceforest as mf

We have the original dataset with missing values(NaN) in `data`. Let's try to impute the missing values in the data with `miceforest`.

In [11]:
# Create kernel. 
kds = mf.ImputationKernel(
  data,
  save_all_iterations=True,
  random_state=100
)

# Run the MICE algorithm for 5 iterations
kds.mice(5)

# Return the completed dataset.
data_imputed = kds.complete_data()

In [12]:
data_imputed.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

Thus, missing values are imputed

## Scaling the data

One of the key assumptions of LDA is that each of the predictor variables have same variance. An easy way to ensure that this assumption is met is to scale each variable such that it has mean of 0 and standard deviation of 1.

In [73]:
scale = StandardScaler()
data_imputed_X = data_imputed[['Pregnancies', 'Glucose','BloodPressure','SkinThickness','Insulin','BMI','DiabetesPedigreeFunction','Age']]
scaled_data_imputed_X = scale.fit_transform(data_imputed_X)

In [74]:
#checking mean of the scaled data
scaled_data_imputed_X.mean(axis=0)

array([-6.47630098e-17,  1.54968631e-16,  3.77013235e-16, -1.13335267e-16,
       -9.02056208e-17,  6.01370805e-17,  2.45174251e-16,  1.93132547e-16])

In [75]:
#checking standard deviation of the scaled data
scaled_data_imputed_X.std(axis=0)

array([1., 1., 1., 1., 1., 1., 1., 1.])

The data has been scaled to mean 0 and standard deviation 1.

In [76]:
scaled_data_imputed_X

array([[ 0.63994726,  0.86428946, -0.03929522, ...,  0.16899856,
         0.46849198,  1.4259954 ],
       [-0.84488505, -1.20027638, -0.52867995, ..., -0.84567228,
        -0.36506078, -0.19067191],
       [ 1.23388019,  2.01127048, -0.6918082 , ..., -1.3240171 ,
         0.60439732, -0.10558415],
       ...,
       [ 0.3429808 , -0.02052447, -0.03929522, ..., -0.90365347,
        -0.68519336, -0.27575966],
       [-0.84488505,  0.14332996, -1.01806469, ..., -0.33833686,
        -0.37110101,  1.17073215],
       [-0.84488505, -0.93810929, -0.20242346, ..., -0.29485096,
        -0.47378505, -0.87137393]])

In [77]:
scaled_data_imputed_X.shape

(768, 8)

In [78]:
scaled_data_imputed_X = pd.DataFrame(scaled_data_imputed_X)

In [79]:
scaled_data_imputed_X.head()

Unnamed: 0,0,1,2,3,4,5,6,7
0,0.639947,0.864289,-0.039295,0.588726,0.629324,0.168999,0.468492,1.425995
1,-0.844885,-1.200276,-0.52868,-0.005676,-1.043159,-0.845672,-0.365061,-0.190672
2,1.23388,2.01127,-0.691808,-0.897278,1.136406,-1.324017,0.604397,-0.105584
3,-0.844885,-1.069193,-0.52868,-0.600077,-0.562765,-0.628243,-0.920763,-1.041549
4,-1.141852,0.50381,-2.649347,0.588726,0.095553,1.546052,5.484909,-0.020496


In [80]:
data_Y = data[['Outcome']]
data_Y

Unnamed: 0,Outcome
0,1
1,0
2,1
3,0
4,1
...,...
763,0
764,0
765,0
766,1


In [81]:
data_Y.shape

(768, 1)

In [82]:
data_Y = pd.DataFrame(data_Y)

In [83]:
data_Y.head()

Unnamed: 0,Outcome
0,1
1,0
2,1
3,0
4,1


In [88]:
scaled_data_imputed_X.isnull().sum()

0    0
1    0
2    0
3    0
4    0
5    0
6    0
7    0
dtype: int64

In [86]:
data_Y.isnull().sum()

Outcome    0
dtype: int64

## Splitting the data into training and test set using Twinning technique
Twinning technique partitions datasets into statistically similar disjoint sets, termed as twins.

Let's install the twinning package

In [None]:
pip install git+https://github.com/avkl/twinning.git

In [14]:
from twinning import twin

The following code generates an 80-20 partition of the imputed dataset `data_imputed`. twin() accepts a numpy ndarray as the dataset, and an integer parameter r representing the inverse of the partitioning ratio, i.e., for an 80-20 split, r = 1 / 0.2 = 5. The function returns indices of the smaller twin.

In [89]:
twin_idx = twin(scaled_data_impputed_X.to_numpy(), r=5)

In [90]:
twin_idx

array([751, 447, 150, 435,  16,  20, 179, 523, 693, 349,  29, 625, 504,
        77, 217, 533, 737, 583, 719, 406, 765, 581, 601, 463, 257, 632,
       137, 210,   1, 196, 729, 208, 233, 135, 181, 441, 432, 224, 202,
       565, 741,  80, 609, 430, 407, 450, 288, 494, 368, 500, 297,  66,
       380, 326, 414, 356, 216, 127, 318, 402, 116, 568, 419, 122, 700,
       144, 364,  48, 522, 281, 143, 194, 717, 582, 355, 209, 195, 335,
       452, 467, 365, 733, 567, 569, 566, 112, 290, 352, 369, 506, 427,
       399, 415, 412, 308, 434, 265, 439, 740, 744,  12, 743, 560, 327,
       404,  11, 648, 306, 495,   0,  31,  94, 325, 130, 393, 527, 320,
       392, 220, 111, 236, 298, 559, 680, 508, 331, 438,  42, 734, 537,
       243,  46, 213, 173, 699, 575, 689, 379, 303, 558, 363, 662, 212,
        93,   8, 221,  58, 228,   2, 464, 251, 459, 177, 621],
      dtype=uint64)

Splitting the data into train and test set

In [93]:
scaled_data_imputed_X_train=scaled_data_imputed_X.drop(scaled_data_imputed_X.index[[751, 447, 150, 435,  16,  20, 179, 523, 693, 349,  29, 625, 504,
        77, 217, 533, 737, 583, 719, 406, 765, 581, 601, 463, 257, 632,
       137, 210,   1, 196, 729, 208, 233, 135, 181, 441, 432, 224, 202,
       565, 741,  80, 609, 430, 407, 450, 288, 494, 368, 500, 297,  66,
       380, 326, 414, 356, 216, 127, 318, 402, 116, 568, 419, 122, 700,
       144, 364,  48, 522, 281, 143, 194, 717, 582, 355, 209, 195, 335,
       452, 467, 365, 733, 567, 569, 566, 112, 290, 352, 369, 506, 427,
       399, 415, 412, 308, 434, 265, 439, 740, 744,  12, 743, 560, 327,
       404,  11, 648, 306, 495,   0,  31,  94, 325, 130, 393, 527, 320,
       392, 220, 111, 236, 298, 559, 680, 508, 331, 438,  42, 734, 537,
       243,  46, 213, 173, 699, 575, 689, 379, 303, 558, 363, 662, 212,
        93,   8, 221,  58, 228,   2, 464, 251, 459, 177, 621]])
data_Y_train = data_Y.drop(data_Y.index[[751, 447, 150, 435,  16,  20, 179, 523, 693, 349,  29, 625, 504,
        77, 217, 533, 737, 583, 719, 406, 765, 581, 601, 463, 257, 632,
       137, 210,   1, 196, 729, 208, 233, 135, 181, 441, 432, 224, 202,
       565, 741,  80, 609, 430, 407, 450, 288, 494, 368, 500, 297,  66,
       380, 326, 414, 356, 216, 127, 318, 402, 116, 568, 419, 122, 700,
       144, 364,  48, 522, 281, 143, 194, 717, 582, 355, 209, 195, 335,
       452, 467, 365, 733, 567, 569, 566, 112, 290, 352, 369, 506, 427,
       399, 415, 412, 308, 434, 265, 439, 740, 744,  12, 743, 560, 327,
       404,  11, 648, 306, 495,   0,  31,  94, 325, 130, 393, 527, 320,
       392, 220, 111, 236, 298, 559, 680, 508, 331, 438,  42, 734, 537,
       243,  46, 213, 173, 699, 575, 689, 379, 303, 558, 363, 662, 212,
        93,   8, 221,  58, 228,   2, 464, 251, 459, 177, 621]])

In [103]:
X_test = scaled_data_imputed_X.loc[[751, 447, 150, 435,  16,  20, 179, 523, 693, 349,  29, 625, 504,
        77, 217, 533, 737, 583, 719, 406, 765, 581, 601, 463, 257, 632,
       137, 210,   1, 196, 729, 208, 233, 135, 181, 441, 432, 224, 202,
       565, 741,  80, 609, 430, 407, 450, 288, 494, 368, 500, 297,  66,
       380, 326, 414, 356, 216, 127, 318, 402, 116, 568, 419, 122, 700,
       144, 364,  48, 522, 281, 143, 194, 717, 582, 355, 209, 195, 335,
       452, 467, 365, 733, 567, 569, 566, 112, 290, 352, 369, 506, 427,
       399, 415, 412, 308, 434, 265, 439, 740, 744,  12, 743, 560, 327,
       404,  11, 648, 306, 495,   0,  31,  94, 325, 130, 393, 527, 320,
       392, 220, 111, 236, 298, 559, 680, 508, 331, 438,  42, 734, 537,
       243,  46, 213, 173, 699, 575, 689, 379, 303, 558, 363, 662, 212,
        93,   8, 221,  58, 228,   2, 464, 251, 459, 177, 621]].values
Y_test = data_Y.loc[[751, 447, 150, 435,  16,  20, 179, 523, 693, 349,  29, 625, 504,
        77, 217, 533, 737, 583, 719, 406, 765, 581, 601, 463, 257, 632,
       137, 210,   1, 196, 729, 208, 233, 135, 181, 441, 432, 224, 202,
       565, 741,  80, 609, 430, 407, 450, 288, 494, 368, 500, 297,  66,
       380, 326, 414, 356, 216, 127, 318, 402, 116, 568, 419, 122, 700,
       144, 364,  48, 522, 281, 143, 194, 717, 582, 355, 209, 195, 335,
       452, 467, 365, 733, 567, 569, 566, 112, 290, 352, 369, 506, 427,
       399, 415, 412, 308, 434, 265, 439, 740, 744,  12, 743, 560, 327,
       404,  11, 648, 306, 495,   0,  31,  94, 325, 130, 393, 527, 320,
       392, 220, 111, 236, 298, 559, 680, 508, 331, 438,  42, 734, 537,
       243,  46, 213, 173, 699, 575, 689, 379, 303, 558, 363, 662, 212,
        93,   8, 221,  58, 228,   2, 464, 251, 459, 177, 621]].values


In [104]:
scaled_data_imputed_X_train.shape

(614, 8)

In [105]:
data_Y_train.shape

(614, 1)

## Fitting a model using Linear Discriminant Analysis

In [111]:
#Fit the LDA model
model = LinearDiscriminantAnalysis()
model.fit(scaled_data_imputed_X_train, data_Y_train)
model.score(scaled_data_imputed_X_train, data_Y_train)

  y = column_or_1d(y, warn=True)


0.7703583061889251

We can see that the model performed an accuracy of 77.03%.

---



## Using the model to predict diabetes using test dataset

We will now use this model to predict using .predict function

In [107]:
Y_pred = model.predict(X_test)

## Model evaluation and accuracy

Since our dataset is imbalanced, we use F1 score as our performance metric

The F1 score can be interpreted as a harmonic mean of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0



In [108]:
confusion_matrix(Y_test, Y_pred)

array([[85, 16],
       [21, 32]])

In [109]:
print(classification_report(Y_test, Y_pred))

              precision    recall  f1-score   support

           0       0.80      0.84      0.82       101
           1       0.67      0.60      0.63        53

    accuracy                           0.76       154
   macro avg       0.73      0.72      0.73       154
weighted avg       0.76      0.76      0.76       154



Computing the F1 score

 We use  `average = 'weighted'`.This accounts for label imbalance

In [110]:
 f1_score(Y_test, Y_pred, average='weighted')

0.7566949241508001

Our overall accuracy is 75.6%