# **Diabetes Prediction Project**

This project aims to predict whether a patient has diabetes based on diagnostic measurements. The dataset, sourced from the National Institute of Diabetes and Digestive and Kidney Diseases, contains medical predictor variables such as the number of pregnancies, BMI, insulin level, age, among others. The dataset includes only female patients who are at least 21 years old and of Pima Indian heritage.

**Overview of the Process**

1. Data Collection and Analysis:

 * Loaded the dataset and explored its structure and summary statistics.
 * Identified class imbalance in the target variable, with more non-diabetic than diabetic instances.

2. Handling Imbalanced Dataset:

 * Used under-sampling to balance the dataset by randomly selecting an equal number of non-diabetic samples as there are diabetic samples.

3. Data Preprocessing:

 * Separated features (X) and labels (Y).
 * Standardized the features using StandardScaler to ensure all variables have a mean of 0 and a standard deviation of 1.

4. Parameter Selection with GridSearchCV:

 * Utilized GridSearchCV to find the best hyperparameters for the SVM classifier.

5. Model Training:

 * Split the dataset into training and testing sets using with stratification.
 * Trained an SVM classifier with the best parameters.

6. Model Evaluation:

 * Evaluated the model's performance, achieving training accuracy of 77.71% and test accuracy of 75.32%.

7. Predictive System:

 * Implemented a predictive function that takes input data, preprocesses it, and uses the trained model to predict whether a person is diabetic or not.

Importing the Dependencies

In [None]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

### **1. Data Collection and Analysis**



In [None]:
# loading the diabetes dataset to a pandas DataFrame
diabetes_dataset = pd.read_csv('/content/diabetes.csv')

In [None]:
# printing the first 5 rows of the dataset
diabetes_dataset.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [None]:
# number of rows and Columns in this dataset
diabetes_dataset.shape

(768, 9)

In [None]:
# getting the statistical measures of the data
diabetes_dataset.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [None]:
diabetes_dataset['Outcome'].value_counts()

Outcome
0    500
1    268
Name: count, dtype: int64

0 --> Non-Diabetic

1 --> Diabetic

This is Imbalanced Dataset

### **2. Handling Imbalanced Dataset**

In [None]:
#separating the diabetic and non_diabetic patient
non_diabetic = diabetes_dataset[diabetes_dataset.Outcome == 0]
diabetic = diabetes_dataset[diabetes_dataset.Outcome == 1]

In [None]:
print(non_diabetic.shape)
print(diabetic.shape)

(500, 9)
(268, 9)


In [None]:
non_diabetic_sample = non_diabetic.sample(n=268,random_state=1)

In [None]:
new_dataset = pd.concat([non_diabetic_sample, diabetic], axis=0)

In [None]:
new_dataset.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
482,4,85,58,22,49,27.8,0.306,28,0
528,0,117,66,31,188,30.8,0.493,22,0
80,3,113,44,13,0,22.4,0.14,22,0
105,1,126,56,29,152,28.7,0.801,21,0
733,2,106,56,27,165,29.0,0.426,22,0


In [None]:
new_dataset.tail()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
755,1,128,88,39,110,36.5,1.057,37,1
757,0,123,72,0,0,36.3,0.258,52,1
759,6,190,92,0,0,35.5,0.278,66,1
761,9,170,74,31,0,44.0,0.403,43,1
766,1,126,60,0,0,30.1,0.349,47,1


In [None]:
new_dataset['Outcome'].value_counts()

Outcome
0    268
1    268
Name: count, dtype: int64

### **3. Data Preprocessing**

In [None]:
X = new_dataset.drop(columns = 'Outcome', axis=1)
Y = new_dataset['Outcome']

In [None]:
print(X)

     Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
482            4       85             58             22       49  27.8   
528            0      117             66             31      188  30.8   
80             3      113             44             13        0  22.4   
105            1      126             56             29      152  28.7   
733            2      106             56             27      165  29.0   
..           ...      ...            ...            ...      ...   ...   
755            1      128             88             39      110  36.5   
757            0      123             72              0        0  36.3   
759            6      190             92              0        0  35.5   
761            9      170             74             31        0  44.0   
766            1      126             60              0        0  30.1   

     DiabetesPedigreeFunction  Age  
482                     0.306   28  
528                     0.493   22  


In [None]:
print(Y)

482    0
528    0
80     0
105    0
733    0
      ..
755    1
757    1
759    1
761    1
766    1
Name: Outcome, Length: 536, dtype: int64


Data Standardization

In [None]:
scaler = StandardScaler()

In [None]:
scaler.fit(X)

In [None]:
standardized_data = scaler.transform(X)

In [None]:
print(standardized_data)

[[-1.61760846e-03 -1.22337621e+00 -5.96621030e-01 ... -6.74299515e-01
  -5.64246834e-01 -4.98007237e-01]
 [-1.15766845e+00 -2.40457494e-01 -1.77225788e-01 ... -2.72642312e-01
  -1.82380614e-02 -1.02484647e+00]
 [-2.90630319e-01 -3.63322334e-01 -1.33056270e+00 ... -1.39728248e+00
  -1.04893911e+00 -1.02484647e+00]
 ...
 [ 5.76407813e-01  2.00182583e+00  1.18580875e+00 ...  3.56620640e-01
  -6.46002158e-01  2.83864125e+00]
 [ 1.44344595e+00  1.38750163e+00  2.42169454e-01 ...  1.49464938e+00
  -2.81023032e-01  8.19090851e-01]
 [-8.68655741e-01  3.59883952e-02 -4.91772220e-01 ... -3.66362326e-01
  -4.38694015e-01  1.17031701e+00]]


In [None]:
X = standardized_data
Y = np.asarray(new_dataset['Outcome'])

In [None]:
print(X)
print(Y)

[[-1.61760846e-03 -1.22337621e+00 -5.96621030e-01 ... -6.74299515e-01
  -5.64246834e-01 -4.98007237e-01]
 [-1.15766845e+00 -2.40457494e-01 -1.77225788e-01 ... -2.72642312e-01
  -1.82380614e-02 -1.02484647e+00]
 [-2.90630319e-01 -3.63322334e-01 -1.33056270e+00 ... -1.39728248e+00
  -1.04893911e+00 -1.02484647e+00]
 ...
 [ 5.76407813e-01  2.00182583e+00  1.18580875e+00 ...  3.56620640e-01
  -6.46002158e-01  2.83864125e+00]
 [ 1.44344595e+00  1.38750163e+00  2.42169454e-01 ...  1.49464938e+00
  -2.81023032e-01  8.19090851e-01]
 [-8.68655741e-01  3.59883952e-02 -4.91772220e-01 ... -3.66362326e-01
  -4.38694015e-01  1.17031701e+00]]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

### **4. Parameter Selection with GridSearchCV:**

In [None]:
params={
    'C':[1,5,10,20],
    'kernel':['linear','poly','rbf','sigmoid']
}

In [None]:
classifier = GridSearchCV(SVC(),params,cv=5)

In [None]:
classifier.fit(X,Y)

In [None]:
result = pd.DataFrame(classifier.cv_results_)

In [None]:
result.head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_C,param_kernel,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.008713,0.002159,0.001429,6.6e-05,1,linear,"{'C': 1, 'kernel': 'linear'}",0.759259,0.738318,0.747664,0.766355,0.747664,0.751852,0.009833,2
1,0.006315,0.00054,0.001583,0.000121,1,poly,"{'C': 1, 'kernel': 'poly'}",0.685185,0.682243,0.719626,0.775701,0.738318,0.720215,0.034859,12
2,0.006413,0.000248,0.002583,0.000449,1,rbf,"{'C': 1, 'kernel': 'rbf'}",0.731481,0.682243,0.766355,0.738318,0.775701,0.73882,0.032782,6
3,0.009512,0.000847,0.00251,0.000534,1,sigmoid,"{'C': 1, 'kernel': 'sigmoid'}",0.694444,0.663551,0.71028,0.71028,0.728972,0.701506,0.021903,13
4,0.015152,0.002548,0.001509,3.6e-05,5,linear,"{'C': 5, 'kernel': 'linear'}",0.759259,0.719626,0.747664,0.775701,0.747664,0.749983,0.018329,3


In [None]:
result = result[['param_C','param_kernel','mean_test_score']]

In [None]:
result

Unnamed: 0,param_C,param_kernel,mean_test_score
0,1,linear,0.751852
1,1,poly,0.720215
2,1,rbf,0.73882
3,1,sigmoid,0.701506
4,5,linear,0.749983
5,5,poly,0.723849
6,5,rbf,0.740706
7,5,sigmoid,0.699637
8,10,linear,0.749983
9,10,poly,0.723884


In [None]:
classifier.best_params_

{'C': 20, 'kernel': 'linear'}

In [None]:
print(round(classifier.best_score_*100,2),'%')

75.19 %


Highest Accuracy = 75.19%

Best Parameters = {'C':20, 'kernel':'linear'}

### **5. Model Training**

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.2, stratify=Y, random_state=2)

In [None]:
print(X.shape, X_train.shape, X_test.shape)

(536, 8) (428, 8) (108, 8)


Training the Model

In [None]:
classifier = SVC(C=20,kernel='linear')

In [None]:
#training the support vector Machine Classifier
classifier.fit(X_train, Y_train)

### **6. Model Evaluation**

Accuracy Score

In [None]:
# accuracy score on the training data
X_train_prediction = classifier.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)

In [None]:
print('Accuracy score of the training data : ', round(training_data_accuracy*100,2),'%')

Accuracy score of the training data :  76.87 %


In [None]:
# accuracy score on the test data
X_test_prediction = classifier.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)

In [None]:
print('Accuracy score of the test data : ', round(test_data_accuracy*100,2),'%')

Accuracy score of the test data :  77.78 %


### **7. Predictive System**

In [None]:


def predict(input_data):
  # changing the input_data to numpy array
  input_data_as_numpy_array = np.asarray(input_data)

  # reshape the array as we are predicting for one instance
  input_data_reshaped = input_data_as_numpy_array.reshape(1,-1)

  # standardize the input data
  std_data = scaler.transform(input_data_reshaped)


  prediction = classifier.predict(std_data)


  if (prediction[0] == 0):
    print('The person is not diabetic')
  else:
    print('The person is diabetic')

In [None]:
input_data = (9,171,110,24,240,45.4,0.721,54)
predict(input_data)

The person is diabetic


