# Diabetes Prediction Using SVM


The objective of this project is to classify whether someone has diabetes or not.


## Importing Necessary Libraries

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.metrics import  confusion_matrix, accuracy_score 
from sklearn.metrics import mean_squared_error
from math import sqrt
import warnings
warnings.filterwarnings("ignore")

## Loading the data

In [2]:
df=pd.read_csv("D:/DATASCIENCE/MLQuest/diabetes.csv")
df

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


About the dataset:-
* Pregnancies :- Number of times a woman has been pregnant
* Glucose :- Plasma Glucose concentration of 2 hours in an oral glucose tolerance test
* BloodPressure :- Diastollic Blood Pressure (mm hg)
* SkinThickness :- Triceps skin fold thickness(mm)
* Insulin :- 2 hour serum insulin(mu U/ml)
* BMI :- Body Mass Index ((weight in kg/height in m)^2)
* Age :- Age(years)
* DiabetesPedigreeFunction :-scores likelihood of diabetes based on family history
* Outcome :- 0(doesn't have diabetes) or 1 (has diabetes)

Dataset consists of several Medical Variables(Independent) and one Outcome Variable(Dependent).The independent variables in this data set are :-'Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin','BMI', 'DiabetesPedigreeFunction', 'Age'.

## Data Exploration

In [3]:
type(df) #to know the type of data

pandas.core.frame.DataFrame

In [4]:
df.shape #getting the number of rows and columns

(768, 9)

In [5]:
df.columns #getting columnnames in the data

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')

In [6]:
df.head() #display the top 5 data records

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [7]:
df.info()  #concise summary of DataFrame

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [8]:
df.describe() #basic statistics of the data

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


We observe that min value of some columns is 0 which cannot be possible medically.

In [9]:
df.isnull().sum() #finding the number of missing values

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

Here, we can see that there are no null values present in the data.Although there were no missing values found in the dataset, there still needs to be some feature engineering done before implementing the model. The features Glucose, Blood pressure, and Skin thickness, Insulin and BMI have minimum values of 0.So to solve this issue all values equal to zero in each of those three features were turned into null values and they were just ignored for simplicity.

## Data Cleaning

In [10]:
# Checking for 0 values in 5 columns
print(df[df['BloodPressure']==0].shape[0])
print(df[df['Glucose']==0].shape[0])
print(df[df['SkinThickness']==0].shape[0])
print(df[df['Insulin']==0].shape[0])
print(df[df['BMI']==0].shape[0])

35
5
227
374
11


In [11]:
# replacing zero values in specific columns 
zero_not_allowed = ["Glucose","BloodPressure","SkinThickness","Insulin","BMI"]

for column in zero_not_allowed:
    df[column] = df[column].replace(0, np.NaN)
    mean = int(df[column].mean(skipna = True))
    df[column] = df[column].replace(np.NaN, mean)

In [12]:
df["Outcome"].value_counts() #count the number of occurrences of each unique value

0    500
1    268
Name: Outcome, dtype: int64

In [13]:
df.groupby("Outcome").mean() #grouping the DataFrame by a categorical variable

Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,3.298,110.706,70.92,27.726,141.952,30.8802,0.429734,31.19
1,4.865672,142.160448,75.123134,31.686567,180.149254,35.381343,0.5505,37.067164


## Data Splitting

In [14]:
# Split the data into features (X) and target variable (y)
x=df.drop(columns="Outcome",axis=1)
y=df["Outcome"]

In [15]:
x

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,6,148.0,72.0,35.0,155.0,33.6,0.627,50
1,1,85.0,66.0,29.0,155.0,26.6,0.351,31
2,8,183.0,64.0,29.0,155.0,23.3,0.672,32
3,1,89.0,66.0,23.0,94.0,28.1,0.167,21
4,0,137.0,40.0,35.0,168.0,43.1,2.288,33
...,...,...,...,...,...,...,...,...
763,10,101.0,76.0,48.0,180.0,32.9,0.171,63
764,2,122.0,70.0,27.0,155.0,36.8,0.340,27
765,5,121.0,72.0,23.0,112.0,26.2,0.245,30
766,1,126.0,60.0,29.0,155.0,30.1,0.349,47


In [16]:
y

0      1
1      0
2      1
3      0
4      1
      ..
763    0
764    0
765    0
766    1
767    0
Name: Outcome, Length: 768, dtype: int64

# Data standardization

In [17]:
Scaler=StandardScaler()

In [19]:
standardized_data=Scaler.fit_transform(x) # Fit the scaler on the data and transform it

In [20]:
standardized_data

array([[ 0.63994726,  0.86525364, -0.03198993, ...,  0.16724016,
         0.46849198,  1.4259954 ],
       [-0.84488505, -1.20601255, -0.5283186 , ..., -0.85155088,
        -0.36506078, -0.19067191],
       [ 1.23388019,  2.01595708, -0.69376149, ..., -1.33183808,
         0.60439732, -0.10558415],
       ...,
       [ 0.3429808 , -0.02243187, -0.03198993, ..., -0.90976751,
        -0.68519336, -0.27575966],
       [-0.84488505,  0.14195434, -1.02464727, ..., -0.34215536,
        -0.37110101,  1.17073215],
       [-0.84488505, -0.94299462, -0.19743282, ..., -0.29849289,
        -0.47378505, -0.87137393]])

In [21]:
x=standardized_data
y=df["Outcome"]

In [22]:
x

array([[ 0.63994726,  0.86525364, -0.03198993, ...,  0.16724016,
         0.46849198,  1.4259954 ],
       [-0.84488505, -1.20601255, -0.5283186 , ..., -0.85155088,
        -0.36506078, -0.19067191],
       [ 1.23388019,  2.01595708, -0.69376149, ..., -1.33183808,
         0.60439732, -0.10558415],
       ...,
       [ 0.3429808 , -0.02243187, -0.03198993, ..., -0.90976751,
        -0.68519336, -0.27575966],
       [-0.84488505,  0.14195434, -1.02464727, ..., -0.34215536,
        -0.37110101,  1.17073215],
       [-0.84488505, -0.94299462, -0.19743282, ..., -0.29849289,
        -0.47378505, -0.87137393]])

In [23]:
y

0      1
1      0
2      1
3      0
4      1
      ..
763    0
764    0
765    0
766    1
767    0
Name: Outcome, Length: 768, dtype: int64

In [24]:
# Splited the data into training and testing sets
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.1,stratify=y,random_state=2)

In [25]:
print(x.shape,x_train.shape,x_test.shape)

(768, 8) (691, 8) (77, 8)


In [26]:
classifier=svm.SVC(kernel="linear")

## Model Training

In [27]:
classifier.fit(x_train,y_train)

In [28]:
# Prediction on training data
train_prediction=classifier.predict(x_train)

In [29]:
RMSE_model_train = sqrt(mean_squared_error(y_train, train_prediction))
print("RMSE for Training Data: ", RMSE_model_train)

RMSE for Training Data:  0.47817792134027715


In [30]:
training_data_accuracy=accuracy_score(train_prediction,y_train)
print("accuracy on training data:",training_data_accuracy)

accuracy on training data: 0.7713458755426917


## Model Evaluation

In [31]:
# Prediction on test data
test_prediction=classifier.predict(x_test)

In [32]:
RMSE_model_test = sqrt(mean_squared_error(y_test, test_prediction))
print("RMSE for Testing Data: ", RMSE_model_test)

RMSE for Testing Data:  0.4698714938993648


In [33]:
test_data_accuracy=accuracy_score(test_prediction,y_test)
print("accuracy on test data:",test_data_accuracy)

accuracy on test data: 0.7792207792207793


* The close values of RMSE between training and testing data suggest that the model is not overfitting or underfitting. It is generalizing well to unseen data.

* The accuracy scores on both training and testing data are reasonably high, indicating good performance.

In [34]:
conf_matrix = confusion_matrix(y_test,test_prediction)
print(f"Confusion Matrix:\n{conf_matrix}")

Confusion Matrix:
[[46  4]
 [13 14]]


In [35]:
# Prediction on sample data
input_data=(6,148,72,35,0,33.6,0.627,50)
input_data_as_numpy_array=np.asarray(input_data)
input_data_reshaped=input_data_as_numpy_array.reshape(1,-1)
prediction=classifier.predict(input_data_reshaped)
print(prediction)
if(prediction[0]==1):
    print("The person is not diabetic")
else:
    print("The person is diabetic")

[1]
The person is not diabetic
