### Navigating Health: Detecting Diabetes Groups with Naive Bayes

**Introduction:**<br>

In the vast landscape of data-driven healthcare, machine learning serves as a powerful tool for predicting, diagnosing, and understanding various medical conditions. In this blog post, we'll explore the application of Naive Bayes, a probabilistic classification algorithm, in detecting diabetes groups. By leveraging the features provided by the dataset, we aim to build a model that can aid in identifying individuals at risk of diabetes.<br>


**Understanding Diabetes:**<br>
Diabetes, a chronic medical condition, affects millions of people worldwide. Early detection and intervention are crucial for managing the disease effectively. Machine learning models, such as Naive Bayes, can assist in identifying patterns within medical data to predict the likelihood of diabetes.<br>

**The Naive Bayes Algorithm:**<br>
Naive Bayes is a probabilistic algorithm based on Bayes' theorem, which calculates the probability of a hypothesis given observed evidence. Despite its "naive" assumption of independence between features, Naive Bayes often performs well in practice, especially with limited data.<br>

In [1]:
import pandas as pd
import numpy as np
import matplotlib as plt
import seaborn as sns



In [2]:
import warnings
warnings.filterwarnings("ignore")

In [3]:
df = pd.read_csv("Dataset of Diabetes .csv")

In [4]:
df.head()

Unnamed: 0,ID,No_Pation,Gender,AGE,Urea,Cr,HbA1c,Chol,TG,HDL,LDL,VLDL,BMI,CLASS
0,502,17975,F,50,4.7,46,4.9,4.2,0.9,2.4,1.4,0.5,24.0,N
1,735,34221,M,26,4.5,62,4.9,3.7,1.4,1.1,2.1,0.6,23.0,N
2,420,47975,F,50,4.7,46,4.9,4.2,0.9,2.4,1.4,0.5,24.0,N
3,680,87656,F,50,4.7,46,4.9,4.2,0.9,2.4,1.4,0.5,24.0,N
4,504,34223,M,33,7.1,46,4.9,4.9,1.0,0.8,2.0,0.4,21.0,N


In [6]:
df1 = pd.read_csv("Diabetes.csv")

In [7]:
df1.head()

Unnamed: 0.1,Unnamed: 0,relwt,glufast,glutest,instest,sspg,group
0,1,0.81,80,356,124,55,Normal
1,2,0.95,97,289,117,76,Normal
2,3,0.94,105,319,143,105,Normal
3,4,1.04,90,356,199,108,Normal
4,5,1.0,90,323,240,143,Normal


In [8]:
df1 = df1.drop("Unnamed: 0", axis = 1)

In [9]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 145 entries, 0 to 144
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   relwt    145 non-null    float64
 1   glufast  145 non-null    int64  
 2   glutest  145 non-null    int64  
 3   instest  145 non-null    int64  
 4   sspg     145 non-null    int64  
 5   group    145 non-null    object 
dtypes: float64(1), int64(4), object(1)
memory usage: 6.9+ KB


In [10]:
df1.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
relwt,145.0,0.97731,0.129235,0.71,0.88,0.98,1.08,1.2
glufast,145.0,121.986207,63.930408,70.0,90.0,97.0,112.0,353.0
glutest,145.0,543.613793,316.950863,269.0,352.0,413.0,558.0,1568.0
instest,145.0,186.117241,120.935158,10.0,118.0,156.0,221.0,748.0
sspg,145.0,184.206897,106.029863,29.0,100.0,159.0,257.0,480.0


In [11]:
df1.isnull().sum()

relwt      0
glufast    0
glutest    0
instest    0
sspg       0
group      0
dtype: int64

In [12]:
x = df1.drop("group", axis=1)
y = df1["group"]

In [13]:
x

Unnamed: 0,relwt,glufast,glutest,instest,sspg
0,0.81,80,356,124,55
1,0.95,97,289,117,76
2,0.94,105,319,143,105
3,1.04,90,356,199,108
4,1.00,90,323,240,143
...,...,...,...,...,...
140,1.05,353,1428,41,480
141,0.91,180,923,77,150
142,0.90,213,1025,29,209
143,1.11,328,1246,124,442


In [14]:
type(y)

pandas.core.series.Series

In [15]:
df1["group"].head()

0    Normal
1    Normal
2    Normal
3    Normal
4    Normal
Name: group, dtype: object

In [16]:
y.value_counts()

Normal               76
Chemical_Diabetic    36
Overt_Diabetic       33
Name: group, dtype: int64

In [17]:
from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.2, random_state=42)

In [18]:
xtrain.shape

(116, 5)

In [19]:
ytrain.shape

(116,)

In [20]:
xtest.shape

(29, 5)

In [21]:
ytest.shape

(29,)

In [22]:
from sklearn.naive_bayes import GaussianNB, BernoulliNB, MultinomialNB
GNB = GaussianNB()
GNB.fit(xtrain, ytrain)

GaussianNB()

In [23]:
print("The training score of the GNB model is ", GNB.score(xtest,ytest))

The training score of the GNB model is  0.9655172413793104


In [24]:
MNB = MultinomialNB()
MNB.fit(xtrain, ytrain)


MultinomialNB()

In [25]:
print("The training score of the GNB model is ", MNB.score(xtest,ytest))

The training score of the GNB model is  0.8275862068965517


In [26]:
BNB = BernoulliNB()
BNB.fit(xtrain,ytrain)

BernoulliNB()

In [27]:
print("The training score of the GNB model is ", BNB.score(xtest,ytest))

The training score of the GNB model is  0.5172413793103449


In [28]:
info1 = [0.94,105, 19,143, 105]

In [29]:
len(info1)

5

In [30]:
#list to array con
info1 = np.array([info1])

In [31]:
info1

array([[  0.94, 105.  ,  19.  , 143.  , 105.  ]])

In [32]:
GNB.predict(info1)

array(['Overt_Diabetic'], dtype='<U17')

In [33]:
info2 = [0.95,105, 310,142, 105]

In [34]:
#list to array con
info2 = np.array([info2])

In [35]:
GNB.predict(info2)

array(['Normal'], dtype='<U17')

In [36]:
pred = GNB.predict(info1)

In [37]:
if pred[0] == 'Normal':
    print("Patient is Normal")
elif pred[0] == 'Chemical_Diabetic':
    print("Patient is Chemical_Diabetic")
else:
    print("Patient is Overt_Diabetic")

Patient is Overt_Diabetic


In [38]:
pred = list(pred)

In [39]:
pred.extend(list(GNB.predict(info2)))

In [40]:
pred

['Overt_Diabetic', 'Normal']

In [41]:
if pred[1] == 'Normal':
    print("Patient is Normal")
elif pred[0] == 'Chemical_Diabetic':
    print("Patient is Chemical_Diabetic")
else:
    print("Patient is Overt_Diabetic")

Patient is Normal


In [45]:
n = int(input("Enter index: "))
if pred[n] == 'Normal':
    print("Patient is Normal")
elif pred[0] == 'Chemical_Diabetic':
    print("Patient is Chemical_Diabetic")
else:
    print("Patient is Overt_Diabetic")

Enter index: 1
Patient is Normal


In [46]:
df

Unnamed: 0,ID,No_Pation,Gender,AGE,Urea,Cr,HbA1c,Chol,TG,HDL,LDL,VLDL,BMI,CLASS
0,502,17975,F,50,4.7,46,4.9,4.2,0.9,2.4,1.4,0.5,24.0,N
1,735,34221,M,26,4.5,62,4.9,3.7,1.4,1.1,2.1,0.6,23.0,N
2,420,47975,F,50,4.7,46,4.9,4.2,0.9,2.4,1.4,0.5,24.0,N
3,680,87656,F,50,4.7,46,4.9,4.2,0.9,2.4,1.4,0.5,24.0,N
4,504,34223,M,33,7.1,46,4.9,4.9,1.0,0.8,2.0,0.4,21.0,N
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,200,454317,M,71,11.0,97,7.0,7.5,1.7,1.2,1.8,0.6,30.0,Y
996,671,876534,M,31,3.0,60,12.3,4.1,2.2,0.7,2.4,15.4,37.2,Y
997,669,87654,M,30,7.1,81,6.7,4.1,1.1,1.2,2.4,8.1,27.4,Y
998,99,24004,M,38,5.8,59,6.7,5.3,2.0,1.6,2.9,14.0,40.5,Y


In [47]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
ID,1000.0,340.5,240.3977,1.0,125.75,300.5,550.25,800.0
No_Pation,1000.0,270551.408,3380758.0,123.0,24063.75,34395.5,45384.25,75435657.0
AGE,1000.0,53.528,8.799241,20.0,51.0,55.0,59.0,79.0
Urea,1000.0,5.124743,2.935165,0.5,3.7,4.6,5.7,38.9
Cr,1000.0,68.943,59.98475,6.0,48.0,60.0,73.0,800.0
HbA1c,1000.0,8.28116,2.534003,0.9,6.5,8.0,10.2,16.0
Chol,1000.0,4.86282,1.301738,0.0,4.0,4.8,5.6,10.3
TG,1000.0,2.34961,1.401176,0.3,1.5,2.0,2.9,13.8
HDL,1000.0,1.20475,0.6604136,0.2,0.9,1.1,1.3,9.9
LDL,1000.0,2.60979,1.115102,0.3,1.8,2.5,3.3,9.9


In [48]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   ID         1000 non-null   int64  
 1   No_Pation  1000 non-null   int64  
 2   Gender     1000 non-null   object 
 3   AGE        1000 non-null   int64  
 4   Urea       1000 non-null   float64
 5   Cr         1000 non-null   int64  
 6   HbA1c      1000 non-null   float64
 7   Chol       1000 non-null   float64
 8   TG         1000 non-null   float64
 9   HDL        1000 non-null   float64
 10  LDL        1000 non-null   float64
 11  VLDL       1000 non-null   float64
 12  BMI        1000 non-null   float64
 13  CLASS      1000 non-null   object 
dtypes: float64(8), int64(4), object(2)
memory usage: 109.5+ KB


In [49]:
df["CLASS"].value_counts()

Y    844
N    103
P     53
Name: CLASS, dtype: int64

In [50]:
df.isna().sum()

ID           0
No_Pation    0
Gender       0
AGE          0
Urea         0
Cr           0
HbA1c        0
Chol         0
TG           0
HDL          0
LDL          0
VLDL         0
BMI          0
CLASS        0
dtype: int64

In [51]:
df

Unnamed: 0,ID,No_Pation,Gender,AGE,Urea,Cr,HbA1c,Chol,TG,HDL,LDL,VLDL,BMI,CLASS
0,502,17975,F,50,4.7,46,4.9,4.2,0.9,2.4,1.4,0.5,24.0,N
1,735,34221,M,26,4.5,62,4.9,3.7,1.4,1.1,2.1,0.6,23.0,N
2,420,47975,F,50,4.7,46,4.9,4.2,0.9,2.4,1.4,0.5,24.0,N
3,680,87656,F,50,4.7,46,4.9,4.2,0.9,2.4,1.4,0.5,24.0,N
4,504,34223,M,33,7.1,46,4.9,4.9,1.0,0.8,2.0,0.4,21.0,N
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,200,454317,M,71,11.0,97,7.0,7.5,1.7,1.2,1.8,0.6,30.0,Y
996,671,876534,M,31,3.0,60,12.3,4.1,2.2,0.7,2.4,15.4,37.2,Y
997,669,87654,M,30,7.1,81,6.7,4.1,1.1,1.2,2.4,8.1,27.4,Y
998,99,24004,M,38,5.8,59,6.7,5.3,2.0,1.6,2.9,14.0,40.5,Y


In [52]:
from methods import preprocessing
Xnew = preprocessing(df.drop("CLASS",axis=1))


In [53]:
#df = df.drop("ID", axis = 1)

In [54]:
#df = preprocessing(df)

In [55]:
#ynew = df[["CLASS_N", "CLASS_P", "CLASS_Y"]]

In [56]:
Xnew

Unnamed: 0,ID,No_Pation,AGE,Urea,Cr,HbA1c,Chol,TG,HDL,LDL,VLDL,BMI,Gender_F,Gender_M,Gender_f
0,0.672140,-0.074747,-0.401144,-0.144781,-0.382672,-1.334983,-0.509436,-1.035084,1.810756,-1.085457,-0.369958,-1.124622,1,0,0
1,1.641852,-0.069940,-3.130017,-0.212954,-0.115804,-1.334983,-0.893730,-0.678063,-0.158692,-0.457398,-0.342649,-1.326239,0,1,0
2,0.330868,-0.065869,-0.401144,-0.144781,-0.382672,-1.334983,-0.509436,-1.035084,1.810756,-1.085457,-0.369958,-1.124622,1,0,0
3,1.412950,-0.054126,-0.401144,-0.144781,-0.382672,-1.334983,-0.509436,-1.035084,1.810756,-1.085457,-0.369958,-1.124622,1,0,0
4,0.680463,-0.069939,-2.334096,0.673299,-0.382672,-1.334983,0.028576,-0.963680,-0.613180,-0.547121,-0.397267,-1.729472,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,-0.584741,0.054384,1.986619,2.002680,0.467970,-0.505840,2.026906,-0.463850,-0.007196,-0.726566,-0.342649,0.085078,0,1,0
996,1.375493,0.179334,-2.561502,-0.724254,-0.149162,1.586758,-0.586295,-0.106828,-0.764676,-0.188229,3.699116,1.536719,0,1,0
997,1.367170,-0.054127,-2.675205,0.673299,0.201102,-0.624289,-0.586295,-0.892276,-0.007196,-0.188229,1.705543,-0.439125,0,1,0
998,-1.005088,-0.072963,-1.765581,0.230173,-0.165842,-0.624289,0.336011,-0.249637,0.598788,0.260385,3.316787,2.202054,0,1,0


In [57]:
ynew = df["CLASS"]

In [58]:
from sklearn.model_selection import train_test_split
xtrain1, xtest1, ytrain1, ytest1 = train_test_split(Xnew,ynew, test_size=0.2, random_state=42)

In [59]:
xtrain1.shape

(800, 15)

In [60]:
ytrain1.shape

(800,)

In [61]:
xtest1.shape

(200, 15)

In [62]:
ytest1.shape

(200,)

In [63]:
from sklearn.naive_bayes import GaussianNB, BernoulliNB, MultinomialNB
Gcls= GaussianNB()
Gcls.fit(xtrain1,ytrain1)

GaussianNB()

In [64]:
print("The training score of the GaussianNB is : ", Gcls.score(xtest1, ytest1))

The training score of the GaussianNB is :  0.88


In [65]:
Mcls= BernoulliNB()
Mcls.fit(xtrain1,ytrain1)
print("The training score of the BernoulliNB is : ", Mcls.score(xtest1, ytest1))

The training score of the BernoulliNB is :  0.93
