# Diabetes Prediction
### Predicting if a person has Diabetes using SVM Model

### Importing needed Packages

In [1]:
import numpy as np #<-- Used to make numpy arrays for data processing analysis.
import pandas as pd #<-- package use for making data frames.
from sklearn.preprocessing import StandardScaler #<-- Used to standardize the data. 
from sklearn.model_selection import train_test_split #<-- Used to split the data into training and test data
from sklearn import svm #<-- Support Vector Machine - The Machine Learning Model used to make the prediction.
from sklearn.metrics import accuracy_score #<-- Check the accuracy of our model. 

### Data Collection & Analysis

In [2]:
#Loading Data Set
df = pd.read_csv(r'C:\Users\u122398\OneDrive - Straumann Group\Desktop\Python Projects\diabetes.csv') 
                    # (File path witht he document is in local computer)
df.head() #<-- calls the first five rows.
# pd.read_csv? <-- If ever need questions about the function.

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [3]:
df.shape #<-- Tells how many rows and columns there are

(768, 9)

In [4]:
# Gettingthe the statistical measures of the data
df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [5]:
df['Outcome'].value_counts() #<-- Counts the values of a given column.

Outcome
0    500
1    268
Name: count, dtype: int64

### 0 --> Non-Diabetic
### 1 --> Diabetic

In [6]:
df.groupby('Outcome').mean() 
#This gives you the mean of the all other columns for either people with Diabetes (1) and people with out Diabetes (0)

Unnamed: 0_level_0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,3.298,109.98,68.184,19.664,68.792,30.3042,0.429734,31.19
1,4.865672,141.257463,70.824627,22.164179,100.335821,35.142537,0.5505,37.067164


In [7]:
# Seperating the data and labels. 
X = df.drop(columns = 'Outcome', axis =1) 
#his drops outcome column axis = 1 means that you want to drop a dolumn if axis = 0 means you want to drop a row. 
Y = df['Outcome']
# Creating a seperate df for outcomes as the dependant variable (What we're trying to predict). 

### Standardizing the Data

In [8]:
# We standardize the data to make all the variables to be in a similar range for the ML model run better.
scaler = StandardScaler()

In [9]:
scaler.fit(X)

In [10]:
standardized_data = scaler.transform(X)

In [11]:
print(standardized_data)

[[ 0.63994726  0.84832379  0.14964075 ...  0.20401277  0.46849198
   1.4259954 ]
 [-0.84488505 -1.12339636 -0.16054575 ... -0.68442195 -0.36506078
  -0.19067191]
 [ 1.23388019  1.94372388 -0.26394125 ... -1.10325546  0.60439732
  -0.10558415]
 ...
 [ 0.3429808   0.00330087  0.14964075 ... -0.73518964 -0.68519336
  -0.27575966]
 [-0.84488505  0.1597866  -0.47073225 ... -0.24020459 -0.37110101
   1.17073215]
 [-0.84488505 -0.8730192   0.04624525 ... -0.20212881 -0.47378505
  -0.87137393]]


In [12]:
X = standardized_data
Y = df['Outcome']

### Train Test Split

In [15]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.2, stratify=Y, random_state =2)

### Split to Training and testing - 
Training is as if for exmaple - You've been studying for a test with practice questions then finally you take the test. You're training your mind how to solve the answer using practive questions then finally you take the real test. 

### test_size = .2 
means that 20% of the data will be used as the test data. 

### Stratify = 2 - 
Alright, imagine you have a big jar full of colorful candies, and you want to share them with your friends. But you want to make sure each friend gets a fair mix of all the different colors, not just one type.

Now, let's say you have a list called "Y" that has different kinds of candies listed. "Stratify Y" means to organize this list in a way that when you divide it into smaller groups, each group still has a good mix of all the different kinds of candies.

In simple terms, it's like making sure everyone gets a fair share of each candy color, so no one ends up with all the red candies while someone else gets all the green ones.

### random_state = 2 - 
In computer stuff, when you see random_state = 2, it's like saying, "I want to shuffle these cards in a way that if I shuffle them again with the same random_state number, I’ll get the same order."

So, if you set random_state = 2, it helps you get the same random results every time you run your code. It’s useful when you want to make sure your experiments or results can be repeated exactly the same way each time.

In [16]:
print(X.shape, X_train.shape, X_test.shape)

(768, 8) (614, 8) (154, 8)


### Training Model

In [17]:
classifier = svm.SVC(kernel='linear')

### Training Support Vector Machine Classifier

In [18]:
classifier.fit(X_train, Y_train)

### Model Evaluation

In [22]:
#Accuracy Score on training 
X_train_prediction = classifier.predict(X_train) #<-- We are using our Classifier (SVM) model to predict our X_train Data
# calling it 'X_train_prediction'.
training_data_accuracy = accuracy_score(X_train_prediction, Y_train) #<-- We then check the accuracy of it and 
# name it training_data_accuracy'.

In [24]:
print('Accuracy score of the training data: ', training_data_accuracy)

Accuracy score of the training data:  0.7866449511400652


### On Test Data

In [26]:
X_test_prediction = classifier.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)

In [28]:
print('Accuracy score of the test data: ', test_data_accuracy)

Accuracy score of the test data:  0.7727272727272727


### Making  a Predictive System
#### if a person has diabetes or not

In [34]:
input_data = (7,181,84,21,192,35.9,0.586,51)

#We need to convert the input_data to a numpy array

input_data_as_numpy_array = np.asarray(input_data)

# Now we need to reshape the array as we are predicting for one instance
# Why are we reshaping? So our model is being trained on 768 example and there are 8 columns in our modelbut
# in this case we are only using one data point (ie-row)
# This tells the model that we only need the prediction forone data point which is for that row. 

input_data_reshaped = input_data_as_numpy_array.reshape(1,-1) # (1,-1) tells the model that we are not giving 768 examples 
#but we are just trying to predict the outcome of one instance (row)

# We also need to standardize the input
std_data = scaler.transform(input_data_reshaped)
print(std_data)

prediction = classifier.predict(std_data) #<-- Now we are feeding std_data (standardized data) to the prediction model
# Which is the classifier model (SVM) and we're naming it prediction

print(prediction)

if (prediction[0] == 0):
    print('Person IS NOT Diabetic')
else:
    print('Person IS Diabetic')

[[0.93691372 1.88112959 0.77001375 0.02907707 0.97422544 0.49592704
  0.34466711 1.51108316]]
[1]
Person IS Diabetic




### CONGRATS!! We just set up and created a predictive system to predict if a person will have diabetes