# **Task 11 of Buildables Data Science Fellowship**

## In this task, I have been assigned to train two models:
1. Linear Regression / Logistic Regression
2. Decision Tree Classifier

## I have to train the models on diabetes dataset, which in the end will predict if a person has diabetes or not.

### **Importing Libraries**

In [76]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    r2_score, 
    mean_squared_error,
    precision_score,
    f1_score,
    recall_score,
    roc_auc_score, 
    accuracy_score, 
    confusion_matrix, 
    classification_report
)

### **Importing CSV Data and converting into Data Frame.**

In [68]:
df = pd.read_csv('diabetes.csv')
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


### **Checking the data for inconsistencies**

In [69]:
cols = df.columns
for col in cols:
    print(col, ': ',df[col].isna().sum())
print(f'\n\nShape of the Dataset: ', df.shape)

Pregnancies :  0
Glucose :  0
BloodPressure :  0
SkinThickness :  0
Insulin :  0
BMI :  0
DiabetesPedigreeFunction :  0
Age :  0
Outcome :  0


Shape of the Dataset:  (768, 9)


In [70]:
df.columns

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')

In [71]:
x = df[['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age']]
y = df['Outcome']

xTrain, xTest, yTrain, yTest = train_test_split(x, y, test_size=0.2, random_state=42)

In [72]:
reg_model = LinearRegression()
reg_model.fit(xTrain, yTrain)

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


In [73]:
y_pred = reg_model.predict(xTest)

In [74]:
r2 = r2_score(yTest, y_pred)
mse = mean_squared_error(yTest, y_pred)
print(f'R2 Score: {r2}\nMean Squarred Error: {mse}')

R2 Score: 0.25500281176741757
Mean Squarred Error: 0.17104527280850104


#### **As I have trained a Llinear Regression Model on a Classification Dataset I'm changing the values to 0 and 1 so I can calculate the accuracy score, precision, recall, and roc_auc score for the model**

In [77]:
y_pred_classes = np.where(y_pred >= 0.5, 1, 0)
y_pred_classes

array([0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0,
       0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1,
       0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1,
       0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1,
       0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0])

#### Now, calculating the accuracy, precision, recall, and roc_auc score for linear regression model by converting its predictions into classification labels

In [83]:
accuracy = accuracy_score(yTest, y_pred_classes)
precision = precision_score(yTest, y_pred_classes)
recall = recall_score(yTest, y_pred_classes)
roc_auc = roc_auc_score(yTest, y_pred_classes)

print(f'Accuracy:{accuracy}\nPrecision: {precision}\nRecall: {recall}\nROC_AUC: {roc_auc}')

Accuracy:0.7597402597402597
Precision: 0.6607142857142857
Recall: 0.6727272727272727
ROC_AUC: 0.7404040404040404


# **Let's train Decision Tree Model on the same dataset.**

In [80]:
dtc = DecisionTreeClassifier(random_state=42)
dtc.fit(xTrain, yTrain)

0,1,2
,criterion,'gini'
,splitter,'best'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,
,random_state,42
,max_leaf_nodes,
,min_impurity_decrease,0.0


In [81]:

y_pred_dtc = dtc.predict(xTest)

In [84]:
accuracy = accuracy_score(yTest, y_pred_dtc)
precision = precision_score(yTest, y_pred_dtc)
recall = recall_score(yTest, y_pred_dtc)
roc_auc = roc_auc_score(yTest, y_pred_dtc)

print(f'Accuracy:{accuracy}\nPrecision: {precision}\nRecall: {recall}\nROC_AUC: {roc_auc}')

Accuracy:0.7467532467532467
Precision: 0.625
Recall: 0.7272727272727273
ROC_AUC: 0.7424242424242424
