<a href="https://colab.research.google.com/github/fkihu/Model-Quality-and-Improvement-Assignment/blob/main/Assignment_Model_Quality_%26_Improvements.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Problem Statement**


As a data professional working for a pharmaceutical company, you need to develop a
model that predicts whether a patient will be diagnosed with diabetes. The model needs
to have an accuracy score greater than 0.85.


Prerequisites

In [None]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

# Exploring the data

Importing the dataset

In [None]:
df = pd.read_csv('https://bit.ly/DiabetesDS')

Determining the size of the dataset

In [None]:
df.shape #768 records with 9 features

(768, 9)

Exploring the first records of the dataset



In [None]:
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


Exploring the last records of the dataset

In [None]:
df.tail()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.34,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1
767,1,93,70,31,0,30.4,0.315,23,0


Exploring the datatypes of the features of the dataset.

In [None]:
df.dtypes

Pregnancies                   int64
Glucose                       int64
BloodPressure                 int64
SkinThickness                 int64
Insulin                       int64
BMI                         float64
DiabetesPedigreeFunction    float64
Age                           int64
Outcome                       int64
dtype: object

Checking the column names for inconsistencies

In [None]:
df.columns #There were no inconsistencies

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')

Checking for missing values in the dataset

In [None]:
df.isna().sum() #There were no missing values found in the dataset.

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

Checking for duplicates in the dataset.

In [None]:
print(df.duplicated().sum()) #There were no duplicates found in the dataset.

0


# Creating the first model: Using Decision Trees

We will need to split the dataset so that we can have a training and a validating dataset.

In [None]:
df_train, df_valid = train_test_split(df, test_size=0.25, random_state=12345)

Checking the shape and proportion of the training and validating datasets to the overall dataset, to be sure that indeed the training dataset is 75% of the bigger dataset.

In [None]:
print(round(df_train.shape[0]/df.shape[0],2))
print(round(df_valid.shape[0]/df.shape[0],2))

0.75
0.25


Defining the criteria for classification using the 'Glucose' feature. For someone to be diagnosed as diabetic, the Glucose level would have been observed to be either equal to or higher than 126mg/DL. This is the feature I will use to classify my dataset for the model.

In [None]:
df_train.loc[df['Glucose'] >= 126, 'Outcome2'] = 1
df_train.loc[df['Glucose'] < 126, 'Outcome2'] = 0

df_valid.loc[df['Glucose'] >= 126, 'Outcome2'] = 1
df_valid.loc[df['Glucose'] < 126, 'Outcome2'] = 0

Declaring the features and target for the df_train

In [None]:
features_train = df_train.drop(columns=["Glucose", 'Outcome'])
target_train = df_train["Outcome2"]

Checking the shape of the features_train and target_train datesets

In [None]:
print(features_train.shape)
print(target_train.shape)

(576, 9)
(576,)


Declaring the features and target for df_valid  dataset.

In [None]:
features_valid = df_valid.drop(columns=['Glucose', 'Outcome'])
target_valid = df_valid["Outcome2"]

Checking the shape of the features_valid and target_valid datasets

In [None]:
print(features_valid.shape)
print(target_valid.shape)

(192, 8)
(192,)


Creating the Decision Tree model

In [None]:
model = DecisionTreeClassifier(random_state=12345)

Training the Decision Tree model

In [None]:
model.fit(features_train, target_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

Finding the predictions of the model using the validation dataset.

In [None]:
predictions_valid = model.predict(features_valid) 

Evaluating the accuracy of the Decision Tree model

In [None]:
accuracy_score(target_valid, predictions_valid) # score = 1.0

1.0

In [None]:
model.score(features_valid, target_valid)

1.0

# Creating the second model: Using Random Forest

In [None]:
df2 = pd.read_csv('https://bit.ly/DiabetesDS')

# Defining the criteria for classification

df2.loc[df['Glucose'] >= 126, 'Outcome2'] = 1
df2.loc[df['Glucose'] < 126, 'Outcome2'] = 0

# Creating the RandomForest model
model_forest = RandomForestClassifier(random_state=12345, n_estimators=3)

# Declaring the features and target for the dataset
features_forest = df2.drop(['Glucose', 'Outcome'], axis=1)
target_forest = df2['Outcome2']

#Training the model
model_forest.fit(features_forest, target_forest)

# Evaluating the model score
model_forest.score(features_forest, target_forest) # score = 0.9987

0.9986979166666666

# Creating the third model: Using Logistic Regression

In [None]:
df3 = pd.read_csv('https://bit.ly/DiabetesDS')

# Defining the criteria for classification

df3.loc[df['Glucose'] >= 126, 'Outcome2'] = 1
df3.loc[df['Glucose'] < 126, 'Outcome2'] = 0


model_log = LogisticRegression(random_state=12345, solver='liblinear')

features_log= df2.drop(['Glucose', 'Outcome'], axis=1)
target_log = df2['Outcome2']

#Training the model
model_log.fit(features_log, target_log)

# Evaluating the model score
model_log.score(features_log, target_log) # score = 1.0

1.0