#Model Quality and Improvements

## 1. Defining the Question

### b) Defining the Metric for Success

The model needs
to have an accuracy score greater than 0.85

### c) Understanding the context 

As a data professional working for a pharmaceutical company, you need to develop a
model that predicts whether a patient will be diagnosed with diabetes. The model needs
to have an accuracy score greater than 0.85.


### d) Recording the Experimental Design

Describe the steps/approach that you will use to answer the given question.



 

1.   Data Importation

1.    Data Exploration
2.   Data Cleaning

2.   Data Preparation

1.   Data Modeling (Using Decision Trees, Random Forest and Logistic Regression
2.   Model Evaluation

1.   Hyparameter Tuning
2.   Findings and Recommendations











### e) Data Relevance

How relevant was the provided data?
Very relevant

## 2. Reading the Data

In [5]:
# Importing our libraries 
# ---
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

In [6]:
# Load the data below
# --- 
# Dataset url = : https://bit.ly/DiabetesDS

# --- 
df = pd.read_csv('https://bit.ly/DiabetesDS')

In [7]:
# Checking the first 5 rows of data
# ---
#
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [8]:
# Checking the last 5 rows of data
# ---
#
df.tail()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.34,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1
767,1,93,70,31,0,30.4,0.315,23,0


In [9]:
# Sample 10 rows of data
# ---
#
df.sample(10)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
138,0,129,80,0,0,31.2,0.703,29,0
251,2,129,84,0,0,28.0,0.284,27,0
101,1,151,60,0,0,26.1,0.179,22,0
201,1,138,82,0,0,40.1,0.236,28,0
648,11,136,84,35,130,28.3,0.26,42,1
567,6,92,62,32,126,32.0,0.085,46,0
351,4,137,84,0,0,31.2,0.252,30,0
124,0,113,76,0,0,33.3,0.278,23,1
76,7,62,78,0,0,32.6,0.391,41,0
184,4,141,74,0,0,27.6,0.244,40,0


In [10]:
# Checking number of rows and columns
# ---
#  
df.shape

(768, 9)

In [11]:
# Checking datatypes
# ---
df.dtypes

Pregnancies                   int64
Glucose                       int64
BloodPressure                 int64
SkinThickness                 int64
Insulin                       int64
BMI                         float64
DiabetesPedigreeFunction    float64
Age                           int64
Outcome                       int64
dtype: object

Record your observations below:

*   Most of the variables are int  types
*   The data provided has 768 rows and feauture 9



## 3. External Data Source Validation

The data was collected and made available by “National Institute of Diabetes and Digestive and Kidney Diseases” as part of the Pima Indians Diabetes Database. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here belong to the Pima Indian heritage (subgroup of Native Americans), and are females of ages 21 and above

## 4. Data Preparation

### Performing Data Cleaning

In [12]:
# Checking missing entries of all the variables. 
# ---
df.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

We observe the following from our dataset:

*   no missing values 



In [13]:
# Standardizing your dataset i.e. variable renaming 
# we make all our column headings to have lower case characters and check the first five rows to confirm changes
df.columns = df.columns.str.lower()
df.head()

Unnamed: 0,pregnancies,glucose,bloodpressure,skinthickness,insulin,bmi,diabetespedigreefunction,age,outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


We observe the following from our dataset:

*   We renamed all columns to have lower cases and checked to confirm the changes. All columns now have lower cases in the column names.



In [14]:
# Checking how many duplicate rows are there in the data
# ---
df.duplicated().sum()

0

We observe the following from our dataset:

*   There are no duplicates in our data



In [15]:
# Checking if any of the columns are all null
# ---
df.isnull().all(axis = 0)

pregnancies                 False
glucose                     False
bloodpressure               False
skinthickness               False
insulin                     False
bmi                         False
diabetespedigreefunction    False
age                         False
outcome                     False
dtype: bool

We observe the following from our dataset:

*   None of the columns contains all null values



In [16]:
# Checking if any of the rows are all null
# ---
sum(df.isnull().all(axis = 1))

0

We observe the following from our dataset:

*   No row contains completely null values



In [17]:
#creating a copy of our dataframe 
#
# ---
#
df_clean = df.copy()
df_clean.head()

Unnamed: 0,pregnancies,glucose,bloodpressure,skinthickness,insulin,bmi,diabetespedigreefunction,age,outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [18]:
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   pregnancies               768 non-null    int64  
 1   glucose                   768 non-null    int64  
 2   bloodpressure             768 non-null    int64  
 3   skinthickness             768 non-null    int64  
 4   insulin                   768 non-null    int64  
 5   bmi                       768 non-null    float64
 6   diabetespedigreefunction  768 non-null    float64
 7   age                       768 non-null    int64  
 8   outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [19]:
df_clean.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
pregnancies,768.0,3.845052,3.369578,0.0,1.0,3.0,6.0,17.0
glucose,768.0,120.894531,31.972618,0.0,99.0,117.0,140.25,199.0
bloodpressure,768.0,69.105469,19.355807,0.0,62.0,72.0,80.0,122.0
skinthickness,768.0,20.536458,15.952218,0.0,0.0,23.0,32.0,99.0
insulin,768.0,79.799479,115.244002,0.0,0.0,30.5,127.25,846.0
bmi,768.0,31.992578,7.88416,0.0,27.3,32.0,36.6,67.1
diabetespedigreefunction,768.0,0.471876,0.331329,0.078,0.24375,0.3725,0.62625,2.42
age,768.0,33.240885,11.760232,21.0,24.0,29.0,41.0,81.0
outcome,768.0,0.348958,0.476951,0.0,0.0,0.0,1.0,1.0


In [20]:
df_clean.columns

Index(['pregnancies', 'glucose', 'bloodpressure', 'skinthickness', 'insulin',
       'bmi', 'diabetespedigreefunction', 'age', 'outcome'],
      dtype='object')

In [23]:
#features = ['pregnancies','glucose','bloodpressure','skinthickness','insulin','bmi','diabetespedigreefunction','age']
#target = df_clean['outcome']



x = df_clean.drop(['outcome'], axis = 1)
y = df_clean.loc[:,"outcome"].values



Logistic regression 

In [24]:
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(random_state=12345, solver='liblinear')
model.fit(x, y)
print (model.score(x, y))


# x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.45, random_state = 123)
# logreg = linear_model.LogisticRegression(max_iter=150)
# logreg.fit(x_train,y_train)
# predicted = logreg.predict(x_test)
# print("Test accuracy: {} ".format(logreg.score(x_test, y_test)))

0.7747395833333334




decisiontree

In [26]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier 
from sklearn.model_selection import train_test_split 
from sklearn import metrics 

x = df_clean.drop(['outcome'], axis = 1)
y = df_clean.loc[:,"outcome"].values

testx = df_clean.drop(['outcome'], axis = 1)
testy = df_clean.loc[:,"outcome"].values

model = DecisionTreeClassifier(random_state=12345,max_depth=5)

model.fit(x, y)


train_predictions = model.predict(x)
test_predictions = model.predict(testx)

print('Accuracy')
print('Training set:', accuracy_score(testy, train_predictions))



Accuracy
Training set: 0.8372395833333334


Random Forest


In [27]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(random_state=12345, n_estimators=3)
#x= feautures
# y = target
model.fit(x, y)
z= model.score(x, y)

print(z)


0.9466145833333334




```
**# Results**
```

1.   Logic regression accuracy 0.7747395833333334

1.   RandomForestClassifier  0.9479166666666666
2.  DecisionTreeClassifier 0.8372395833333334







Best model to use is random forest because it meets our criteria of accuracy score greater than 0.85