# Prerequisites

In [2]:
# Import numpy and pandas libraries to begin with
import pandas as pd
import numpy as np


# Data Importation

In [3]:
# Load the diabetes dataset and preview first few records
diabetes_df = pd.read_csv("https://bit.ly/DiabetesDS")
diabetes_df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


# Data Exploration 

In [4]:
# Check dataframe structure
diabetes_df.shape

(768, 9)

In [6]:
# Check the column datatypes
diabetes_df.dtypes

Pregnancies                   int64
Glucose                       int64
BloodPressure                 int64
SkinThickness                 int64
Insulin                       int64
BMI                         float64
DiabetesPedigreeFunction    float64
Age                           int64
Outcome                       int64
dtype: object

In [7]:
# Check if there any all null columns
diabetes_df.isnull().any()

Pregnancies                 False
Glucose                     False
BloodPressure               False
SkinThickness               False
Insulin                     False
BMI                         False
DiabetesPedigreeFunction    False
Age                         False
Outcome                     False
dtype: bool

In [11]:
# Select and preview unique Pregnacies
diabetes_df.Pregnancies.unique().tolist()

[6, 1, 8, 0, 5, 3, 10, 2, 4, 7, 9, 11, 13, 15, 17, 12, 14]

In [13]:
# Select and preview unique Outcomes
diabetes_df.Outcome.unique().tolist()

[1, 0]

In [12]:
# Check for duplicate rows based on all columns
diabetes_df[diabetes_df.duplicated()]

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome


# Data Exploration Observations
- The dataset 9 columns and 768 rows
- All columns are of integer or float datatype
- There are no duplicate rows
- There are no null values in any of the columns
- First 8 columns will form the features for our analysis while the Outcome column will be our target
- So far the dataset look ok.



# Data Cleanup

We will undertake two clean up exercises.
- When modeling, it is important to clean the data sample to ensure that the observations best represent the problem.
- Sometimes a dataset can contain extreme values that are outside the range of what is expected and unlike the other data i.e. outliers.
- Outliers are known to cause e.g. the linear regression model to learn a bias or skewed understanding of the problem, thus removing these outliers from the training set will allow a more effective model to be learned.

Our first clean up excersie will be 
- Round of the Diabetes Pedegree Fuction to 2 decimal places
- Remove outliers from the dataset

In [17]:
# Round diabetes pedegree function to 2 decimal places
diabetes_df['DiabetesPedigreeFunction'] = diabetes_df['DiabetesPedigreeFunction'].round(decimals=2)
diabetes_df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.63,50,1
1,1,85,66,29,0,26.6,0.35,31,0
2,8,183,64,0,0,23.3,0.67,32,1
3,1,89,66,23,94,28.1,0.17,21,0
4,0,137,40,35,168,43.1,2.29,33,1


In [18]:
# Removing Outliers in the dataframe
# We first defining our quantiles using the quantile() function
# ---
# 
Q1 = diabetes_df.quantile(0.25)
Q3 = diabetes_df.quantile(0.75)
IQR = Q3 - Q1
IQR

# Then filtering out our outliers by getting values which are outside our IQR Range.
# ---
#
diabetes_df_iqr = diabetes_df[((diabetes_df < (Q1 - 1.5 * IQR)) | (diabetes_df > (Q3 + 1.5 * IQR))).any(axis=1)]

# One way of dealing with outliers is removing them 
# Checking the size of the dataset with outliers for cleaning purposes
# ---
#
diabetes_df_iqr.shape

(128, 9)

We will omit 128 rows from the dataset of which are outliers so that have a dataset that help create a more effective model 