# Final Project: Analysis of Heart Disease

# Introduction


Our main objective is to analyse the dataset of attributes related to diabetes and to predict whether that person is diabetic or not. We will be applying machine learning algorithms to try to achieve this goal. 


## Libraries:

In [10]:
import requests
import pandas as pd
import json
import numpy as np

## Data Used
In the data collection stage of the data life cycle, you need to focus on collecting data from websites and databases. 

We have found data from Kaggle at: 
https://www.kaggle.com/datasets/mathchi/diabetes-data-set

This data has all the attributes needed for predicting diabetes which will aid us to reach our final goal. The data is of 21 year old women at the Pima Indian Heritage. The data is organized with the folloing attributes as described by the website: 


Pregnancies: Number of times pregnant

Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test

BloodPressure: Diastolic blood pressure (mm Hg)

SkinThickness: Triceps skin fold thickness (mm)

Insulin: 2-Hour serum insulin (mu U/ml)

BMI: Body mass index (weight in kg/(height in m)^2)

DiabetesPedigreeFunction: Diabetes pedigree function

Age: Age (years)

Outcome: Class variable (0 or 1)




In [9]:
attributes = pd.read_csv("diabetes.csv")
attributes.head(10)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
5,5,116,74,0,0,25.6,0.201,30,0
6,3,78,50,32,88,31.0,0.248,26,1
7,10,115,0,0,0,35.3,0.134,29,0
8,2,197,70,45,543,30.5,0.158,53,1
9,8,125,96,0,0,0.0,0.232,54,1


As you can see, from the data shown above each individual has an number id and has each attribute filled out.

Outcomes, the most important piece of data is the last column.

In [11]:
attributes.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


As we can see, there are 768 individual data points. There are some interesting points to how much the data points vary for example skin thickness can vary from 23 at the 50th percentile to 99 at the maximum. 

Other things to note is that there are no abnormal conditions seen outside of things expected to see in diabetics, for example the maximum blood preasure is 122 which is within a normal range and isnt considered to be harmful. 

There is an issue when it comes to the lower bounds as there are many 0 values that simply wouldn't be possible to have. It is assumed that these 0 values are simply missing values so we will need to replace them with nan. 

In [12]:
attributes = attributes.replace(0,np.nan)

In [13]:
attributes.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,657.0,763.0,733.0,541.0,394.0,757.0,768.0,768.0,268.0
mean,4.494673,121.686763,72.405184,29.15342,155.548223,32.457464,0.471876,33.240885,1.0
std,3.217291,30.535641,12.382158,10.476982,118.775855,6.924988,0.331329,11.760232,0.0
min,1.0,44.0,24.0,7.0,14.0,18.2,0.078,21.0,1.0
25%,2.0,99.0,64.0,22.0,76.25,27.5,0.24375,24.0,1.0
50%,4.0,117.0,72.0,29.0,125.0,32.3,0.3725,29.0,1.0
75%,7.0,141.0,80.0,36.0,190.0,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


Now we have a more accurate representation of the data. 

# Data Processing

## processing the data:
