# Dataset

### About
Diabetes Health Indicators Dataset contains **healthcare** statistics and **lifestyle** survey information about people in general along with their **diagnosis of diabetes**.

The survey and creation of this dataset was funded by the **Center for Disease Control** (CDC) of the **United States of America**.

Diabetes diagnosis recorded is one of the following $\rightarrow$ Patient is **healthy**, **pre-diabetic** or **diabetic**.


### Source
[UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/891/cdc+diabetes+health+indicators)

### Available Features
1. **`HighBP`**: Whether patient suffers from high blood pressure
    - 0: No
    - 1: Yes
2. **`HighChol`**: Whether patient suffers from high blood cholesterol
    - 0: No
    - 1: Yes
3. **`CholCheck`**: Whether cholesterol checked in past 5 years
    - 0: No
    - 1: Yes
4. **`BMI`**: Body Mass Index of patient
5. **`Smoker`**: Whether patient smoked $\geq 100 \;Cigarettes$ in entire life
    - 0: No
    - 1: Yes
6. **`Stroke`**: Whether patient ever had a stroke
    - 0: No
    - 1: Yes
7. **`HeartDiseaseorAttack`**: Whether patient suffers from Coronary Heart Disease of Myocardial infarction
    - 0: No
    - 1: Yes
8. **`PhysActivity`**: Whether had physical activity in past 30 days
    - 0: No
    - 1: Yes
9. **`Fruits`**: Whether patient consumes fruits $\geq 1 \;time/day$
    - 0: No
    - 1: Yes
10. **`Veggies`**: Whether patient consumes vegetables $\geq 1 \;time/day$
    - 0: No
    - 1: Yes
11. **`HvyAlcoholConsump`**: Whether patient is a heavy drinker
    - 0: No
    - 1: Yes
    - _Heavy drinker adult man_ $\rightarrow \geq 14 \;drinks/week$
    - _Heavy drinker adult woman_ $\rightarrow \geq 7 \;drinks/week$
12. **`AnyHealthcare`**: Whether patient has any health insurance
    - 0: No
    - 1: Yes
13. **`NoDocbcCost`**: Was there a time in past 30 days when patient could not see a doctor because of healthcare cost?
    - 0: No
    - 1: Yes
14. **`GenHlth`**: General health condition of patient (self-rated by patient)
    - 1: Excellent
    - 2: Very good
    - 3: Good
    - 4: Fair
    - 5: Poor
15. **`MentHlth`**: How many days in past 30 days did patient feel his/her mental health was not good (1-30)
16. **`PhysHlth`**: How many days in past 30 days did patient feel his/her physical health was not good (1-30)
17. **`DiffWalk`**: Whether patient has any serious difficulty in walking/climbing stairs
    - 0: No
    - 1: Yes
18. **`Sex`**: Sex of patient
    - 0: Female
    - 1: Male
19. **`Age`**: Range of age of patient (years)
    - 1: $18-24 \;years$
    - 9: $60-64 \;years$
    - 13: $>80 \;years$
20. **`Education`**: Level of education of patient
    - 1: Never attended school/only attended kindergarten
    - 2: Grade 1-8 (Elementary)
    - 3: Grade 9-11 (Some High school)
    - 4: Grade 12 (High school graduate)
    - 5: College Years 1-3
    - 6: College $\geq 4 \;years$
21. **`Income`**: Annual income scale of patient
    - 1: $<\$10,000$
    - 5: $<\$35,000$
    - 8: $\geq \$75,000$


### Target Label

1. **`Diabetes_012`**: Whether patient has diabetes
    - `0` = No
    - `1` = Prediabetes
    - `2` = Diabetes

---
---

# Loading the Dataset

In [1]:
# libraries for manipulating & visualizing data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# filepath of dataset
filepath = r"D:\ML\Deep Learning_Practical\Diabetes Prediction Project\data\diabetes_prediction.csv"

In [3]:
# load into dataframe

df = pd.read_csv(filepath)

In [4]:
# check if loaded properly

df.head()

Unnamed: 0,Diabetes_012,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
0,0.0,1.0,1.0,1.0,40.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,5.0,18.0,15.0,1.0,0.0,9.0,4.0,3.0
1,0.0,0.0,0.0,0.0,25.0,1.0,0.0,0.0,1.0,0.0,...,0.0,1.0,3.0,0.0,0.0,0.0,0.0,7.0,6.0,1.0
2,0.0,1.0,1.0,1.0,28.0,0.0,0.0,0.0,0.0,1.0,...,1.0,1.0,5.0,30.0,30.0,1.0,0.0,9.0,4.0,8.0
3,0.0,1.0,0.0,1.0,27.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,0.0,0.0,0.0,0.0,11.0,3.0,6.0
4,0.0,1.0,1.0,1.0,24.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,3.0,0.0,0.0,0.0,11.0,5.0,4.0


In [5]:
print(f"Dataset has {df.shape[0]} examples and {df.shape[1] - 1} input features.")

Dataset has 253680 examples and 21 input features.


---
>_Now that dataset has been loaded we will probe how the features available are related to uncover potential patterns in the dataset._
---

# Exploratory Data Analysis