<a href="https://www.kaggle.com/code/absndus/data-science-portfolio-eda-diagnosing-diabetes?scriptVersionId=132437129" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

## Data Science Portfolio - EDA Diagnosing Diabetes Notebook ##

### Created by: Albert Schultz ###

### Date Created: 06/04/2023 ###

### Version: 1.00 ###

### Executive Summary ###
This notebook goes exploration of the diabetes dataset and learning how to EDA to make sense of messy data.

## Table of Contents ##

1. [Introduction](#1.-Introduction)
2. [Purpose, Vision, and Goals](#2.-Purpose,-Vision,-and-Goals)
3. [Import Diabetes Dataset](#3.-Import-Diabetes-Dataset)
4. [Review the Diabetes Dataset](#4.-Review-the-Diabetes-Dataset)
3. [Summary](#Summary)

## 1. Introduction ##

In this notebook, I will be using my Data Science and programming skills to perform dataset import, investigate into the dataset for abnormalities, and clean the dataset to make sense of the diabetes as a data story.

**Initialize the Notebook for data access, import library modules, and set the working directory for this project.**

In [18]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/diabetes/diabetes.csv


## 2. Purpose, Vision, and Goals ##

The vision of this notebook is to create an understanding of how the diabetes dataset can be imported to cleaning the dataset to make more sense of the data before the Exploration Data Analysis phase.

**Vision:** To understand how to perform import of data, clean and investigate into the dataset, and explore the cleaned dataset for answers about diabetes.

**Goals:**
1. Import the required Python libraries needed for my analysis in the diabetes dataset.
2. Import the data set into the Python IDE environment for staging, data investigations, extractions, data manipulations and presentation.
3. Perform dataset manipulations of columns' names, data manipulations and cleaning.
4. Perform Exploratory Data Analysis to understand aspects of diabetes.
5. Present the cleaned data set of the diabetes information.

### Information About the Dataset ###

**Note:** This dataset is from the National Institute of Diabetes and Digestive and Kidney Diseases. It contains the following columns:

**Pregnancies:** Number of times pregnant
**Glucose:** Plasma glucose concentration at 2 hours in an oral glucose tolerance test
**BloodPressure:** Diastolic blood pressure
**SkinThickness:** Triceps skinfold thickness
**Insulin:** 2-Hour serum insulin
**BMI:** Body mass index
**DiabetesPedigreeFunction:** Diabetes pedigree function
**Age:** Age (years)
**Outcome:** Class variable (0 or 1)

## 3. Import Diabetes Dataset ##

**Introduction:** In this section, I will be importing the diabetes raw dataset (.csv) file into this environment for staging, investigations, data manipulations, and cleaning the dataset.

1. Import the library modules such as **numpy** and **pandas** before the cleaning begins.

In [19]:
import pandas as pd
import numpy as np

2. Import the **diabetes.csv** dataset raw file into this notebook.

In [20]:
diabetes = pd.read_csv('/kaggle/input/diabetes/diabetes.csv')

3. Print the first five rows of the dataset called **diabetes**.

In [21]:
diabetes.head(5)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


## 4. Review the Diabetes Dataset ##

**Introduction:** This section, I go over and review the new dataframe **diabetes** for any abnormal data or missing data in the dataframe.

1. Print out the datatypes of the dataframe **diabetes**.

In [22]:
diabetes.dtypes

Pregnancies                   int64
Glucose                       int64
BloodPressure                 int64
SkinThickness                 int64
Insulin                       int64
BMI                         float64
DiabetesPedigreeFunction    float64
Age                           int64
Outcome                      object
dtype: object

2. Review the information of the dataset **diabetes**.

In [23]:
diabetes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    object 
dtypes: float64(2), int64(6), object(1)
memory usage: 54.1+ KB


3. Print out the numbers of columns in the dataset.

In [24]:
len(diabetes.columns)

9

4. Print out the numbers of columns and the amount of observations.

In [25]:
diabetes.shape

(768, 9)

I can see that there are **768 observations** and **9 columns** in total of the dataset **diabetes**.

5. Review the dataset for missing data.

In [26]:
diabetes.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

**Summary:** I can see that there were **no** missing data in each column in the dataset. But, to get further understanding of the dataframe, I would need to run the **describe()** command to review all of the information from the dataset.

6. Run the describe() command against the dataframe **diabetes**.

In [27]:
diabetes.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0


**Summary:** Based on the min field, I can see that they were set to **0.00** which is oddly strange since the people would be dead. What I can assume is, that the 0.00 represents that they were **missing data** in the columns that has the 0.00 in them. I noticed that in the **insulin** column that the max outlier was **846** which is very high than normal. Also, the max value for the **Pregnacies** column was **17** which is not normal ot have.

7. Use the code below to replace the instances with **0** with a **NaN** using the Numpy library module in the five columns mentioned.

In [28]:
diabetes[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']] = diabetes[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']].replace(0,np.nan)

8. Review the diabetes dataframe again using the **describe()** method.

In [29]:
diabetes.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
count,768.0,763.0,733.0,541.0,394.0,757.0,768.0,768.0
mean,3.845052,121.686763,72.405184,29.15342,155.548223,32.457464,0.471876,33.240885
std,3.369578,30.535641,12.382158,10.476982,118.775855,6.924988,0.331329,11.760232
min,0.0,44.0,24.0,7.0,14.0,18.2,0.078,21.0
25%,1.0,99.0,64.0,22.0,76.25,27.5,0.24375,24.0
50%,3.0,117.0,72.0,29.0,125.0,32.3,0.3725,29.0
75%,6.0,141.0,80.0,36.0,190.0,36.6,0.62625,41.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0


9. Check for missing data again using the **isnull() and sum()** method.

In [30]:
diabetes.isnull().sum()

Pregnancies                   0
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64

10. Let's take a look at the rows to get a better idea of why some data might be missing. Print out all rows whose data is missing.

In [31]:
diabetes[diabetes.isnull().any(axis=1)]

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148.0,72.0,35.0,,33.6,0.627,50,1
1,1,85.0,66.0,29.0,,26.6,0.351,31,0
2,8,183.0,64.0,,,23.3,0.672,32,1
5,5,116.0,74.0,,,25.6,0.201,30,0
7,10,115.0,,,,35.3,0.134,29,0
...,...,...,...,...,...,...,...,...,...
761,9,170.0,74.0,31.0,,44.0,0.403,43,1
762,9,89.0,62.0,,,22.5,0.142,33,0
764,2,122.0,70.0,27.0,,36.8,0.340,27,0
766,1,126.0,60.0,,,30.1,0.349,47,1


**Summary:** I can see that a majority of the missing data is from Insulin and Skin Thickness.

11. Review the datatypes of the dataframe **diabetes**.

In [32]:
diabetes.dtypes

Pregnancies                   int64
Glucose                     float64
BloodPressure               float64
SkinThickness               float64
Insulin                     float64
BMI                         float64
DiabetesPedigreeFunction    float64
Age                           int64
Outcome                      object
dtype: object

**Summary:** It looks like the **outcome** column is of the type **object** even though my initial inspection was **int64**.

12. Print out unique values of the **Outcome** column.

In [33]:
diabetes.Outcome.unique()

array(['1', '0', 'O'], dtype=object)

## Summary ##

This portfolio notebook went over the process of performing importing, loading of data into the dataframe, and EDA. 