# EDA Diagnosis Diabetes

You will use your EDA skills to help inspect, clean, and validate the data.

Note: This dataset is from the National Institute of Diabetes and Digestive and Kidney Diseases. It contains the following columns:

- Pregnancies: Number of times pregnant
- Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
- BloodPressure: Diastolic blood pressure
- SkinThickness: Triceps skinfold thickness
- Insulin: 2-Hour serum insulin
- BMI: Body mass index
- DiabetesPedigreeFunction: Diabetes pedigree function
- Age: Age (years)
- Outcome: Class variable (0 or 1)

### Initial Inspection

1. Load the data in a variable called diabetes_data and print the first few rows.

In [1]:
import pandas as pd
import numpy as np

In [2]:
diabetes_data = pd.read_csv('diabetes.csv')
print(diabetes_data.head())

   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   
3            1       89             66             23       94  28.1   
4            0      137             40             35      168  43.1   

   DiabetesPedigreeFunction  Age Outcome  
0                     0.627   50       1  
1                     0.351   31       0  
2                     0.672   32       1  
3                     0.167   21       0  
4                     2.288   33       1  


In [3]:
print(diabetes_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    object 
dtypes: float64(2), int64(6), object(1)
memory usage: 54.1+ KB
None


2. How many columns (features) does the data contain? 9 columns
How many rows (observations) does the data contain? 768 rows

### Further Inspection
3. Let’s inspect diabetes_data further. Do any of the columns in the data contain null (missing) values?


If you answered no to the question above, not so fast! While it’s technically true that none of the columns contain null values, that doesn’t necessarily mean that the data isn’t missing any values. When exploring data, you should always question your assumptions and try to dig deeper.
To investigate further, calculate summary statistics on diabates_data using the .describe() method.

In [4]:
diabetes_data.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0


4. Looking at the summary statistics, do you notice anything odd about the following columns?

- Glucose
- BloodPressure
- SkinThickness
- Insulin
- BMI

Yes, the minimum value for this column is 0 which is impossible.

5. Do you spot any other outliers in the data?

Yes, the max in Insulin and pregnancies 17 (which is not impossible but weird)

6. Let’s see if we can get a more accurate view of the missing values in the data.

Replace the instances of 0 with NaN in the five columns mentioned:

In [7]:
diabetes_data[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']] = diabetes_data[['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']].replace(0, np.NaN)

7. Next, check for missing (null) values in all of the columns just like you did in Step 5.

Now how many missing values are there?

In [10]:
diabetes_data.isnull().sum()

Pregnancies                   0
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64

8. Let’s take a closer look at these rows to get a better idea of why some data might be missing.

Print out all of the rows that contain missing (null) values.

In [15]:
diabetes_data[diabetes_data.isnull().any(axis=1)]

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148.0,72.0,35.0,,33.6,0.627,50,1
1,1,85.0,66.0,29.0,,26.6,0.351,31,0
2,8,183.0,64.0,,,23.3,0.672,32,1
5,5,116.0,74.0,,,25.6,0.201,30,0
7,10,115.0,,,,35.3,0.134,29,0
...,...,...,...,...,...,...,...,...,...
761,9,170.0,74.0,31.0,,44.0,0.403,43,1
762,9,89.0,62.0,,,22.5,0.142,33,0
764,2,122.0,70.0,27.0,,36.8,0.340,27,0
766,1,126.0,60.0,,,30.1,0.349,47,1


9. Next, take a closer look at the data types of each column in diabetes_data.

In [18]:
diabetes_data.dtypes

Pregnancies                   int64
Glucose                     float64
BloodPressure               float64
SkinThickness               float64
Insulin                     float64
BMI                         float64
DiabetesPedigreeFunction    float64
Age                           int64
Outcome                      object
dtype: object

10. To figure out why the Outcome column is of type object (string) instead of type int64, print out the unique values in the Outcome column.

In [20]:
diabetes_data.Outcome.unique()

array(['1', '0', 'O'], dtype=object)

How might you resolve this issue? Changing the O to 0

In [22]:
diabetes_data.Outcome = diabetes_data.Outcome.replace('O', 0)
diabetes_data.Outcome = pd.to_numeric(diabetes_data.Outcome)
diabetes_data.Outcome.unique()

array([1, 0])

### Next Steps
11. Congratulations! In this project, you saw how EDA can help with the initial data inspection and cleaning process. This is an important step as it helps to keep your datasets clean and reliable.

Here are some ways you might extend this project if you’d like:

- Use .value_counts() to more fully explore the values in each column.
- Instead of changing the 0 values in the five columns to NaN, try replacing the values with the median or mean of each column.

In [24]:
diabetes_data.value_counts()

Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin  BMI   DiabetesPedigreeFunction  Age  Outcome
0            74.0     52.0           10.0           36.0     27.8  0.269                     22   0          1
4            117.0    64.0           27.0           120.0    33.2  0.230                     24   0          1
             111.0    72.0           47.0           207.0    37.1  1.390                     56   1          1
             110.0    76.0           20.0           100.0    28.4  0.118                     27   0          1
             109.0    64.0           44.0           99.0     34.8  0.905                     26   1          1
                                                                                                            ..
1            131.0    64.0           14.0           415.0    23.7  0.389                     21   0          1
             130.0    70.0           13.0           105.0    25.9  0.472                     22   0          1
      

In [27]:
diabetes_data = diabetes_data.fillna(diabetes_data.mean())
diabetes_data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148.0,72.0,35.0,155.548223,33.6,0.627,50,1
1,1,85.0,66.0,29.0,155.548223,26.6,0.351,31,0
2,8,183.0,64.0,29.15342,155.548223,23.3,0.672,32,1
3,1,89.0,66.0,23.0,94.0,28.1,0.167,21,0
4,0,137.0,40.0,35.0,168.0,43.1,2.288,33,1
