# EDA: Diagnosing Diabetes

In this project we are interested in exploring data that looks at how certain diagnostic factors affect the diabetes outcome of women patients.

The objective is to inspect, clean, and validate the data.

**Note**: This [dataset](https://www.kaggle.com/uciml/pima-indians-diabetes-database) is from the National Institute of Diabetes and Digestive and Kidney Diseases. It contains the following columns:

- `Pregnancies`: Number of times pregnant
- `Glucose`: Plasma glucose concentration per 2 hours in an oral glucose tolerance test
- `BloodPressure`: Diastolic blood pressure
- `SkinThickness`: Triceps skinfold thickness
- `Insulin`: 2-Hour serum insulin
- `BMI`: Body mass index
- `DiabetesPedigreeFunction`: Diabetes pedigree function
- `Age`: Age (years)
- `Outcome`: Class variable (0 or 1)



## Intitial Inspection


First we get aquainted with the dataset by looking at it. Then we look at all the nine columns and write their expected datatypes so that we can work with the data.

Expected data type for each column:

- `Pregnancies`: `int64`
- `Glucose`: `int64`
- `BloodPressure`: `int64`
- `SkinThickness`: `int64`
- `Insulin`: `int64`
- `BMI`: `float64`
- `DiabetesPedigreeFunction`: `float64`
- `Age`: `int64`
- `Outcome`: `int64`


Now, we load in the diabetes data to start exploring.

We load the data in a variable called `diabetes_data` and print the first few rows.
   
**Note**: The data is stored in a file called `diabetes.csv`.

In [38]:
import pandas as pd
import numpy as np

# load in data
diabetes_data = pd.read_csv('diabetes.csv')
print(diabetes_data.head())

   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   
3            1       89             66             23       94  28.1   
4            0      137             40             35      168  43.1   

   DiabetesPedigreeFunction  Age Outcome  
0                     0.627   50       1  
1                     0.351   31       0  
2                     0.672   32       1  
3                     0.167   21       0  
4                     2.288   33       1  


Let us find out the size of the given dataset by printing the number of rows and columns

In [39]:
# print number of rows
print(len(diabetes_data))
# print number of columns
print(len(diabetes_data.columns))

768
9


## Further Inspection


Find if any of the coloumns in the data contain missing or null values

In [40]:
# find whether columns contain null values
print(diabetes_data.isnull().sum())

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64


By looking at the output obtained above we can say that there are no missing/null values. While it's technically true that none of the columns contain null values, it doesn't necessarily mean that the data isn't missing any values.
   
Lets question our assumptions and try to dig deeper.
   
To investigate further, we calculate summary statistics on `diabetes_data` using the `.describe()` method.

In [41]:
# perform summary statistics
print(diabetes_data.describe())

       Pregnancies     Glucose  BloodPressure  SkinThickness     Insulin  \
count   768.000000  768.000000     768.000000     768.000000  768.000000   
mean      3.845052  120.894531      69.105469      20.536458   79.799479   
std       3.369578   31.972618      19.355807      15.952218  115.244002   
min       0.000000    0.000000       0.000000       0.000000    0.000000   
25%       1.000000   99.000000      62.000000       0.000000    0.000000   
50%       3.000000  117.000000      72.000000      23.000000   30.500000   
75%       6.000000  140.250000      80.000000      32.000000  127.250000   
max      17.000000  199.000000     122.000000      99.000000  846.000000   

              BMI  DiabetesPedigreeFunction         Age  
count  768.000000                768.000000  768.000000  
mean    31.992578                  0.471876   33.240885  
std      7.884160                  0.331329   11.760232  
min      0.000000                  0.078000   21.000000  
25%     27.300000        

On looking at the summary statistics, we notice something odd about the following columns:

   - `Glucose`
   - `BloodPressure`
   - `SkinThickness`
   - `Insulin`
   - `BMI`
   
If we take a look at the minimum values for these five columns, we notice that they are all `0`. 

How can Blood Pressure or BMI be `0`? That makes no sense! These values also seem to be way off from their respective medians and means, which is possibly another indicator that something is off.

One way to interpret this is that there are missing values in the data.

Lets try and spot other outliers in the data (if any)

In addition to the `0` values that show up for the columns above, there appear to be additional outliers, such as:

- The maximum value of the `Insulin` column is `846`, which is abnormally high.
- The maximum value of the `Pregnancies` column is `17`. While having 17 pregnancies is not impossible, this case might be something to look further into to determine its accuracy.


To see if we can get a more accurate view of the missing values in the data lets replace the instances of `0` with `NaN` in the five columns mentioned:

In [42]:
# replace instances of 0 with NaN
diabetes_data[['Glucose', 'BloodPressure', 
               'SkinThickness', 'Insulin', 'BMI']] = diabetes_data[['Glucose', 'BloodPressure', 'SkinThickness', 
                                                                    'Insulin', 'BMI']].replace(0, np.NaN)

Now, lets again check for missing (null) values in all of the columns as we did earlier.    

In [43]:
# find whether columns contain null values after replacements are made
print(diabetes_data.isnull().sum())

Pregnancies                   0
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64


One thing to notice here is that most rows with missing data have missing values in more than one column. In fact, every single row with at least one missing value also has a missing value in the `Insulin` column. If patients did not have their insulin measured, why might they also not have had these other measurements taken?

Now lets take a look at the data types of each column in `diabetes_data`.

In [44]:
# print data types using .info() method
print(diabetes_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
Pregnancies                 768 non-null int64
Glucose                     763 non-null float64
BloodPressure               733 non-null float64
SkinThickness               541 non-null float64
Insulin                     394 non-null float64
BMI                         757 non-null float64
DiabetesPedigreeFunction    768 non-null float64
Age                         768 non-null int64
Outcome                     768 non-null object
dtypes: float64(6), int64(2), object(1)
memory usage: 54.1+ KB
None


Here we see that the `Outcome` column is of type `object` (string) instead of type `int64`, to find out why this is the case lets print out the unique values in the `Outcome` column.

In [45]:
# print unique values of Outcome column
print(diabetes_data.Outcome.unique())


['1' '0' 'O']


To solve this issue lets replace instances of `'O'` with `0` and convert the `Outcome` column to type `int64`.

In [46]:
diabetes_data['Outcome'] = diabetes_data['Outcome'].replace('O', '0')
print(diabetes_data.Outcome.unique())
diabetes_data['Outcome'] = pd.to_numeric(diabetes_data['Outcome'], errors='coerce').astype('int64')
print(diabetes_data.info())

['1' '0']
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
Pregnancies                 768 non-null int64
Glucose                     763 non-null float64
BloodPressure               733 non-null float64
SkinThickness               541 non-null float64
Insulin                     394 non-null float64
BMI                         757 non-null float64
DiabetesPedigreeFunction    768 non-null float64
Age                         768 non-null int64
Outcome                     768 non-null int64
dtypes: float64(6), int64(3)
memory usage: 54.1 KB
None


In [57]:
diabetes_data['Pregnancies'].value_counts()

1     135
0     111
2     103
3      75
4      68
5      57
6      50
7      45
8      38
9      28
10     24
11     11
13     10
12      9
14      2
15      1
17      1
Name: Pregnancies, dtype: int64

In [55]:
diabetes_data['Glucose'].value_counts()

99.0     17
100.0    17
129.0    14
106.0    14
111.0    14
         ..
182.0     1
169.0     1
160.0     1
62.0      1
149.0     1
Name: Glucose, Length: 135, dtype: int64

In [49]:
diabetes_data['BloodPressure'].value_counts()

70.0     57
74.0     52
78.0     45
68.0     45
72.0     44
64.0     43
80.0     40
76.0     39
60.0     37
62.0     34
82.0     30
66.0     30
88.0     25
84.0     23
90.0     22
58.0     21
86.0     21
50.0     13
56.0     12
54.0     11
52.0     11
92.0      8
75.0      8
65.0      7
94.0      6
85.0      6
48.0      5
96.0      4
44.0      4
110.0     3
106.0     3
98.0      3
100.0     3
108.0     2
55.0      2
30.0      2
104.0     2
46.0      2
122.0     1
95.0      1
102.0     1
61.0      1
40.0      1
24.0      1
38.0      1
114.0     1
Name: BloodPressure, dtype: int64

In [50]:
diabetes_data['SkinThickness'].value_counts()

32.0    31
30.0    27
27.0    23
23.0    22
33.0    20
28.0    20
18.0    20
31.0    19
19.0    18
39.0    18
29.0    17
25.0    16
26.0    16
22.0    16
37.0    16
40.0    16
35.0    15
41.0    15
36.0    14
15.0    14
17.0    14
20.0    13
24.0    12
42.0    11
13.0    11
21.0    10
34.0     8
46.0     8
38.0     7
12.0     7
11.0     6
16.0     6
45.0     6
43.0     6
14.0     6
10.0     5
44.0     5
48.0     4
47.0     4
50.0     3
49.0     3
8.0      2
54.0     2
7.0      2
52.0     2
63.0     1
56.0     1
51.0     1
60.0     1
99.0     1
Name: SkinThickness, dtype: int64

In [51]:
diabetes_data['Insulin'].value_counts()

105.0    11
130.0     9
140.0     9
120.0     8
94.0      7
         ..
272.0     1
41.0      1
25.0      1
600.0     1
59.0      1
Name: Insulin, Length: 185, dtype: int64

In [52]:
diabetes_data['BMI'].value_counts()

32.0    13
31.6    12
31.2    12
33.3    10
32.4    10
        ..
32.1     1
52.9     1
31.3     1
45.7     1
41.8     1
Name: BMI, Length: 247, dtype: int64

In [53]:
diabetes_data['DiabetesPedigreeFunction'].value_counts()

0.254    6
0.258    6
0.259    5
0.238    5
0.207    5
        ..
0.886    1
0.804    1
1.251    1
0.382    1
0.375    1
Name: DiabetesPedigreeFunction, Length: 517, dtype: int64

In [54]:
diabetes_data['Age'].value_counts()

22    72
21    63
25    48
24    46
23    38
28    35
26    33
27    32
29    29
31    24
41    22
30    21
37    19
42    18
33    17
32    16
36    16
38    16
45    15
34    14
40    13
43    13
46    13
39    12
35    10
50     8
44     8
51     8
52     8
58     7
47     6
54     6
57     5
60     5
48     5
49     5
53     5
55     4
62     4
63     4
66     4
56     3
59     3
65     3
67     3
61     2
69     2
72     1
64     1
68     1
70     1
81     1
Name: Age, dtype: int64

In [56]:
diabetes_data['Outcome'].value_counts()

0    500
1    268
Name: Outcome, dtype: int64