<a href="https://colab.research.google.com/github/ankramirez/Data_Science/blob/main/Pima%2BIndians%2BDiabetes%2BAnalysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Foundations of Data Science Project - Diabetes Analysis

---------------
## Context
---------------

Diabetes is one of the most frequent diseases worldwide and the number of diabetic patients are growing over the years. The main cause of diabetes remains unknown, yet scientists believe that both genetic factors and environmental lifestyle play a major role in diabetes.

A few years ago research was done on a tribe in America which is called the Pima tribe (also known as the Pima Indians). In this tribe, it was found that the ladies are prone to diabetes very early. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients were females at least 21 years old of Pima Indian heritage. 

-----------------
## Objective
-----------------

Here, we are analyzing different aspects of Diabetes in the Pima Indians tribe by doing Exploratory Data Analysis.

-------------------------
## Data Dictionary
-------------------------

The dataset has the following information:

* Pregnancies: Number of times pregnant
* Glucose: Plasma glucose concentration over 2 hours in an oral glucose tolerance test
* BloodPressure: Diastolic blood pressure (mm Hg)
* SkinThickness: Triceps skin fold thickness (mm)
* Insulin: 2-Hour serum insulin (mu U/ml)
* BMI: Body mass index (weight in kg/(height in m)^2)
* DiabetesPedigreeFunction: A function that scores the likelihood of diabetes based on family history.
* Age: Age in years
* Outcome: Class variable (0: a person is not diabetic or 1: a person is diabetic)

## Q 1: Import the necessary libraries and briefly explain the use of each library (3 Marks)

In [1]:
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

#ANS Q1:

**Numpy**: It is a fundamental package wich allows us to work with arrays, these arrays facilitate advance mathematical and other types of operations on large number data, numerical analysis can also be performed.

**Pandas**: Is a Software Library, allows us to analyse data and manipulate it easily using DataFrames and Series, it's the main tool used to read and write data in various formats.

**Seaborn**: Is a data visualization library based on matplotlib for Exploratory Analysis, the API is considerably more friendly than Matplotlib and provides attractive and informative statistical graphics.

**Matplotlib**: It is a data visualization library, which initially tried emulating MATLAB. Uses NummPy in order to manage large arrays.

## Q 2: Read the given dataset (1 Mark)

#Q2 OUTCOME

In [2]:
pima = pd.read_csv("diabetes.csv")

FileNotFoundError: ignored

## Q3. Show the last 10 records of the dataset. How many columns are there? (1 Mark)

In [None]:
pima.tail(10)

#ANS Q3:

There is a total number of 9 columns in the data (index column is not taken into account since it is assigned when the data is loaded and converted into a data frame).

## Q4. Show the first 10 records of the dataset (1 Mark)

#Q4 OUTCOME

In [None]:
pima.head(10)

## Q5. What do you understand by the dimension of the dataset? Find the dimension of the `pima` dataframe. (1 Mark)

In [None]:
pima.shape

#ANS Q5:

The dimension of the dataset is obtained by calling the method of the dataframe 'shape' and the result is given in a tuple, in which the first element indicates the number of rows present in the dataset, and the second element indicates the number of columns present in the dataset.


## Q6. What do you understand by the size of the dataset? Find the size of the `pima` dataframe. (1 Mark)

In [None]:
pima.size

#ANS Q6: 

The size of the dataset is obtained by calling the method 'size' of the dataframe, and the result is a scalar, a numpy integer, given by the product of the number of rows by the number of columns, which is the total number of elements.


## Q7. What are the data types of all the variables in the data set? (2 Marks)
**Hint: Use the info() function to get all the information about the dataset.**

In [None]:
pima.info()

#ANS Q7

The variables types are:

int64 in the case of Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, Age and Outcome. 

float64 in the case of BMI and DiabetesPedigreeFunction.

int64 represent an integer of 64 bits, while float64 can store digits after decimal points, up to 15 to 17. Despite Outcome being stored as an integer variable, it's actual value it's either yes or no, which would have been better represented with a boolean variable.                                                                    

## Q8. What do we mean by missing values? Are there any missing values in the `pima` dataframe? (2 Marks)

In [None]:
pima.isnull().values.any()

#ANS Q8: 

A missing value is when a given variable doesn't possess a value in an specific position, and it's indicated by the symbol NaN (Not a value).

## Q9. What do the summary statistics of the data represent? Find the summary statistics for all variables except 'Outcome' in the `pima` data. Take one column/variable from the output table and explain all its statistical measures. (3 Marks)

In [None]:
pima.iloc[:,0:8].describe()

In [None]:
pima.BMI.describe()

## Q 10. Plot the distribution plot for the variable 'BloodPressure'. Write detailed observations from the plot. (2 Marks)

In [None]:
sns.displot(pima['BloodPressure'], kind='kde')
plt.show()

#ANS Q10:

- The distribution seems to be normal because it shows some kind of symmetry around the mean (around 72, and minimum value around 40 and max value around 100).
- The small widthness implies a small standard deviation. 
- There seem to be some extreme values, specially to the right.

## Q 11. What is the 'BMI' of the person having the highest 'Glucose'? (1 Mark)

In [None]:
pima[pima['Glucose']==pima['Glucose'].max()]['BMI']


## Q12.
### 12.1 What is the mean of the variable 'BMI'? 
### 12.2 What is the median of the variable 'BMI'? 
### 12.3 What is the mode of the variable 'BMI'?
### 12.4 Are the three measures of central tendency equal?

### (3 Marks)

In [None]:
m1 = pima['BMI'].mean()  # mean
print(m1)
m2 = pima['BMI'].median()  # median
print(m2)
m3 = pima['BMI'].mode()[0]  # mode
print(m3)

equal = lambda x, y, z: print('All three measures of central tendency are equal') if x==y==z else print('Not all the three measures of central tendency are equal')
equal(m1,m2,m3)

#ANS Q12:

The values of median and mode are equal, while the the value of the mean is different, however, not for much.

This supports the previous statement about the distribution being somehow normal-distributed.

## Q13. How many women's 'Glucose' levels are above the mean level of 'Glucose'? (1 Mark)

In [None]:
pima[pima['Glucose']>pima['Glucose'].mean()].shape[0]

#ANS Q13:

There are 343 women with 'Glucose' level (at the time of examination) higher than the mean (at the time of examination).


## Q14. How many women have their 'BloodPressure' equal to the median of 'BloodPressure' and their 'BMI' less than the median of 'BMI'? (2 Marks)

In [None]:
pima[(pima['BloodPressure']==pima['BloodPressure'].median()) & (pima['BMI']<pima['BMI'].median())].shape[0]

#ANS Q14:

There are 22 women in total who, at the time of the examination, had a 'BloodPressure' equal to the median of 'BloodPressure' and at the same time had a 'BMI' less than the median value of 'BMI'. (Median values calculated for all 768 women who went through the test.)


## Q15. Create a pairplot for the variables 'Glucose', 'SkinThickness', and 'DiabetesPedigreeFunction'. Write your observations from the plot. (4 Marks)

In [None]:
sns.pairplot(data=pima,vars=['Glucose', 'SkinThickness', 'DiabetesPedigreeFunction'], hue='Outcome')
plt.show()

In [None]:
#Creating a new DF with different outcomes, Yes = 1, No = 0
pimayes = pima[pima['Outcome']==1]
pimano = pima[pima['Outcome']==0]

#Creating a correlation matrix for each DF, using the selected variables
corryes = pimayes[['Glucose', 'SkinThickness', 'DiabetesPedigreeFunction']].corr()
corrno = pimano[['Glucose', 'SkinThickness', 'DiabetesPedigreeFunction']].corr()

#Displaying the heatmap for each correlation matrix
plt.figure(figsize=(10,8))
plt.title('Correlation heatmap for women with diabetes')
sns.heatmap(corryes, annot = True)
plt.show()
plt.figure(figsize=(10,8))
plt.title('Correlation heatmap for women without diabetes')
sns.heatmap(corrno, annot = True)
plt.show()


## Q16. Plot the scatterplot between 'Glucose' and 'Insulin'. Write your observations from the plot. (2 Marks)

In [None]:
sns.scatterplot(x='Glucose',y='Insulin',data=pima)
plt.show()
sns.scatterplot(x='Glucose',y='Insulin',data=pima, hue='Outcome')
plt.show()

#ANS Q16:

- From the graph it can be seen that there seems to be some correlation between the 'Glucose' value and the 'Insulin'.
- There seems to be a lot of women with a constant value of Insulin. \
<br>**To better analyse the data, a new graph using a hue according to the diabetes outcome was made**:
- The correlation between the 'Glucose' value and the 'Insulin' seems to hold either for women with diabetes, as well as for the women without diabetes.
- The constant value of Insulin is present for both outcomes.


## Q 17. Plot the boxplot for the 'Age' variable. Are there outliers? (2 Marks)

In [None]:
plt.boxplot(pima['Age'])

plt.title('Boxplot of Age')
plt.ylabel('Age')
plt.show()

#ANS Q17:

There are outliers, mainly for older ages, around 67 years and older. It can be observed from the blank-filled dots outside the inter quartile range indicated by straight lines. 


## Q18. Plot histograms for the 'Age' variable to understand the number of women in different age groups given whether they have diabetes or not. Explain both histograms and compare them. (3 Marks)

In [None]:
plt.hist(pima[pima['Outcome']==1]['Age'], bins = 5)
plt.title('Distribution of Age for Women who have Diabetes')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

In [None]:
plt.hist(pima[pima['Outcome']==0]['Age'], bins = 5)
plt.title('Distribution of Age for Women who do not have Diabetes')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

In [None]:
sns.histplot(data=pima, x='Age', hue='Outcome', bins = 5, multiple="stack")
plt.title('Distribution of Age for Women who has Diabetes')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

In [None]:
print(pima[pima['Outcome']==0]['Age'].median())
print(pima[pima['Outcome']==1]['Age'].median())
print(pima[pima['Outcome']==0]['Age'].mean())
print(pima[pima['Outcome']==1]['Age'].mean())

#ANS Q18:

- The age distribution for women without diabetes is clearly skewed to the left, with a mean surely between 21 and 32 years, other central tendencies measurements as median and mode will most likely have a value somewhere between the interval values' of the first bin.
- The age distribution for women with diabetes is skewed to the left as well, the central values are probably around the first bin, and they should be closer than those corresponding to the women without diabetes, since the distribution is more smoothly distributed, it's not so clear where the mode can be found, but could be around the first bin.
<br>
- When comparing both histograms, it can be seen that there are way many more women who don't have diabetes around the age comprised in the first bin, and therefore the distribution is clearly more skewed than that of the women who have diabetes.









## Q 19. What is the Interquartile Range of all the variables? Why is this used? Which plot visualizes the same? (2 Marks)

In [None]:
Q1 = pima.quantile(0.25)
Q3 = pima.quantile(0.75)
IQR = Q3 - Q1
print(IQR)

#ANS Q19

The IQ range indicates the value of the range where 50% of the data is distributed, can also be used to determine a range outside of which, the values are taken as outliers.\
Using separately the range between Q1 and Q2 (median) and between Q2 and Q3, the skewness of the distribution can be determined.

## Q 20. Find and visualize the correlation matrix. Write your observations from the plot. (3 Marks)

In [None]:
corr_matrix = pima.iloc[:,0:8].corr()

corr_matrix

In [None]:
plt.figure(figsize=(8,8))
sns.heatmap(corr_matrix, annot = True)

# display the plot
plt.show()

#ANS Q20

- Overall, there doesn't seem to be a high correlation between any two variables.
- The highest correlations seems to be between BMI and skin thickness, and age and pregnancies.
- In the case of age and pregnancies seems reasonably to find a positive correlation, since the older the woman is, the more children that could have bared.
- The variables Pregnancies and Insulin, have a negative correlation, but since it's value is so small, their relation can be disregarded, as well as in the case of Pregnancies and DiabetesPedigreeFunction.
