# Exploratory Data Analysis

Exploratory Data Analysis (EDA) is a phase of the Data Science Cycle in which we can attain a deeper understanding of the data: understanding interactions, detection of atypical data, data distribution, data visualization, among others. 

In this session, we will do some EDA of the `diabetes` dataset. For now, as usual, let us import some libraries.

In [None]:
import pandas                  as pd
import numpy                   as np
import matplotlib.pyplot       as plt
import seaborn                 as sns

Remember that this data was downloaded from this webpage: https://www.kaggle.com/vikasukani/diabetes-data-set. 

In [None]:
diabetes = pd.read_csv('diabetes-dataset.csv')
diabetes.head()

## Some statistics and skewness of a distribution

One of the first things one could do is compute some statistics to comprehend a bit better the data. One good option is to use Panda's `describe` method:

In [None]:
diabetes.describe()

Statistics such as standar deviation, mean, quantiles, and others, are provided by this method. One interesting thing is that these quantities can already tell us something about how a variable is distributed. For instance, a measure of **skewness** is the following formula:

$$\frac{\mu-\nu}{\sigma}$$,

where $\nu$ is the median, also known as the second quantile. If this quantity is positive, this commonly indicates that the tail of the data is on the right side of the distribution; if said quantity is negative, then the tail is on the left side. If both $\mu$ and $\nu$ are equal, then we are dealing with a symmetric distribution.

<img src="skewness.png" alt="Drawing" style="width: 700px;"/>

Notice that the well known variable `Pregnancies` has positive skewness. Let us verify that plotting a histogram of `Pregnancies` with Panda's `hist` method.

In [None]:
diabetes['Insulin'].hist(bins=18)

In [None]:
diabetes['Pregnancies'].hist(bins=50)

What about `SkinThickness` and `BMI`?

In [None]:
diabetes[['SkinThickness', 'BMI']].hist(bins=50)

Notice that the variable `BMI` posseses a close-to-symmetrical distribution, nevertheless, this measure of skewness fails assesssing how the variable `SkinThickness` is skewed. This is due to a large amount of registers with a value of zero. Let's correct that.

In [None]:
diabetes['SkinThickness'] > 0

In [None]:
diabetes[diabetes['SkinThickness'] > 0]

In [None]:
diabetes.loc[diabetes['SkinThickness'] > 0, 'SkinThickness'].hist(bins=50)

What does the method `describe` has to say about this?

In [None]:
diabetes[(diabetes['SkinThickness'] > 0) & (diabetes['Insulin'] > 0)]

In [None]:
diabetes.loc[(diabetes['SkinThickness'] > 0) & (diabetes['Insulin'] > 0), 'SkinThickness'].hist()

In [None]:
diabetes.loc[(diabetes['SkinThickness'] > 0) & (diabetes['Insulin'] > 0), ['SkinThickness', 'Insulin']].hist()

Do you notice the difference? It seems that those zero values are "making some noise." This could suggest that we might need a method to handle these zeros. 

Given the later, it is important to mention that the method for measuring skewness that we talked about is one of many, and not only that, it is not infallible.

## Scatter plots

Scatter plots are a good way to visualize the relation between two variables. Let's assume we're interested in exploring how `BMI` and `BloodPressure` are related. Well, we will "scatter them both."

In [None]:
sns.scatterplot(x='BMI', y='BloodPressure', data=diabetes)

There is definitely a cluster, however, the plot shows that these two variables are correlated in a positive fashion: if one increases, the other goes up as well; if one goes down, the other does the same. In fact, I am not a doctor, but this makes sense: the higher the `BMI`, the higher the `BloodPressure`.

## Correlation

According to Wikipedia, "in the broadest sense, correlation is any statistical association, though it actually refers to the degree to which a pair of variables are linearly related." So, if two variables have any type of statistical association with each other, they can be correlated, either positively or negatively. In fact, let's take a look at the following table:

<img src="illuminati.jpg" alt="Drawing" style="width: 500px;"/>

What do you make of this? Are we able to conclude that epidemic and pandemics are caused by advances in technology that works with electromagnetic waves?

Well, of course not, these two variables might be correlated, but **correlation does not imply causation**. The moral of the story is that it is not wise to establish a cause-and-effect relationship based solely on correlation. Please, if you are reading this, just don't do it, this might lead you to conclude absurd things.

However, stablishing correlations between a variable of interest and some predictors can be useful, as well as measuring the correlation between variables that we employ as predictors. A way to do this visually is with a **correlation matrix**. The following code does this for us with the variables of the `diabetes` dataset.

In [None]:
plt.figure(figsize=(7,7))
sns.heatmap(diabetes.corr(), cmap="RdYlBu", 
    annot=True, square=True,
    vmin=-1, vmax=1, fmt="+.3f")
plt.title("Correlation matrix for the diabetes dataset")

## Detecting Outliers

There are different techniques for detecting outliers, one of them being the **z-score**. This quantity is defined as follows:

$$z=\frac{x-\mu}{\sigma}$$,

where $x$ is some observation. 

If a variable follows a normal distribution, or close to normal, then this score is worth using. The following image shows why:

<img src="normal.png" alt="Drawing" style="width: 500px;"/>

The criterion for detecting outliers using the z-scores goes as this: if for a given observation $x$ the absolute value of its z-score is greater than 3, then x is an outlier.

The `BMI` variable seems to follow a close to normal distribution, does it have any outliers?

In [None]:
sigma = diabetes['BMI'].std()
mu = diabetes['BMI'].mean()

In [None]:
diabetes[diabetes['BMI'] > mu + 3 * sigma]

In [None]:
diabetes[diabetes['BMI'] < mu - 3 * sigma]

In [None]:
diabetes['z-score'] = (diabetes['BMI'] - mu) / sigma
diabetes

In [None]:
diabetes['z-score'].hist(bins=50)

### Boxplots

**Boxplot** are another technique that help us to understand the distribution of a variable and are useful for detecting outliers as well. The following image shows the "anatomy" of a boxplot (in Spanish!):

<img src="boxplot.png" alt="Drawing" style="width: 500px;"/>

Eberything looks pretty clear, except for maximum and minimum non-atypical values. These are calculated as follows:

$$\text{Maximum non-atypical value}=Q_3+\frac{3}{2}IQR$$
$$\text{Minimum non-atypical value}=Q_1-\frac{3}{2}IQR$$,

where $IQR=Q_3-Q_1$.

This story would not be complete without a real boxplot, right?

In [None]:
sns.boxplot(y='BMI', data=diabetes)

In [None]:
sns.boxplot(y='Pregnancies', data=diabetes)

## Barplots

Barplots are another tool that we can use to understand data better. For instance, say we want to see if there is a difference in the variable `BMI` of people with diabetes and people who dont't have this condition. We could do something like the following:

In [None]:
sns.barplot(data=diabetes, x='Outcome', y='BMI', errorbar='sd')
sns.set_palette('Set2')

plt.title('Difference in BMI')
plt.xlabel('Diabetes')
plt.ylabel('BMI')

plt.show()

We can also use another variant of `barplot` known as `countplot`.

In [None]:
sns.countplot(data=diabetes, x='Outcome')
sns.set_palette('Set2')

plt.title('Diabetes count')
plt.xlabel('Diabetes')

plt.show()

## Groupby

Another useful tool is the `groupy` method: if we wanna group our data by categories and then aggregate it with some function, this could help us to understand differences among several categories in our data.

In [None]:
diabetes.groupby(by='Outcome').mean()