# Graphs with python. 2
## Distribution

We have several graphs to show how our data are distributed, i.e., which values are more or less frequent.

![imaxe-graficas-distribucion](img/distribucions.png)

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Dataset height, weight, gender
# https://www.kaggle.com/mustafaali96/weight-height/version/1

In [None]:
df = pd.read_csv('../datasets/weight-height.csv')
df.head()

### Displot function
https://seaborn.pydata.org/generated/seaborn.displot.html

Different approaches: 
- histplot
- kdeplot
- ecdf

In [None]:
# The default representation is of type 'histogram'.
sns.displot(data=df.Height)

In [None]:
# Other way to call the function
sns.displot(data=df,x='Height')

In [None]:
# KDE type representation (Kernel Density Estimates)
sns.displot(data=df.Height, kind='kde')

In [None]:
# A third type of distribution graph would be the ECDF, which shows the data cumulatively.
# Graphically we will detect steeper slopes in the values where more data are located.
sns.displot(data=df.Height, kind='ecdf')

In [None]:
# Normal distribution
# Previous distribution graphs show a pattern similar to the NORMAL DISTRIBUTION.

In [None]:
# After representing the height values let's look at the weights.
sns.displot(data=df,x='Weight')


In [None]:
# Transform inches to cms
df['altura'] = df.Height / 0.39370

# Transform pounds to kgs
df['peso'] = df.Weight / 2.2046

In [None]:
df_pesoaltura = df[['Gender','peso','altura']]
df_pesoaltura.head()

In [None]:
# Now I have a dataframe with cms and kg

sns.displot(data=df_pesoaltura.altura)
plt.title('Distribución de alturas')
plt.xlabel('altura en cm\'s')

In [None]:
sns.displot(data=df_pesoaltura.peso)
plt.title('Distribución de peso')
# We detected two peaks!
# The distribution does not conform to the NORMAL, as expected.

In [None]:
# We can represent weight by indicating a categorical variable.
sns.displot(data=df_pesoaltura, x='peso', hue='Gender')
plt.title('Distribución de peso')

In [None]:
sns.displot(data=df_pesoaltura, x='peso', hue='Gender', kind = 'kde')
plt.title('Distribución de peso')

In [None]:
# Displot also allows to represent the distribution as a function of two variables
# Color intensity indicates the largest or smallest distribution.
sns.displot(data=df_pesoaltura, x='altura',y='peso')

In [None]:
sns.displot(data=df_pesoaltura, x='altura',y='peso', kind='kde')

In [None]:
# Again we detect the two "peaks".
sns.displot(data=df_pesoaltura, x='altura',y='peso', kind='kde', hue='Gender')