# Measures of Shape
---
- Author: Diego Inácio
- GitHub: [github.com/diegoinacio](https://github.com/diegoinacio)
- Notebook: [measures_shape.ipynb](https://github.com/diegoinacio/data-science-notebooks/blob/master/Probability-and-Statistics/measures_shape.ipynb)
---
Overview of measures of shape in distributions.

Regarding the data distribution shape, we can classify it into two categories: **symmetrical** or **asymmetrical**. In a *symmetrical* case, the two sides of the distribution are a mirror image of each other, as the `normal distribution`, for example. However, in a *asymmetrical* case, both sides around the central location (as mean or median) are relatively different, as the `skewed distribution`, for example.

![measures of shape](sourceimages/measures_shape_01.jpg)

In [None]:
%matplotlib inline
import matplotlib
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [None]:
plt.rcParams['figure.figsize'] = (16, 8)

In [None]:
import warnings
warnings.filterwarnings('ignore')

## Skewness
---
Considering a probability distribution of a real-valued random variable, [skewness](https://en.wikipedia.org/wiki/Skewness) is a measure of the **asymmetry** about its mean. To calculate the skewness coefficient, we have:

$$ \large
S_1 = \frac{\overline{x} - M_o}{\sigma} \quad ; \quad S_2 = \frac{3 \cdot (\overline{x} - M_d)}{\sigma}
$$

where:
- $S_1$ is the Pearson *mode* skewness (or first skewness coefficient) and $S_2$ is the Pearson *median* skewness (or second skewness coefficient);
- $M_o$ and $M_d$ are the *mode* (categorical and discrete variables) and *median* (numeric), respectively.

For a [unimodal](https://en.wikipedia.org/wiki/Unimodality) distributions, the skewness value can be:

- **positive**: When the largest tail is on the *right* side of the distribution and when the *mean* value is greater than the *mode*|*median* value; ($\overline{x} > M_o, M_d$)
- **negative**: When the largest tail is on the *left* side of the distribution and when the *mean* value is less than the *mode*|*median* value; ($\overline{x} < M_o, M_d$)
- **zero**: When the *mean* and *median* values are the same. This means that the distribution is *symmetric*. ($\overline{x} = M_o, M_d$)

In [None]:
np.random.seed(0)
df_ = np.random.normal(0, 1, 100000)
df_ = df_ - df_.min() + 1 # non-zero normalization
DF = [
    2**(df_/2),
    df_,
    np.log2(df_)
]

fig, AX = plt.subplots(1, 3)
COLOR = list("rgb")

for df, axis, color in zip(DF, AX, COLOR):
    mean = np.mean(df)
    median = np.median(df)
    std = np.std(df)
    # skewness
    S2 = 3*(mean - median)/std

    sns.distplot(df, ax=axis, color=color)
    axis.set_title(f'$S_2={S2:.03f}$')
    axis.axvline(x=mean, ymin=0.0, ymax=1.0, color='cyan', )
    axis.axvline(x=median, ymin=0.0, ymax=1.0, color='black', linestyle='--')
    axis.legend(["distribution", 'mean', 'median'])

plt.show()

Another way to measure the skewness of a distributions is based on the third [standardized moment](https://en.wikipedia.org/wiki/Standardized_moment), calculated as:

$$ \large
g_1 = \frac{m_3}{m_2^{\frac{3}{2}}} = \frac{\frac{1}{n} \sum_i^n (x_i - \overline{x})^3}{\left[ \frac{1}{n} \sum_i^n (x_i - \overline{x})^2 \right]^{\frac{3}{2}}}
$$

where $m_2$ and $m_3$ are the *second* and *third* central moments, respectively. Finally, we can find the adjusted Fisher–Pearson standardized moment coefficient $G_1$, defined as:

$$ \large
G_1 = \frac{\sqrt{n(n-1)}}{n-2} \cdot g_1
$$

The interpretation of this value follows the same rules as the Person coefficient, in relatikon to the numberic sign. Houwever, it has some quirks:

- If the skewness is less than $−1$ or greater than $+1$, the distribution can be called **highly skewed**;
- If the skewness is between $−1$ and $−\frac{1}{2}$ or between $+\frac{1}{2}$ and $+1$, the distribution can be called **moderately skewed**;
- If the skewness is between $−\frac{1}{2}$ and $+\frac{1}{2}$, the distribution can be called **approximately symmetric**.

In [None]:
np.random.seed(0)
df_ = np.random.normal(0, 1, 100000)
df_ = df_ - df_.min() + 1 # non-zero normalization
DF = [
    2**(df_/2),
    df_,
    np.log2(df_)
]

fig, AX = plt.subplots(1, 3)
COLOR = list("rgb")

for df, axis, color in zip(DF, AX, COLOR):
    n = df.size
    mean = np.mean(df)
    median = np.median(df)
    # central moments
    m2 = np.sum((df - mean)**2)/n
    m3 = np.sum((df - mean)**3)/n
    # moment coefficient
    g1 = m3/(m2**(3/2))
    # measure
    G1 = (n*(n - 1))**0.5/(n - 2)*g1

    sns.distplot(df, ax=axis, color=color)
    axis.set_title(f'$G_1={G1:.03f}$')
    axis.axvline(x=mean, ymin=0.0, ymax=1.0, color='cyan', )
    axis.axvline(x=median, ymin=0.0, ymax=1.0, color='black', linestyle='--')
    axis.legend(["distribution", 'mean', 'median'])

plt.show()

## Kurtosis
---
Considering a probability distribution of a real-valued random variable, [kurtosis](https://en.wikipedia.org/wiki/Kurtosis) is a measure of the **tailedness**. To calculate the kurtosis coefficient, we have:

$$ \large
K = \frac{Q_3 - Q_1}{2(D_9 - D_1)}
$$

where:
- $Q_1$ and $Q_3$ are the *first* and *third* quartile, respectively;
- $D_1$ and $D_9$ are the *first* and *ninth* decile, respectively.

The interpretation of excess kurtosis value results in 3 distinct regimes as described below:
- **Mesokurtic**: Where there is no excess kurtosis. In a normal distribution, this value is approximately $K \approx 0.263$;
- **Leptokurtic**: Where the excess kurtosis is *negative*. This happen when $K < 0.263$;
- **Platykurtic**: Where the excess kurtosis is *positive*. This happen when $K > 0.263$.

In [None]:
np.random.seed(0)
df_ = pd.Series(np.random.normal(0, 1, 100000))
DF = [
    df_[np.absolute(df_) < 4],
    df_[np.absolute(df_) < 2],
    df_[np.absolute(df_) < 1]
]

fig, AX = plt.subplots(1, 3)
COLOR = list("rgb")

for df, axis, color in zip(DF, AX, COLOR):
    # quartiles
    Q1 = df.quantile(0.25)
    Q3 = df.quantile(0.75)
    # deciles
    D1 = df.quantile(0.10)
    D9 = df.quantile(0.90)
    # kurtosis
    K = (Q3 - Q1)/(2*(D9 - D1))

    sns.distplot(df, ax=axis, color=color)
    axis.set_title(f'$K={K:.03f}$')
    

plt.show()

Similarly to skewness, there is another way to measure the kurtosis of a distributions, but now is based on the fourth standardized moment, calculated as:

$$ \large
g_2 = \frac{m_4}{m_2^{2}} -3 = \frac{\frac{1}{n} \sum_i^n (x_i - \overline{x})^4}{\left[ \frac{1}{n} \sum_i^n (x_i - \overline{x})^2 \right]^{2}} - 3
$$

where $m_2$ and $m_4$ are the *second* and *fourth* central moments, respectively. Finally, we can find the adjusted Fisher–Pearson standardized moment coefficient $G_2$, defined as:

$$ \large
G_2 = \frac{n-1}{(n-2)(n-3)} \cdot [(n+1) \cdot g_2 + 6]
$$

In [None]:
np.random.seed(0)
df_ = np.random.normal(0, 1, 100000)
DF = [
    df_[np.absolute(df_) < 3],
    df_[np.absolute(df_) < 2],
    df_[np.absolute(df_) < 1]
]

fig, AX = plt.subplots(1, 3)
COLOR = list("rgb")

for df, axis, color in zip(DF, AX, COLOR):
    n = df.size
    mean = np.mean(df)
    # central moments
    m2 = np.sum((df - mean)**2)/n
    m4 = np.sum((df - mean)**4)/n
    # moment coefficient
    g2 = m4/(m2**2) - 3
    # measure
    G2 = (n - 1)/((n - 2)*(n - 3))*((n + 1)*g2 + 6)

    sns.distplot(df, ax=axis, color=color)
    axis.set_title(f'$G_2={G2:.03f}$')

plt.show()