# Synopsis


In this session we will get an overview of descriptive statistics.  

We also discuss how to calculate frequency plots, box plots, and violin plots.

# Read libraries

In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

from colorama import Back, Fore, Style
from copy import copy, deepcopy
from pathlib import Path
from sys import path

path.append( str(Path.cwd().parent) )


In [None]:
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import scipy.stats as stats


In [None]:
my_fontsize = 15

# Descriptive statistics

The first step in the analysis of data is to to obtain a description that summarizes its statistical properties. There are a number of **statistics** (that is, measures that can be calculated for that data) that are particularly useful.

> **Number of observations** is the number of data in the data set. In this case, the number of applicants for which we have GPA values.
>
> **Minimum** is the smallest value in the data set.
>
> **Maximum** is the largest value in the data set. For the GPA data, this is presumably 4.0.
>
> **Support** (also called **range**) is the interval over which the values of the data set spread. Since GAPs are positive and must be no larger that 4, we know that the range must be a subset of the interval [0, 4]. Presumably, students will GPAs lower than 2 will not apply to a graduate program, so the support of our GPA data will likely be a subset of the interval [2, 4].
>
> **Mode** is the most common value in the data set.  
>
> **Median** is the value that is larger than half of all values and smaller than half of all values in the data set. The median is an example of a **percentile**.  Two other common percentiles are the **first quartile** and the **third quartile**.
>
> **Interquartile range** is the difference between the third and first quartile. It provides an estimation of the dispersion of the data.
>
> **Sample Mean** (also called sample average) is the sum of all values divided by the number of observations.  The sample mean has the smallest distance to the set of all values in the sample.
>
> **Standard deviation** is a measure of the spread around the sample mean for the values in the data set.
>
> **Skewness** is a measure of the asymmetry of the values in the data set. If you divide the support of the data at the sample mean, and if one of the interval is longer than the other, than the data is skewed.

These quantities can all be easily obtained using methods already coded in `Scipy` and `Numpy`.

Let's work through an example:

In [None]:
seed = 42  # So hopefully we are all getting the same numbers

In [None]:
a = 1
b = 100

data_20 = stats.randint.rvs(a, b, size = 20, random_state = seed)
data_20.sort()

data_100 = stats.randint.rvs(a, b, size = 100, random_state = seed)
data_100.sort()

data_1000 = stats.randint.rvs(a, b, size = 1000, random_state = seed)
data_1000.sort()

print(data_20)
data_20

In [None]:
print(f"There are {len(data_20)} observations in the sample.\n")

print(f"The minimum value in the sample is {min(data_20)}.\n")

print(f"The maximum value in the sample is {max(data_20)}.\n")

print(f"The theoretical support for this process is integers between {a} and {b}.\n")

print(f"The median of the sample is {np.median(data_20)}.\n")

# You write the rest


In [None]:
print(f"The mode of the sample is {stats.mode(data_20)}.\n")

Not exactly what we wanted...  `.mode` returns an object with 2 attributes: `mode` and `count`.

In [None]:
print(f"The mode of the sample is {stats.mode(data_20).mode}.\n")


Better... But still an issue because `.mode` could handle arrays of arrays so it returns something that could handle multiple values.

In [None]:
print(f"The mode of the sample is {stats.mode(data_20).mode[0]}; it "
      f"occurs {stats.mode(data_20).count[0]} times.\n")

In [None]:
my_array = np.ones(shape=(20,2))
my_array[:,0] = data_20
print(my_array)


for i in range(2):
    print(f"\nThe mode of column {i} of the sample is "
          f"{stats.mode(my_array).mode[0,i]}; it occurs "
          f"{stats.mode(my_array).count[0,i]} times.\n")


# Frequency plots

While descriptive statistics are very useful, their calculation involves the loss of a lot of information on the data.  Creating a frequency plot provides a much more accurate picture of the statistical properties of the data *as long as it is calculated properly*. 


Consider our data from above. How many times $-$ i.e., how frequently $-$ does each value in the sample space occurs in our small sample?



In [None]:
def hist_plot(y_max, step):
    fig = plt.figure( figsize = (12, 4) )
    ax = fig.add_subplot(111)

    ax.set_xlim(0, 100)
    ax.set_xlabel('Value', fontsize = 1.4* my_fontsize)

    ax.set_xticks(range(0, 101, 10))
    ax.set_xticklabels(range(0, 101, 10), fontsize = my_fontsize)
    
    ax.set_ylim(0, y_max)
    ax.set_ylabel('Frequency', fontsize = 1.4* my_fontsize)

    ax.set_yticks(range(0, y_max, step))
    ax.set_yticklabels(range(0, y_max, step), fontsize = my_fontsize)
    
    return ax


def box_plot(sizes):
    n = len(sizes)
    xticks = np.arange(0, 101, 10)
    labels = [f"n = {n}" for n in sizes]
    
    fig = plt.figure( figsize = (12, 2*n+1) )
    ax = fig.add_subplot(111)
    
    for axis in ['top','right', 'left']:
        ax.spines[axis].set_visible(False)

    ax.set_xlim(0, 100)
    ax.set_xlabel('Value', fontsize = 1.4* my_fontsize)
    
    ax.set_xticks(xticks)
    ax.set_xticklabels(xticks, fontsize = my_fontsize)
    ax.vlines( xticks, 0, 1+len(sizes), colors = '0.7', 
               zorder = -10 )
    
    ax.set_ylim(0.5, 0.5+n)
    ax.set_yticks(np.arange(1, 1+n))
    ax.set_yticklabels(labels, fontsize = my_fontsize)
        
    return ax



In [None]:
ax = hist_plot(3, 1)
ax.hist(data_20, bins = range(101), width = 0.8, align = 'mid'); 

ax = hist_plot(20, 5)
ax.hist(data_1000, bins = range(101), width = 0.8, align = 'mid'); 


What do we learn from these plots about **the process that generated the data**?

If we read the **top panel** in a naive manner, we would think that values between 40 and 50 never occur, and that 2 is one of the three most likely values to occur. 

**But that is not how we truly read them!**

We read them as suggesting that any value between 1 and 100 has pretty much the same chance of occurring.

This means that the representation we chose for the frequency plot is not ideal.  We have too few data points in our samples to allow for the fine grain creation of the frequency plot.

## Binning

One way to address lack of data is to **bin** our results.  Instead of considering every value between 0 and 100, we can instead create boxes $-$ bins $-$ where we will place our values. One bin would be 1-20, another 21-40, and so on.

We would then count how many of the data points in our sample fall within each bin


In [None]:
ax = hist_plot(30, 5)
ax.hist(data_100, bins = range(0, 101, 20), width = 19.5, align = 'mid'); 


Now it is much clearer to see that the data is pretty much uniformly distributed over the entire sample space, which is the kind of information you would try to glean from a frequency plot. As the sample size increases, the pattern $-$ uniformity of probability $-$ becomes more and more apparent.


In [None]:
ax = hist_plot(25, 5)
ax.hist(data_100, bins = range(0, 101, 20), width = 19.5, align = 'mid'); 

ax = hist_plot(220, 50)
ax.hist(data_1000, bins = range(0, 101, 20), width = 19.5, align = 'mid'); 


# Box plots

While frequency plots are wonderful because the enable the human eye and brain to make a quick evaluation of patterns in the data, they are not particularly useful when sample sizes are too small.

**Note that when the sample size is *really small*, the most transparent representation is showing all the data!**

A box plot provides a way to graphically summarize properties of data when one has samples with sizes in the range 10 to 20.

In [None]:
sizes = [20]
data = np.array(data_20)

ax = box_plot(sizes)
bplot = ax.boxplot( data_20, vert = False, patch_artist = True )

ax.text( 74, 1.1, 'Third quartile\nQ3', ha = 'left', rotation = 45,
         fontsize = my_fontsize )
ax.text( 19, 1.1, 'First quartile\nQ1', ha = 'left', rotation = 45,
         fontsize = my_fontsize )
ax.text( 51.5, 1.1, 'Median', ha = 'left', color = 'orange', rotation = 45,
         fontsize = my_fontsize )

ax.hlines(0.84, 22, 35)
ax.hlines(0.84, 65, 77)
ax.text( 50, 0.8, 'Inter-quartile range (IQR)', ha = 'center', 
         color = 'blue', fontsize = my_fontsize )

ax.text( 93, 0.55, 'Right whisker:\nmaximum value\nsmaller than\nQ3 + 1.5 IQR', 
         ha = 'left', color = 'darkred', fontsize = my_fontsize )


ax.text( 2, 0.55, 'Left whisker:\nminimum value\ngreater than\nQ1 - 1.5 IQR', 
         ha = 'left', color = 'darkred', fontsize = my_fontsize );

Note that if the data does not extend beyond the first and/or third quartiles by more than `1.5 IQR`, then the whiskers simple show the minimum and/or maximum of the data.

If there are data points with values beyond these limits, they are shown individually in the plot with a `marker` of your choice and are, usually, interpreted as **outliers**.

In [None]:
ax = box_plot([20, 100, 1000])
bplot = ax.boxplot( [data_20, data_100, data_1000], vert = False, 
                    patch_artist = True )

colors = ['0.5', '0.6', '0.7']
for patch, color in zip(bplot['boxes'], colors):
    patch.set_facecolor(color);

    
for i, txt in enumerate( [20, 100, 1000] ):
    ax.text(82, i+1.2, f"N = {txt}", fontsize = 1.3* my_fontsize)

Note that as sample sizes increase, the box plot starts to provide a more and more accurate picture of the stochastic process. Specifically, 

> the median approaches 50, 
>
> the first quartile approaches 25, 
>
> the third quartile approaches 75, 
>
> the whiskers approach 1 and 100. 

# Violin plots

Violin plots are becoming a popular alternative to box plots. The algorithm for creating violin plots uses kernel density estimation to obtain a smooth profile of the variation in local density of the data.

For the sample with 20 data points, we can see that KDE reproduces the low, high, low, high,low variation of the frequency plot.  If we had fewer points, the violin plot would not really have access to enough information to generate more than a guess of what the true frequency is locally.

In [None]:
sizes = [20, 20]
data = np.array([data_20, data_20])

ax = box_plot(sizes)
bplot = ax.boxplot( data_1000, vert = False, positions = [2], 
                    patch_artist = True )

parts = ax.violinplot( data_1000, vert = False, showmeans = False, 
                       showmedians = False, showextrema=False )


for pc in parts['bodies']:
    pc.set_facecolor('#D43F3A')
    pc.set_edgecolor('black')
    pc.set_alpha(1)

quartile1, median, quartile3 = np.percentile( data_20, [25, 50, 75])
iqr = quartile3 - quartile1
whiskers_min = np.clip(quartile1 - 1.5*iqr, quartile1, data_20[0])
whiskers_max = np.clip(quartile3 + 1.5*iqr, quartile3, data_20[-1])

ax.scatter(median, 1, marker='o', color='white', s = 300, zorder=3)
ax.hlines(1, quartile1, quartile3, color = 'k', lw = 10)
ax.hlines(1, whiskers_min, whiskers_max, color = 'k', lw = 2);
