# Introduction to Descriptive Statistics

-----

One of the most basic tasks in data analytics, particularly for large data sets, is describing a data set by a small number of statistical quantities. For example, we may wish to know the average income in a geographic region, or the general spread about this average income. In this notebook, we introduce statistical quantities that can be used to describe a (one-dimensional) data set and to enable comparisons between data sets.

## Table of Contents

[Measures of Location](#Measures-of-Location)  

[Measures of Variability](#Measures-of-Variability)  

[Measures of Distribution](#Measures-of-Distribution)  

[Pandas Descriptive Statitics](#Pandas-Descriptive-Statitics)

[Exploratory Data Analysis](#Exploratory-Data-Analysis)

First, we include our standard imports and provide a brief exploration of a demo dataset before generating a NumPy array for our `total_bill` column to simplify computing descriptive statistics.

-----

In [1]:
# Standard imports
import numpy as np
import scipy as sp
import pandas as pd

import seaborn as sns

pd.options.display.max_rows = 10

# Load the Tips Data
tips = pd.read_csv('tips.csv')

# Display first few rows
tips.head(5)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [2]:
# Convert the total_bill column to a NumPy array
data = tips.total_bill.values
data[10:20]

array([10.27, 35.26, 15.42, 18.43, 14.83, 21.58, 10.33, 16.29, 16.97,
       20.65])

-----
[[Back to TOC]](#Table-of-Contents)

## Measures of Location

### [Mean][1]

Given a data set, for example, the `total_bill` column in the _tips_ data set, the first step is often to compute a typical value, in this case the typical bill. The standard approach to this task is to compute the average or (arithmetic) mean ($\mu$ or $\overline{x}$):

\begin{equation}
\textrm{Mean} = \mu = \overline{x} = \frac{\sum_{i=1}^N x_i}{N}
\end{equation}

Computing this in Python is generally quite simple, and most data structures (including both Pandas and NumPy) include helper functions for computing this and other descriptive statistics. For example, we can compute the mean value for the `total_bill` column in the tips data:

`tips.total_bill.mean()`

-----
[1]: https://en.wikipedia.org/wiki/Mean

In [3]:
# We compute the mean of data
print(f'Total Bill Mean Value = {tips.total_bill.mean():6.2f}')

Total Bill Mean Value =  19.79


-----

The mean (or average) is a commonly used descriptive statistic. However, some data are distributed in ways that make the mean a poor measure for the typical value. For example, if data are bi-modal (with lots of low values and lots of high values), the mean will typically lie between these two clumps of data. Likewise, this can occur if a data set has a number of outliers at one end of a range (such as sales contracts that are much higher than the rest of the sales contracts in a list). In this case, other measures can provide more robust estimates. Some of the more common measures include the median and the mode.


### [Median][2]

The median is conceptually an easy measure of centrality–half the data lie above and below the median. Thus, to compute the median you first sort the data and select the middle element. While conceptually simple, in practice computing the median is complicated by several issues:

1. We need to sort the data, which is challenging for big data (and also implies the data can be sorted).
2. For a data set with an even number of values there is no middle value.

In the last case, for example, given the list of points `[0, 1, 2, 3, 4, 5]`, we can use three different techniques to compute a median:
a. average the two middle values (`median=2.5`),
b. take the lower of the two middle values (`median=2`), or
c. take the higher of the two middle values (`median=3`).

Often, a Python module by default employs the first technique (average the two middle values), although, as demonstrated below.

-----

[2]: https://en.wikipedia.org/wiki/Median


In [4]:
import numpy as np
np.median([1,2,3,4,5,6])

3.5

In [5]:
print(f'Total Bill Median Value = {tips.total_bill.median():6.2f}')

Total Bill Median Value =  17.80



### Mode

The mode is simply the most common value in a data set. Often, a mode makes the most sense when the data have been binned, in which case the data have been aggregated and it becomes simple to find the bin with the most values (we will see this in a later lesson on distributions). A mode can also make sense when data are categorical, or explicitly limited to a discrete set of values. In following cell we will find out the most common group size of a meal. Notice that since `size` is a keyword to pandas, we can't refer to the `size` column with `tips.size` as we do for `total_bill` above, we have to use `tips['size']` instead. Also since `mode` can have more than one values, we convert the result which is a pandas Series to a python list.

-----

In [6]:
print(f'Total Bill Median Value = {tips["size"].mode().tolist()}')

Total Bill Median Value = [2]


-----

<font color='red' size = '5'> Student Exercise </font>

In the empty **Code** cell below, write and execute code to compute the mean and median for the `total_bill` column, separately for those rows corresponding to _lunch_ and _dinner_. What do these values suggest about the typical lunch or dinner party?


-----

-----
[[Back to TOC]](#Table-of-Contents)


## Measures of Variability

Once we have a measure for the central location of a data set, the next step is to quantify the spread or dispersion of the data around this location. Inherent in this process is the computation of a distance between the data and the central location. Variations in the distance measurement can lead to differences in variability measures, which include the following dispersion measurements:

- Range
- Variance
- Standard deviation


### [Range][1]

Of these, the simplest is a range measurement, which is simply the difference between two characteristic values. Often these are the minimum and maximum values in a data set. While simple, this provides little insight into the true spread or dispersion of the data, in particular, about the centrality measure (since the mean might lie anywhere between the maximum and minimum values).

### [Variance][3]

Variance is the expectation of the squared deviation of a random variable from its mean. Informally, it measures how far a set of (random) numbers are spread out from their average value.
The formula for the variance is simple:

\begin{equation}
\textrm{Variance} = \frac{\sum_{i=1}^N (x_i - \overline{x})^2}{N}
\end{equation}

### [Standard Deviation][4]

One concern with the variance, however, is that the units of variance are the square of the original units (for example, length versus length * length). To enable the measure of the variability around the mean to be compared to the mean, we generally will use the standard deviation, which is simply the square root of the variance:

\begin{equation}
\textrm{Standard Deviation} = \sigma = \sqrt{\left(\frac{\sum_{i=1}^N (x_i - \overline{x})^2}{N}\right)}
\end{equation}

Pandas Series and DataFrame have functions to calculate these statistics as shown in below code cell.

-----


[1]: https://en.wikipedia.org/wiki/Range_(statistics)
[3]: https://en.wikipedia.org/wiki/Variance#Sample_variance
[4]: https://en.wikipedia.org/wiki/Standard_deviation

In [7]:
# Compute Range quantities
maximum = tips.total_bill.max()
minimum = tips.total_bill.min()
print(f'Maximum = {maximum:6.4f}')
print(f'Minimum = {minimum:6.4f}')
print(f'Range = {maximum-minimum:6.4f}')

# Variance and standard deviation
print(f'Variance = {tips.total_bill.var():6.4f}')
print(f'Standard Deviation = {tips.total_bill.std():6.4f}\n')

Maximum = 50.8100
Minimum = 3.0700
Range = 47.7400
Variance = 79.2529
Standard Deviation = 8.9024



-----

<font color='red' size = '5'> Student Exercise </font>

In the empty **Code** cell below, write and execute code to compute the standard deviation for the `total_bill` column, separately for those rows corresponding to _lunch_ and _dinner_. What do these values suggest about the typical lunch or dinner party? Also, compare the coefficient of variation for the same data. Do these values also make sense?


-----

-----
[[Back to TOC]](#Table-of-Contents)

## Measures of Distribution

While useful, measures of location and dispersion are only two numbers. Thus, other techniques exist for analytically quantifying a distribution of data. Fundamentally, these techniques all rely on sorting the data and finding the value  below which a certain percentage of data lie. In this manner, one _percentile_ would correspond to one percent of the data, while the median would correspond to fifty percent of the data. Formally, the following pre-defined measures are commonly used:

4. Percentiles: Divides the data into one percent chunks, the value indicates which percentile is of interest
5. Deciles: Divides the data into ten chunks, each containing 10% of the data
6. Quantiles: Divides the data into five chunks, each containing 20% of the data
7. Quartiles: Divides the data into four chunks, each containing 25% of the data

With NumPy we just use `percentile` function for all of these and specify the appropriate value for the percentile. Thus for the second decile, we specify `np.percentile(data, 20)`. The following Code cell demonstrates some of the more common values.

-----

In [8]:
# Demonstrate percentile with the Median
data = tips.total_bill.values
print(f'Median (via percentile) = {np.percentile(data, 50):6.4f}')

# Compute quartiles
print(f'First Quartile = {np.percentile(data, 25):6.4f}')
print(f'Second Quartile = {np.percentile(data, 50):6.4f}')
print(f'Third Quartile = {np.percentile(data, 75):6.4f}')
print(f'Fourth Quartile = {np.percentile(data, 100):6.4f}')

# Interquartile range is sometimes used as a dispersion measure
print(f'Quartile Range = {np.percentile(data, 75) - np.percentile(data, 25):6.4f}')

Median (via percentile) = 17.7950
First Quartile = 13.3475
Second Quartile = 17.7950
Third Quartile = 24.1275
Fourth Quartile = 50.8100
Quartile Range = 10.7800


In [9]:
# Compute quantiles
print('First Quantile = {np.percentile(data, 20):6.4f}')
print('Second Quantile = {np.percentile(data, 40):6.4f}')
print('Third Quantile = {np.percentile(data, 60):6.4f}')
print('Fourth Quantile = {np.percentile(data, 80):6.4f}')

First Quantile = {np.percentile(data, 20):6.4f}
Second Quantile = {np.percentile(data, 40):6.4f}
Third Quantile = {np.percentile(data, 60):6.4f}
Fourth Quantile = {np.percentile(data, 80):6.4f}


-----
[[Back to TOC]](#Table-of-Contents)

## Pandas Descriptive Statitics

Pandas Series and DataFrame function `describe()` generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values. The function lists count of not NaN values, mean, standard deviation, quartiles including min and max of numeric columns. 

In [10]:
tips.describe()

Unnamed: 0,total_bill,tip,size
count,244.0,244.0,244.0
mean,19.785943,2.998279,2.569672
std,8.902412,1.383638,0.9511
min,3.07,1.0,1.0
25%,13.3475,2.0,2.0
50%,17.795,2.9,2.0
75%,24.1275,3.5625,3.0
max,50.81,10.0,6.0


-----
[[Back to TOC]](#Table-of-Contents)

## Exploratory Data Analysis

Exploratory Data Analysis, or EDA, refers to the critical process of performing initial investigations on data so as to discover patterns, to spot anomalies, to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.

A statistical model introduced above can be used to see what the data can tell us. But mostly, a graphical approach will be used in EDA. We will introduce some basic data visualizations for EDA in the following lesson.

-----

<font color='red' size = '5'> Student Exercise </font>

In the empty **Code** cell below, print descriptive statistics of total_bill, tip and size, separately for those rows corresponding to _lunch_ and _dinner_. What do these values suggest about the typical lunch or dinner party?

-----

-----

## Ancillary Information

The following links are to additional documentation that you might find helpful in learning this material. Reading these web-accessible documents is completely optional.

1. [NumPy statistics][1] documentation
2. The [Python statistics module][2]
3. Blog article by Randal Olsen on [statistical analysis][4] with Pandas
4. The [exploratory data analysis chapter][ts2c] from the book _Think Stats 2e_ by Allen Downey

-----

[1]: https://docs.scipy.org/doc/numpy/reference/routines.statistics.html
[2]: https://docs.python.org/3/library/statistics.html
[4]: http://www.randalolson.com/2012/08/06/statistical-analysis-made-easy-in-python/
[ts2c]: http://greenteapress.com/thinkstats2/html/index.html

**&copy; 2019: Gies College of Business at the University of Illinois.**

This notebook is released under the [Creative Commons license CC BY-NC-SA 4.0][ll]. Any reproduction, adaptation, distribution, dissemination or making available of this notebook for commercial use is not allowed unless authorized in writing by the copyright holder.

[ll]: https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode