# Estimating quartiles

# Document

<table align="left">
    <tr>
        <th class="text-align:left">Title</th>
        <td class="text-align:left">Estimating quartiles</td>
    </tr>
    <tr>
        <th class="text-align:left">Last modified</th>
        <td class="text-align:left">2018-09-04</td>
    </tr>
    <tr>
        <th class="text-align:left">Author</th>
        <td class="text-align:left">Gilles Pilon <gillespilon13@gmail.com></td>
    </tr>
    <tr>
        <th class="text-align:left">Status</th>
        <td class="text-align:left">Active</td>
    </tr>
    <tr>
        <th class="text-align:left">Type</th>
        <td class="text-align:left">Jupyter notebook</td>
    </tr>
    <tr>
        <th class="text-align:left">Created</th>
        <td class="text-align:left">2018-08-18</td>
    </tr>
    <tr>
        <th class="text-align:left">File name</th>
        <td class="text-align:left">estimating_quartiles.ipynb</td>
    </tr>
    <tr>
        <th class="text-align:left">Other files required</th>
        <td class="text-align:left">estimating_quartiles.csv</td>
    </tr>
</table>

# In brevi

The purpose of this notebook is to explore the ways that Python calculates quartiles. During the development of the anova_one_factor notebook, I discovered that [Python](https://www.python.org), [LibreOffice](https://www.libreoffice.org), and [Excel](https://office.microsoft.com/excel/) calculate quartiles in the same way, but that Minitab and an online source calculate them differently. I've discovered that there are at least eleven ways to calculate quartiles.

# Data

Download the data file.

[estimating_quartiles](https://drive.google.com/open?id=1Nc_VFXo2SrsSdprfCmQYhLbJawAzKpH6)

# Methodology

Various data munging operations are performed using pandas.

# Explanation of the eleven methods

Quantiles divide the range of a probability distribution into continuous intervals with equal probabilities, or divide the observations in a sample in the same way [Wikipedia](https://en.wikipedia.org/wiki/Quantile).

## Method 1
TBD

## Method 2
TBD

## Method 3
TBD

## Method 4
TBD

## Method 5
TBD

## Method 6
TBD

## Method 7
TBD

## Method 8
TBD

## Method 9
TBD

## Method 10
TBD

## Method 11
TBD

In [1]:
import datetime as dt
start_time = dt.datetime.now()

In [2]:
# Import the required librairies.
import pandas as pd

  return f(*args, **kwds)
  return f(*args, **kwds)


In [3]:
# Read the data file.
# y is the column of response values.
df = pd.read_csv('estimating_quartiles.csv')

In [4]:
# Calculate basic statistics.
df.describe()

Unnamed: 0,y
count,8.0
mean,20.875
std,27.010249
min,0.0
25%,0.75
50%,7.5
75%,35.5
max,63.0


In [5]:
    """
    Return five statistics
    
    Returns
    -------
    min            = minimum value
    quantile(0.25) = first quartile
    quantile(0.50) = median
    quantile(0.75) = third quartile
    max            = maximum value
    """

def five_number_summary(data: pd.Series) -> pd.DataFrame:
   return pd.DataFrame([(interpolation,
            data.min(),
            data.quantile(0.25, interpolation=interpolation),
            data.quantile(0.50, interpolation=interpolation),
            data.quantile(0.75, interpolation=interpolation),
            data.max())
            for interpolation
                in ('linear', 'lower', 'higher', 'nearest',
                    'midpoint')],
                columns=['interpolation', 'min', 'q1', 'q2',
                         'q3', 'max']).\
                set_index(['interpolation'])

In [6]:
results = five_number_summary(df['y'])
results

Unnamed: 0_level_0,min,q1,q2,q3,max
interpolation,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
linear,0,0.75,7.5,35.5,63
lower,0,0.0,2.0,27.0,63
higher,0,1.0,13.0,61.0,63
nearest,0,1.0,13.0,27.0,63
midpoint,0,0.5,7.5,44.0,63


In [7]:
results.iloc[0,3]

35.5

In [8]:
    """
    Return six statistics
    
    Returns
    -------
    min            = minimum value
    quantile(0.25) = first quartile
    quantile(0.50) = median
    quantile(0.75) = third quartile
    max            = maximum value
    iqr            = interquartile range
    """

def six_number_summary(data: pd.Series) -> pd.DataFrame:
   return pd.DataFrame([(interpolation,
            data.min(),
            data.quantile(0.25, interpolation=interpolation),
            data.quantile(0.50, interpolation=interpolation),
            data.quantile(0.75, interpolation=interpolation),
            data.max(),
            (data.quantile(0.75, interpolation=interpolation) -\
             data.quantile(0.25, interpolation=interpolation))
            )
            for interpolation
                in ('linear', 'lower', 'higher', 'nearest',
                    'midpoint')],
                columns=['interpolation', 'min', 'q1', 'q2',
                         'q3', 'max', 'iqr']).\
                set_index(['interpolation'])

In [9]:
results = six_number_summary(df['y'])
results

Unnamed: 0_level_0,min,q1,q2,q3,max,iqr
interpolation,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
linear,0,0.75,7.5,35.5,63,34.75
lower,0,0.0,2.0,27.0,63,27.0
higher,0,1.0,13.0,61.0,63,60.0
nearest,0,1.0,13.0,27.0,63,26.0
midpoint,0,0.5,7.5,44.0,63,43.5


In [10]:
end_time = dt.datetime.now()
(end_time - start_time).total_seconds()

0.516034

# Future work

- Add detail to explain each of the eleven methods for estimating quantiles.
- Determine how to calculate four additional methods for estimating quartiles. See journal article by Hyndman and Fan.

# References

[Five-number summary](https://en.wikipedia.org/wiki/Five-number_summary)

Hyndman, Rob J. and Yanan Fan. "Sample Quantiles in Statistical Packages." *The American Statistician* Vol. 50, No. 4 (Nov. 1996): 361-365. [JSTOR 2684934](http://www.jstor.org/stable/2684934).

[pandas](https://pandas.pydata.org/pandas-docs/stable/api.html)