## Measures of Position
Measures of position help us understand the relative standing of a data point within a dataset. One of the most useful tools in this category is the **quantile**. Quantiles divide a dataset into equal-sized intervals based on the number of data points. Some common types of quantiles include:
- **Median** divides a dataset into two equal parts
- **Quartiles** divide a data into four equal parts
- **Deciles** divide a data into ten equal parts
- **Percentiles** divide a data into 100 equal parts

These measures help us understand how individual values relate to the rest of the data. Instead of summarising with a single central value, they break the data into parts to show relative standing.

Let's represent our data on a number line; if we were to divide this line into 4 equal parts, we would get quartiles
<br>

$$
\underbrace{
|{-}{-}{-}{-}{-}|{-}{-}{-}{-}{-}|{-}{-}{-}{-}{-}|{-}{-}{-}{-}{-}|
}_{\text{Quartiles}}
$$

Instead, if we were to divide it into 10 equal parts, we would get deciles
<br>

$$
\underbrace{
|{-}{-}|{-}{-}|{-}{-}|{-}{-}|{-}{-}|{-}{-}|{-}{-}|{-}{-}|{-}{-}|{-}{-}|
}_{\text{Deciles}}
$$

## The Median as a Quantile
The **median** is a specific case of a **quantile**. Specifically, it is the **0.5 quantile**. It divides the data into two equal parts: 50% of the values lie below it and 50% lie above it.

Consider the following number line to understand this intuition:

$$|{-}{-}{-}{-}{-}{-}{-}{-}{-}{-}|{-}{-}{-}{-}{-}{-}{-}{-}{-}{-}|$$
$$\text{Median}$$

If the data is odd in length, for example, $[12, 5, 7, 9, 10] \rightarrow \text{(sort)} \rightarrow [5, 7, 9, 10, 12]$, then we can pick the middle value, that is, $9$.

If the data is even in length, for example, $[12, 13, 5, 7, 9, 10] \rightarrow \text{(sort)} \rightarrow [5, 7, 9, 10, 12, 13]$, then we need to find a value that divides the data into two equal halves. Any number between $9$ and $10$ would fit the requirement here, but the most straightforward value to select is the arithmetic mean of the two values, that is, $\frac{9 + 10}{2} = 9.5$

## Tertiles
We can divide data into **three equal parts** using **tertiles**:
- 1st tertile ($T_1$): 0.33 quantile → ~33% of the data lies below this point  
- 2nd tertile ($T_2$): 0.66 quantile → ~66% of the data lies below this point

Let's compute these tertiles manually using the example $[15, 18, 67, 100, 21, 50]$. Can this be divided into three equal parts?

$$
\text{15    } \quad\quad \text{  18  } \quad\quad \text{     21 } \quad\quad \text{50  } \quad\quad \text{  67} \quad\quad \text{  100}
$$
$$
|{-}{-}{-}{-}{-}{-}{-}|{-}{-}{-}{-}{-}{-}{-}|{-}{-}{-}{-}{-}{-}{-}|
$$
$$
\text{Min} \quad\quad\quad\quad\quad T_1 \quad\quad\quad\quad\quad T_2 \; \quad\quad\quad\quad \; \text{Max}
$$

Note that the diagram shown here does not show the numbers on the axis to the right scale. But it is evident that there exist numbers $T_1$ and $T_2$ that divide the data into three equal parts. Our task is to find the values of these numbers in a similar approach as before.

We know that $T_1$ lies in $\left[18, 21\right]$ and $T_2$ is in $\left[50, 67\right]$, but the problem here is that the numeric distance between each data point is inconsistent. But that is easily solved because our data is already sorted. So the indices of the data (mathematically $1, 2, 3, ...$ and in Python $0, 1, 2, ...$) can be used to find the indices of $T_1$ and $T_2$, and then we can use weighted arithmetic mean to find the corresponding values using the ends of their respective intervals. For calculating the positions, we can use the number of spaces between the data points in the datasets.

So, $\text{Position}(T_1) = \frac{1}{3} \times \left[\text{number of data points} - 1\right] + 1 = \frac{1}{3} \times (6 - 1) + 1= \frac{5}{3} + 1\approx 2.67$.

Adjusting for Python indexing, $\text{Position}(T_1) = \frac{1}{3} \times \left[\text{length}(\text{list}) - 1\right] = \frac{1}{3} \times (6 - 1) = \frac{5}{3} \approx 1.67$.

Similarly, $\text{Position}(T_2) = \frac{2}{3} \times \left[\text{Number of data points} - 1\right] + 1 = \frac{2}{3} \times (6 - 1) + 1 = \frac{10}{3} + 1\approx 4.34$.

Adjusting for Python indexing, $\text{Position}(T_2) = \frac{2}{3} \times \left[\text{length}(\text{list}) - 1\right] = \frac{2}{3} \times (6 - 1) = \frac{10}{3} \approx 3.34$.

Essentially, what we need now is the value in the data space (after sorting) that corresponds to the indices that have been computed. A straightforward way to compute these values would be to take a weighted arithmetic mean of the values at the ends of the intervals of these quantiles.

For $T_1$, the computed index is $1.67$, so its value is equal to the weighted average of $18$ and $21$ with the weights $67%$ and $33%$ ($2:1$ ratio between the end values). So, the value of $T_1$ covers $67%$ of the line segment between $18$ and $21$ on the axis.

So, $T_1 \approx 18 + 0.67 \times (21 - 18) \approx 18 + 0.67 \times 3 \approx 20$.

Similarly, the value of $T_2$ covers $34%$ of the line segment between $50$ and $67$ on the axis.

So, $T_2 \approx 50 + 0.34 \times (67 - 50) \approx 50 + 0.34 \times 17 \approx 55.6$.

In [1]:
data = [15, 18, 67, 100, 21, 50]

In [2]:
def compute_tertiles(data):
    sorted_data = sorted(data)
    n = len(sorted_data)

    # First tertile (1/3 quantile)
    p1 = (1/3) * (n - 1)
    lower1 = int(p1)
    upper1 = lower1 + 1
    fractional_part1 = p1 - lower1

    # Calculating the quantile value using weighted mean
    t1 = sorted_data[lower1] + fractional_part1 * (sorted_data[upper1] - sorted_data[lower1])

    # Second tertile (2/3 quantile)
    p2 = (2/3) * (n - 1)
    lower2 = int(p2)
    upper2 = lower2 + 1
    fractional_part2 = p2 - lower2
    t2 = sorted_data[lower2] + fractional_part2 * (sorted_data[upper2] - sorted_data[lower2])

    return t1, t2

In [3]:
t1, t2 = compute_tertiles(data)
print(f'For the data {data},')
print('T1:', t1)
print('T2:', t2)

For the data [15, 18, 67, 100, 21, 50],
T1: 20.0
T2: 55.666666666666664


## General Formula
For a dataset ${x_1, x_2, x_3, ..., x_n}$, and a division of the data into $q$ equal parts, where $q \in \{1, 2, 3, ..., n\}$, the position of the $k^{\text{th}}$ quantile, where $k \in \{1, 2, 3, ..., q-1\}$, is $p = \frac{k}{q} \times (n - 1) + 1$, and its value will be $Q_k = x_{\lfloor p \rfloor} + \{p\} \left(x_{\lceil p \rceil} - x_{\lfloor p \rfloor}\right)$. To adjust for Python indexing, we only need to reduce $p$ by $1$.

In [4]:
def quantile(data, k, q):
    sorted_data = sorted(data)  # sorted data
    n = len(sorted_data)  # number of data points
    
    pos = (k / q) * (n - 1)  # position of kth quantile

    lower = int(pos)  # lower limit of interval
    upper = lower + 1  # upper limit of interval
    fraction = pos - lower  # fractional part
    
    return sorted_data[lower] + fraction * (sorted_data[upper] - sorted_data[lower]) # Linear weighted mean

Let's try to compute our tertiles again using this formula

In [5]:
print(f'For the data {data},')
print('T1:', quantile(data, 1, 3))
print('T2:', quantile(data, 2, 3))

For the data [15, 18, 67, 100, 21, 50],
T1: 20.0
T2: 55.666666666666664


Notice how $k$ and $q$ always appear as a ratio in our computations. This is why many libraries such as NumPy and Pandas just use this ratio as the main parameter of their respective quantile functions.

In [6]:
import numpy as np

In [8]:
np.quantile(data, 1/3)

20.0

In [9]:
np.quantile(data, 2/3)

55.666666666666664

## Working with a Dataset
Now that we have seen how to calculate the quartiles for a small list, let’s apply the same concept to a larger dataset. We will work with a CSV file named `'student_data.csv'`, which contains information about students scores in math, reading, and writing. Our goal is to calculate the quartiles, the $9^{th}$ decile, and the $50^{th}$ and $95^{th}$ percentiles of the maths scores in the dataset.

In [10]:
# Open the CSV file, read it, and store it as a dictionary
file_path = 'student_data.csv'

with open(file_path, mode = 'r') as file:  # opening the file in reading mode
    lines = file.readlines()  # reading all the lines in the file as a list of strings

data = [line.strip().split(',') for line in lines]  # use the commas in the lines to split the values for each line

df = {key: [int(row[i]) for row in data[1:]] for i, key in enumerate(data[0])}  # building the dictionary

In [11]:
print(df)

{'math score': [72, 69, 90, 47, 76, 71, 88, 40, 64, 38, 58, 40, 65, 78, 50, 69, 88, 18, 46, 54, 66, 65, 44, 69, 74, 73, 69, 67, 70, 62, 69, 63, 56, 40, 97, 81, 74, 50, 75, 57, 55, 58, 53, 59, 50, 65, 55, 66, 57, 82, 53, 77, 53, 88, 71, 33, 82, 52, 58, 0, 79, 39, 62, 69, 59, 67, 45, 60, 61, 39, 58, 63, 41, 61, 49, 44, 30, 80, 61, 62, 47, 49, 50, 72, 42, 73, 76, 71, 58, 73, 65, 27, 71, 43, 79, 78, 65, 63, 58, 65, 79, 68, 85, 60, 98, 58, 87, 66, 52, 70, 77, 62, 54, 51, 99, 84, 75, 78, 51, 55, 79, 91, 88, 63, 83, 87, 72, 65, 82, 51, 89, 53, 87, 75, 74, 58, 51, 70, 59, 71, 76, 59, 42, 57, 88, 22, 88, 73, 68, 100, 62, 77, 59, 54, 62, 70, 66, 60, 61, 66, 82, 75, 49, 52, 81, 96, 53, 58, 68, 67, 72, 94, 79, 63, 43, 81, 46, 71, 52, 97, 62, 46, 50, 65, 45, 65, 80, 62, 48, 77, 66, 76, 62, 77, 69, 61, 59, 55, 45, 78, 67, 65, 69, 57, 59, 74, 82, 81, 74, 58, 80, 35, 42, 60, 87, 84, 83, 34, 66, 61, 56, 87, 55, 86, 52, 45, 72, 57, 68, 88, 76, 46, 67, 92, 83, 80, 63, 64, 54, 84, 73, 80, 56, 59, 75, 85, 

In [12]:
# Calculate the quartiles
print('Q1 (math):', quantile(df['math score'], 1, 4))
print('Q2 (math):', quantile(df['math score'], 2, 4))
print('Q3 (math):', quantile(df['math score'], 3, 4))

# Calculate the 9th decile
print('D9 (math):', quantile(df['math score'], 9, 10))

# Calculate the 50th and 95th percentiles
print('P50 (math):', quantile(df['math score'], 50, 100))
print('P95 (math):', quantile(df['math score'], 95, 100))

Q1 (math): 57.0
Q2 (math): 66.0
Q3 (math): 77.0
D9 (math): 86.0
P50 (math): 66.0
P95 (math): 90.04999999999995


In [13]:
# Calculate the quartiles
print('Q1 (reading):', quantile(df['reading score'], 1, 4))
print('Q2 (reading):', quantile(df['reading score'], 2, 4))
print('Q3 (reading):', quantile(df['reading score'], 3, 4))

# Calculate the 9th decile
print('D9 (reading):', quantile(df['reading score'], 9, 10))

# Calculate the 50th and 95th percentiles
print('P50 (reading):', quantile(df['reading score'], 50, 100))
print('P95 (reading):', quantile(df['reading score'], 95, 100))

Q1 (reading): 59.0
Q2 (reading): 70.0
Q3 (reading): 79.0
D9 (reading): 87.10000000000002
P50 (reading): 70.0
P95 (reading): 92.0


In [14]:
# Calculate the quartiles
print('Q1 (writing):', quantile(df['writing score'], 1, 4))
print('Q2 (writing):', quantile(df['writing score'], 2, 4))
print('Q3 (writing):', quantile(df['writing score'], 3, 4))

# Calculate the 9th decile
print('D9 (writing):', quantile(df['writing score'], 9, 10))

# Calculate the 50th and 95th percentiles
print('P50 (writing):', quantile(df['writing score'], 50, 100))
print('P95 (writing):', quantile(df['writing score'], 95, 100))

Q1 (writing): 57.75
Q2 (writing): 69.0
Q3 (writing): 79.0
D9 (writing): 87.0
P50 (writing): 69.0
P95 (writing): 92.0


Notice $Q_2$ and $P_{50}$ are the same (also equal to the median). It is the ratio $k/q$ that decides the quantile values.