# Probability & Statistics Series (Chapter 2)

## Learning and understanding statistics with Python

----

### Summary Statistics: Measures of Central Tendency

In this section we will discuss three statistical functions that are used to estimate the centre of a dataset and hence summarise it. These are the `mean`, `median` and the `mode`. We will implement them in the Python programming language and use them to solve problems in statistical data analysis.

**The mean**

The mean value of a dataset is defined as the sum of all the data values divided by the size (number of data values) of the data set.

Let $\bar x$ be the mean of some data set $X$ of size $N$, then $\bar x = \frac{\sum{X}}{N}$.

*Example 1*

Let $X = [2, 4, 6, 9, 3, 2, 9, 1]$. Then the mean $\bar x = \frac{2 + 4 + 6 + 9 + 3 + 2 + 9 + 1}{8} = \frac{36}{8} = 4.5$.

In python we can define a function `mean_value()` to calculate the mean of the dataset `X` as follows:

In [1]:
def mean_func(dataset):
    # initialise a variable to store the sum of the values in the dataset
    sum_value = 0
    
    # loop through all the values in the dataset, adding each value to the variable sum_value.
    for value in dataset:
        sum_value += value
        
    # define a variable mean_value to store the computed mean of the numbers: 
    # sum_of_the_values / number_of_values_in_the_dataset
    mean_value = sum_value / len(dataset)
    
    # return the mean value
    return mean_value

Let us test out our function on the example dataset  in `Example 1` above.

In [2]:
# Define X as a list of numbers as shown in Example 1 above.
X = [2, 4, 6, 9, 3, 2, 9, 1]

# Create a variable mean_of_X to store the value returned by our mean_value() function.
mean_of_X = mean_func(X)

# print out the value of mean_of_X.
print(mean_of_X)

4.5


Let us try out a few more examples here before we move on.

*Example 2*

Find the mean of the dataset $X_1 = [5, 6, 2, 0, 5, 8, 9, 10, 4]$

_Answer_

$\bar x_1 = \frac{\sum{X_1}}{9} = \frac{49}{9} \approx 5.4$

*Example 3*

Find the mean of the dataset $X_2 = [23, 14, 6, 19, 50, 42, 16, 7, 15, 22, 8]$

_Answer_

$\bar x_2 = \frac{\sum{X_2}}{11} = \frac{222}{11} \approx 20.2$

Using our python mean function, we have the following:

In [3]:
# Dataset for Example 2
X1 = [5, 6, 2, 0, 5, 8, 9, 10, 4]

# Dataset for Example 3
X2 = [23, 14, 6, 19, 50, 42, 16,7, 15, 22, 8]

# Compute the mean for dataset X1
Ex2Answer = mean_func(X1)

# Compute the mean for dataset X2
Ex3Answer = mean_func(X2)

print(Ex2Answer)
print(Ex3Answer)

5.444444444444445
20.181818181818183


**The median**

The `median` of a dataset is the middle (positional) value of aa ordered (sorted) dataset.

*Example 4*

Consider the datasets 

$X = [5, 6, 2, 0, 5, 8, 9, 10, 4]$

and 

$Y = [23, 14, 6, 19, 50, 42, 16,7, 22, 8]$

The median of X is the middle value of X sorted in ascending order:

Sorted $X = [0, 2, 4, 5, 5, 6, 8, 9, 10]$

median = 5 (5 divides the entire dataset into two halves of equal lengths)

The median of Y is the middle value of Y sorted in ascending order:

Sorted $Y = [6, 7, 8, 14, 16, 19, 22, 23, 42, 50]$

median = 17.5 (the value right in between 16 and 19)

In python we can write our own median_func() that accepts a dataset and returns the median value. There are two cases to consider here: (1) a dataset with an odd number length and (2) a dataset with an even number length. We will use an `if-else` statement to decide on the action to take in each case.

In [4]:
def median_func(dataset):
    # compute the size of the dataset
    dataset_size = len(dataset)
    
    # sort dataset
    dataset = sorted(dataset)
    left = dataset_size // 2 - 1
    right = left + 1
    
    # check case: (1) or (2)
    if (dataset_size % 2 == 0): # case (1)
        median_value = (dataset[left] + dataset[right])/2
    else: # case (2)
        median_value = dataset[left]
    return median_value

Let us now test out our `median` function on our datasets $X$ and $Y$.

In [5]:
X3 = [5,6,2,0,5,8,9,10,4]
X4 = [23,14,6,19,50,42,16,7,22,8]

median_x3 = median_func(X3)
median_x4 = median_func(X4)

print(median_x3)
print(median_x4)

5
17.5


**The mode**

The `mode` of a dataset is the data value that occurs most frequently in the dataset. For example, let us consider the dataset $𝑋=[5,6,2,0,5,8,9,10,4]$. In $X$, the data value 5 occurs twices as many times as any other value in the dataset so the `mode` of $X$ is 5. Let's try to write a python function to compute the mode of any dataset. The strategy we will use here is to generate a frequency distribution of the values in the dataset and then return the data value with the highest frequency. The python dictionary data structure is a great way to do this.

In [6]:
def mode_func(dataset):
    
    # declare an empty dictionary
    freq_dist = {}
    
    # build the frequency distribution
    for data in dataset:
        if data in freq_dist:
            freq_dist[data] += 1
        else:
            freq_dist[data] = 1
    
    # initialise a variables to store the highest frequency found and the data value it corresponds to. 
    largest_value = 0
    mode_value = None
    
    # loop through the frequency distribution to find the highest frequency data value.
    for data, value in freq_dist.items():
        if value > largest_value:
            largest_value = value
            mode_value = data
            
    return mode_value

Let us now test out our mode function on the datset $X3$.

In [7]:
mode_x3 = mode_func(X3)
print(mode_x3)

5


To summarise, the `mean`, `median` and `mode` are summary statistics use to summarise data. They are commonly referred to as `measures of central tendency` as they give a measure or estimate of the central value of the dataset. In some cases, the are equivalent but mostly they will differ. depending on the nature of the dataset, one measure of central tendency may be more suitable than the other. We will explore this further later.

**Comparing measures of central tendency**

Let us compare the mean, median and mode of the datsets $X3$ and $X4$.

In [8]:
# Dataset X3
#X3 = [5,6,2,0,5,8,9,10,4]
#X4 = [23,14,6,19,50,42,16,7,22,8]
meanX3 = mean_func(X3) # 5.4
medianX3 = median_func(X3) # 5
modeX3 = mode_func(X3) # 5

# Dataset X4
meanX4 = mean_func(X4) # 20.7, 
medianX4 = median_func(X4) # 17.5
modeX4 = mode_func(X4) # 23

Now let us print out statements that communicate the comparisons for each dataset

In [9]:
print(f"The mean of the dataset X3 is {meanX3}, the median is {medianX3} and the mode is {modeX3}")
print(f"The mean of the dataset X4 is {meanX4}, the median is {medianX4} and the mode is {modeX4}")

The mean of the dataset X3 is 5.444444444444445, the median is 5 and the mode is 5
The mean of the dataset X4 is 20.7, the median is 17.5 and the mode is 23


Notice that for $X3$, the measaures are approximatelt the same, but for $X4$, the values are quite different. The dataset $X3$ is more `normally distributed` (see more about this later), meaning that the data values are approximately evenly distributed around the mean. For $X4$, the mean lies more to the right of the centre value of the data so we say it is `skewed` (more about this later) to the left.

### Measures of Dispersion and Spread

When we have our dataset, we can organise them into `frequency tables` and calculate averages such as the `mean`, `median` and `mode`. This insight allows us to conveniently communicate a summary of the dataset but it is fairly general giving us an idea of the centre of the dataset but nothing about how the other data points in the dataset relate to this central value. To address this issue we will now introduce the `range` that measures how wide the data is spread from the smallest to the largest value, and the `variance` and `standard deviation` which measures the variability (average separation) of the data points from the mean.

**Range**

The `range` of a dataset is defined as the difference between the largest data point and the smallest data point in the dataset. It measures the interval over which the dataset spans.

$$Range(X) = |max(X) - min(X)|$$

**Python Implementation**

In [10]:
# Dataset X
X = [3.29, 3.59, 3.79, 3.75, 3.99]

def data_range(X):
    """
    This function takes a list of values as argument and returns the difference between
    the largest and smallest value.
    """
    return abs(max(X) - min(X))

data_range(X)

0.7000000000000002

**Variance**

We can have a good idea of how `close` the data points in the dataset are to the calculated `mean`. The variance allows us to compute the average square distance of the data points from the calculated `mean`. If the average distance of the data points is small, then we can conclude that the calculated `mean` is quite representative of the dataset and that the variability of the data points is low. If the average distance is large, then the data points are quite spread out with large variability.

The sample variance, $S$, is calculated by:

$$S = \frac{\sum_{k=1}^{n}{(x - \bar{x})^2}}{n - 1}$$

**Variance Algorithm**

INPUT: 
   - $X$            -- dataset
   - $\bar{x}$      -- calculated mean

OUTPUT:
   - $varX$         -- calculated variance
   
ALGORITHMS:
      
   1. Calculate the sum of the squared difference $\delta{x_i}^2 = (x_i - \bar{x})^2$
   
   2. Set $varX = \frac{\sum_{i = 0}^{n}{\delta{x_i}^2}}{n - 1}$
   
   3. Return $varX$

**Python Implementation**

In [11]:
def data_variance(dataset):
    """
    This function takes a dataset and calculates and returns the variance.
    """
    
    # find the sum of the values in the dataset
    sum_x = 0
    for x in dataset:
        sum_x += x
    
    # compute the mean of the dataset using the sum of the dataset
    mean_x = sum_x / (len(dataset))
    
    # compute the sum of square differences
    delta_x2 = 0
    for x in dataset:
        delta_x2 += (x - mean_x)**2
    
    # compute the variance
    varX = delta_x2 / (len(dataset)-1)
    
    return varX

**Testing the `data_variance` function**

In [12]:
from math import sqrt

# print out the variance of the dataset X
data_variance(X)

0.06832000000000003

**Standard Deviation**

The `standard deviation` is defined as the square root of the `variance`. Both variance and standard deviation measure the variability of the data points from the mean, however, in many cases, the standard deviation is used instead of the variance because it gives a more meaningful value of variability than the variance. Because the variance is defined as the mean square distance, it gives the average square separation of the data points from the mean, but the standard deviation gives the average separation value which in many case is more meaningful to the problem being solved. 

$$std(X) = \sqrt{varX} = \sqrt{\frac{\sum_{k=1}^{n}{(x - \bar{x})^2}}{n - 1}}$$ 

**Python Implementation**

In [13]:
# We will implement the standard deviation function as a class decorator function.
class root_mean:
    """
    This class decorator performs a square root operation on the result of a function and returns this value. 
    """
    def __init__(self, func):
        self.func = func
        
    def __call__(self, x):
        from math import sqrt
        ret = sqrt(self.func(x))
        return ret

@root_mean
def data_variance(dataset):
    """
    This function takes a dataset and calculates and returns the variance.
    """
    
    # find the sum of the values in the dataset
    sum_x = 0
    for x in dataset:
        sum_x += x
    
    # compute the mean of the dataset using the sum of the dataset
    mean_x = sum_x / (len(dataset))
    
    # compute the sum of square differences
    delta_x2 = 0
    for x in dataset:
        delta_x2 += (x - mean_x)**2
    
    # compute the variance
    varX = delta_x2 / (len(dataset)-1)
    
    return varX

data_variance(X)

0.2613809480432727