![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

<a href="https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fcallysto%2Fcurriculum-notebooks&branch=master&subPath=Mathematics/CentralTendency/central-tendency.ipynb&depth=1" target="_parent"><img src="https://raw.githubusercontent.com/callysto/curriculum-notebooks/master/open-in-callysto-button.svg?sanitize=true" width="123" height="24" alt="Open in Callysto"/></a>

# Data Analysis: Mean, Median, and Mode

Mean, median, and mode are known as measures of central tendency because they tell us where data is centered or clustered together. We will also talk about range.

If you want a good summary of this topic, check out [this website for Australian junior high students](https://www.mathsteacher.com.au/year8/ch17_stat/02_mean/mean.htm) or watch the video at the end!

## Definitions and Examples

### 1. Range

Given a set of values, the **range** is the difference between the highest value and the lowest value. This tells us if the values are spread apart or close together. 

##### Example
The time it took Avery to walk to school each day this week was 10 minutes, 15 minutes, 12 minutes, 20 minutes, and 8 minutes. What is the range of the times it takes to walk to school?

range = highest value - lowest value <br>
range = 20 minutes - 8 minutes <br>
range = 12 minutes <br>

Since the range is large compared to the values, we know that the time it takes to walk to school can be quite different.

Let's try that calculation in Python.

In [None]:
walking_times = [10, 15, 12, 20, 8]
minimum = min(walking_times)
maximum = max(walking_times)
range_of_times = maximum - minimum
print('The range is', range_of_times)

### 2. Mean

The **mean** (sometimes called an average) of a set of numbers is the *sum of all the values* divided by the *number of values*. 

##### Example

At Rose Middle School, there are 3 classes of Grade 8 students. The first one has 25 students, the second has 18 students, and the third has 20 students. What is the average number of students in a Grade 8 class?

Our set of data is $(25, 18, 20)$ and there are $3$ elements in this set.

$$\text{mean} = \frac{25 + 18 + 20}{3} = \frac{63}{3} = 21 $$

That means that Rose Middle School has an average of 21 students per class in Grade 8.

Using Python:

In [None]:
class_sizes = [25, 18, 20]
total = sum(class_sizes)
number_of_classes = len(class_sizes)
class_size_mean = total / number_of_classes
print('The mean is', class_size_mean)

We can also use the method `mean` from the `statistics` library:

In [None]:
class_sizes = [25, 18, 20]
from statistics import mean
mean(class_sizes)

### 3. Median

The **median** is the middle number in a sequence of numbers. To find the median, the set needs to be ordered from smallest to largest.

##### Example 1
Data set: $2, 6, 3, 7, 5, 3, 9$
<br> Sorted data set: $2, 3, 3, 5, 6, 7, 9$
<br> So the median will be $5$ because it's 4 numbers in from both sides

In [None]:
dataset = [2, 6, 3, 7, 5, 3, 9]
number_of_values = len(dataset)
middle_index = (number_of_values // 2)    # the // means integer division
dataset.sort()                            # sort the list
median_value = dataset[middle_index]      # get the middle value from the sorted list
print(dataset)
print('The median value is', median_value)

We can also use `median` from `statistics`:

In [None]:
dataset = [2, 6, 3, 7, 5, 3, 9]
from statistics import median
median(dataset)

Did you notice that this data set has an odd number of values? If there is an even number of values the median will be halfway between the two middle numbers.

##### Example 2
Data set: $2, 6, 3, 7, 5, 3, 9, 4$ <br>
**Notice how there's 8 values in this set**
<br> Sorted data set: $2, 3, 3, 4, 5, 6, 7, 9$
<br> There two middle numbers are $4$ and $5$
<br> Therefore, the median will be $\frac{(4+5)}{2} = 4.5$

In [None]:
dataset = [2, 6, 3, 7, 5, 3, 9, 4]
median(dataset)

### 4. Mode

The **mode** of a data set is the element that occurs the most often. It's possible for a data set to have no modes, one mode, two modes (bimodal), or even three modes (trimodal).

To find the mode it can be helpful to sort the values so that it's easier to see when repeated values are next to each other.

#### Example
Data set: $2, 6, 3, 7, 5, 3, 9$
<br> Sorted data set: $2, 3, 3, 5, 6, 7, 9$
<br> Here, only the value $3$ repeats and all other values only appear once. 
<br>So, mode will be $3$.

In [None]:
dataset = [2, 6, 3, 7, 5, 3, 9]
dataset.sort()
print(dataset)

from statistics import mode
mode(dataset)

##### Example 2
Data set: $2, 6, 3, 7, 5, 3, 9, 5, 3, 5, 6$
<br> Sorted data set: $2, 3, 3, 3,  5, 5, 5,  6, 6, 7, 9$
<br> Values $3$ and $5$ have same number of repetitions. 
<br> Notice how $6$ is also repeated but it is not as many times as $3$ or $5$ so it is not part of the mode.
<br>So, the mode will be, $3, 5$

In [None]:
dataset = [2, 6, 3, 7, 5, 3, 9, 5, 3, 5, 6]
from statistics import multimode
multimode(dataset)

##### Example 3
Data set: $2, 6, 3, 7, 5, 8, 9, 4$
<br> Sorted data set: $2, 3, 4, 5, 6, 7, 8, 9$
<br>There are no values that repeat, so there is no mode.

In [None]:
dataset = [2, 6, 3, 7, 5, 8, 9, 4]
dataset.sort()
print(dataset)
multimode(dataset)

## Example Using A Large Dataset

We can calculate the mean, median, and mode for both the *x* and *y* values in a large dataset, named "Datasaurus", that was created by [Alberto Cairo](http://www.thefunctionalart.com/2016/08/download-datasaurus-never-trust-summary.html) and made free for anyone to use.

We're going to use the [pandas](https://pandas.pydata.org) library to import the data and perform calculations.

In [None]:
import pandas as pd
import plotly.express as px
dataset = pd.read_csv('https://raw.githubusercontent.com/callysto/data-files/main/Mathematics/CentralTendency/data/datasaurus.csv')
px.scatter(dataset, x='x', y='y', title='Datasaurus', width=800, height=600)

### Range

In [None]:
max_x = max(dataset['x'])
min_x = min(dataset['x'])
max_y = max(dataset['y'])
min_y = min(dataset['y'])
range_x = max_x - min_x
range_y = max_y - min_y
print('The range of x is {range_x} (from {min_x} to {max_x})'.format(range_x=range_x, min_x=min_x, max_x=max_x))
print('The range of y is {range_y} (from {min_y} to {max_y})'.format(range_y=range_y, min_y=min_y, max_y=max_y))

### Mean

In [None]:
mean_x = dataset['x'].mean()
mean_y = dataset['y'].mean()
print('The mean x is', mean_x)
print('The mean y is', mean_y)

We can see that the mean of the *x* values is larger than 50 which means there are **more** dots on the **right** than the left. We also know the mean of the *y* values is less than 50 which means there are **more** dots on the **bottom** than the top.

We can see this as the T-Rex is on the right of the graph and it took lots of dots to make the details of the claws and bottom jaw compared to the top of the head.

### Median

In [None]:
median_x = dataset['x'].median()
median_y = dataset['y'].median()
print('The median x is', median_x)
print('The median y is', median_y)

We now know the middle of the T-Rex is at approximately $(53, 46)$.

### Mode

In [None]:
mode_x = dataset['x'].mode()
mode_y = dataset['y'].mode()
print('The mode x is')
print(mode_x)
print('')
print('The mode y is')
print(mode_y)

This shows us that there are three modes in the *x* values, and one mode in the *y* values, but those are not really useful values in this dataset.

To show that measures of central tendency don't always tell the full story, [Autodesk Research](https://www.autodeskresearch.com/publications/samestats) created the "Datasaurus Dozen", 12 graphs with the same *x* and *y* mean values as "Datasaurus".

![Datasaurus Dozen](images/DinoSequential.gif)

# Conclusion

* Range is the highest value minus the lowest value
* Mean is the average and is found by adding all the values then dividing by the number of values
* Median is the middle value and is found by sorting the set and finding the middle
* Mode is the most frequent value and is found by counting how many times values are repeated

While measures of central tendency are useful for many things, we should remember that they don't always tell the whole story of a dataset.

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)