<a href="https://colab.research.google.com/github/brendanpshea/data-science/blob/main/DataScience_06_DescriptiveStatistics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Measures of Central Tendency: How Can We Find the Center of Our Data?

In data science, understanding the "typical" value of a dataset is crucial for making informed decisions and drawing meaningful insights. One of the most common measures of central tendency is the **mean**, also known as the **average**. The mean provides a single value that represents the center of a dataset, making it useful for summarizing large amounts of information.

### The Mean

The **mean** is calculated by summing all values in a dataset and dividing by the number of values. It's particularly useful when dealing with continuous data and is widely used across various fields, including medicine, finance, and social sciences.

Formula:

$$
\text{mean} = \frac{\text{sum of all values}}{\text{number of values}}
$$

In mathematical notation:

$$
\bar{x} = \frac{\sum x}{n}
$$

Where $\bar{x}$ (read as "x bar") is the mean, $\sum x$ is the sum of all values, and $n$ is the number of values.

Let's consider an example using patient temperatures in a hospital ward:

In [3]:
import pandas as pd
import numpy as np

# Create a sample dataset of patient temperatures
data = {
    'Patient ID': range(1, 11),
    'Temperature (°C)': [36.8, 37.2, 36.9, 37.4, 36.7, 37.1, 37.3, 36.5, 37.0, 37.2]
}

df = pd.DataFrame(data)
df

Unnamed: 0,Patient ID,Temperature (°C)
0,1,36.8
1,2,37.2
2,3,36.9
3,4,37.4
4,5,36.7
5,6,37.1
6,7,37.3
7,8,36.5
8,9,37.0
9,10,37.2


We can calculate the mean as follows:

In [4]:
# Calculate the mean temperature
mean_temp = df['Temperature (°C)'].mean()
print(f"The mean patient temperature is: {mean_temp:.1f}°C")

The mean patient temperature is: 37.0°C


In this example, the mean temperature of 37.0°C provides a quick summary of the "typical" patient temperature in the ward. This demonstrates how the mean can efficiently summarize a dataset with a single value.

### Properties and Considerations of the Mean

1. *Influence of all data points**--The mean takes into account every value in the dataset, which can be both an advantage and a disadvantage.
2. *Sensitivity to outliers*--The mean can be significantly affected by extreme values (**outliers**). For instance, if one patient had a very high fever of 40°C, it would pull the mean higher, potentially giving a misleading representation of the typical temperature in the ward.
3. *Suitability for further statistical analysis*--Many statistical methods and models use the mean in their calculations, making it a fundamental concept in data science.

### When to Use the Mean

The mean is most appropriate when:

1. The data is continuous (like temperature, income, or test scores)
2. The distribution is roughly symmetric (not heavily skewed)
3. Outliers are not a significant concern or have been addressed

In various data science applications, the mean is commonly used for:

- Summarizing central tendencies in large datasets
- Comparing different groups or populations
- Tracking changes over time (e.g., mean daily stock prices)
- Input for more complex statistical analyses

### The Median
While the mean is a common measure of central tendency, it can be sensitive to extreme values. This is where the **median** comes in. The median is another measure of central tendency that is particularly useful when dealing with skewed data or datasets with outliers. If the dataset has an odd number of values, the median is the middle value. If the dataset has an even number of values, the median is the average of the two middle values.

To find the median:

1. Sort the data in ascending order
2. If $n$ is odd, the median is the $\frac{n+1}{2}$th term
3. If $n$ is even, the median is the average of the $\frac{n}{2}$th and $\frac{n}{2}+1$th terms

In mathematical notation:

For odd $n$:
$$
\text{Median} = x_{\frac{n+1}{2}}
$$

For even $n$:
$$
\text{Median} = \frac{x_{\frac{n}{2}} + x_{\frac{n}{2}+1}}{2}
$$

Where $x_i$ represents the $i$th value in the sorted dataset.

Let's consider an example using patient waiting times in an emergency room:


In [10]:
import pandas as pd
import numpy as np

# Create a sample dataset of patient waiting times
data = {
    'Patient ID': range(1, 11),
    'Wait Time (minutes)': [10, 15, 20, 25, 30, 35, 40, 45, 50, 200]
}

df = pd.DataFrame(data)

# Calculate the median wait time
median_wait = df['Wait Time (minutes)'].median()

# Calculate the mean for comparison
mean_wait = df['Wait Time (minutes)'].mean()

print(f"Patient wait time data:\n{df}\n")
print(f"The median wait time is: {median_wait:.1f} minutes")
print(f"The mean wait time is: {mean_wait:.1f} minutes")

Patient wait time data:
   Patient ID  Wait Time (minutes)
0           1                   10
1           2                   15
2           3                   20
3           4                   25
4           5                   30
5           6                   35
6           7                   40
7           8                   45
8           9                   50
9          10                  200

The median wait time is: 32.5 minutes
The mean wait time is: 47.0 minutes


In this example, the median wait time (32.5 minutes) provides a more representative measure of the typical wait time than the mean (47.0 minutes), which is pulled higher by the outlier of 200 minutes.

### Properties and Considerations of the Median

1.  *Robustness to outliers*--The median is not affected by extreme values, making it a robust measure of central tendency.
2.  *Suitability for ordinal data*--The median can be used with ordinal data (where order matters but the interval between values may not be consistent), unlike the mean.
3.  *Less affected by skewed distributions*--For skewed data, the median often provides a better representation of the "typical" value than the mean.
4.  *Limited use in further statistical analyses*--Unlike the mean, the median is not commonly used in more advanced statistical calculations.

### When to Use the Median

The median is most appropriate when:

1.  The data is skewed or contains outliers
2.  Dealing with ordinal data
3.  The "middle" value is more informative than the arithmetic average
4.  Working with datasets related to income, housing prices, or other fields where extreme values are common

In various data science applications, the median is commonly used for:

-   Reporting central tendencies in skewed datasets
-   Describing income distributions (e.g., median household income)
-   Analyzing response times or durations
-   Real estate valuations (e.g., median home prices)

The Mode: What's the Most Common Value in Our Dataset?
-
While the mean and median provide valuable insights into the center of a dataset, sometimes we're interested in knowing the most frequently occurring value. This is where the mode comes in. The mode is particularly useful for categorical data and discrete numerical data.

The **mode** is the value that appears most frequently in a dataset. Unlike the mean and median, a dataset can have more than one mode if multiple values occur with the same highest frequency.

Mathematically, the mode is defined as:

$$
\text{Mode} = \text{argmax}_x f(x)
$$

Where $f(x)$ is the frequency function, and $\text{argmax}$ returns the value of $x$ that maximizes $f(x)$.

Let's consider an example using blood types in a hospital:

In [11]:
# Create a sample dataset of blood types
data = {
    'Patient ID': range(1, 21),
    'Blood Type': ['A', 'B', 'O', 'A', 'AB', 'O', 'A', 'O', 'B', 'A',
                   'O', 'A', 'B', 'O', 'A', 'O', 'A', 'AB', 'O', 'A']
}

df = pd.DataFrame(data)

# Calculate the mode
mode_blood_type = df['Blood Type'].mode()

# Calculate frequencies
blood_type_freq = df['Blood Type'].value_counts()

print(f"Patient blood type data:\n{df}\n")
print(f"The mode of blood types is: {mode_blood_type[0]}")
print(f"\nBlood type frequencies:\n{blood_type_freq}")

Patient blood type data:
    Patient ID Blood Type
0            1          A
1            2          B
2            3          O
3            4          A
4            5         AB
5            6          O
6            7          A
7            8          O
8            9          B
9           10          A
10          11          O
11          12          A
12          13          B
13          14          O
14          15          A
15          16          O
16          17          A
17          18         AB
18          19          O
19          20          A

The mode of blood types is: A

Blood type frequencies:
Blood Type
A     8
O     7
B     3
AB    2
Name: count, dtype: int64


In this example, blood type A is the mode, occurring most frequently in the dataset.

### Properties and Considerations of the Mode

1.  *Applicability to categorical data*--The mode can be used with nominal (categorical) data, unlike the mean and median.
2.  *Multiple modes possible*--A dataset can have more than one mode (bimodal, trimodal, etc.) if multiple values occur with the same highest frequency.
3.  **=Insensitivity to extreme values*--Like the median, the mode is not affected by outliers.
4.  *Limited use in further statistical analyses*--The mode is not commonly used in more advanced statistical calculations.

### When to Use the Mode

The mode is most appropriate when:

1.  Working with categorical data (e.g., blood types, product categories)
2.  Dealing with discrete numerical data (e.g., shoe sizes, number of children in families)
3.  Identifying the most common item or category in a dataset
4.  Describing data that doesn't have a clear numerical ordering

In various data science applications, the mode is commonly used for:

-   Analyzing customer preferences in market research
-   Identifying the most common diagnoses in medical studies
-   Determining popular choices in survey responses
-   Describing the typical characteristics of a dataset when mean and median aren't applicable

### Multimodal Data

When a dataset has multiple modes, it's often referred to as **bimodal** (two modes), **trimodal** (three modes), or more generally, **multimodal**. This can reveal important characteristics of the data, such as:

-   Presence of distinct subgroups within the population
-   Potential for further segmentation or clustering analysis
-   Need for more detailed investigation into the factors causing multiple peaks in the data distribution

Understanding the mode and recognizing multimodal distributions can provide valuable insights, especially when dealing with categorical data or when the most frequent value is of particular interest. While it may not be as commonly used in advanced statistical analyses as the mean or median, the mode remains an important tool in the data scientist's toolkit for describing and understanding data distributions.

## Measures of Dispersion: How Spread Out is Our Data?

While measures of central tendency give us a sense of the typical value in a dataset, they don't tell us anything about how the data is spread out. This is where measures of dispersion come in. These measures help us understand the variability and spread of our data points.

Let's start with the simplest measure of dispersion: the range.

### Range

The **range** is the difference between the maximum and minimum values in a dataset. It gives a quick and simple measure of the spread of the data.

Mathematically, the range is defined as:

$$
\text{Range} = \text{Maximum value} - \text{Minimum value}
$$

or

$$
\text{Range} = \max(x) - \min(x)
$$

Where $x$ represents the dataset.

### Minimum and Maximum

The **minimum** (min) is the smallest value in the dataset, while the **maximum** (max) is the largest value.

Let's consider an example using gene expression levels in a cancer research study:

In [12]:
# Create a sample dataset of gene expression levels
np.random.seed(52179)  # for reproducibility
data = {
    'Sample ID': range(1, 21),
    'Gene Expression (FPKM)': np.random.normal(loc=20, scale=3, size=20).round(1)
}

df = pd.DataFrame(data)

# Calculate range, min, and max
expr_range = df['Gene Expression (FPKM)'].max() - df['Gene Expression (FPKM)'].min()
expr_min = df['Gene Expression (FPKM)'].min()
expr_max = df['Gene Expression (FPKM)'].max()

print(f"Gene expression data:\n{df}\n")
print(f"Range of gene expression: {expr_range:.1f} FPKM")
print(f"Minimum gene expression: {expr_min:.1f} FPKM")
print(f"Maximum gene expression: {expr_max:.1f} FPKM")

Gene expression data:
    Sample ID  Gene Expression (FPKM)
0           1                    23.1
1           2                    23.9
2           3                    21.4
3           4                    17.0
4           5                    19.0
5           6                    21.0
6           7                    18.0
7           8                    22.0
8           9                    20.9
9          10                    18.4
10         11                    26.7
11         12                    20.2
12         13                    15.8
13         14                    17.3
14         15                    17.6
15         16                    19.6
16         17                    18.3
17         18                    21.9
18         19                    19.4
19         20                    20.3

Range of gene expression: 10.9 FPKM
Minimum gene expression: 15.8 FPKM
Maximum gene expression: 26.7 FPKM


### nterpretation of Results

-   The gene expression levels in our tumor samples range from 10.9 to 15.8 FPKM.
-   The range of 9.6 FPKM gives us a quick sense of how spread out our data is.

### Properties and Considerations of Range, Min, and Max

1.  *Sensitivity to outliers*--The range is highly sensitive to extreme values, as it only considers the two most extreme points in the dataset. In our gene expression example, if we had one sample with an unusually high expression level of 40 FPKM, it would dramatically increase our range without necessarily being representative of the overall data spread.
2.  *Simplicity*--Range, min, and max are easy to calculate and understand, making them useful for quick assessments of data spread. In our example, we can quickly communicate that "gene expression varied between 15.3 and 24.9 FPKM across tumor samples."
3.  *Limited information*--While they provide the boundaries of the data, they don't give information about the distribution of values between these extremes. We don't know from these measures alone whether most of our tumor samples had expression levels closer to the minimum, closer to the maximum, or evenly spread out.
4.  *Useful for data cleaning*---Min and max values are often used to identify potential outliers or data entry errors. In our gene expression study, values far outside our observed range might warrant further investigation -- they could represent interesting biological phenomena or potential errors in measurement or data recording.

### When to Use Range, Min, and Max

These measures are most appropriate when:

1.  You need a quick, simple measure of data spread
2.  Identifying the boundaries of your dataset is important
3.  Detecting potential outliers or data errors
4.  Comparing the spread of different datasets at a high level

In our gene expression example, these measures could be used to:

-   Quickly compare the variability of expression across different genes
-   Set the y-axis limits when plotting expression levels
-   Identify tumor samples with unusually high or low expression for further study
-   Normalize expression data across different genes for comparative analyses

While the range, minimum, and maximum provide a basic understanding of data spread, they are often used in conjunction with more sophisticated measures of dispersion, such as variance and standard deviation, to get a more complete picture of how data is distributed. These measures are particularly useful in the early stages of data analysis for getting a quick sense of the data's boundaries and potential outliers.

## How Much Do Our Data Points Deviate from the Mean?

While the range gives us a quick idea of the spread of our data, it doesn't tell us much about how the data is distributed within that range. This is where variance and standard deviation come in. These measures give us a sense of how much our data points typically deviate from the mean.

### Variance
The **variance** measures the average squared deviation from the mean. It gives us a sense of how spread out our data is, with larger values indicating greater spread.

For a population, the variance is defined as:

$$
\sigma^2 = \frac{\sum_{i=1}^{N} (x_i - \mu)^2}{N}
$$

For a sample, we use:

$$
s^2 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n - 1}
$$

Where:

-   $\sigma^2$ is the population variance
-   $s^2$ is the sample variance
-   $x_i$ are the individual values
-   $\mu$ is the population mean
-   $\bar{x}$ is the sample mean
-   $N$ is the population size
-   $n$ is the sample size

### Standard Deviation

The **standard deviation** is the square root of the variance. It's often preferred over variance because it's in the same units as the original data.

For a population:

$$
\sigma = \sqrt{\frac{\sum_{i=1}^{N} (x_i - \mu)^2}{N}}
$$

For a sample:

$$
s = \sqrt{\frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n - 1}}
$$

Let's consider an example using life expectancy data for North American countries:

In [13]:
# Create a dataset of life expectancy across North American countries
# Data source: World Bank, 2021 (or latest available) life expectancy data
data = {
    'Country': ['Canada', 'United States', 'Mexico', 'Costa Rica', 'Panama',
                'El Salvador', 'Guatemala', 'Honduras', 'Nicaragua', 'Belize'],
    'Life Expectancy': [82.3, 76.1, 75.1, 80.3, 78.5, 73.3, 74.3, 75.3, 74.5, 74.6]
}

df = pd.DataFrame(data)

# Calculate mean, variance, and standard deviation
mean_life_exp = df['Life Expectancy'].mean()
var_life_exp = df['Life Expectancy'].var()
std_life_exp = df['Life Expectancy'].std()

print(f"Life expectancy data:\n{df}\n")
print(f"Mean life expectancy: {mean_life_exp:.1f} years")
print(f"Variance of life expectancy: {var_life_exp:.1f} years²")
print(f"Standard deviation of life expectancy: {std_life_exp:.1f} years")

Life expectancy data:
         Country  Life Expectancy
0         Canada             82.3
1  United States             76.1
2         Mexico             75.1
3     Costa Rica             80.3
4         Panama             78.5
5    El Salvador             73.3
6      Guatemala             74.3
7       Honduras             75.3
8      Nicaragua             74.5
9         Belize             74.6

Mean life expectancy: 76.4 years
Variance of life expectancy: 8.7 years²
Standard deviation of life expectancy: 2.9 years


### Interpretation of Results

-   The mean life expectancy across North American countries is 76.4 years.
-   The variance of 8.7 years² gives us a sense of the spread, but it's in squared units, which can be hard to interpret.
-   The standard deviation of 2.9 years is more interpretable: it tells us that, on average, life expectancy in these countries deviates from the mean by about 3.0 years.

### Properties and Considerations of Variance and Standard Deviation

1.  *Units*--Variance is in squared units (years² in this case), which can be difficult to interpret. Standard deviation is in the same units as the original data (years), making it more interpretable.
2.  *Sensitivity to outliers*--Both measures are sensitive to outliers. In our data, Canada's high life expectancy (82.3 years) and El Salvador's lower life expectancy (73.3 years) increase the variance and standard deviation.
3.  *Usefulness in further analysis*--Many statistical methods and models use variance and standard deviation in their calculations.
4.  *Normal distribution*--In a normal distribution, about 68% of the data falls within one standard deviation of the mean, 95% within two standard deviations, and 99.7% within three standard deviations. While our data isn't perfectly normal, we can still use this as a rough guide.

### When to Use Variance and Standard Deviation

These measures are most appropriate when:

1.  You need a more nuanced measure of spread than the range
2.  The data is roughly normally distributed
3.  You're preparing data for further statistical analysis
4.  You want to compare the spread of different datasets that might have different means

In our life expectancy example, these measures could be used to:

-   Compare the variability in life expectancy across different regions (e.g., North America vs. Europe)
-   Identify countries with unusually high or low life expectancies (e.g., Canada is about 2 standard deviations above the mean)
-   Provide a confidence interval for life expectancy estimates in the region
-   Standardize life expectancy data for comparison with other health metrics

Variance and standard deviation provide a more comprehensive view of data spread than simpler measures like range. They are fundamental in many statistical analyses and provide valuable insights into the distribution of our data, as demonstrated with this real-world public health example.