<a href="https://colab.research.google.com/github/brendanpshea/data-science/blob/main/DataScience_06_DescriptiveStatistics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data's Vital Signs: Mastering Descriptive Statistics
### The Philosophy and Practice of Data Science | Brendan Shea, PhD

In data science, descriptive statistics serve as our compass, guiding us through the intricate terrain of information across all fields of study. This chapter delves into the essential tools and techniques that allow us to summarize, visualize, and interpret data, using examples from public health to illustrate broader applications.

From calculating simple averages to constructing confidence intervals, we'll explore how these fundamental concepts bring clarity to complex datasets in any discipline. As we navigate through measures of central tendency, dispersion, and distribution, we'll use real-world examples that, while drawn from epidemiology and health surveys, demonstrate principles applicable to finance, marketing, social sciences, and beyond.

We'll explore the power of data visualization, learning how to represent statistical findings in clear, compelling graphs that speak volumes at a glance, regardless of the field. Moreover, we'll discuss the strengths and limitations of various statistical measures, emphasizing the importance of choosing the right tool for each unique dataset, whether you're analyzing stock prices, customer preferences, or climate data.

By the end of this chapter, you'll not only understand the mechanics of descriptive statistics but also appreciate their pivotal role in uncovering meaningful patterns and insights in any type of data. Whether you're a budding data scientist, a professional in any field, or simply curious about the stories hidden in numbers, this chapter will equip you with the knowledge to describe, summarize, and interpret data with confidence and clarity.

### Learning Outcomes

By the end of this chapter, you will be able to:

1.  Calculate and interpret measures of central tendency (mean, median, mode) for various types of datasets.
2.  Compute and analyze measures of dispersion, including range, variance, and standard deviation.
3.  Apply percent change and percent difference calculations to evaluate trends over time.
4.  Construct and interpret confidence intervals for population parameters.
5.  Create and interpret frequency distributions and percentages for categorical data.
6.  Visualize statistical data using appropriate graphs, including histograms, bar charts, and line plots.
7.  Select appropriate descriptive statistics based on data type and distribution.
8.  Critically evaluate the strengths and limitations of different statistical measures.
9.  Interpret descriptive statistics to draw meaningful conclusions about trends and patterns.
10. Utilize Python libraries such as Pandas, NumPy, and Matplotlib to perform descriptive statistical analyses on diverse datasets.

Keywords: mean, median, mode, range, variance, standard deviation, percent change, percent difference, confidence intervals, frequency distribution, data visualization, descriptive statistics, Python, Pandas, NumPy, Matplotlib

## Brendan's Lecture

In [None]:
##Click here to launch my lecture
from IPython.display import YouTubeVideo
YouTubeVideo('ihFgyI2KUD0', width=800, height=500)

### BrendyBot is Here to Answer Your Questions
![image.png](https://github.com/brendanpshea/colab-utilities/raw/main/brendy_bot_pic.png)

If you have questions about the content of this chapter, you can try out "BrendyBot", an AI chat bot I've trained on the lecture notes for this class (note that BrendyBot is stil experimental, and can definitley make mistakes!).

https://poe.com/BrendyBot

## Measures of Central Tendency: How Can We Find the Center of Our Data?

In data science, understanding the "typical" value of a dataset is crucial for making informed decisions and drawing meaningful insights. One of the most common measures of central tendency is the **mean**, also known as the **average**. The mean provides a single value that represents the center of a dataset, making it useful for summarizing large amounts of information.

### The Mean

The **mean** is calculated by summing all values in a dataset and dividing by the number of values. It's particularly useful when dealing with continuous data and is widely used across various fields, including medicine, finance, and social sciences.

Formula:

$$
\text{mean} = \frac{\text{sum of all values}}{\text{number of values}}
$$

In mathematical notation:

$$
\bar{x} = \frac{\sum x}{n}
$$

Where $\bar{x}$ (read as "x bar") is the mean, $\sum x$ is the sum of all values, and $n$ is the number of values.

### Using Pandas to Calculate the Mean
Before we dive into calculating the mean, let's set up our data using two popular Python libraries for data analysis: pandas and numpy. We'll use pandas to create a structured dataset and numpy for numerical operations.
First, we import these libraries:

In [None]:
import pandas as pd
import numpy as np


Now, let's create a sample dataset of patient temperatures. We'll use a Python dictionary to structure our data, with two keys: 'Patient ID' and 'Temperature (°C)'. The `range()` function generates patient IDs from 1 to 10, and we provide a list of temperature readings:

In [None]:
# Create a sample dataset of patient temperatures
data = {
    'Patient ID': range(1, 11),
    'Temperature (°C)': [36.8, 37.2, 36.9, 37.4, 36.7, 37.1, 37.3, 36.5, 37.0, 37.2]
}

Next, we'll use pandas to create a DataFrame. A DataFrame is a two-dimensional labeled data structure in pandas, similar to a spreadsheet or a SQL table. It's a fundamental data structure for data analysis in Python. Then, we'll use `df.head()` to display the first few rows:

In [None]:
# Create data frame
df = pd.DataFrame(data)
# display first five rows
df.head()

Unnamed: 0,Patient ID,Temperature (°C)
0,1,36.8
1,2,37.2
2,3,36.9
3,4,37.4
4,5,36.7


Now that we have our data in a DataFrame, let's calculate the mean temperature. In pandas, we can easily compute the mean of a specific column using the `mean()` function. Here's how we do it:

In [None]:
# Calculate the mean temperature
mean_temp = df['Temperature (°C)'].mean()
print(f"The mean patient temperature is: {mean_temp:.1f}°C")

The mean patient temperature is: 37.0°C


Let's break this down:

1.  `df['Temperature (°C)']` selects the 'Temperature (°C)' column from our DataFrame.
2.  `.mean()` is a pandas method that calculates the average of all values in the selected column.
3.  We store this result in the variable `mean_temp`.
4.  Finally, we print the result using an f-string, which allows us to format our output. The `.1f` formatter specifies that we want to display the temperature to one decimal place.

This code will calculate the mean temperature across all patients in our dataset and display it in a readable format. By using pandas functions, we can efficiently compute statistical measures on our structured data.

In this example, the mean temperature of 37.0°C provides a quick summary of the "typical" patient temperature in the ward. This demonstrates how the mean can efficiently summarize a dataset with a single value.

### Properties and Considerations of the Mean

1. *Influence of all data points**--The mean takes into account every value in the dataset, which can be both an advantage and a disadvantage.
2. *Sensitivity to outliers*--The mean can be significantly affected by extreme values (**outliers**). For instance, if one patient had a very high fever of 40°C, it would pull the mean higher, potentially giving a misleading representation of the typical temperature in the ward.
3. *Suitability for further statistical analysis*--Many statistical methods and models use the mean in their calculations, making it a fundamental concept in data science.

### When to Use the Mean

The mean is most appropriate when:

1. The data is continuous (like temperature, income, or test scores)
2. The distribution is roughly symmetric (not heavily skewed)
3. Outliers are not a significant concern or have been addressed

In various data science applications, the mean is commonly used for:

- Summarizing central tendencies in large datasets
- Comparing different groups or populations
- Tracking changes over time (e.g., mean daily stock prices)
- Input for more complex statistical analyses

### The Median
While the mean is a common measure of central tendency, it can be sensitive to extreme values. This is where the **median** comes in. The median is another measure of central tendency that is particularly useful when dealing with skewed data or datasets with outliers. If the dataset has an odd number of values, the median is the middle value. If the dataset has an even number of values, the median is the average of the two middle values.

To find the median:

1. Sort the data in ascending order
2. If $n$ is odd, the median is the $\frac{n+1}{2}$th term
3. If $n$ is even, the median is the average of the $\frac{n}{2}$th and $\frac{n}{2}+1$th terms

In mathematical notation:

For odd $n$:
$$
\text{Median} = x_{\frac{n+1}{2}}
$$

For even $n$:
$$
\text{Median} = \frac{x_{\frac{n}{2}} + x_{\frac{n}{2}+1}}{2}
$$

Where $x_i$ represents the $i$th value in the sorted dataset.

Let's consider an example using patient waiting times in an emergency room. First, let's create a dataframe of sample data:


In [None]:
# Create a sample dataset of patient waiting times
data = {
    'Patient ID': range(1, 11),
    'Wait Time (minutes)': [10, 15, 20, 25, 30, 35, 40, 45, 50, 200]
}

df = pd.DataFrame(data)
df

Unnamed: 0,Patient ID,Wait Time (minutes)
0,1,10
1,2,15
2,3,20
3,4,25
4,5,30
5,6,35
6,7,40
7,8,45
8,9,50
9,10,200


In [None]:
# Calculate the median wait time
median_wait = df['Wait Time (minutes)'].median()
# Calculate the mean for comparison
mean_wait = df['Wait Time (minutes)'].mean()
print(f"The median wait time is: {median_wait:.1f} minutes")
print(f"The mean wait time is: {mean_wait:.1f} minutes")

The median wait time is: 32.5 minutes
The mean wait time is: 47.0 minutes


Here's what happens:

1.  `df['Wait Time (minutes)'].median()` calculates the median of the 'Wait Time (minutes)' column. The median is the middle value when the data is ordered from least to greatest.
2.  `df['Wait Time (minutes)'].mean()` calculates the mean of the same column, which we've seen before.
3.  We store these results in `median_wait` and `mean_wait` respectively.
4.  Finally, we print both results using f-strings, formatting the output to one decimal place.

By calculating both the median and the mean, we can compare these two measures of central tendency. If there's a significant difference between the two, it might indicate that our data is skewed or contains outliers. This comparison can provide valuable insights into the distribution of wait times in our emergency room scenario.

In this example, the median wait time (32.5 minutes) provides a more representative measure of the typical wait time than the mean (47.0 minutes), which is pulled higher by the outlier of 200 minutes.

### Properties and Considerations of the Median

1.  *Robustness to outliers*--The median is not affected by extreme values, making it a robust measure of central tendency.
2.  *Suitability for ordinal data*--The median can be used with ordinal data (where order matters but the interval between values may not be consistent), unlike the mean.
3.  *Less affected by skewed distributions*--For skewed data, the median often provides a better representation of the "typical" value than the mean.
4.  *Limited use in further statistical analyses*--Unlike the mean, the median is not commonly used in more advanced statistical calculations.

### When to Use the Median

The median is most appropriate when:

1.  The data is skewed or contains outliers
2.  Dealing with ordinal data
3.  The "middle" value is more informative than the arithmetic average
4.  Working with datasets related to income, housing prices, or other fields where extreme values are common

In various data science applications, the median is commonly used for:

-   Reporting central tendencies in skewed datasets
-   Describing income distributions (e.g., median household income)
-   Analyzing response times or durations
-   Real estate valuations (e.g., median home prices)

### The Mode: What's the Most Common Value in Our Dataset?
While the mean and median provide valuable insights into the center of a dataset, sometimes we're interested in knowing the most frequently occurring value. This is where the mode comes in. The mode is particularly useful for categorical data and discrete numerical data.

The **mode** is the value that appears most frequently in a dataset. Unlike the mean and median, a dataset can have more than one mode if multiple values occur with the same highest frequency.

Mathematically, the mode is defined as:

$$
\text{Mode} = \text{argmax}_x f(x)
$$

Where $f(x)$ is the frequency function, and $\text{argmax}$ returns the value of $x$ that maximizes $f(x)$.

Let's consider an example using blood types in a hospital. Again, we'll start by creating a dataframe.

In [None]:
# Create a sample dataset of blood types
data = {
    'Patient ID': range(1, 21),
    'Blood Type': ['A', 'B', 'O', 'A', 'AB', 'O', 'A', 'O', 'B', 'A',
                   'O', 'A', 'B', 'O', 'A', 'O', 'A', 'AB', 'O', 'A']
}

blood_type_df = pd.DataFrame(data)
blood_type_df.head()

Unnamed: 0,Patient ID,Blood Type
0,1,A
1,2,B
2,3,O
3,4,A
4,5,AB


In this example, we're working with a dataset of patient blood types. Here's how we calculate the mode and frequencies using pandas:

In [None]:
# Calculate the mode
mode_blood_type = blood_type_df['Blood Type'].mode()

# Calculate frequencies
blood_type_freq = blood_type_df['Blood Type'].value_counts()

print(f"The mode of blood types is: {mode_blood_type[0]}")
print(f"\nBlood type frequencies:\n{blood_type_freq}")

The mode of blood types is: A

Blood type frequencies:
Blood Type
A     8
O     7
B     3
AB    2
Name: count, dtype: int64


The code works as follows:

1.  `blood_type_df['Blood Type'].mode()` calculates the mode of the 'Blood Type' column. The mode is the most frequently occurring value in a dataset. Note that for categorical data like blood types, the mode is often more meaningful than the mean or median.
2.  `blood_type_df['Blood Type'].value_counts()` calculates the frequency of each unique value in the 'Blood Type' column. This gives us a count of how many times each blood type appears in our dataset.
3.  We print the mode, which is the most common blood type. Note that we use `mode_blood_type[0]` because the `mode()` function can return multiple values if there's more than one mode. (See the section on "multi-modal data" below)
4.  Finally, we print the frequencies of all blood types.

This analysis not only tells us the most common blood type (the mode) but also provides a complete breakdown of the distribution of blood types in our patient population. This kind of information can be crucial for blood banks, hospital supply management, and other healthcare planning scenarios.

In this example, blood type A is the mode, occurring most frequently in the dataset.

### Properties and Considerations of the Mode

1.  *Applicability to categorical data*--The mode can be used with nominal (categorical) data, unlike the mean and median.
2.  *Multiple modes possible*--A dataset can have more than one mode (bimodal, trimodal, etc.) if multiple values occur with the same highest frequency.
3.  **=Insensitivity to extreme values*--Like the median, the mode is not affected by outliers.
4.  *Limited use in further statistical analyses*--The mode is not commonly used in more advanced statistical calculations.

### When to Use the Mode

The mode is most appropriate when:

1.  Working with categorical data (e.g., blood types, product categories)
2.  Dealing with discrete numerical data (e.g., shoe sizes, number of children in families)
3.  Identifying the most common item or category in a dataset
4.  Describing data that doesn't have a clear numerical ordering

In various data science applications, the mode is commonly used for:

-   Analyzing customer preferences in market research
-   Identifying the most common diagnoses in medical studies
-   Determining popular choices in survey responses
-   Describing the typical characteristics of a dataset when mean and median aren't applicable

### Multimodal Data

When a dataset has multiple modes, it's often referred to as **bimodal** (two modes), **trimodal** (three modes), or more generally, **multimodal**. This can reveal important characteristics of the data, such as:

-   Presence of distinct subgroups within the population
-   Potential for further segmentation or clustering analysis
-   Need for more detailed investigation into the factors causing multiple peaks in the data distribution

Understanding the mode and recognizing multimodal distributions can provide valuable insights, especially when dealing with categorical data or when the most frequent value is of particular interest. While it may not be as commonly used in advanced statistical analyses as the mean or median, the mode remains an important tool in the data scientist's toolkit for describing and understanding data distributions.

## Measures of Dispersion: How Spread Out is Our Data?

While measures of central tendency give us a sense of the typical value in a dataset, they don't tell us anything about how the data is spread out. This is where measures of dispersion come in. These measures help us understand the variability and spread of our data points.

Let's start with the simplest measure of dispersion: the range.

### Range

The **range** is the difference between the maximum and minimum values in a dataset. It gives a quick and simple measure of the spread of the data.

Mathematically, the range is defined as:

$$
\text{Range} = \text{Maximum value} - \text{Minimum value}
$$

or

$$
\text{Range} = \max(x) - \min(x)
$$

Where $x$ represents the dataset.

### Minimum and Maximum

The **minimum** (min) is the smallest value in the dataset, while the **maximum** (max) is the largest value.

Let's consider an example using gene expression levels in a cancer research study. To start, we'll create a sample dataset of gene expression levels using numpy's random number generation and pandas DataFrame:

In [None]:
# Create a sample dataset of gene expression levels
np.random.seed(52179)  # for reproducibility
data = {
    'Sample ID': range(1, 21),
    'Gene Expression (FPKM)': np.random.normal(loc=20, scale=3, size=20).round(1)
}

gene_df = pd.DataFrame(data)
gene_df.head()

Unnamed: 0,Sample ID,Gene Expression (FPKM)
0,1,23.1
1,2,23.9
2,3,21.4
3,4,17.0
4,5,19.0


There are a few new things here:
- `np.random.seed(52179)` sets a seed for the random number generator. By using a seed, we ensure that the "random" numbers generated are the same each time we run the code, which is crucial for reproducibility in data science. (Brendan's Note: I often use my birthday as my random seed...).
-   We create a dictionary `data` with two keys:
    -   'Sample ID': This uses `range(1, 21)` to generate sample IDs from 1 to 20.
    -   'Gene Expression (FPKM)': This uses `np.random.normal()` to generate 20 random numbers from a normal distribution with a mean (loc) of 20 and a standard deviation (scale) of 3. The `.round(1)` rounds these numbers to one decimal place.
-   `gene_df = pd.DataFrame(data)`: This creates a pandas DataFrame from our dictionary, which we've seen before.

Now, can easily find the min, max, and range by using pandas `df.max()` and `df.min()` functions.

In [None]:

# Calculate range, min, and max
expr_range = gene_df['Gene Expression (FPKM)'].max() - gene_df['Gene Expression (FPKM)'].min()
expr_min = gene_df['Gene Expression (FPKM)'].min()
expr_max = gene_df['Gene Expression (FPKM)'].max()

print(f"Range of gene expression: {expr_range:.1f} FPKM")
print(f"Minimum gene expression: {expr_min:.1f} FPKM")
print(f"Maximum gene expression: {expr_max:.1f} FPKM")

Range of gene expression: 10.9 FPKM
Minimum gene expression: 15.8 FPKM
Maximum gene expression: 26.7 FPKM


### Interpretation of Results

-   The gene expression levels in our tumor samples range from 10.9 to 15.8 FPKM.
-   The range of 9.6 FPKM gives us a quick sense of how spread out our data is.

### Properties and Considerations of Range, Min, and Max

1.  *Sensitivity to outliers*--The range is highly sensitive to extreme values, as it only considers the two most extreme points in the dataset. In our gene expression example, if we had one sample with an unusually high expression level of 40 FPKM, it would dramatically increase our range without necessarily being representative of the overall data spread.
2.  *Simplicity*--Range, min, and max are easy to calculate and understand, making them useful for quick assessments of data spread. In our example, we can quickly communicate that "gene expression varied between 15.3 and 24.9 FPKM across tumor samples."
3.  *Limited information*--While they provide the boundaries of the data, they don't give information about the distribution of values between these extremes. We don't know from these measures alone whether most of our tumor samples had expression levels closer to the minimum, closer to the maximum, or evenly spread out.
4.  *Useful for data cleaning*---Min and max values are often used to identify potential outliers or data entry errors. In our gene expression study, values far outside our observed range might warrant further investigation -- they could represent interesting biological phenomena or potential errors in measurement or data recording.

### When to Use Range, Min, and Max

These measures are most appropriate when:

1.  You need a quick, simple measure of data spread
2.  Identifying the boundaries of your dataset is important
3.  Detecting potential outliers or data errors
4.  Comparing the spread of different datasets at a high level

In our gene expression example, these measures could be used to:

-   Quickly compare the variability of expression across different genes
-   Set the y-axis limits when plotting expression levels
-   Identify tumor samples with unusually high or low expression for further study
-   Normalize expression data across different genes for comparative analyses

While the range, minimum, and maximum provide a basic understanding of data spread, they are often used in conjunction with more sophisticated measures of dispersion, such as variance and standard deviation, to get a more complete picture of how data is distributed. These measures are particularly useful in the early stages of data analysis for getting a quick sense of the data's boundaries and potential outliers.

## How Much Do Our Data Points Deviate from the Mean?

While the range gives us a quick idea of the spread of our data, it doesn't tell us much about how the data is distributed within that range. This is where variance and standard deviation come in. These measures give us a sense of how much our data points typically deviate from the mean.

### Variance
The **variance** measures the average squared deviation from the mean. It gives us a sense of how spread out our data is, with larger values indicating greater spread.

For a population, the variance is defined as:

$$
\sigma^2 = \frac{\sum_{i=1}^{N} (x_i - \mu)^2}{N}
$$

For a sample, we use:

$$
s^2 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n - 1}
$$

Where:

-   $\sigma^2$ is the population variance
-   $s^2$ is the sample variance
-   $x_i$ are the individual values
-   $\mu$ is the population mean
-   $\bar{x}$ is the sample mean
-   $N$ is the population size
-   $n$ is the sample size

### Standard Deviation

The **standard deviation** is the square root of the variance. It's often preferred over variance because it's in the same units as the original data.

For a population:

$$
\sigma = \sqrt{\frac{\sum_{i=1}^{N} (x_i - \mu)^2}{N}}
$$

For a sample:

$$
s = \sqrt{\frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n - 1}}
$$

Let's consider an example using life expectancy data for North American countries. Again, we'll start by creating a data frame:

In [None]:
# Create a dataset of life expectancy across North American countries
# Data source: World Bank, 2021 (or latest available) life expectancy data
data = {
    'Country': ['Canada', 'United States', 'Mexico', 'Costa Rica', 'Panama',
                'El Salvador', 'Guatemala', 'Honduras', 'Nicaragua', 'Belize'],
    'Life Expectancy': [82.3, 76.1, 75.1, 80.3, 78.5, 73.3, 74.3, 75.3, 74.5, 74.6]
}

life_df = pd.DataFrame(data)
life_df

Unnamed: 0,Country,Life Expectancy
0,Canada,82.3
1,United States,76.1
2,Mexico,75.1
3,Costa Rica,80.3
4,Panama,78.5
5,El Salvador,73.3
6,Guatemala,74.3
7,Honduras,75.3
8,Nicaragua,74.5
9,Belize,74.6


Now, we can use the following functions:

1. `df['column'].mean()` to calculate the mean.
2. `df['column'].var()` to calculate the variance.
3. `df['column'].std()` to calculate the mean.

In [None]:
# Calculate mean, variance, and standard deviation
mean_life_exp = life_df['Life Expectancy'].mean()
var_life_exp = life_df['Life Expectancy'].var()
std_life_exp = life_df['Life Expectancy'].std()

print(f"Mean life expectancy: {mean_life_exp:.1f} years")
print(f"Variance of life expectancy: {var_life_exp:.1f} years²")
print(f"Standard deviation of life expectancy: {std_life_exp:.1f} years")

Mean life expectancy: 76.4 years
Variance of life expectancy: 8.7 years²
Standard deviation of life expectancy: 2.9 years


### Interpretation of Results

-   The mean life expectancy across North American countries is 76.4 years.
-   The variance of 8.7 years² gives us a sense of the spread, but it's in squared units, which can be hard to interpret.
-   The standard deviation of 2.9 years is more interpretable: it tells us that, on average, life expectancy in these countries deviates from the mean by about 3.0 years.

### Properties and Considerations of Variance and Standard Deviation

1.  *Units*--Variance is in squared units (years² in this case), which can be difficult to interpret. Standard deviation is in the same units as the original data (years), making it more interpretable.
2.  *Sensitivity to outliers*--Both measures are sensitive to outliers. In our data, Canada's high life expectancy (82.3 years) and El Salvador's lower life expectancy (73.3 years) increase the variance and standard deviation.
3.  *Usefulness in further analysis*--Many statistical methods and models use variance and standard deviation in their calculations.
4.  *Normal distribution*--In a normal distribution, about 68% of the data falls within one standard deviation of the mean, 95% within two standard deviations, and 99.7% within three standard deviations. While our data isn't perfectly normal, we can still use this as a rough guide.

### When to Use Variance and Standard Deviation

These measures are most appropriate when:

1.  You need a more nuanced measure of spread than the range
2.  The data is roughly normally distributed
3.  You're preparing data for further statistical analysis
4.  You want to compare the spread of different datasets that might have different means

In our life expectancy example, these measures could be used to:

-   Compare the variability in life expectancy across different regions (e.g., North America vs. Europe)
-   Identify countries with unusually high or low life expectancies (e.g., Canada is about 2 standard deviations above the mean)
-   Provide a confidence interval for life expectancy estimates in the region
-   Standardize life expectancy data for comparison with other health metrics

Variance and standard deviation provide a more comprehensive view of data spread than simpler measures like range. They are fundamental in many statistical analyses and provide valuable insights into the distribution of our data, as demonstrated with this real-world public health example.

### How Often Does Each Category Occur in Our Data? Frequencies and Percentages

When working with categorical data, we often want to know how many times each category appears in our dataset. This is where frequencies and percentages come in handy. They help us summarize and understand the distribution of categorical variables.

### Frequency

The **frequency** of a category is simply the count of how many times that category appears in the dataset.

### Percentage

The **percentage** is the proportion of a particular category expressed as a part of 100. It's calculated by dividing the frequency of a category by the total number of observations and multiplying by 100.

$$
\text{Percentage} = \frac{\text{Frequency}}{\text{Total number of observations}} \times 100
$$

Let's consider an example using vaccination status data from a public health survey:

In [None]:
# Create a dataset of vaccination status
data = {
    'Vaccination Status': ['Fully Vaccinated', 'Partially Vaccinated', 'Unvaccinated',
                           'Fully Vaccinated', 'Unvaccinated', 'Fully Vaccinated',
                           'Partially Vaccinated', 'Fully Vaccinated', 'Unvaccinated',
                           'Fully Vaccinated', 'Fully Vaccinated', 'Partially Vaccinated',
                           'Unvaccinated', 'Fully Vaccinated', 'Partially Vaccinated',
                           'Fully Vaccinated', 'Unvaccinated', 'Fully Vaccinated',
                           'Partially Vaccinated', 'Fully Vaccinated']
}

vac_df = pd.DataFrame(data)
vac_df.head()

Unnamed: 0,Vaccination Status
0,Fully Vaccinated
1,Partially Vaccinated
2,Unvaccinated
3,Fully Vaccinated
4,Unvaccinated


Now, we'll analyze vaccination status data by calculating both the frequencies and percentages for each category. This two-step approach gives us a comprehensive view of our data distribution.

Here's our strategy:

1.  Calculate frequencies--We'll use pandas' `value_counts()` function to count how many times each vaccination status appears in our dataset.
2.  Calculate percentages--We'll again use `value_counts()`, but with the `normalize=True` argument to get proportions, which we'll then convert to percentages.
3.  Combine the results--We'll use pandas' `concat()` function to merge our frequency and percentage data into a single, easy-to-read summary table.

Key pandas functions we'll use:

-   `value_counts()`: This function counts the occurrences of each unique value in a Series. When used with `normalize=True`, it returns proportions instead of counts.
-   `concat()`: This function allows us to combine multiple pandas objects along a particular axis.

We'll also use some basic pandas operations like rounding and column renaming to polish our final output.

This approach not only gives us the raw counts of each vaccination status but also their relative proportions, providing a clear picture of vaccination coverage in our sample. Such a summary is invaluable in public health for understanding the current state of vaccination and informing future strategies.

Now, let's see this strategy in action with our code:

In [None]:
# Calculate frequencies
freq = vac_df['Vaccination Status'].value_counts()

# Calculate percentages
perc = vac_df['Vaccination Status'].value_counts(normalize=True) * 100

# Combine frequencies and percentages
summary = pd.concat([freq, perc], axis=1, keys=['Frequency', 'Percentage'])
summary['Percentage'] = summary['Percentage'].round(1)

summary

Unnamed: 0_level_0,Frequency,Percentage
Vaccination Status,Unnamed: 1_level_1,Unnamed: 2_level_1
Fully Vaccinated,10,50.0
Partially Vaccinated,5,25.0
Unvaccinated,5,25.0


### Bar Graphs for Displaying Groups by Percntile
After calculating our frequencies and percentages, it's often helpful to visualize this data. Visualization can quickly convey the distribution of our data and highlight important patterns or differences that might not be immediately apparent from numbers alone.

For categorical data like vaccination status, a **bar graph** (also called a bar chart) is an excellent choice for visualization. Bar graphs use rectangular bars with heights proportional to the values they represent. They are particularly useful when you want to:

1.  Compare quantities across different categories
2.  Show the distribution of a categorical variable
3.  Highlight the differences between groups

In our case, a bar graph will allow us to visually compare the frequencies of different vaccination statuses.

To create our bar graph, we'll use **Ploty**, a popular plotting library in Python.

-   We import Plotly Express as `px`.
-   We use `px.bar()` to create a bar chart. This function handles most of the setup automatically.
-   We pass our `summary` DataFrame to the function, specifying:
    -   `x=summary.index`: Use the index (vaccination status) for the x-axis.
    -   `y='Frequency'`: Use the 'Frequency' column for the y-axis.
    -   `title`: Set the chart title.
    -   `labels`: Rename the axes labels.
    -   `height`: Set the height of the figure (optional).
-   `fig.show()` displays the interactive plot.

Now, let's see how we implement this in code:

In [None]:
import plotly.express as px

fig = px.bar(summary,  # data source
             x=summary.index, # the x-axis variable
             y='Frequency',
             title='Distribution of Vaccination Status',
             labels={'index': 'Vaccination Status', 'Frequency': 'Frequency'},
             height=500)

fig.show()

### Interpretation of Results

-   The most common category is "Fully Vaccinated" with a frequency of 10, representing 50% of the sample.
-   Both "Unvaccinated" and "Partially Vaccinated" categories have a frequency of 5 each, each representing 25% of the sample.

### Properties and Considerations of Frequencies and Percentages

1.  Frequencies and percentages are easy to calculate and interpret, making them accessible to a wide audience.
2.  The sum of all frequencies should equal the total number of observations, and the sum of all percentages should equal 100%.
3.  Frequencies provide absolute counts, while percentages provide relative proportions, allowing for easier comparison across different-sized datasets.
4.  Percentages can be misleading when sample sizes are small. Always consider the total sample size when interpreting percentages.

### When to Use Frequencies and Percentages

These measures are most appropriate when:

1.  Working with categorical or discrete data
2.  Summarizing the distribution of a variable
3.  Comparing proportions across different groups or categories
4.  Communicating results to a general audience

In our vaccination status example, frequencies and percentages could be used to:

-   Track progress towards vaccination goals
-   Compare vaccination rates across different demographics or regions
-   Inform public health messaging and resource allocation
-   Provide clear, understandable statistics for public communication

### Additional Considerations

1.  **Grouped Data**. For continuous variables, you might need to group data into categories before calculating frequencies and percentages (e.g., age groups, income brackets).
2.  **Cumulative Frequencies/Percentages**. These can be useful for ordinal data or when you want to know the proportion at or below a certain level.
3.  **Relative Frequencies**. These are similar to percentages but expressed as proportions (between 0 and 1) instead of percentages.

Frequencies and percentages are fundamental tools in descriptive statistics, especially for categorical data. They provide a clear, intuitive way to summarize and communicate the distribution of data across different categories, making them invaluable in fields like public health, market research, and social sciences.

## Percent Change: How Much Has Our Data Changed Over Time?

When analyzing data over time, we often want to quantify how much a value has increased or decreased. Percent change is a useful measure for this purpose, allowing us to express the change relative to the initial value.

### Percent Change

**Percent change** is the difference between a final and initial value, expressed as a percentage of the initial value.

The formula for percent change is:

$$
\text{Percent Change} = \frac{\text{Final Value} - \text{Initial Value}}{\text{Initial Value}} \times 100
$$

Let's consider an example using COVID data from 2020:

In [None]:
# Create a dataset of COVID-19 cases in the US for 2020
# Data source: CDC COVID Data Tracker (https://covid.cdc.gov/covid-data-tracker/)
data = {
    'Date': ['2020-03-01', '2020-04-01', '2020-05-01', '2020-06-01', '2020-07-01',
             '2020-08-01', '2020-09-01', '2020-10-01', '2020-11-01', '2020-12-01'],
    'Cumulative Cases': [30, 186101, 1069424, 1781376, 2624873,
                         4496408, 6004443, 7262734, 9105230, 13447627]
}
covid_df = pd.DataFrame(data)
covid_df['Date'] = pd.to_datetime(covid_df['Date'])
covid_df.head()

Unnamed: 0,Date,Cumulative Cases
0,2020-03-01,30
1,2020-04-01,186101
2,2020-05-01,1069424
3,2020-06-01,1781376
4,2020-07-01,2624873


To calculate percent change, we'll use the pandas `pct_change()` function. This function computes the percentage change between adjacent elements in a Series or DataFrame. Here's our strategy:

1.  We'll apply `pct_change()` to our 'Cumulative Cases' column.
2.  We'll multiply the result by 100 to convert from a decimal to a percentage.
3.  We'll round the result to two decimal places for readability.

The `pct_change()` function is particularly useful because it automatically handles the calculation for each row based on the previous row's value, saving us from having to write a loop or complex formula.

In [None]:
# Calculate percent change
covid_df['Percent Change'] = covid_df['Cumulative Cases'].pct_change().round(2) * 100
covid_df

Unnamed: 0,Date,Cumulative Cases,Percent Change
0,2020-03-01,30,
1,2020-04-01,186101,620237.0
2,2020-05-01,1069424,475.0
3,2020-06-01,1781376,67.0
4,2020-07-01,2624873,47.0
5,2020-08-01,4496408,71.0
6,2020-09-01,6004443,34.0
7,2020-10-01,7262734,21.0
8,2020-11-01,9105230,25.0
9,2020-12-01,13447627,48.0


In addition to calculating the percent change between consecutive time points, it's often valuable to compute the overall percent change from the beginning to the end of our dataset. This gives us a big-picture view of how much our measure (in this case, cumulative COVID-19 cases) has changed over the entire period we're studying.

The formula for overall percent change is the same as the one we used earlier, but now we're using the first and last values in our dataset:

Overall Percent Change = (Final Value - Initial Value) / Initial Value * 100

This calculation will show us the total relative increase in cases from the start to the end of our data period. It's useful for:

1.  Summarizing the overall growth in cases over the entire time span
2.  Providing a single, easily communicable figure for the total change
3.  Comparing long-term trends across different regions or diseases

To calculate this, we'll use pandas indexing to access the first and last values in our 'Cumulative Cases' column. Here's our strategy:

1.  We'll use `df['Cumulative Cases'].iloc[0]` to get the first value (initial cases).
2.  We'll use `df['Cumulative Cases'].iloc[-1]` to get the last value (final cases).
3.  We'll apply our percent change formula using these values.
4.  We'll format the result as a percentage and round it for readability.

We'll also extract the start and end dates from our 'Date' column to provide context for our percent change value.

Now, let's see how we implement this calculation in code:

In [None]:
# Calculate overall percent change from first to last date
overall_change = ((covid_df['Cumulative Cases'].iloc[-1] - covid_df['Cumulative Cases'].iloc[0]) /
                   covid_df['Cumulative Cases'].iloc[0]) * 100

print(f"\nOverall percent change from {covid_df['Date'].iloc[0].date()} to {covid_df['Date'].iloc[-1].date()}: {overall_change:.1f}%")


Overall percent change from 2020-03-01 to 2020-12-01: 44825323.3%


### Interpretation of Results

-   The percent changes between consecutive dates show how rapidly the cumulative case count was increasing.
-   The extremely high percent change from March to April 2020 (620237.0%) reflects the initial rapid spread of the virus from a very low base.
-   Subsequent percent changes, while lower, still indicate significant growth, especially in the earlier periods.
-  The overall percent change (from March 2020 to Demember 2020) was an asstounding 44825323.3%.

### Properties and Considerations of Percent Change

1.   Percent change expresses change relative to the initial value, allowing for comparison between different scales or time periods.
2.  When the initial value is small, even a small absolute change can result in a large percent change, as seen in the early stages of the pandemic.
3.  This example uses cumulative cases, which always increase. Percent change of new cases might show both increases and decreases.
4.  With exponential growth, as often seen in pandemic data, a log scale can be useful for visualization.

### When to Use Percent Change

Percent change is most appropriate when:

1.  Analyzing trends over time
2.  Comparing changes across different regions or time periods
3.  Communicating relative changes to a general audience
4.  Evaluating the rate of spread or containment of a disease

In our COVID-19 example, percent change could be used to:

-   Track the rate of increase in cases over time
-   Compare the growth rate of cases across different regions or countries
-   Evaluate the effectiveness of public health interventions
-   Identify periods of acceleration or deceleration in case growth

Percent change is a valuable tool in epidemiology and public health, particularly for analyzing trends and communicating changes over time. However, it should be used in conjunction with other measures and always interpreted within the appropriate context.

### Line Plots to Visualize Perecent Change
While bar graphs are excellent for categorical data, line graphs are particularly useful for visualizing data that changes over time, making them ideal for our COVID-19 case data.

Line graphs connect individual data points with lines, which allows us to easily see trends, patterns, and rates of change. They are especially good at showing:

1.  Continuous data over time
2.  Trends and patterns in the data
3.  Rate of change (steep lines indicate rapid change, while flatter lines show slower change)

In our COVID-19 example, a line graph will help us visualize how cumulative cases have increased over time, potentially revealing periods of rapid spread or slowdown.

Now, let's see how we create this line graph, using `px.line()`.

In [None]:
import plotly.express as px

fig = px.line(covid_df,
              x='Date',
              y='Cumulative Cases',
              title='Cumulative COVID-19 Cases in the US Over Time',
              markers=True, height=400)
fig.show()

### Percent Difference: How Do We Compare Two Different Values?

While percent change is useful for comparing consecutive time periods, percent difference is valuable when comparing two different values that may not be directly sequential. This is particularly useful in public health for comparing rates or prevalences across different years or populations.

**Percent difference** is the absolute difference between two values, expressed as a percentage of their average. It's particularly useful when there's no clear "before" and "after" value, or when comparing two values of equal standing.

The formula for percent difference is:

$$
\text{Percent Difference} = \frac{|\text{Value}_1 - \text{Value}_2|}{\frac{\text{Value}_1 + \text{Value}_2}{2}} \times 100
$$

Let's consider an example using cigarette smoking prevalence data in the United States:

In [None]:
# Create a dataset of cigarette smoking prevalence in the US
# Data source: CDC National Health Interview Survey (NHIS)
data = {
    'Year': [1965, 1970, 1975, 1980, 1985, 1990, 1995, 2000, 2005, 2010, 2015, 2020],
    'Smoking Prevalence (%)': [42.4, 37.4, 36.4, 33.2, 30.1, 25.5, 24.7, 23.2, 20.9, 19.3, 15.1, 12.5]
}

smoking_df = pd.DataFrame(data)
smoking_df.head()

Unnamed: 0,Year,Smoking Prevalence (%)
0,1965,42.4
1,1970,37.4
2,1975,36.4
3,1980,33.2
4,1985,30.1


Here's our strategy for calculating percent differences:

1.  We'll create a reusable **function** to calculate percent difference. This allows us to easily compute percent differences for multiple pairs of years.
2.  We'll use pandas' `loc` method to select smoking prevalence values for specific years from our DataFrame.
3.  We'll apply our function to compare smoking prevalence between 1965 and 2020, and between 1990 and 2010.

This approach will allow us to:

-   Quantify the change in smoking prevalence over long (1965-2020) and shorter (1990-2010) time periods
-   Compare the rate of change in different eras
-   Provide a standardized measure of change that can be compared across different studies or populations

The percent difference gives us a symmetrical measure of relative change, which can be particularly useful when we're not thinking of one value as a "starting point" and the other as an "ending point," but rather just want to express how different two values are from each other.

Now, let's see how we implement these calculations in code:

In [None]:
# Function to calculate percent difference
def percent_difference(value1, value2):
    return abs(value1 - value2) / ((value1 + value2) / 2) * 100

# Calculate percent difference between 1965 and 2020
diff_1965_2020 = percent_difference(smoking_df.loc[smoking_df['Year'] == 1965, 'Smoking Prevalence (%)'].values[0],
                                    smoking_df.loc[smoking_df['Year'] == 2020, 'Smoking Prevalence (%)'].values[0])

# Calculate percent difference between 1990 and 2010
diff_1990_2010 = percent_difference(smoking_df.loc[smoking_df['Year'] == 1990, 'Smoking Prevalence (%)'].values[0],
                                    smoking_df.loc[smoking_df['Year'] == 2010, 'Smoking Prevalence (%)'].values[0])

print(f"Percent difference in smoking prevalence between 1965 and 2020: {diff_1965_2020:.1f}%")
print(f"Percent difference in smoking prevalence between 1990 and 2010: {diff_1990_2010:.1f}%")

Percent difference in smoking prevalence between 1965 and 2020: 108.9%
Percent difference in smoking prevalence between 1990 and 2010: 27.7%


### Line Graph: Prevalance of Smoking

We can also create a line graph to help visualize this data.

In [None]:
# Visualize the data
fig = px.line(smoking_df,
              x='Year',
              y='Smoking Prevalence (%)',
              title='Cigarette Smoking Prevalence in the US (1965-2020)',
              markers=True,
              height=400)

fig.show()

## Confidence Intervals: How Certain Are We About Our Estimates?

When working with sample data to make inferences about a population, we need a way to express the uncertainty in our estimates. Confidence intervals provide a range of plausible values for a population parameter, along with a level of confidence in that range.

A **confidence interval** is a range of values that is likely to contain the true population parameter with a certain level of confidence. It is typically expressed as a point estimate (usually the sample mean) plus or minus a margin of error.

The general formula for a confidence interval is:

$$
\text{Confidence Interval} = \text{Point Estimate} \pm (\text{Critical Value} \times \text{Standard Error})
$$

For a population mean with known population standard deviation:

$$
\bar{x} \pm (z_{\alpha/2} \times \frac{\sigma}{\sqrt{n}})
$$

Where:

-   $\bar{x}$ is the sample mean
-   $z_{\alpha/2}$ is the critical value from the standard normal distribution
-   $\sigma$ is the population standard deviation
-   $n$ is the sample size

Let's consider an example using blood pressure data from a (fake) health survey, which we will create using Python's scientific computing libraries:

1.  SciPy: We'll import the `stats` module from SciPy, which provides many statistical functions.
2.  NumPy: We'll use NumPy's random number generation capabilities to create our simulated data.

Here's our strategy for generating the sample data:

1.  Set a random seed to ensure that our "random" numbers are reproducible. Anyone using the same seed will get the same sequence of random numbers.
2.  Define our sample parameters:
    -   Sample size: We'll generate 100 blood pressure readings. After you've seen how this impact confidence intervals, you might want to go back and see what happens if you make this smaller (say, 300), or much larger (say, 3000).
    -   Mean (location): We'll center our distribution at 120 mmHg, a typical average for systolic blood pressure.
    -   Standard deviation (scale): We'll use 15 mmHg to introduce variability in our data.
3.  Generate the sample: We'll use NumPy's `random.normal()` function to generate our data. This function creates random numbers from a normal (Gaussian) distribution, which is often a good model for biological measurements like blood pressure.

By using a normal distribution, we're making an assumption about the underlying population distribution of blood pressure. This is often a reasonable assumption for many biological variables, but it's important to remember that it's a model, not necessarily reality.

In [None]:
from scipy import stats

# Set random seed for reproducibility
np.random.seed(52179)

# Generate sample data: systolic blood pressure readings
sample_size = 100
blood_pressure = np.random.normal(loc=120, scale=15, size=sample_size)

print(f"Sample Blood Pressure: {blood_pressure[0:10].round(2)}")

Sample Blood Pressure: [135.43 139.42 126.98 104.79 114.81 124.92 110.18 130.18 124.58 112.21]


After generating our sample of blood pressure readings, we want to estimate the true population mean and quantify our uncertainty in this estimate. As mentioned above, this is where confidence intervals come in. Here's our strategy for calculating the confidence interval:

1.  Calculate sample statistics:
    -   The **sample mean** is our best point estimate of the population mean.
    -   The **sample standard deviation** measures the variability in our sample.
2.  Set the confidence level:
    -   We'll use a **95% confidence leve**, which is common in many fields.
3.  Calculate the margin of error:
    -   We'll use the **t-distribution**, which is appropriate for small sample sizes or when the population standard deviation is unknown.
    -   The margin of error accounts for the sample size and variability in our estimate.
4.  Compute the confidence interval:
    -   We'll subtract and add the margin of error to our sample mean to get the lower and upper bounds of the interval.

The interpretation of a 95% confidence interval is: if we repeated this sampling process many times and calculated the interval each time, about 95% of these intervals would contain the true population mean.

Key functions we'll use:

-   `np.mean()` and `np.std()` for calculating sample statistics
-   `stats.t.ppf()` to find the critical value from the t-distribution
-   Basic arithmetic operations to calculate the margin of error and confidence interval bounds

Now, let's see how we implement these calculations in code:

In [None]:
# Calculate sample statistics
sample_mean = np.mean(blood_pressure)
sample_std = np.std(blood_pressure, ddof=1)  # ddof=1 for sample standard deviation

# Set confidence level
confidence_level = 0.95

# Calculate confidence interval
margin_of_error = stats.t.ppf((1 + confidence_level) / 2, df=sample_size-1) * (sample_std / np.sqrt(sample_size))
ci_lower = sample_mean - margin_of_error
ci_upper = sample_mean + margin_of_error


print(f"Sample Mean Blood Pressure: {sample_mean:.2f} mmHg")
print(f"95% Confidence Interval: ({ci_lower:.2f}, {ci_upper:.2f}) mmHg")


Sample Mean Blood Pressure: 121.46 mmHg
95% Confidence Interval: (118.49, 124.42) mmHg


### Histogram: Systolic Blood Pressure
After calculating our confidence interval, it's helpful to visualize our data and results. For this, we'll use a histogram, which is a great tool for displaying the distribution of continuous data like our blood pressure readings.

A **histogram** divides the range of values into intervals (**bins**) and shows the frequency of data points falling into each bin. This gives us a visual representation of:

1.  The central tendency of our data (where most values cluster)
2.  The spread or variability of our data
3.  The shape of our distribution (symmetric, skewed, etc.)

In addition to the histogram, we'll add vertical lines to show our calculated sample mean and confidence interval bounds. This will help us visualize how our point estimate and interval estimate relate to the overall distribution of our data.

Now, let's see how we create this visualization. We'll use px.histogram().

In [None]:
import plotly.express as px
import plotly.graph_objects as go
# Create the histogram
fig = px.histogram(x=blood_pressure, nbins=20,
                   title='Distribution of Systolic Blood Pressure with 95% Confidence Interval',
                   labels={'x': 'Systolic Blood Pressure (mmHg)', 'y': 'Frequency'},
                   height=400)


# Add vertical lines for sample mean and confidence interval
fig.add_vline(x=sample_mean, line_dash="dash", line_color="red", annotation_text="Sample Mean")
fig.add_vline(x=ci_lower, line_dash="dash", line_color="green", annotation_text="95% CI")
fig.add_vline(x=ci_upper, line_dash="dash", line_color="green")

fig.show()

### Interpretation of Results

-   The sample mean systolic blood pressure is 121.46 mmHg.
-   We are 95% confident that the true population mean systolic blood pressure falls between 118.49 and 124.42 mmHg.
-   This means if we repeated this sampling process many times, about 95% of the calculated intervals would contain the true population mean.

### Properties and Considerations of Confidence Intervals

1.   The width of the **confidence interval** is affected by the sample size, variability in the data, and the chosen confidence level.
2.  **Confidence Level**. Commonly used **confidence levels** are 90%, 95%, and 99%. A higher confidence level results in a wider interval.
3.  **Sample Size**. Larger **sample sizes** generally lead to narrower confidence intervals, indicating more precise estimates.
4.  The basic formula assumes the sampling distribution is approximately **normal**, which is usually valid for large sample sizes or when the population is normally distributed.

### When to Use Confidence Intervals

Confidence intervals are most appropriate when:

1.  Estimating population parameters from sample data
2.  Expressing the precision of an estimate
3.  Comparing groups or assessing changes over time
4.  Communicating results of statistical analyses to both technical and non-technical audiences

In our blood pressure example, confidence intervals could be used to:

-   Estimate the average blood pressure in a population
-   Compare blood pressure levels between different demographic groups
-   Assess the effectiveness of interventions aimed at reducing blood pressure
-   Guide clinical decisions about blood pressure management

### Additional Considerations

1. A 95% confidence interval doesn't mean there's a 95% chance the true parameter is in that interval. Rather, it means 95% of similarly constructed intervals would contain the true parameter. In real life, we also have to account for the fact that sampling procedures are biased (for example, in political polling, a consistent problem is that not all people are equally likely to respond to polls).
2.  In some cases, you might be interested in only an upper or lower bound, leading to **one-sided confidence intervals**.
3.  For non-normal distributions or other types of estimates, different methods (like bootstrap) might be more appropriate for constructing confidence intervals.

Confidence intervals are a crucial tool in inferential statistics, providing a measure of the reliability of our estimates. They bridge the gap between point estimates and the inherent variability in sampling, offering a more nuanced view of population parameters. In public health and epidemiology, they're invaluable for making informed decisions based on sample data.

## What is the Normal Distribution?

The normal distribution, sometimes called the "bell curve" because of its shape, is a special pattern that many things in nature and society tend to follow. It's incredibly important in statistics because it helps us understand and predict how data behaves in many real-world situations.

### Key Features of the Normal Distribution

1. *Bell-shaped*. If you draw it, it looks like a bell or a mountain peak.
2. *Symmetrical*. Both sides are mirror images of each other.
3. *Most values cluster around the middle*. In a normal distribution, most of the data falls near the average (mean).
4. *Defined by two numbers*. The average (mean) tells us where the peak is, and the standard deviation tells us how spread out the data is.

### Why is it Important?

The normal distribution is a big deal in statistics for a few reasons:

1.  Many things we measure, from heights of people to errors in scientific measurements, often follow this pattern.
2. Because we know how normal distributions behave, we can make educated guesses about data.
3. Lots of statistical methods assume data is normally distributed.

### Normal Distribution and Confidence Intervals

Remember our discussion about confidence intervals? The idea of normal distribution plays a big role there. Here's why:

1. When we take many samples from a population, the averages of these samples tend to follow a normal distribution, even if the original population isn't perfectly normal. This is thanks to something called the Central Limit Theorem.

2. In a normal distribution, we know that about 68% of the data falls within one standard deviation of the mean, 95% within two, and 99.7% within three (the **Empirical Rule**). This predictable spread is what allows us to calculate confidence intervals.

3. When we say we have a "95% confidence interval," we're using properties of the normal distribution. We're saying that if we repeated our sampling many times, 95% of our intervals would contain the true population value.

4. Many statistical tests, not just confidence intervals, assume that data is normally distributed. This assumption allows us to use the properties of the normal distribution to make inferences about populations based on samples.

## A Visualization of the Normal Distribution

Let's look at a simpler graph of the normal distribution to help visualize these concepts:


In [None]:
import plotly.graph_objects as go
import numpy as np

# Generate data for a standard normal distribution
x = np.linspace(-3, 3, 100)
y = 1 / np.sqrt(2 * np.pi) * np.exp(-0.5 * x**2)

# Create the plot
fig = go.Figure()

# Add the normal distribution curve
fig.add_trace(go.Scatter(x=x, y=y, mode='lines', name='Normal Distribution', line=dict(color='blue', width=2)))

# Define shaded regions
shades = [
    {'xrange': (-1, 1), 'color': 'rgba(255,0,0,0.3)', 'name': '68%'},
    {'xrange': (-2, 2), 'color': 'rgba(0,255,0,0.2)', 'name': '95%'}
]

# Add shaded areas
for shade in shades:
    x_shade = np.linspace(shade['xrange'][0], shade['xrange'][1], 100)
    y_shade = 1 / np.sqrt(2 * np.pi) * np.exp(-0.5 * x_shade**2)
    fig.add_trace(go.Scatter(x=np.concatenate([x_shade, x_shade[::-1]]),
                             y=np.concatenate([y_shade, [0]*len(y_shade)]),
                             fill='tozeroy', fillcolor=shade['color'],
                             line=dict(color='rgba(255,255,255,0)'),
                             name=shade['name']))

# Update layout
fig.update_layout(title='Normal Distribution',
                  xaxis_title='Standard Deviations from Mean',
                  yaxis_title='Probability',
                  showlegend=True)

fig.show()


## Key Points Summary
-   Measures of central tendency (mean, median, mode) provide a single value to represent the typical or central point in a dataset.
-   Measures of dispersion (range, variance, standard deviation) quantify the spread or variability of data points around the central tendency.
-   Percent change and percent difference calculations are crucial for comparing values over time or between different groups.
-   Confidence intervals offer a range of plausible values for population parameters, reflecting the uncertainty in our estimates.
-   Frequency distributions and percentages summarize categorical data, widely used in market research, social sciences, and many other fields.
-   Data visualization techniques, such as histograms, bar charts, and line plots, help in understanding and communicating statistical findings effectively across disciplines.
-   The choice of descriptive statistic depends on the data type, distribution, and specific research question.-   
Python libraries like Pandas, NumPy, and Matplotlib provide powerful tools for performing descriptive statistical analyses on diverse datasets.
-   Interpreting descriptive statistics requires consideration of the context, limitations, and potential biases in the data collection process.
-   While examples in this chapter draw from health data, the descriptive statistics methods presented are foundational and applicable to all areas of data science and research.

## Game: Legend of the Recursive Dragon (Stats)

In [None]:
json_url = "https://github.com/brendanpshea/computing_concepts_python/raw/main/lotr/lotrd_stats.json"
!wget "https://github.com/brendanpshea/computing_concepts_python/raw/main/lotr/lotr.py" -q -nc

from lotr import *
start_game(json_url)

VBox(layout=Layout(border='solid 1px #00ff00', padding='20px', width='800px'))

## GLossary
| Term | Definition |
|------|------------|
| bar graph | A chart that uses rectangular bars with heights or lengths proportional to the values they represent. Useful for comparing categories or groups.|
| Bimodal | Describes a distribution with two distinct peaks or modes, indicating two frequently occurring values or clusters in the data.|
| confidence interval | A range of values that likely contains an unknown population parameter, based on a sample statistic and associated with a specific confidence level.|
| confidence level | The probability that the method used to calculate an interval produces an interval containing the true population parameter. Often expressed as a percentage, e.g., 95%.|
| critical value | A point on the scale of the test statistic beyond which we reject the null hypothesis, based on the chosen significance level.|
| Data Frame | A two-dimensional labeled data structure in pandas, similar to a spreadsheet or SQL table, with columns of potentially different types.|
| df.head() | A pandas method that returns the first few rows of a DataFrame, useful for quickly inspecting the data structure and content.|
| df.iloc[0] | Accesses the first row of a pandas DataFrame using integer-location based indexing.|
| df.iloc[-1] | Retrieves the last row of a pandas DataFrame using negative indexing with the iloc accessor.|
| df['column'].mean() | Calculates the arithmetic average of all values in the specified column of a pandas DataFrame.|
| df['column'].median() | Computes the middle value of the specified column when the data is arranged in ascending or descending order.|
| df['column'].mode() | Identifies the most frequently occurring value(s) in the specified column of a DataFrame.|
| df['column'].std() | Calculates the standard deviation, a measure of variability or dispersion, for the specified column.|
| df['column'].var() | Computes the variance, which quantifies the average squared deviation from the mean, for the specified column.|
| empirical rule | States that for a normal distribution, approximately 68% of data falls within one standard deviation of the mean, 95% within two, and 99.7% within three.|
| frequency | The number of times a particular value or category occurs in a dataset.|
| histogram | A graphical representation of the distribution of numerical data, where rectangular bars represent the frequency of data points falling into specified ranges or bins.|
| line plot | A type of graph that displays information as a series of data points connected by straight line segments, often used to show trends over time.|
| Mean | The arithmetic average of a set of numbers, calculated by summing all values and dividing by the count of values.|
| Median | The middle value in a sorted list of numbers. For an even number of values, it's the average of the two middle numbers.|
| Mode | The value that appears most frequently in a dataset. A dataset can have one mode, multiple modes, or no mode.|
| Multimodal | Describes a distribution with more than two distinct peaks or frequently occurring values.|
| normal distribution | A symmetric, bell-shaped probability distribution where data tends to cluster around the mean, with decreasing frequency as values move away from the center.|
| np.random.seed(number) | A NumPy function that sets the random seed, ensuring reproducibility of random number generation in Python.|
| numpy | A fundamental Python library for numerical computing, providing support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions.|
| percent change | The difference between a starting and ending value, expressed as a percentage of the starting value. Calculates relative change over time.|
| percent difference | The absolute difference between two values, expressed as a percentage of their average. Used to compare two quantities without designating either as a baseline.|
| percentage | A way of expressing a number as a fraction of 100, often used to describe proportions or relative frequencies.|
| plotly | An interactive graphing library for Python that creates web-based visualizations, suitable for both static and dynamic data representation.|
| px.bar() | A Plotly Express function for creating bar charts, useful for comparing categorical data or showing the distribution of a discrete variable.|
| px.histogram() | A Plotly Express function that generates histograms, visualizing the distribution of a continuous variable by dividing it into bins.|
| px.line() | A Plotly Express function for creating line plots, typically used to show trends or changes over time in continuous data.|
| Range | The difference between the maximum and minimum values in a dataset, providing a simple measure of spread or variability.|
| relative frequency | The ratio of the frequency of a particular outcome to the total number of outcomes, often expressed as a decimal or percentage.|
| Standard Deviation | A measure of variability in a dataset, calculated as the square root of the variance. It quantifies the average distance between each data point and the mean.|
| Variance | A measure of variability in a dataset, calculated as the average squared deviation from the mean. It quantifies how far a set of numbers are spread out from their average.|
