# Parameters and statistics

In the last section we learned about populations, samples, and their respective distributions. But throughout your career people are frequently interested in metrics -- a single number which explains something important, like [example of important real-world metric on a population.]

## Population parameter

The {dterm}`population parameter` refers to a desired metric of a population, and just like the population distribution a parameter is usually considered *fixed*.

Remember that the *p*opulation has a *p*arameter since they both start with *p*.

Using our population of fish weights, we could ask about parameters such as the mean or maximum weight, or about average variability of fish weights. For all of these potential parameters, the population produces a single value.

In [None]:
import babypandas as bpd
import numpy as np

population = bpd.read_csv('../../data/fish_kg_cm.csv')

In [None]:
print('Population Mean:    ', population.get('WEIGHT').mean())
print('Population Max:     ', population.get('WEIGHT').max())
print('Population Variance:', np.var(population.get('WEIGHT')))

But, we've already realized that we can't expect to measure entire populations in the real world, so we'll need work with samples instead.

## Sample statistic

When the metric used to calculate a population parameter is used on a sample, we call it the {dterm}`sample statistic`. Just like sample distributions, a sample statistic is subject to random chance depending on what group of individuals we sample!

Remember that the *s*ample produces a *s*tatistic since they both start with *s*.

Just like parameters, we could calculate statistics such as the mean, max, or variance, and we'll receive a single value. But, we should expect these values to differ each time we conduct a new sample -- even when the sample remains the same size.

In [None]:
sample = population.sample(100)

print('Sample Mean:    ', sample.get('WEIGHT').mean())
print('Sample Max:     ', sample.get('WEIGHT').max())
print('Sample Variance:', np.var(sample.get('WEIGHT')))

Ideally, we'd like to be able to use a sample statistic to provide us with an educated guess for the true population parameter. Unfortunately, it seems like the sample statistic doesn't seem to always match the parameter...

In [None]:
pop_mean = population.get('WEIGHT').mean().round(2)
pop_max  = population.get('WEIGHT').max().round(2)
pop_var  = np.var(population.get('WEIGHT')).round(2)

# Collect a handful of samples keep track of various sample statistics for each

sample_means = []
sample_maxes = []
sample_vars  = []

for i in range(5):
    
    sample = population.sample(100)
    
    sample_means.append(sample.get('WEIGHT').mean().round(2))
    sample_maxes.append(sample.get('WEIGHT').max().round(2))
    sample_vars.append(np.var(sample.get('WEIGHT')).round(2))
    
print('Pop Mean:        ', pop_mean)
print('Sample Means:    ', sample_means)
print('Pop Max:         ', pop_max)
print('Sample Maxes:    ', sample_maxes)
print('Pop Variance:    ', pop_var)
print('Sample Variances:', sample_vars)

Based on what we've seen so far, it seems unlikely that the sample max will be the same as the population max, but the sample mean appears consistently close to the population mean.

How close is our statistic to the parameter? How consistent is it? With what probability will the statistic equal the population parameter (within some margin-of-error, like +/- 1 gram)?

With an understanding of formal mathematics and probability theory we can answer these questions! In the mean time, we can use the some approach from our introduction to probabilities: just run a simulation!

## Sampling distribution

Using the same general steps for simulation as we learned before, we can run an experiment to select a random sample of n=100 from the population and see what the resulting sample statistic is.

In [None]:
# Write the code for a single trial
def sample_mean(n):
    return population.sample(n).get('WEIGHT').mean()

sample_mean(100)

In [None]:
# Call the trial function a lot of times and keep track of the results
sample_means = []

for i in range(10_000):
    sample_means.append(sample_mean(100))

We can now calculate a specific experimental probability, like the probability that the sample mean is within +- 0.01 of the population mean. Or, better yet, we can enable ourselves to answer lots of questions about the sample statistic by plotting the *sampling distribution*.

The {dterm}`sampling distribution` is the distribution of all posssible sample statistics with a fixed population and metric, and given sample size. For example, as a result of the experiment above we can plot the sampling distribution of mean fish weights at the London Zoo when the sample size is 100.

Because we're running an experiment instead of calculating the theoretical probabilities, this distribution is considered *empirical*.

In [None]:
sample_means_series = bpd.Series(data=sample_means)
sample_means_series.plot(kind='hist', density=True,
                         title="Empirical sampling distribution of mean fish weight, n=100")

On top of this plot we can overlay the true population mean.

In [None]:
ax = sample_means_series.plot(kind='hist', density=True)
# Add a vertical line at the population mean, with a red color
ax.axvline(x=population.get('WEIGHT').mean(), c='r')

- we've experimentally discovered that the sample mean seems centered -- so on average we should get a sample mean close to the population mean

For any metric we choose, a sampling distribution exists for every possible sample size! Let's put our experiment into a function and use it to find some sampling distributions for the max weight.

In [None]:
def sample_max(n):
    return population.sample(n).get('WEIGHT').max()

def sample_max_distribution(sample_size, ax=None):
    
    sample_maxes = []
    
    for i in range(1000):
        sample_maxes.append(sample_max(sample_size))
        
    sample_maxes_series = bpd.Series(data=sample_maxes)
    
    ax = sample_maxes_series.plot(kind='hist', density=True, ax=ax)
    ax.axvline(x=population.get('WEIGHT').max(), c='r')
    
    return ax

Why are we only going to change the sample size?

It's worth mentioning that the number of trials of our experiment won't affect the overall shape of the resulting sampling distribution -- only the granularity of it. Additionally, the population is fixed when we're looking at the sampling distribution, so the only think left to change is the sample size.

In [None]:
sample_max_distribution(500)

**Question**:
 Want to convince yourself that the number of trials really doesn't change our sampling distribution? See if you can modify the function above to add a `number_of_trials` argument and check it over a handful of values!

<details><summary><b>Answer</b>:</summary>None</details>

It so happens that the sample size *does* have a pretty profound effect on most sampling distributions. Let's see what happens when we increase the sample size.

In [None]:
# You don't need to understand this plotting code, but congrats if you do :)
import matplotlib.pyplot as plt

def plot_three_sampling_distributions(sample_sizes):
    """
    Simulates the sampling distribution of sample max for three different sample
    sizes, then plots them side-by-side.
    """

    # Create a figure to hold three charts on the same row
    fig, axes = plt.subplots(1, 3, sharey=True,
                             constrained_layout=True, figsize=(10, 3))
    fig.suptitle('Sample Distributions of Max Weight')

    # Simulate the sampling distribution for each sample size
    for i in range(len(sample_sizes)):
        
        sample_max_distribution(sample_sizes[i], ax=axes[i])
        axes[i].set_title('n='+str(sample_sizes[i]))

In [None]:
plot_three_sampling_distributions([50, 150, 300])

As sample size increases, the sampling distribution changes significantly -- this is the case for practically any metric we choose to study. Once again, with a larger sample size, our result is better (in this case, more consistent).

## All distributions and metrics in context

And so, our illustration of our different distributions from the previous section grows to include a third panel.

We start with a true populalation distribution for a given feature, which we don't know the shape of, and this population has a true parameter, like the population mean -- it might look like this:

![population distribution, parameter]

From the population distribution many possible sample distributions are drawn, but they'll usually resemble the population:

![some sample distributions]

Each sample distribution produces a single statistic, like the sample mean. The distribution of these sample statistics form our sampl*ing* distribution

![sample statistics feeding into the sampling distribution]

Every metric has a different-looking sampling distribution which will become more consistent and reliable as the sample size increases.