## Practical basics

It is easy to calculate the quantities we encounter most often in statistics!

In [1]:
import pandas as pd
import numpy as np
import altair as alt
import helpers.plotting as pt
pt.enable_slide_theme()
pt.import_lato_font_in_notebook()

In [2]:
from helpers.svg_wrapper import SVGImg

In [3]:
# test: simple example to calculate mentally

coins_in_pocket = [5, 10, 10, 20, 20]

# other easy to calculate examples
#coins_in_pocket = [ 2, 10, 10, 10, 50, 50]
#coins_in_pocket = [ 1  1  2  2 10 20]

print("Sum of coins:     ", np.sum(coins_in_pocket))
print("Mean:             ", np.mean(coins_in_pocket))
print("Median:           ", np.median(coins_in_pocket))
print("Sum of deviations:", 
    np.sum(
        (np.array(coins_in_pocket) - np.mean(coins_in_pocket))**2
    )
)
print("Var:              ", np.var(coins_in_pocket))
print("Std:              ", np.std(coins_in_pocket))

Sum of coins:      65
Mean:              13.0
Median:            10.0
Sum of deviations: 180.0
Var:               36.0
Std:               6.0


### How to characterise a sample of quantitative observations?

- **Central tendency**: where are the observed values?
- **Dispersion**: how far apart are the observed values?

### For example, imagine drawing coins from a pocket

- How much is each new coin worth?
- How certain are we about that value?

Let's start with an example of 5 coins...

### Measures of central tendency

In [4]:
SVGImg('images/coins_central_tendency.svg', width='70%', output_dir='slides')

## *

If there is an even number of observations in our sample, <br>the median is in between the two central ones:
- median([5,5,**10**,20,20]) = **10**
- median([5,5,**10**,**20**,20,20]) = (**10**+**20**) / 2 = 15

See e.g. https://en.wikipedia.org/wiki/Central_tendency for details <br>on these and other measures of central tendency.

### Measures of dispersion

In [5]:
SVGImg('images/coins_dispersion.svg', width='70%', output_dir='slides')

See e.g. https://en.wikipedia.org/wiki/Statistical_dispersion for details <br>on these and other measures of dispersion.

### We usually have incomplete information about the world

- We can know the values of all the coins in our pocket.
- What about all the coins in a vault? We might not have the time to look at them all!
- We can still estimate the true statistics of a **population** <br> (e.g. all coins in a vault, all cards in a stack, ...) <br>from a representative **sample** that we draw in an experiment.

### The sample estimate of the population mean<br> is just the sample mean!

- The sample mean is the sum of values divided by number of samples.
- This is exactly the mean that we calculated above for the values of the five coins!

### The sample variance is a biased estimate of the population variance!

In [6]:
SVGImg('images/population_vs_sample.svg', width='100%', output_dir='slides')

In [7]:
%%html
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
    "HTML-CSS" : {
        mtextFontInherit: true,
    }
});
</script>

### All estimates have errors

We often want to estimate the populalation mean from a sample and how precise we expect it to be. This "standard error" is calculated as follows $^{*▾}$:

$\text{Standard Error} = \sqrt{\frac{\text{Population Variance}}{\text{Sample Size}}}$.

With the estimated population variance of our 5 coins, we obtain:

$\text{Standard Error (coin example)} = \sqrt{\frac{\text{45}}{\text{5}}} = \sqrt{\text{9}} = \text{3}$.

### *The standard error is still only an approximation 

- It can also be written as population standard deviation divided by the square root of the number of observations.
- In contrast to the variance, the [standard deviation still has a bias](https://en.wikipedia.org/wiki/Unbiased_estimation_of_standard_deviation) because the square root is a curved function.
- There are specific corrections for a number of scenarios
- Knowing approximately how precise your estimate of the mean is much better than knowing nothing about it!

### We now can answer our initial question:

**Mean ± Standard Error** is a common way to state an estimate and its uncertainty. 

- In the example with 5 coins, we can say we expect the value of taking one coin out of the pocket is 13 ± 3 cents.

- Some applications require higher confidence levels. In these cases 2 or more standard errors are [reported](https://en.wikipedia.org/wiki/68–95–99.7_rule).

### We often show one standard error (SE) as error bars

- **1SE error bars show if the means from two samples are significantly different**.
- If the bars don't overlap, the observed means are > 2SE apart
- Because of this, 1SE is a very common size for error bars.
- This is what we looked at in the the card experiment in [section one](http://localhost:8000/1_cards.slides.html).

In [8]:
SVGImg('images/error_bars.svg', width='50%', output_dir='slides')

### Two standard errors are approximately equal to the 95% confidence interval

- The 95% confidence interval (CI) contains the population mean with 95% probability.
- Sometimes used as the size for error bars - always check the description of a figure!
- Allows to test if a **mean is different from an exact number**.

In [9]:
SVGImg('images/error_bars_ci.svg', width='50%', output_dir='slides')

### Try it!

In [19]:
# infinite population & random sample
# ===================================

# inputs
plot_width  = 500
max_samples = 200
input_mean = alt.binding(
    input='range',
    min=-5,
    max=5,
    step=.1,
    name='Pop. mean'
)
input_std = alt.binding(
    input='range',
    min=0.1,
    max=5,
    step=.1,
    name='Pop. Std.'
)
input_samples = alt.binding(
    input='range',
    min=10,
    max=max_samples,
    step=1,
    name='Sample size'
)
input_uncertainty = alt.binding(
    input='range',
    min=1,
    max=3,
    step=1,
    name='Show std. errors'
)
mean_selection = alt.selection_single(
    bind=input_mean,
    init={'mean': 0}
)
std_selection = alt.selection_single(
    bind=input_std,
    init={'std': 1.5}
)
samples_selection = alt.selection_single(
    bind=input_samples,
    init={'samples': 50}
)
uncertainty_selection = alt.selection_single(
    bind=input_uncertainty,
    init={'ses': 1}
)


scale = alt.Scale(domain=[-10,10])

# simple plot with population mean + population error
pop_mean_chart = alt.Chart(
    data=pd.DataFrame({'x': [1]})
)
title_only_ax_kwargs = dict(
    ticks=False,
    labels=False,
    #grid=False,
    domain=False,
    orient='top'
)
pop_mean = (
    pop_mean_chart.mark_errorbar(color='red', rule=alt.MarkConfig(size=1.5), ticks=True).encode(
        x=alt.X(
            "xmin:Q", 
            axis=alt.Axis(
                title="Population mean ± expected uncertainty", **title_only_ax_kwargs
            )
        ),
        x2="xmax:Q",
    ) + pop_mean_chart.mark_point(color='red', size=20, filled=True, stroke=None).encode(
        x=alt.X("mean:Q", scale=scale, axis=alt.Axis(title=""))
    )
).add_selection(
    samples_selection, std_selection, mean_selection, 
).transform_calculate(
  mean = mean_selection['mean'],
  std  = std_selection['std'],
  samples = samples_selection['samples'],
  ses = uncertainty_selection['ses']
).transform_calculate(
    xmin='+datum.mean - datum.ses * datum.std / sqrt(+datum.samples)',
    xmax='+datum.mean + datum.ses * datum.std / sqrt(+datum.samples)',
).properties(
    view=alt.ViewConfig(strokeWidth=0), width=plot_width
)

# combined chart with different ways to represent sample 
data = alt.sequence(0, max_samples, as_='t')
sample_chart = alt.Chart(data)

samp_mean = (
    sample_chart.mark_errorbar(rule=alt.MarkConfig(size=1.5), ticks=True).encode(
        x=alt.X(
            "bar_min:Q", 
            scale=scale, 
            axis=alt.Axis(
                title="Sample mean ± sample uncertainty", **title_only_ax_kwargs
            )
        ),
        x2='bar_max:Q'
    ) + sample_chart.mark_point(size=20, filled=True, stroke=None).encode(
        x='sample_mean:Q'
    )
).transform_aggregate(
        sample_mean = 'mean(x)',
        sample_err = 'stderr(x)',
).transform_calculate(
    ses = uncertainty_selection['ses'],
    bar_min = '+datum.sample_mean - datum.ses * datum.sample_err',
    bar_max = '+datum.sample_mean + datum.ses * datum.sample_err'
).properties(
    view=alt.ViewConfig(strokeWidth=0), width=plot_width
)

hist = sample_chart.mark_bar(clip=True).encode(
    alt.X("x:Q", bin=alt.Bin(), scale=scale, axis=alt.Axis(title='Sample Histogram', **title_only_ax_kwargs)),
    y=alt.Y('count()', title='Count'),
).properties(
    width=plot_width,
    height=100
)

ticks = sample_chart.mark_tick(clip=True).encode(
    x=alt.X('x:Q', scale=scale, axis=alt.Axis(title='Sample Values'))
).properties(
    view=alt.ViewConfig(strokeWidth=0), width=plot_width
)

combined_sample = (
    samp_mean & hist & ticks
).add_selection(
    uncertainty_selection, samples_selection, std_selection, mean_selection
).transform_filter(
    alt.datum.t <= alt.expr.toNumber(samples_selection.samples),
).transform_calculate(
    mean = mean_selection['mean'],
    std  = std_selection['std'],
    x='sampleNormal(+datum.mean,+datum.std)', # the plus casts to number
)

(pop_mean & combined_sample).display(renderer='svg')

Each time you change the parameters, new samples are drawn from a normal distribution.<br>
The latter is also called Gaussian or bell curve and we will hear more about it later.

## In the next session,

we will take a deeper dive into…

- Comparing two different statistical populations 
- Statistical significance and p-values
- Statistical power and effect sizes

---

## Exercises

In [20]:
# Analyse any of the previous examples using software you are familiar with.
# Create a figure with error bars