## Practical basics

It is easy to calculate the quantities we encounter most often in statistics!

In [1]:
import pandas as pd
import numpy as np
import altair as alt
import helpers.plotting as pt
from helpers.svg_wrapper import SVGImg
from helpers.pracical_basics import plot_population_vs_sample_mean, print_example_statistics
pt.enable_slide_theme()
pt.import_lato_font_in_notebook()

### How to characterise a sample of quantitative observations?

- **Central tendency**: where are the observed values?
- **Dispersion**: how far apart are the observed values?

### For example, imagine drawing coins from a pocket

- How much is each new coin worth?
- How certain are we about that value?

Let's start with an example of 5 coins...

### Measures of central tendency

In [2]:
SVGImg('images/coins_central_tendency.svg', width='70%', output_dir='slides')

## *

If there is an even number of observations in our sample, <br>the median is in between the two central ones:
- median([5,5,**10**,20,20]) = **10**
- median([5,5,**10**,**20**,20,20]) = (**10**+**20**) / 2 = 15

See e.g. https://en.wikipedia.org/wiki/Central_tendency for details <br>on these and other measures of central tendency.

### Measures of dispersion

In [3]:
SVGImg('images/coins_dispersion.svg', width='70%', output_dir='slides')

See e.g. https://en.wikipedia.org/wiki/Statistical_dispersion for details <br>on these and other measures of dispersion.

### We usually have incomplete information about the world

- We can know the values of all the coins in our pocket.
- What about all the coins in a vault? We might not have the time to look at them all!
- We can still estimate the true statistics of a **population** <br> (e.g. all coins in a vault, all cards in a stack, ...) <br>from a representative **sample** that we draw in an experiment.

### The sample estimate of the population mean<br> is just the sample mean!

- The sample mean is the sum of values divided by number of samples.
- This is exactly the mean that we calculated above for the values of the five coins!

### The sample variance is a biased estimate of the population variance!

In [4]:
SVGImg('images/population_vs_sample.svg', width='100%', output_dir='slides')

In [5]:
%%html
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
    "HTML-CSS" : {
        mtextFontInherit: true,
    }
});
</script>

### All estimates have errors

We often want to estimate the populalation mean from a sample and how precise we expect it to be. This "standard error" is calculated as follows $^{*▾}$:

$\text{Standard Error} = \sqrt{\frac{\text{Population Variance}}{\text{Sample Size}}}$.

With the estimated population variance of our 5 coins, we obtain:

$\text{Standard Error (coin example)} = \sqrt{\frac{\text{45}}{\text{5}}} = \sqrt{\text{9}} = \text{3}$.

### *The standard error is still only an approximation 

- It can also be written as population standard deviation divided by the square root of the number of observations.
- In contrast to the variance, the [standard deviation still has a bias](https://en.wikipedia.org/wiki/Unbiased_estimation_of_standard_deviation) because the square root is a curved function.
- There are specific corrections for a number of scenarios
- Knowing approximately how precise your estimate of the mean is much better than knowing nothing about it!

### We now can answer our initial question:

**Mean ± Standard Error** is a common way to state an estimate and its uncertainty. 

- In the example with 5 coins, we can say we expect the value of taking one coin out of the pocket is 13 ± 3 cents.

- Some applications require higher confidence levels. In these cases 2 or more standard errors are [reported](https://en.wikipedia.org/wiki/68–95–99.7_rule).

### We often show one standard error (SE) as error bars

- **1SE error bars show if the means from two samples are significantly different**.
- If the bars don't overlap, the observed means are > 2SE apart
- Because of this, 1SE is a very common size for error bars.
- This is what we looked at in the the card experiment in [section one](http://localhost:8000/1_cards.slides.html).

In [6]:
SVGImg('images/error_bars.svg', width='50%', output_dir='slides')

### Two standard errors are approximately equal to the 95% confidence interval

- The 95% confidence interval (CI) contains the population mean with 95% probability.
- Sometimes used as the size for error bars - always check the description of a figure!
- Allows to test if a **mean is different from an exact number**.

In [7]:
SVGImg('images/error_bars_ci.svg', width='50%', output_dir='slides')

### Try it!

In [8]:
# infinite population & random sample
# ===================================    
plot_population_vs_sample_mean()

Each time you change the parameters, new samples are drawn from a normal distribution.<br>
The latter is also called Gaussian or bell curve and we will hear more about it later.

### Notice that the...

- **uncertainty goes up** proportionally to the **population standard deviation** (std.)
- **uncertainty goes down** proportionally to the square root of the **sample size**: fast for small sample sizes and then slower and slower.
- **error bars of population mean and sample mean** overlap for **95%** of the samples when they show **1 standard error**
- **true population mean is inside the sample error bars** for **95%** of the samples when they show **2 standard errors**.

## In the next session,

we will take a deeper dive into…

- Comparing two different statistical populations 
- Statistical significance and p-values
- Statistical power and effect sizes

---

## Exercises

Complete the following exercises to become more familiar with the contents of this session.

In [9]:
# simple examples to calculate mostly mentally
coin_example_samples = [
    #[5, 10, 10, 20, 20], # we already did this one
    [ 2, 10, 10, 10, 50, 50],
    [ 1,  1, 2,  2, 10, 20,],
]

### 1. Drawing coins

Calculate mean, median, standard deviation, and standard error for the following samples. You can do it just like it was explained above.

Solutions are on the next slide. Feel free to use your calculator of favourite software!

In [105]:
print_example_statistics(coin_example_samples, print_only_sample=True)

Values in sample:           [2, 10, 10, 10, 50, 50]
Values in sample:           [1, 1, 2, 2, 10, 20]


### Solutions to Exercise 1

In [106]:
print_example_statistics(coin_example_samples)

Values in sample:           [2, 10, 10, 10, 50, 50]
Sum of values:              132
Mean:                       22.0
Median:                     10.0
Sum of deviations:          2400.0
Sample Variance:            400.0
Sample Standard deviation:  20.0
Population Variance:        480.0
Pop. Standard deviation:    21.909
Standard error:             8.944

Values in sample:           [1, 1, 2, 2, 10, 20]
Sum of values:              36
Mean:                       6.0
Median:                     2.0
Sum of deviations:          294.0
Sample Variance:            49.0
Sample Standard deviation:  7.0
Population Variance:        58.8
Pop. Standard deviation:    7.668
Standard error:             3.13



### 2. Drawing cards

In section 1, we had an example where we drew cards. 

- When we drew 25 cards, we got 12 wins for the first stack and 15 for the second one.
- When we drew 100 cards, we got 56 wins for the first stack and 40 for the second one.

Calculate the winning probabilities and standard errors for both stacks for both 25 and 100 cards. 

- Write the results down in the notation mean ± standard error.
- Can you remember when the difference was significant and why?
- Try to round to [significant digits](https://en.wikipedia.org/wiki/Significant_figures). If your value is e.g. 0.12345 ± 0.05432, you only report 0.12 ± 0.05!

The solution is on the next slide

### 2. Solution to 2.

In [111]:
## you can use the same code as above if you like, e.g. like this
card_samples = [
    12 * [1] + (25 - 12) * [0],
    15 * [1] + (25 - 15) * [0],
]
print_example_statistics(card_samples)

Values in sample:           [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Sum of values:              12
Mean:                       0.48
Median:                     0.0
Sum of deviations:          6.240000000000001
Sample Variance:            0.24960000000000004
Sample Standard deviation:  0.5
Population Variance:        0.26
Pop. Standard deviation:    0.51
Standard error:             0.102

Values in sample:           [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Sum of values:              15
Mean:                       0.6
Median:                     1.0
Sum of deviations:          6.000000000000001
Sample Variance:            0.24000000000000005
Sample Standard deviation:  0.49
Population Variance:        0.25
Pop. Standard deviation:    0.5
Standard error:             0.1



In [112]:
# but this way might be more educational
from math import sqrt

def print_winning_probabilities_with_errors(
        n_cards, stack_1_wins, stack_2_wins
    ):
    # do the calculations just like on the slide where we explained them
    # of course it would be faster to use 
    stack_1_p_win = stack_1_wins / n_cards
    stack_2_p_win = stack_2_wins / n_cards
    stack_1_pop_var = (stack_1_wins * (1 - stack_1_p_win)**2 + (n_cards - stack_1_wins) * (0 - stack_1_p_win)**2) / (n_cards - 1)
    stack_2_pop_var = (stack_2_wins * (1 - stack_2_p_win)**2 + (n_cards - stack_2_wins) * (0 - stack_2_p_win)**2) / (n_cards - 1)
    stack_1_pop_std = sqrt(stack_1_pop_var)
    stack_2_pop_std = sqrt(stack_2_pop_var)
    stack_1_se = stack_1_pop_std / sqrt(n_cards)
    stack_2_se = stack_2_pop_std / sqrt(n_cards)
    n_significant_digits = -int(np.floor(np.log10(stack_1_se)))

    print(f"Stack 1 has {stack_1_wins} wins out of {n_cards} cards")
    print(f"Stack 2 has {stack_2_wins} wins out of {n_cards} cards")
    print()
    print("Stack 1 winning probabiltiy:", round(stack_1_p_win, n_significant_digits), "±", round(stack_1_se, n_significant_digits))
    print("Stack 2 winning probabiltiy:", round(stack_2_p_win, n_significant_digits), "±", round(stack_2_se, n_significant_digits))
    print()
    if stack_1_p_win + stack_1_se <  stack_2_p_win - stack_2_se:
        print("Stack 1 has a significantly higher winning probability than stack 2 (p < 5%)")
    elif stack_1_p_win - stack_1_se >  stack_2_p_win + stack_2_se:
        print("Stack 1 has a significantly higher winning probability than stack 2 (p < 5%)")
    else:
        print("We can't decide which stack has a higher winning probability")

n_cards = 25
stack_1_wins = 12
stack_2_wins = 15

print_winning_probabilities_with_errors(25, 12, 15)

print()
print_winning_probabilities_with_errors(100, 56, 40)

Stack 1 has 12 wins out of 25 cards
Stack 2 has 15 wins out of 25 cards

Stack 1 winning probabiltiy: 0.5 ± 0.1
Stack 2 winning probabiltiy: 0.6 ± 0.1

We can't decide which stack has a higher winning probability

Stack 1 has 56 wins out of 100 cards
Stack 2 has 40 wins out of 100 cards

Stack 1 winning probabiltiy: 0.56 ± 0.05
Stack 2 winning probabiltiy: 0.4 ± 0.05

Stack 1 has a significantly higher winning probability than stack 2 (p < 5%)


### 3. Explore your favourite software

- Find out if the software you use for spreadsheets, Business Intelligence, or programming has these functions built in:
    - Mean
    - Median
    - Standard Deviation
    - Standard Error


- Find out how to use your favourite plotting software to draw a bar chart with error bars

### 4. Explore your favourite software

- Find out if the software you use for spreadsheets, Business Intelligence, or programming has these functions built in:
    - Mean
    - Median
    - Standard Deviation
    - Standard Error


- Find out how to use your favourite plotting software to draw a bar chart with error bars. Try searching for the solution online.

In [15]:
# Analyse any of the previous examples using software you are familiar with.
# Create a figure with error bars