## Practical basics

It is easy to calculate the quantities we encounter most often in statistics!

In [22]:
import pandas as pd
import numpy as np
import altair as alt
import helpers.plotting as pt
from helpers.svg_wrapper import SVGImg
from helpers.pracical_basics import plot_population_vs_sample_mean, print_example_statistics
pt.enable_slide_theme()
pt.import_lato_font_in_notebook()

In [20]:
%%html
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
    "HTML-CSS" : {
        mtextFontInherit: true,
    }
});
</script>

In [3]:
# for the exercises only
from helpers.cards import draw_cards
from IPython.display import Markdown as md

### How to characterise a sample of quantitative observations?

- **Central tendency**: where are the observed values?
- **Dispersion**: how far apart are the observed values?

### For example, imagine drawing coins from a pocket

- How much is each new coin worth?
- How certain are we about that value?

Let's start with an example of 5 coins...

### Measures of central tendency

In [4]:
SVGImg('images/coins_central_tendency.svg', width='70%', output_dir='slides')

## *

If there is an even number of observations in our sample, <br>the median is in between the two central ones:
- median([5,5,**10**,20,20]) = **10**
- median([5,5,**10**,**20**,20,20]) = (**10**+**20**) / 2 = 15

See e.g. https://en.wikipedia.org/wiki/Central_tendency for details <br>on these and other measures of central tendency.

### Measures of dispersion

In [5]:
SVGImg('images/coins_dispersion.svg', width='70%', output_dir='slides')

See e.g. https://en.wikipedia.org/wiki/Statistical_dispersion for details <br>on these and other measures of dispersion.

### We usually have incomplete information about the world

- We can know the values of all the coins in our pocket.
- What about all the coins in a vault? We might not have the time to look at them all!
- We can still estimate the true statistics of a **population** <br> (e.g. all coins in a vault, all cards in a stack, ...) <br>from a representative **sample** that we draw in an experiment.

### The sample estimate of the population mean<br> is just the sample mean!

- The sample mean is the sum of values divided by number of samples.
- This is exactly the mean that we calculated above for the values of the five coins!

### The sample variance is a biased estimate of the population variance!

In [6]:
SVGImg('images/population_vs_sample.svg', width='100%', output_dir='slides')

### All estimates have errors

We often estimate the population mean from a sample. The uncertainty of this estimate is quantified by the "standard error" and calculated as follows $^{*▾}$:

$\text{Standard Error} = \sqrt{\frac{\text{Population Variance}}{\text{Sample Size}}}$.

With the estimated population variance of our 5 coins, we obtain:

$\text{Standard Error (coin example)} = \sqrt{\frac{\text{45}}{\text{5}}} = \sqrt{\text{9}} = \text{3}$.

### *The standard error is still only an approximation 

- It can also be written as population standard deviation divided by the square root of the number of observations.
- In contrast to the variance, the [standard deviation still has a bias](https://en.wikipedia.org/wiki/Unbiased_estimation_of_standard_deviation) because the square root is a curved function.
- There are specific corrections for a number of scenarios
- Knowing approximately how precise your estimate of the mean is much better than knowing nothing about it!

### We now can answer our initial question:

**Mean ± 2 Standard Errors** is a common way to state an estimate and its uncertainty. 

- In the example with 5 coins, we can say we expect the value of taking one coin out of the pocket is 13 ± 6 cents.

- Lower or higher confidence levels are used for some applications. In these cases 1, 3, or more standard errors are [reported](https://en.wikipedia.org/wiki/68–95–99.7_rule).

### Two standard errors are approximately equal to the 95% confidence interval

- The 95% confidence interval (CI) contains the population mean with 95% probability.
- Often used as the size for error bars - but always check the description of a figure!
- Allows to test if a **mean is different from an exact number** with a **5 error rate**.
- This can be useful to test if an effect is significantly different from 0.

In [7]:
SVGImg('images/error_bars_ci.svg', width='50%', output_dir='slides')

### We often use confidence intervals as error bars

- The 95% confidence interval is approximately 2 standard errors (or 1.96).
- If the bars don't overlap, we assume that the observed means different with <br>a 1% error rate.
- This is what we looked at in the the card experiment in [section one](1_introductory_card_experiment.slides.html).
- Smaller or bigger error bars can be useful [depending on the use case](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2064100/)
- There are other tests for significance that we will discuss in a followup

In [8]:
SVGImg('images/error_bars.svg', width='50%', output_dir='slides')

### Try it!

In [9]:
# infinite population & random sample
# ===================================    
plot_population_vs_sample_mean()

Each time you change the parameters, new samples are drawn from a normal distribution.<br>
The latter is also called Gaussian or bell curve and we will hear more about it later.

### Notice that the...

- **uncertainty goes up** proportionally to the **population standard deviation** (std.)
- **uncertainty goes down** proportionally to the square root of the **sample size**: fast for small sample sizes and then slower and slower.
- **error bars of sample mean and population mean** overlap for **99%** of the samples when they show **2 standard errors**
- **true population mean is inside the sample error bars** for **95%** of the samples when they show **2 standard errors**.

## In the next session, we will discover...

- distributions, which describe the probabilities of all possible results in a sample or population. 
- why the normal (or Gaussian) distribution is so important.

Go to [next session](3_distributions.slides.html) or continue for some exercises on the following slides.

---

## Exercises

Complete the following exercises to become more familiar with the contents of this session.

In [10]:
# simple examples to calculate mostly mentally
coin_example_samples = [
    #[5, 10, 10, 20, 20], # we already did this one
    [ 2, 10, 10, 10, 50, 50],
    [ 1,  1, 2,  2, 10, 20,],
]

### 1. Drawing coins

Calculate mean, median, standard deviation, and standard error for the following samples. You can do it just like it was explained above.

In [11]:
print_example_statistics(coin_example_samples, print_only_sample=True)

Values in sample:           [2, 10, 10, 10, 50, 50]
Values in sample:           [1, 1, 2, 2, 10, 20]


Solutions are on the next slide. Feel free to use your calculator or favourite software!

### Solutions to exercise 1

In [12]:
print_example_statistics(coin_example_samples)

Values in sample:           [2, 10, 10, 10, 50, 50]
Sum of values:              132
Mean:                       22.0
Median:                     10.0
Sum of deviations:          2400.0
Sample Variance:            400.0
Sample Standard deviation:  20.0
Population Variance:        480.0
Pop. Standard deviation:    21.909
Standard error:             8.944

Values in sample:           [1, 1, 2, 2, 10, 20]
Sum of values:              36
Mean:                       6.0
Median:                     2.0
Sum of deviations:          294.0
Sample Variance:            49.0
Sample Standard deviation:  7.0
Population Variance:        58.8
Pop. Standard deviation:    7.668
Standard error:             3.13



### 2. Drawing cards

In section 1, we had an example where we drew cards. 

In [13]:
stack_1_cards, stack_2_cards = draw_cards(
    p_win_1 = 0.5,
    p_win_2 = 0.38,
    n_cards = 100,
    seed    = 0
)
n_cards_small_experiment = 15
n_cards_bigger_experiment = 100

In [14]:
md(
    f"- We drew {len(stack_1_cards[:n_cards_small_experiment])} cards "
    f"and we got {sum(stack_1_cards[:n_cards_small_experiment])} wins for the first stack "
    f"and {sum(stack_2_cards[:n_cards_small_experiment])} for the second one.\n"
    f"- We drew {len(stack_1_cards)} cards and we got {sum(stack_1_cards)} wins "
    f"for the first stack and {sum(stack_2_cards)} for the second one."
)

- We drew 15 cards and we got 8 wins for the first stack and 10 for the second one.
- We drew 100 cards and we got 56 wins for the first stack and 34 for the second one.

Calculate the winning probabilities and standard errors for both stacks for both experiments. 

- Write the results down in the notation mean ±2 standard errors.
- Can you remember when the difference was significant and why?
- Try to round to [significant digits](https://en.wikipedia.org/wiki/Significant_figures). If your value is e.g. 0.12345 ± 0.05432, you only report 0.12 ± 0.05!

The solution is on the next slide

### Solution to exercise 2

In [15]:
## you can use the same code as above if you like, e.g. like this
print_example_statistics(
    [stack_1_cards[:n_cards_small_experiment], 
     stack_2_cards[:n_cards_small_experiment]]
)

Values in sample:           [1 1 0 0 1 1 1 0 0 1 1 0 0 0 1]
Sum of values:              8
Mean:                       0.5333333333333333
Median:                     1.0
Sum of deviations:          3.733333333333333
Sample Variance:            0.249
Sample Standard deviation:  0.499
Population Variance:        0.267
Pop. Standard deviation:    0.516
Standard error:             0.133

Values in sample:           [1 0 1 1 0 0 1 0 1 1 1 1 1 1 0]
Sum of values:              10
Mean:                       0.6666666666666666
Median:                     1.0
Sum of deviations:          3.333333333333334
Sample Variance:            0.222
Sample Standard deviation:  0.471
Population Variance:        0.238
Pop. Standard deviation:    0.488
Standard error:             0.126



In [16]:
# but this way might be more educational
from math import sqrt

def report_winning_probabilities_with_errors(
        n_cards, stack_1_wins, stack_2_wins
    ):
    """Return results for a card experiment including significant digits of mean ± standard error as markdown"""
    # check this code - we do the calculations just like on the slide where we explained them
    # of course it would be faster and more concise to use e.g. the mean and std functions from numpy
    
    # winning probabilities - same as taking a mean
    stack_1_p_win = stack_1_wins / n_cards
    stack_2_p_win = stack_2_wins / n_cards
    
    # the educational way to calculate the variance
    stack_1_pop_var = (stack_1_wins * (1 - stack_1_p_win)**2 + (n_cards - stack_1_wins) * (0 - stack_1_p_win)**2) / (n_cards - 1)
    stack_2_pop_var = (stack_2_wins * (1 - stack_2_p_win)**2 + (n_cards - stack_2_wins) * (0 - stack_2_p_win)**2) / (n_cards - 1)
    
    stack_1_pop_std = sqrt(stack_1_pop_var)
    stack_2_pop_std = sqrt(stack_2_pop_var)
    
    stack_1_se = stack_1_pop_std / sqrt(n_cards)
    stack_2_se = stack_2_pop_std / sqrt(n_cards)
    
    n_significant_digits = -int(np.floor(np.log10(stack_1_se)))

    if stack_1_p_win + 1.96 * stack_1_se <  stack_2_p_win - 2 * stack_2_se:
        conclusion = "Stack 1 has a significantly higher winning probability than stack 2 (p < 5%)"
    elif stack_1_p_win - 196 * stack_1_se >  stack_2_p_win + 2 * stack_2_se:
        conclusion = "Stack 1 has a significantly higher winning probability than stack 2 (p < 5%)"
    else:
        conclusion = "We can't decide which stack has a higher winning probability"
    
    return md(
        f"Drawing {n_cards} cards:\n"
        f"- Stack 1 has {stack_1_wins} wins\n"
        f"- Stack 2 has {stack_2_wins} wins\n"
        f"- Stack 1 winning probabiltiy: {round(stack_1_p_win, n_significant_digits)} ± {round(2 * stack_1_se, n_significant_digits)}\n"
        f"- Stack 2 winning probabiltiy: {round(stack_2_p_win, n_significant_digits)} ± {round(2 * stack_2_se, n_significant_digits)}\n"
        "- "
        + conclusion
     )

In [17]:
n_cards = 15
stack_1_wins = sum(stack_1_cards[:n_cards])
stack_2_wins = sum(stack_2_cards[:n_cards])

report_winning_probabilities_with_errors(n_cards, stack_1_wins, stack_2_wins)

Drawing 15 cards:
- Stack 1 has 8 wins
- Stack 2 has 10 wins
- Stack 1 winning probabiltiy: 0.5 ± 0.3
- Stack 2 winning probabiltiy: 0.7 ± 0.3
- We can't decide which stack has a higher winning probability

In [18]:
n_cards = 100
stack_1_wins = sum(stack_1_cards[:n_cards])
stack_2_wins = sum(stack_2_cards[:n_cards])

report_winning_probabilities_with_errors(n_cards, stack_1_wins, stack_2_wins)

Drawing 100 cards:
- Stack 1 has 56 wins
- Stack 2 has 34 wins
- Stack 1 winning probabiltiy: 0.56 ± 0.1
- Stack 2 winning probabiltiy: 0.34 ± 0.1
- We can't decide which stack has a higher winning probability

### 3. Explore your favourite software

- Find out if the software you use for spreadsheets, Business Intelligence, or programming has these functions built in:
    - Mean
    - Median
    - Standard Deviation
    - Standard Error


- Find out how to use your favourite plotting software to draw a bar chart with error bars

### 5. Use mathematical notation for the same calculations.

We often use the symbol $\mu$ for the mean. Consider a sample $s$ with $n$ values $x_1, x_2, \dots, x_n$. We can call its mean $\mu_s$. We can write the calculation using the [symbol](https://www.youtube.com/watch?v=5jwXThH6fg4) $\sum$ for [summation](https://en.wikipedia.org/wiki/Summation): 

$$\mu_{s} = \frac{1}{n} \sum_{i = 1}^{n} x_i = \frac{x_1 + x_2 + \dots + x_n}{n}$$

A common symbol for the standard deviation is $\sigma$. The variance is $\sigma^2$. Its calculation was described above, but as an equation it is much more concise:

$$\sigma_s^2 = \frac{1}{n-1} \sum_{i = 1}^{n} (x_i - \mu_s)^2$$
    
If we know the mean $\mu_p$ of a population $p$, the variance becomes 

$$\sigma_{s,p}^2 = \frac{1}{n} \sum_{i = 1}^{n} (x_i - \mu_p)^2$$

Finally the standard error of the mean of the sample is

$$\sigma_{\mu_{s}} = \frac{\sigma_s}{\sqrt{n}}$$

$ $
