## Practical basics

It is easy to calculate the quantities we encounter most often in statistics!

# Very WIP!

How to calculate these quantities
- mean, median
- standard deviation
- standard error

The latter requires moving from a population to a sample https://en.wikipedia.org/wiki/Variance#Population_variance_and_sample_variance
Need to introduce the fact that we are typically doing estimations. Because otherwise there would be no error, we would know the truth! But in prectice we almost never do.

- How to calculate them with pen & paper for small examples
- Some more easy to do examples including the cards example with solutions on following slide
- Example with quantiles as extension of median, box plot
- Appendix: how to calculate the standard quantities in Excel, Tableau, Python

[no, not this mean](https://en.wikipedia.org/wiki/File:Taylor_Swift_-_Mean.png#/media/File:Taylor_Swift_-_Mean.png)

if you fount this funny, you may like [this list.](https://www.isg.rhul.ac.uk/~sdg/mathsongs.html) You may also have issues.

In [17]:
import pandas as pd
import numpy as np
import altair as alt
import helpers.plotting as pt
pt.enable_slide_theme()
pt.import_lato_font_in_notebook()

In [18]:
from helpers.svg_wrapper import SVGImg

In [19]:
# test: simple example to calculate mentally

coins_in_pocket = [5, 10, 10, 20, 20]

# other easy to calculate examples
#coins_in_pocket = [ 2, 10, 10, 10, 50, 50]
#coins_in_pocket = [ 1  1  2  2 10 20]

print("Sum of coins:     ", np.sum(coins_in_pocket))
print("Mean:             ", np.mean(coins_in_pocket))
print("Median:           ", np.median(coins_in_pocket))
print("Sum of deviations:", 
    np.sum(
        (np.array(coins_in_pocket) - np.mean(coins_in_pocket))**2
    )
)
print("Var:              ", np.var(coins_in_pocket))
print("Std:              ", np.std(coins_in_pocket))

Sum of coins:      65
Mean:              13.0
Median:            10.0
Sum of deviations: 180.0
Var:               36.0
Std:               6.0


### How to characterise a sample of quantitative observations?

- **Central tendency**: where are the observed values?
- **Dispersion**: how far apart are the observed values?

### For example, imagine drawing coins from a pocket

- How much is each new coin worth?
- How certain are we about that value?

Let's start with an example of 5 coins...

### Measures of central tendency

In [20]:
SVGImg('images/coins_central_tendency.svg', width='70%', output_dir='slides')

## *

If there is an even number of observations in our sample, <br>the median is in between the two central ones:
- median([5,5,**10**,20,20]) = **10**
- median([5,5,**10**,**20**,20,20]) = (**10**+**20**) / 2 = 15

See e.g. https://en.wikipedia.org/wiki/Central_tendency for details <br>on these and other measures of central tendency.

### Measures of dispersion

In [21]:
SVGImg('images/coins_dispersion.svg', width='70%', output_dir='slides')

See e.g. https://en.wikipedia.org/wiki/Statistical_dispersion for details <br>on these and other measures of dispersion.

### We usually have incomplete information about the world

- We can know the values of all the coins in our pocket.
- What about all the coins in a vault? We might not have the time to look at them all!
- We can still estimate the true statistics of a **population** <br> (e.g. all coins in a vault, all cards in a stack, ...) <br>from a representative **sample** that we draw in an experiment.

### The sample estimate of the population mean<br> is just the sample mean!

- The sample mean is the sum of values divided by number of samples.
- This is exactly the mean that we calculated above for the values of the five coins!

### The sample variance is a biased estimate of the population variance!

In [41]:
SVGImg('images/population_vs_sample.svg', width='100%', output_dir='slides')

### Try it!

In [33]:
# test: interactive distribution parameters
plot_width  = 500
max_samples = 200
input_mean = alt.binding(
    input='range',
    min=-5,
    max=5,
    step=.1,
    name='mean'
)
input_std = alt.binding(
    input='range',
    min=0.1,
    max=5,
    step=.1,
    name='std'
)
input_samples = alt.binding(
    input='range',
    min=10,
    max=max_samples,
    step=1,
    name='samples'
)
mean_selection = alt.selection_single(
    bind=input_mean,
    init={'mean': 0}
)
std_selection = alt.selection_single(
    bind=input_std,
    init={'std': 1}
)
samples_selection = alt.selection_single(
    bind=input_samples,
    init={'samples': 50}
)
scales = alt.selection_interval(bind='scales')
                                
data = alt.sequence(0, max_samples, as_='t')
chart = alt.Chart(data).add_selection(
    scales, samples_selection, std_selection, mean_selection
).transform_filter(
    alt.datum.t <= alt.expr.toNumber(samples_selection.samples)    
).transform_calculate(
  mean = mean_selection['mean'],
  std  = std_selection['std'],
).transform_calculate(
    x='sampleNormal(+datum.mean,+datum.std)', # the plus casts to number
)
scale = alt.Scale(domain=[-10,10])
hist = chart.mark_bar().encode(
    alt.X("x:Q", bin=alt.Bin(maxbins=7), scale=scale, axis=None),
    y='count()',
).properties(
    width=plot_width
)
ticks = chart.mark_tick().encode(
    x=alt.X('x:Q', scale=scale)
).properties(
    view=alt.ViewConfig(strokeWidth=0), width=plot_width)

#normal = chart.

(hist & ticks)

Each time you change the parameters, new samples are drawn from a normal distribution.<br>
The latter is also called Gaussian or bell curve and we will discuss it depth later.

In [7]:
import numpy as np
#from helpers.cards import draw_cards