## Distributions and the central limit theorem

- Introduce Statistical Distribtutions
- Central limit theorem

- In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment
- In the discrete case, it is sufficient to specify a probability mass function assigning a probability to each possible outcome: for example, when throwing a fair die, each of the six values 1 to 6 has the probability 1/6. 
- In contrast, when a random variable takes values from a continuum then typically, any individual outcome has probability zero and only events that include infinitely many outcomes, such as intervals, can have positive probability. 

In [116]:
import pandas as pd
import numpy as np
import altair as alt
import helpers.plotting as pt
from helpers.svg_wrapper import SVGImg
pt.enable_slide_theme()
pt.import_lato_font_in_notebook()

In [100]:
def distribution_parameter(
        min:float, max:float, step:float, init_value:float, name:str, type:float='range'
    ) -> alt.selection:
    b = alt.binding(
        input=type, min=min, max=max, step=step, name=name
    )
    return alt.selection_single(
        bind = b,
        init = dict(value=init_value)
    )

def distribution_chart(
        xmin:float,
        xmax:float,
        xstep:float,
        pdf_transform_expression: str,
        mark_type='line',
        xscale=alt.Scale(),
        yscale=alt.Scale(),
        **distribution_parameters
    ) -> alt.Chart:
    return alt.Chart(
        alt.sequence(xmin, xmax, xstep, as_='x')
    ).__getattr__(f'mark_{mark_type}')().encode(
        x=alt.X('x:Q', scale=xscale),
        y=alt.Y('px:Q', scale=yscale),
    ).add_selection(
        *list(distribution_parameters.values())[::-1], 
    ).transform_calculate(
        **{k: v.value for k, v in distribution_parameters.items()}
    ).transform_calculate(
        px = pdf_transform_expression
    )

### How to describe the possible outcomes of an experiment? Distributions!

Think about **throwing a** six-sided **[die](https://en.wikipedia.org/wiki/Dice)**: 
- The possible outcomes are the integers (whole numbers) 1, 2, 3, 4, 5, 6
- Each outcome should have the same **probability 1/6**
- We can say, the outcomes are described by the **discrete uniform distribution** on the integers 1 ... 6.

In [122]:
SVGImg('images/die.svg', width='15%', output_dir='slides')

### There are discrete and continuous probability distributions

In the discrete case, 
- we can specify the probability of each possible outcome
- mathematically this is done via a "probability mass function"
- all the probabilities add up to one (one out of all possible outcomes will happen)

In [114]:
# uniform probability mass function
distribution_chart(
    pdf_transform_expression = '1/datum.max * (1 <= datum.x) * (datum.x <= datum.max)',
    xmin = -1,
    xmax = 20,
    xstep = 1,
    mark_type='bar',
    xscale=alt.Scale(domain=(0,20)),
    yscale=alt.Scale(domain=(0,0.51)),
    max=distribution_parameter(
        min=2,
        max=20,
        step=1,
        init_value=2,
        name='Max'
    )
)

### There are discrete and continuous probability distributions

In the continuous case, 
- there are infinitely many possible outcomes
- we typically give the 

In [6]:
# uniform density
distribution_chart(
    pdf_transform_expression = 'densityUniform(datum.x, datum.min, datum.max)',
    xmin = -5,
    xmax = 5,
    xstep = .1,
    min = distribution_parameter(
        min=-5,
        max=0,
        step=.1,
        init_value=0,
        name='Min'
    ),
    max = distribution_parameter(
        min=0,
        max=5,
        step=.1,
        init_value=2,
        name='Max'
    )
)

### ---

In [4]:
# gaussian density
distribution_chart(
    pdf_transform_expression = 'densityNormal(datum.x, datum.mean, datum.std)',
    xmin = -5,
    xmax = 5,
    xstep = .1,
    mean = distribution_parameter(
        min=-5,
        max=5,
        step=.1,
        init_value=0,
        name='Mean'
    ),
    std = distribution_parameter(
        min=.1,
        max=5,
        step=.1,
        init_value=1,
        name='Std'
    )
)

In [5]:
# gaussian density
distribution_chart(
    pdf_transform_expression = 'densityNormal(log(datum.x), log(datum.mean), log(datum.std))',
    xmin = 0,
    xmax = 100,
    xstep = .1,
    mean = distribution_parameter(
        min=0.1,
        max=100,
        step=.1,
        init_value=1,
        name='Mean'
    ),
    std = distribution_parameter(
        min=1.1,
        max=5,
        step=.1,
        init_value=2,
        name='Std'
    )
)

## In the next session,

we will take a deeper dive into…

- Comparing two different statistical distributions 
- Statistical significance and p-values
- Statistical power and effect sizes