# Discrete Probability Distributions

### Learning Objectives
- [Introduction to Probability Distributions](#Introduction-to-Probability-Distributions)
- [Discrete Probability Distributions](#Discrete-Probability-Distributions)
- [Bernouilli Distribution](#Bernouilli-Distribution)
- [Binomial Distribution](#Binomial-Distribution)

# Introduction to Probability Distributions
As we have seen, a __probability distribution__ is just a function that maps a unique probability to every possible outcome. As a probability distribution accounts for all possible outcomes of the random variable, the sum of the probabilities given by the distribution will always be equal to 1. Consider the example of the probability distribution of the throw of a die below:

| x   | 1             | 2             | 3             | 4             | 5             | 6             |
|------|---------------|---------------|---------------|---------------|---------------|---------------|
| p(x) | $\frac{1}{6}$ | $\frac{1}{6}$ | $\frac{1}{6}$ | $\frac{1}{6}$ | $\frac{1}{6}$ | $\frac{1}{6}$ |

We see that for all possible outcomes of the throw, the probability distribution assigns a probability. In this case, we have assigned theoretical probabilities, but the same can be done with experimental probabilities. We also made a distinction between two types of random variables, namely __discrete__ or __continuous__. __Discrete random variables__ can only take certain numbers as values. This is the case in our dice throwing experiment as the outcome can only take on the values within the set $\{1, 2, 3, 4, 5, 6\}$.  __Continuous random variables__ on the other hand can take any value within a range. The height of a person is an example of a continuous random variables as it can be any value in the certain range of heights, for example between 0.50m and 2.50m and could be determined as 1.65457m. 

As we will see soon, while the differences may not seem very large, they result in different tools for statistical analysis. In this notebook, we will focus on discrete random variables.

# Discrete Probability Distributions
__Discrete probability distributions__ are, simply put, probability distributions used to describe discrete random variables, hence variables with a ___finite___ number of possible outcomes. The example of throwing a die that we have covered is a classic example of a simple discrete probability distribution. If we have some data and know the underlying distribution, we can derive all the relevant information from the data, such as its population mean and variance, as well as seeing how likely certain events are to occur. In practice, we will rarely know the _exact_ probability distribution of a random variable, but we can still model them using well-studied distributions depending on the given context, or try to estimate their distributions based on observed data.

# Bernouilli Distribution

Consider a game of heads or tails with a fair coin. We can consider the outcome to be a random variable, $X$, landing on heads with a probability $p=0.5$, and landing on tails with a probability $q=1-p=0.5$. We only know these probabilities as both coins are equally likely under the 'fair coin' assumption. 

Below, let's code up a function that randomly generates $n$ coin toss outcomes based on these probabilities and returns the number of heads and number of tails. (Hint: Consider a heads to 1 and tails to be 0)

In [26]:
import numpy as np

## Generate n observations of coin tosses:
def coin_tosses(n):
    possible_outcomes = np.array([1, 0]) # where 0 represents heads and 1 represents tails
    outcomes_observed = np.random.choice(possible_outcomes, size=n, p=[0.5, 0.5]).tolist()
    heads_count, tails_count = outcomes_observed.count(1), outcomes_observed.count(0)
    return [heads_count, tails_count]

print(coin_tosses(n=1000))

[520, 480]


Now let's step away from our function and treat it as a black box, assuming that the output was generated in the real-world. Let us also forget the probabilities of each possible outcome and not assume an initial theoretical probability distribution. Can we estimate the probability distribution from the observed data? YES WE CAN!

All we need to do is to use the formula observed in the previous notebooks, and we can estimate the probability of every possible outcome by divinding their frequency by the total number of outcomes observed. The most common away to display frequency and probability distributions is by using what is known as a histogram. Let us now plot the frequency distribution and probability distribution for the coin toss example given the 'observed' data generated from our function above.

In [30]:
import plotly.graph_objects as go

y = coin_tosses(n=10000)
x = ['Heads', 'Tails']

# Plotting frequency distribution
fig = go.Figure()
fig.add_trace(go.Bar(x=x, y=y))
fig.update_layout(
    title_text='Coin Toss Frequency Distribution', # title of plot
    xaxis_title_text='Outcome', # xaxis label
    yaxis_title_text='Frequency (Count)', # yaxis label
)
fig.show()

# Plotting probability distribution
norm_y = np.array(y)/np.sum(y)

fig = go.Figure()
fig.add_trace(go.Bar(x=x, y=norm_y))
fig.update_layout(
    title_text='Coin Toss Probability Distribution', # title of plot
    xaxis_title_text='Outcome', # xaxis label
    yaxis_title_text='Probability', # yaxis label
)
fig.show()

But why look into such a simple example? Well, it's because a coin toss is an example of one of the simplest well-defined, theoretical discrete probability distributions: the __Bernoulli distribution__. A Bernoulli distribution is very straight-forward, it's used to define any random variable whose possible outcomes can only be either $X=0$ or $X=1$, such that $p(1) = p$ and $p(0) = q = 1-p$, where $p$ and $q$ are positive constants s.t. $0 < p,q < 1 $. In the example of the coin toss, we considered 'Heads' to be 0 and 'Tails'to be 1.

Therefore, we can use the Bernoulli distribution for any binary problem, such as yes/no questions and answers. Hence, as simple as it may be, it can be highly applicable in hypothesis testing in the real-world. Furthermore, we can use this information to determine the general formulas for the expected value and population variance for a __Bernoulli random variable__:

$$E(X) = \sum_{x \in X}xp(x) = (1)\times p + (0) \times q = p$$

$$var(X) = \sum_{x \in X}(x - \mu)^{2}p(x) = (1 - p)^{2}\times p + (0 - p)^{2} \times q = q^{2}p + p^{2}q = pq(p+q) = pq = p(1-p) $$

Given these properties, what is the expected value of the coin toss case?

# Binomial Distribution

So the Bernoulli distribution is very simple and can be useful depending on the context in which it is applied, which is a great bargain for its buck. However, we are not always concerned with the outcome of a single binary outcome (having two options), but can also be concerned with the accumulation of multiple outcomes. For example, could we expand the idea behind the Bernouilli distribution to be able to describe the probability of observing a single 'Heads' after 2 coin tosses? Well consider the tree diagram below expanding this for coin tosses:

<img src='https://www.mathsteacher.com.au/year10/ch05_probability/06_further_representation/Image3890.gif' style='display: block; margin-left:auto; margin-right:auto; width:50%'>

From this diagram, we see that there are two possible __combinations__ to obtain one 'Heads', yet only one combination possible for obtaining both $0$ or $2$ 'Heads'. Thus, we see that all outcomes that contain only one 'Heads' have their probabilities adding up to $\frac{1}{4} + \frac{1}{4} = \frac{1}{2}$. We can see the probability of observing 0 'Heads' is $\frac{1}{4}$ and the probability of observing 2 'Heads' is $\frac{1}{4}$. These are all the possible outcomes and their probabilities sum to 1, making this a probability distribution! If we consider a random variable that is the number of 'Heads' observed after 2 coin tosses, we get the following table:

| X    | 0             | 1             | 2             |
|------|---------------|---------------|---------------|
| P(X) | $\frac{1}{4}$ | $\frac{1}{2}$ | $\frac{1}{4}$ |

This is a very simple example of a __binomial distribution__. For this type of distribution, the random variable at hand is considered to be the number of successes ($k$) in a sequence of $n$ indepedent Bernoulli events, known as __trials__. Another important parameter is the __success probability ($p$)__, which in the case of the coin tosses is simply $\frac{1}{2}$. In our example, we ran two trials, which correspond to our coin tosses, which as we saw in the beginning of the notebook, can be modelled as Bernoulli random variables. We can extend this to any number of trials we would like.

Let's write up a function that is similar to the one we wrote for the coin_tosses() above. This time, we want the function to take in the number of trials (coin tosses in a single binomial calculation), the number of samples of the binomial distribution, and return the frequency distribution of each possible outcome. (Hint: start by thinking of the simplest case shown above for two trials, then generalize).


In [39]:
# Code up binomial function!
def many_coin_tosses(n, p, n_samples):
    outcomes = []
    bernoulli_outcomes = np.array([1, 0])
    for i in range(n_samples):
        trial_outcomes = np.random.choice(bernoulli_outcomes, size=n, p=[p, 1-p]).tolist()
        success_count = trial_outcomes.count(1)
        outcomes.append(success_count)
    # Determine the number of times we observed each success count
    possible_success_counts = list(range(0, n+1))
    success_count_distribution = [outcomes.count(k) for k in possible_success_counts]
    return success_count_distribution

print(many_coin_tosses(n=10, p=0.5, n_samples=1000))

[0, 5, 47, 116, 183, 246, 219, 131, 41, 12, 0]


Great! Now that our binomial function is working, we can try and visualize our frequency distribution and estimated probability distribution as we did for the Bernouilli distribution.

In [40]:
n, p, n_samples = 10, 0.5, 1000
y = many_coin_tosses(n=n, p=p, n_samples=n_samples)

# Plotting frequency distribution
fig = go.Figure()
fig.add_trace(go.Bar(y=y))
fig.update_layout(
    title_text='Frequency Distribution', # title of plot
    xaxis_title_text='Outcome', # xaxis label
    yaxis_title_text='Frequency (Number of heads seen in {} trials)'.format(n), # yaxis label
)
fig.show()

# Plotting probability distribution
norm_y = np.array(y)/np.sum(y)

fig = go.Figure()
fig.add_trace(go.Bar(y=norm_y))
fig.update_layout(
    title_text='Probability Distribution', # title of plot
    xaxis_title_text='Outcome', # xaxis label
    yaxis_title_text='Probability', # yaxis label
)
fig.show()

We see from the plots that the outcomes with highest probability of occuring will those around the mean of the distribution. Consider the case of 10 coin tosses. There are way more possible combinations in which you can get 5 'Heads' than 0 or 10 'Heads'. How can we formulate this idea theoretically? Sadly, it's beyond the scope of this course, but it is based on what is known as the __binomial theorem__, and accounts for all possible combinations, then computes the respective probabilities. The equation for the probability distribution of a binomially distributed random variable with $n$ number of trials and a success probability is given by:

$$
p(k, n, p) = \begin{pmatrix} n \\ k \end{pmatrix} p^{k}(1-p)^{n-k}
$$

Where:

$$
\begin{pmatrix} n \\ k \end{pmatrix} = \frac{n!}{k!(n-k)!}
$$

It's quite tricky if you've never seen it before so don't worry too much about it. What is especially important is understanding the context in which we can apply this probability distribution. Let's now plot the probability histogram we obtained from experimental data on the same plot with the theoretical distribution to see how good our estimates are!

In [46]:
from scipy.stats import binom

n, p, n_samples = 10, 0.5, 10000
y = many_coin_tosses(n=n, p=p, n_samples=n_samples)
theoretical_y = [binom.pmf(k, n, p) for k in range(n+1)]
# Plotting probability distribution
experimental_y = np.array(y)/np.sum(y)

fig = go.Figure()
fig.add_trace(go.Bar(y=experimental_y, marker_color='indianred', name='Experimental'))
fig.add_trace(go.Bar(y=theoretical_y, marker_color='lightsalmon', name='Theoretical'))
fig.update_layout(
    title_text='Binomial Probability Distribution (n={}, p={})'.format(n, p), # title of plot
    xaxis_title_text='Outcome', # xaxis label
    yaxis_title_text='Probability', # yaxis label
)
fig.show()

As a final note, it may be useful to know that using the theoretical probability distribution, we can come up with a general formula for the expected value and population variance. In fact, we can compute it even more easily then with the usual formulas with a little trick. First, we have to state two properties:
- If you sum $n$ independent random variables $X_{1} + X_{2} + ... + X_{n}$, the expected value of the sum is given by $\mu_{1} + \mu_{2} + ... + \mu_{n}$ and the variance of the sum is given by $\sigma_{1}^{2} + \sigma_{2}^{2} + ... + \sigma_{n}^{2}$
- A binomially distributed random variable, $Y$, with $n$ trials is just equal to the sum of the Bernoulli distributed random variables that are each trial

Therefore, since all trials carry out the same experiment, they all represent the same random variable, leading to the result:

$$ E(Y) = E(X_{1}) + E(X_{2}) + ... + E(X_{n}) = p + p + ... + p = np $$
$$ var(Y) = var(X_{1}) + var(X_{2}) + ... + var(X_{n}) = p(1-p) + p(1-p) + ... + p(1-p) = np(1-p)$$

Awesome! I encourage you guys to play around with the parameters set in the above script so try and understand a bit how the shape of the binomial distribution varies with its parameters and what that actually means in the real-world. Overall, while these distributions are not the most widely used, they can be highly relevant in a context in which we can model things with either a binomial or even Bernoulli distribution. Generally, in data science, whenever we deal with inferential statistics (deducing properties of our data with probability distributions), it's very useful to understand the underlying distribution of our data. You will see how these can be useful once you go over statistical hypothesis testing and AB testing.

# Challenges
__Question 1:__ Write a function called binomial_stats that:
- Takes in the number of trials, success probability
- For 10000 samples, compute the experimental mean and variance from data randomly generated as done in this notebook
- Return a tuple of the form (mean, variance)

Check your answers for some examples using this [website](https://stattrek.com/online-calculator/binomial.aspx) before submitting: 