## 1. Event and Experimental Probability

When asked: *What's the probability of landing Tails when tossing a (fair) coin?* You'd (probably) answer: $\frac{1}{2}$. Or 50%. But what does that mean?

And when asked: *What's the probability of getting 6 when throwing a six-sided (fair) die?* The expected answer: $\frac{1}{6}$ *What about NOT getting 6?* It's: $\frac{5}{6}$. And *getting a number less than five?* $\frac{2}{3}$? But what do those numbers mean?

And and - *The probability of drawing an Ace from a (fair) deck of cards?*

---

Let's get back to the coin. $\frac{1}{2}$ says you. But what does it mean? Maybe you ment that out of two tossings you'll land exactly one Tails ($TH$ or $HT$)? Okay, let's toss a coin two times. I got $TT$. Maybe my assumption was wrong? Or I've just got unlucky? Let's toss it again. $HT$ this time. Maybe I'm actually right? Or I got lucky. Another two tosses! $HH$. Hmm... Unlucky again? But can I talk about luck in mathematics and in hypothesis testing?

What about the other answer: 50%. Does that mean that I'll land Tails exactly half the times when tossing a coin $N$ times? Let's toss it, say, 10 times. But I don't actually have a coin. Luckily I have Python which can simulate tossing a coin. 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

rng = np.random.default_rng(seed=1023)

In [None]:
coin = ['H', 'T']

tossing = rng.choice(coin, size=10)
tossing

6 of 10 Tails. Close, but not quite 50%. But, wait. 50% also means 50/100. Let's toss a coin 100 times then. 

In [None]:
tossing = rng.choice(coin, size=100)
tossing

How many Tails, then?

In [None]:
np.unique(tossing, return_counts=True)

47/100. Not bad. But not quite 50%. Lucky, but not as quite? (But can I talk about luck in mathematics and in hypothesis testing?) What if those 50% is actually an *approximation* of the result? Let's toss a coin 1000 times.  

In [None]:
tossing = rng.choice(coin, size=1000)
np.unique(tossing, return_counts=True)

520/1000. That's 52%. And that's not that quite bad *approximation*. But still... Now, let's toss a coin ONE MILLION TIMES.

In [None]:
tossing = rng.choice(coin, size=10**6)
np.unique(tossing, return_counts=True)

In [None]:
499463/10**6

Well - that's 49.95%, almost 50%. Quite good approximation. 

What if tossed a coin more than million times?

In [None]:
tossing = rng.choice(coin, size=10**7)
np.unique(tossing, return_counts=True)

In [None]:
5000092/10**7

This is even better, veeeery close to 50%; 50.001%, to be more exact.

And if we tossed a coin **infinitely many times**? Then there is no approximation. We'd get 50% sharp. Mathematically, we can write this

$$ P(T) = \lim_{N_{\rm TRIES}\rightarrow\infty}\frac{n_{\rm HITS}}{N_{\rm TRIES}} = 0.5 = \frac{1}{2}.$$

What are all these letters? Let's interpret this formula one by one. 

- $T$ is the *event*: event of landing Tails when tossing ONE coin. 


- $P$ is the *probability*: we may consider it as a function which *measures* how probable an event is. So, $P(T)$ is probability of langing Tails when tossing one coin. 


- $N_{\rm TRIES}$ is the number of trials of the SAME experiment. In our case, it's the number of tossing a coin. 


- $n_{\rm HITS}$ is the number of 'hits', i.e. the number of desired outcomes in our set of trials. In our case, it's the number of Tails landed. 


- $\lim_{N_{\rm TRIES}\rightarrow\infty}$ is the *limit*: a value of the fraction $\frac{n_{\rm HITS}}{N_{\rm TRIES}}$ which we would obtain if we would *theoretically* be able to toss a coin infinitely many times in our finite lives. But we can't do that. We can only obtain an *experimental* (or *statistical*) approximation of the *theoretical* value in finite number of trials. And as we demonstrated - the larger number of trials, the better the approximation. 

***

Let's now illustrate this *limit* on another example: rolling a six-sided die. Intuintively we say that the probability of getting a 6 is $\frac{1}{6}$, or 16.666...% . Similarly as with the coin, we'll simulate rolling a die for various numbers of trials and list the results, i.e. approximations of theoretical probability.

In [None]:
exp_prob_df = pd.DataFrame(columns=['No. of Rolls', 'No. of 6s', 'Experimental Probability'])

number_of_rolls = np.logspace(1.75, 7, 100, dtype=int).tolist()
number_of_6s = []
die = list(range(1, 7))
           
for n_rolls in number_of_rolls:
    rolling = rng.choice(die, size=n_rolls)
    
    results = np.unique(rolling, return_counts=True)
    n_6 = results[1][-1]
    number_of_6s.append(n_6)

number_of_6s = np.asarray(number_of_6s)

exp_prob_df['No. of Rolls'] = number_of_rolls
exp_prob_df['No. of 6s'] = number_of_6s
exp_prob_df['Experimental Probability'] = np.round(number_of_6s/number_of_rolls, 5)
    
    
print(exp_prob_df.to_markdown())

And here's how our accumulation point looks like:

In [None]:
fig, ax = plt.subplots(figsize=(16, 10))

sns.scatterplot(ax = ax, data = exp_prob_df, x='No. of Rolls', y='Experimental Probability');
ax.axhline(.16667, c='r')

ax.set_xscale('log')
ax.set_yticks(np.linspace(.08, .24, 25))
ax.legend(['Theoretical Probability']);

From the scatterplot above we see how the values of ratio $\frac{n_{\rm HITS}}{N_{\rm TRIES}}$ converge towards the predicted theoretical probability of 16.666...% as $N_{\rm TRIES}\rightarrow\infty$.

But, is there a way to calculate theoretical probability **exactly**?

## 2. $\sigma$-Algebra of Events and Theoretical Probability

In order to be able to speak properly about theoretical probability (and be able to compute it), we need to introduce *$\sigma$-algebra of events* and *probability-as-a-measure*. We can consider $\sigma$-algebra of events as a family of sets where every set is an event, and every element of this event-set is an *outcome* for which consider the evenet realized, i.e. a *favorable outcome*. 

For example, if we toss a coin two times, then one event-set might be

$$ A - {\rm Landed\ 1\ Heads\ and\ 1\ Tails}, $$

and its elements are

$$ A = \{HT, TH\}.$$

A set of all the possible outcomes for a given experiment/observation is called the *universal set*. It is denoted by $\Omega$ and it is the set upon which $\sigma$-algebra of events is built upon, from the subsets $A\subset\Omega$ of the universal set.  

For tossing a coin two times we have

$$ \Omega = \{HH, HT, TH, TT\},$$

and obviously

$$ A \subset \Omega. $$

Let $\Sigma$ be a $\sigma$-algebra of events on $\Omega$, and $A$ and $B$ two events (which we write $A, B\in \Sigma$). Then we have:

$$1^\circ\quad \emptyset,\ \Omega \in \Sigma$$

$$2^\circ\quad A^C \in \Sigma$$

$$3^\circ\quad A\cup B \in \Sigma$$

$$4^\circ\quad A\cap B \in \Sigma.$$

What do these cryptic messages even mean? Let's explain them one by one, I promise they make sense.

$$1^\circ\quad \emptyset,\ \Omega \in \Sigma$$

This means that the empty set $\emptyset$ and the total set $\Omega$ are also considered as events.

$\emptyset$ is called an *impossible event*, and it does not contain any (possible) outcome. 

$\Omega$, viewed as an event is called *certain event* - getting any outcome from all the possible outcomes is definitely a certain event. 

---

$$2^\circ\quad A^C \in \Sigma$$

An complementary set of $A$: $A^C = \Omega\setminus A$ is also considered as an event. Every unfavourable outcome of $A$ is favourable for $A^C$ (and vice versa). So, for the event $A$ as defined above, we have:

$$ A^C = \{HH, TT\}. $$

---

$$3^\circ\quad A\cup B \in \Sigma$$

A union of two events (denoted also as $A + B$) is also considered as an event. But what is a union of two events actually? We say that the event $A\cup B$ is realized when at least one of the events $A$ **OR** $B$ is realized. $A\cup B$ contains all the outcomes which are favourable for either the event $A$ **OR** event $B$.

For example, if we define event $B$ as

$$B - {\rm Landed\ 2\ Tails,}$$

we have 

$$ A\cup B = \{HT, TH, TT\}.$$

---

$$4^\circ\quad A\cap B \in \Sigma.$$

An intersection of two events (denoted also as AB) is also considered as an event. We say that the event $AB$ is realized when both events $A$ **AND** $B$ are simultaneously realized. $AB$ contains all the outcomes which are favourable for both events $A$ **AND** $B$. 

For examle, if we define event $C$ as 

$$ C - {\rm Tails\ in\ the\ first\ coin\ toss}, $$

we have 

$$ AC = \{HT, TH\} \cap \{TH, TT\} = \{TH\}.$$

Two events $A$ and $B$ are *mutually exclusive* or *disjoint* if $A\cap B = \emptyset$, i.e. if realization of both events simultaneously is an impossible event.

---

One more importan notion is the *elementary event* - an event containing a single possible outcome. So, for tossing our coin two times, elementary events are: $\{HH\},\ \{HT\},\ \{TH\}$ and $\{TT\}$.

$\sigma$-algebras of events serve us as a brige between the natural language by which we describe events and their outcomes with formal mathematical language to describe them via sets, their elements and set operations. The upside of mathematical objects is that it is natural to impose some measure on them, and now we can measure the events, i.e. measure the probability of event realization. So we can define probability-as-a-measure, that is a mapping $P$ from $\sigma$-algebra of events to interval a set of real numbers via following set of *axioms* (known as *Kolmogorov Axioms*):

**Axiom 1**: For any event $A$ we have

$$ P(A)\geqslant 0.$$

**Axiom 2**: For certain event $\Omega$ we have

$$ P(\Omega) = 1.$$

**Axiom 3**: For mutually exclusive events $A_1, A_2, \ldots, A_n, \ldots$ we have

$$P\Big(\bigcup_{i=1}^{\infty}A_i\Big) = \sum_{i=1}^{\infty}P(A_i).$$

What do those axioms tell us?

**Axiom 1** means that the probability of any event has to be some non-negative real number; in other words - we cannot have negative probability (the same way we cannot have negative length, surface or volume - which are also measures of some kind)

**Axiom 2** says that the probability of a certain event is 1 (or 100%). As we know that $\Omega$, viewed as a set, is a universal set, i.e. set that contains all the possible outcomes - this axiom also tells us that the probability of observing any outcome of all possible defined outcomes is equal to 1. 

If we want to compute a probability of some union of mutually exclusive events, **Axiom 3** tells us that we can do that just by summing the probabilities of every single event. 

***

And here's a nice reminder for $\sigma$-algebras and Kolmogorov Axioms:

<img src="prob.png" style="width: 800px;"/>

These axioms have very usefull consequences; if $A$ and $B$ are two events, we have:

$$ 1^\circ\ P(\emptyset) = 0, $$

$$ 2^\circ\ 0 \leqslant P(A) \leqslant 1, $$

$$ 3^\circ\ P(A^C) = 1 - P(A), $$

$$ 4^\circ\ A\subseteq B \Rightarrow P(A) \leqslant P(B).$$

Let's now go over these consequences. 

$$ 1^\circ\ P(\emptyset) = 0 $$

We saw that impossible event is represented by $\emptyset$. So, this tells us that the probability of an impossible event is 0. 

***

$$ 2^\circ\ 0 \leqslant P(A) \leqslant 1 $$

Not only that the probability of an event is some non-negative real number - it's a real number belonging to the interval [0, 1]; and the endpoints of this interval correspond to the impossible event ($P(\emptyset) = 0$) and certain event ($P(\Omega) = 1$). The closer the probability is to 1, the more probable is the realization of an event. This also alows us to speak about probabilities in terms of percents. 

***

$$ 3^\circ\ P(A^C) = 1 - P(A) $$

This tells us how to simply calculate probability of a complementary event. If a probability of an event happening is 22%, then the probability of it NOT happening is 78%.

***

$$4^\circ\ A\subseteq B \Rightarrow P(A) \leqslant P(B) $$

If set $A$ is contained in set $B$, i.e. if all the outcomes favourable for event $A$ are also favourable for event $B$, then event $A$ has smaller (or equal) chances for realization than the event $B$. In other words - events with smaller set of favourable outcomes are less probable. 

***

All this talk about the Probability Theory, but we still haven't figured out how to calculate theoretical probability. Don't worry, we are almost there - and we have all the ingredients to write a formula that stems quite naturally from the theoretical foundations above. 

As we saw, probabilty of an event should be some number between 0 and 1, with probabilities of impossible and certain event as extreme values. And the bigger the event/set is, the biger its probability should be. This leads us to define theoretical probability of an event ($A$) via the following simple formula:

$$P(A) = \frac{|A|}{|\Omega|} = \frac{{\rm No.\ of\ all\ the\ favourable\ outcomes\ for}\ A}{{\rm No.\ of\ all\ the\ possible\ outcomes}}.$$

($|A|$ is the *cardinality* of a set, i.e. a number of elements that set has.)

One can easily check that formula for the probability, as given above, satisfies all the Kolmogorov Axioms and, of course, all the listed consequences. 

***

Now that we have the 'formula for probability' we can easily calculate probabilities of the events listed at the beginning of this notebook. 

- Probability of landing Tails on a coin toss:

$$P(T) = \frac{|\{T\}|}{|\{H, T\}|} = \frac{1}{2}.$$


- Probability of landing at least one Tails on two coin tosses (event $A$):

$$P(A) = \frac{|\{TH, HT, TT\}|}{|\{HH, TH, HT, TT\}|} = \frac{3}{4}.$$


- Probability of getting 6 when rolling a six-sided die:

$$P(X = 6) = \frac{|\{6\}|}{|\{1, 2, 3, 4, 5, 6\}|} = \frac{1}{6}.$$



- Probability of getting less than 5 when rolling a six-sided die:

$$P(X < 5) = \frac{|\{1, 2, 3, 4\}|}{|\{1, 2, 3, 4, 5, 6\}|} = \frac{4}{6} = \frac{2}{3}.$$



- Probability of getting any number from 1 to 6 when rolling a six-sided die:

$$P(1\leqslant X\leqslant 6) = \frac{|\{1, 2, 3, 4, 5, 6\}|}{|\{1, 2, 3, 4, 5, 6\}|} = \frac{6}{6} = 1.$$



- Probability of getting a 7 when rolling a six-sided die:

$$P(X = 7) = \frac{|\emptyset|}{|\{1, 2, 3, 4, 5, 6\}|} = \frac{0}{6} = 0.$$


- Probability of getting 6 or 7 when rolling a 20-sided die:

$$P(\{X=6\}\cup\{X=7\}) = P(X=6) + P(X=7) = \frac{1}{20} + \frac{1}{20} = \frac{2}{20} = \frac{1}{10}.$$



- Probability of getting less than 19 when rolling a 20-sided die:

$$P(X < 19) = 1 - P(\{X < 19\}^C) = 1 - P(X\geqslant 19) =  1 -\frac{|\{19, 20\}|}{|\{1, 2, \ldots, 20\}|} = 1 - \frac{2}{20} = \frac{18}{20} = \frac{9}{10}.$$


- Probability of drawing an Ace from a deck of cards:

In [None]:
print('P(X = A) = |{\U0001F0A1, \U0001F0B1, \U0001F0C1, \U0001F0D1}|/|Whole Deck of Cards| = 4/52 = 1/13.')

Even though we have a tool to calculate probability of an event exactly, we shouldn't forget about experimental probability. First, it can serve us to experimentally check our theoretical calculation. Secondly, and more important: sometimes calculating theoretical probability is difficult, or even impossible; so, performing the experiments and noting down the results is a way to obtain the probability of an event. 

## 3. Random variable: a Discrete Type