# 1. basic concepts


In this activity we are going to explore:
* **Events** and _discrete_ **random variables** [r.v.]
* **Probability distributions** associated with r.v.
* **Conditional probability distribution** between r.v.
* **Independence** between r.v.
* **Joint probability distribution**
* **Expected value**
* **Variance**

## 1.1 events and r.v.

Consider the event of **throwing of a 6-face dice**; either with a single throw, or a sequence of throws.

Considering this event, there are some terms that are important to differentiate, and we'll stress them here so that we are all on the same page!

1. the ***event*** of throwing the dice;
1. the **set of outcomes of the event** that are relevant to us: which face ends up on the dice (we could be considering other outcomes! - say, in the case of a physical dice, how many times it hit the table);
1. the ***random variable*** associated with the set of outcomes - call it $X$ -: representing the number on the face that ends up on a throw;
1. the ***sample space***: the possible values that $X$ may have;

Now:

* what is the sample space of $X$?

## 1.2 probability distribution

### 1.2.1 theoretical considerations

Consider the same event - the dice throw.
* what is the probability distribution of the r.v. $X$, considering that we have no evidence to favor the occurrence of any outcome?

Since we do not have a table to play on (!), let's simulate our hypothetical dice.

### 1.2.1 simlation fun

* use the `numpy.random` module to generate a single throw of a dice: to have a natural number between 1 and 6 with equal probability (- abuse it)

* could you tell the probabilities of each outcome by looking at a single outcome?

Let's go deeper!

We can consider a sequence of multiple throws of our dice, and have each throw's outcome be represented by an r.v. named $X_i$

* create a function `throw_dice_multiple_times` that, given a number of throws `n`, returns a sequence with the outcomes of `n` throws of the dice
* or, in other words, outputs the outcome of $(X_1, ..., X_n)$, usually represented as $(x_1, ..., x_n)$

* now, use that function to make count plots on the frequency of each outcome for 10, 100, and 1000 throws - _iot_ visualize the frequency of each outcome

* what do you observe as the number of throws gets higher? - can we now _observe_ the probabilities of each outcome?

## 1.3 conditional probability

* what's the probability that a throw's outcome _was_ an odd number, _knowing that it was_ a prime? 
* or, in other words, what's $P(X \text{ is odd} | X \text{ is prime})$?

TODO

## 1.4 independence

## 1.5 joint and marginal probability

### 1.5.1 theoretical fun

Let's change things a bit: consider we now have 2 dice - one blue, one red - also, very perfect.

Call $X$ the r.v. associated with the outcome of the blue dice, and $Y$ the one associated with the red dice.

* how many outcomes are possible for a throw of the two dice?
* or, in other words, how many elements are in the sample space of the _joint_ r.v.?

* on a single throw (of the 2 dice), what's the probability that both dices end up with 3?
* or, in _other words_, what's $P(X=3, Y=3)$?

Consider a new random variable $Z$ that represents the sum of the outcome of both dice (i.e., $Z=X+Y$).

* what's the sample space of $Z$?

* what's the probability that both dice ended up with 3, knowing that their sum was 6?
* in _other words_, what's $P(X=3,Y=3|Z=6)$?

### 1.5.2 simulation work


Getting back to our simulation efforts.

* make a function `throw_sum` that simulates the throwing of the 2 dices - outputting the outcome of $Z$: his small realized self, $z$

* make a function `throw_sum_multiple_times` that simulates `n` throws of the 2 dices - outputting the sequence of outcomes of each $Z_i$

* make **count plots** of the outcomes of $Z$ for 10, 100, and 1000 throws - to visualize the frequency of each outcome

## 1.6 expected value

* what's the expected value of $X$ and $Z$? 

* compare those value with the _sample average_ of the outcomes of 1000 throws of the simulated dices associated with each r.v.

## 1.7 variance

* what's the variance of $X$ and $Z$? 

* compare those value with the _sample variance_ of the outcomes of 1000 throws of the simulated dices associated with each r.v.

## 1.8 continuous random variables

The r.v. associated with the outcome of our dice is what's called a _discrete random variable_ - in one of those rare occasions that mathematics births a telling name.

We are now going to consider another _type_ of r.v., a _continuous_ one.

So, consider a dystopic dream: there is only one way to travel from Lisbon to Porto: a bus that passes once a day, which may departure from Lisbon at any instant of the day. Furthermore, there is no evidence to favour a specific time-period - it's _uniformly random_.

Now, let's consider a r.v. $X$ that represents the day's fraction at which the bus departs: its sample space is $[0, 1[$.

* what is the expected value and variance of $X$?

* use `numpy.random` again to simulate the outcomes of 1 year of departures

* make a histogram with 12 bins (say, for the months!) of the simulated outcomes

# 2. Pervasive distributions

We've now seen some r.v. and their probability distributions - but we haven't given them any _special names_.

Turns out there are some special distributions - special because they _pop-up_ multiple times when we model world phenomena - and some, even, _very_ special.

* a preemptive question: what's the special name of the distribution we used to model our dice and bus?

There are many such distributions, and we are going to look at some of them - namely:
1. **Bernoulli**
1. **Binomial**
1. **Geometric**
1. **Poisson**
1. **Exponential**
1. **Normal** (a.k.a. Gaussian)

For each distribution, it is important that you know what it is mainly _used for_ (i.e., what general phenomena it _models_).

We are going to explore these distributions making use of `scipy.stats` package, since we're passed `numpy`'s expertise.

In [3]:
from scipy import stats

## 2.1 bernoulli

Consider a biased coin that we somehow know has a 60% chance to land heads.

* simulate the outcome of 1000 throws of that coin

* calculate the sample average and the sample variance of your simulated outcomes, and compare them with the expected value and variance of the distribution

* try to describe which events are modeled with a bernoulli distribution

Now, say that you have reason to believe (god, or google, told you) that the chance that you receive a spam email is 1 in 10. Furthermore, say want to investigate a collection of 20 emails.

* what's (then) the probability that, out of those 20, only one of them is spam?

... enter the Binomial

## 2.2 binomial

Let's keep with the emails example (which really belongs here...), same probability of spam and number of emails considered.

* use `scipy.stats.binom` to respond to the last question (- if you haven't already done so!)

* what's the probability that 5 of the emails are spam?

* what's the most probable number of spam emails that one may get?

* what's the _expected number_ of spam emails you receive, out of the 20?

* what if you considered 1000? - what is the expected number then? (with the same probability of an email being spam)

* back to the sample of 20 emails - say you've taken a pick at 2 of them, and realized they are both spam - what's the probability that, out of the 20, 3 of them are spam?

Say we take the r.v. $X$ to represent the number of spam emails we get out of the 20.

* what's the sample space of $X$?

* plot the probability mass function of $X$

* plot the cumulative distribution function of $X$

## 2.3 geometric

Still keeping with the emails... now say you are waiting for them, holding. The probability of one being spam remains 1 in 10.

* what's the probability that you will only see a spam email on the 8th email you receive?

* what's the most probable number of emails you will receive until you catch one that is spam?

* what's the probability that only the 4th one is NOT spam?

Say we take the r.v. $Y$ to represent the number of emails you receive until you get the first spam one.

* what's the sample space of $Y$?

* plot the probability mass function of $Y$ (until a reasonable value)

* plot the cumulative distribution function of $Y$ (again, be reasonable - please!)

## 2.4 poisson

In [None]:
Let's stay on the emails universe, but consider a different set of questions.

Say you want to model the number of emails you get on a single day.

## 2.5 exponential

## 2.6 normal

## 2.7 all-in-one!

# 3. testing hypotheticals

So this is paradoxical: we are about to leave hypothetical lands - to look at hypothesis testing.

Now, we won't accept google-given probability distributions, we will instead test whether we have reason to believe what we're told and what we want to know: 
we will test statements about our r.v.'s based on the outcomes we have from them.

# 4. testing A, and testing B