# Exercises: Probability

## Intro to Probability

### Questions

Suppose we observe a person choosing which city to travel from, $y$, so the domain of $y$ is $D_y$,

In [5]:
Dy = {'Leeds', 'London', 'Manchester'}

Q. Use `product()` to obtain the total outcome space of *three* possible observations.

HINT: itertools, `product(..., repeat=...)`

Q. Define an event E, by filtering the outcome space for the case where a majority of people chose the same city.

Q. What is the probility that E occurs?

Q. 

* Choose a problem domain of your own interest (retail, finance, health, crime..). 

* Choose a *binary* target variable to model, ie., $y \in \{0, 1\}$. 

* Define the outcome space for $N$ measurements of this variable (consider $N=10$, and also, EXTRA: can you represent $N=1000$ efficiently?).

* Define three events of interest over this outcome space and determine their probabilities. 


### Extra


Bayes theorem is,

$$P(A|B, \Omega) = \frac{P(B|A, \Omega)P(A|\Omega)}{P(B|\Omega)}$$

Show bayes theorem holds for 

* $\Omega$ : The outcome space for an experiment where 
    * a dice is rolled; if the outcome is even, a coin is flipped; if the outcome is odd, a dice is rolled and its eveness is recorded
* $A$ : that two die rolls are odd
* $B$ : that you initially roll a number greater than 3

---

## Intro to Probability Density

Q. Obtain several real-world datasets (eg., via `sns.load_dataset`) and prepare for modelling (ie., removing missing data).

Show in each case the means of several random samples from a column (ie., variable) are distributed approximately normally. 

Q. The formula for a normal distirbution is,

```python

def normal(outcome, m=100, s=15):
    v = 2 * s ** 2
    c = (pi * v) ** -0.5
    
    return c * exp( - ((outcome - m) ** 2) / v )
```

By obtaining the mean of means from one of your datasets above, and the standard deviation, *integrate* the normal distribution from the median to a point of interest. 

HINT: `from scipy.integrate import quad as areaof`


### Extra
Q. From one of your datasets perform a factoring (ie., groupby) on the column you have been sampling to create two subsets. (Eg., ages for men, ages for women). 

By obtaining the mean of their sample means, and their standard deviation; perform the same integration for both distributions.

Is there a difference in value? If so, what is the meaning of this difference?

## Intro to Random Variables

Q. Starting with the outcome spaces you used in "Intro to Probaility" define several Random Variables (ie., python functions) which compute a real-value for events of interest.

Eg., consider the event $E$: A majority wanted the same city. A random variable, $C$ could be a *count the number of unique choices*. This event then corresponds to $C >= 2$.

Q. Plot the distributions of your random variables (ie., generate values and plot).

### Extra

> In probability theory and statistics, the Poisson distribution (/ˈpwɑːsɒn/; French pronunciation: ​[pwasɔ̃]), named after French mathematician Siméon Denis Poisson, is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant mean rate and independently of the time since the last event. -- Wikipedia

Assuming an event occurs independent of prior ones, on average with a time of $\lambda$, the probability of $k$ events in an interval $n$, $P(k|n) =\frac{\lambda^k}{k!} e^{-\lambda} $.


Suppose we measure the (independent) arrival of messages over $N$ seconds, with an observation each second: $E = (y_0, \dots y_N)$ where $y_i \in \{0, 1\}$. 

**Q. Define the outcome space for $N=3$.**

**Q. Compute a probability for each event in the outcome space given a mean arrival time $\lambda$, (eg., 1.2s)**

**Q. Introduce a random variable which obtains the event count from the outcome space.**

**Q. By considering such a random variable show the sum of probabilities for these counts is close to 1.**

**Q. Why isn't it 1?**

In [17]:
from itertools import product
from math import factorial, exp 

Dyi = {0, 1}

l = 1.2
N = 3
O = set(product(Dyi, repeat=N))

def P(k, N, l):
    return exp(-1 * l) * (l ** k)/factorial(k)

print([P(sum(e), N, l) for e in O ])

X = set([sum(e) for e in O])

print(sum([P(k, N, l) for k in X ]))

[0.21685983257678554, 0.21685983257678554, 0.36143305429464256, 0.30119421191220214, 0.36143305429464256, 0.36143305429464256, 0.0867439330307142, 0.21685983257678554]
0.9662310318143443
