# Important Brain Dump

Some important stuff here which I need to remember and yet to be able to formalize.

**Always refer to my example setup.**

## Temp Writings

```{prf:definition} Bernoulli Distribution
:label: def:bernoulli

Let $X$ be a **Bernoulli random variable** with parameter $p$. Then the 
probability mass function (PMF) of $X$ is given by 

$$
\begin{align}
\P(X=x) = \begin{cases}
p   &\quad \text{ if } x=1 \\
1-p &\quad \text{ if } x=0 \\
0   &\quad \text{ otherwise }
\end{cases}
\end{align}
$$

where $0 \leq p \leq 1$ is called the Bernoulli parameter. 

Some conventions:

1. We denote $X \sim \bern(p)$ if $X$ follows a Bernoulli distribution with parameter $p$.
2. The states of $X$ are $x \in \{0,1\}$. This means $X$ only has two (binary) states, 0 and 1.
3. We denote $1$ as **success** and $0$ as **failure** and consequently $p$ as the probability of success
and $1-p$ as the probability of failure.
4. Bear in mind that $X$ is defined over $\pspace$, and when we say $\P \lsq X=x \rsq$, we are also saying
$\P \lsq E \rsq$ where $E \in \E$. Imagine a coin toss, $E$ is the event that the coin lands on heads,
which translates to $E = \{X=1\}$.
```


```{prf:property} Expectation of Bernoulli Distribution
:label: prop:bernoulli

Let $X \sim \bern(p)$ be a Bernoulli random variable with parameter $p$. Then the expectation of $X$ is given by

$$
\begin{align}
\exp \lsq X \rsq = \sum_{x \in X(\S)} x \cdot \P(X=x) = 1 \cdot p + 0 \cdot (1-p) = p
\end{align}
$$
```

```{prf:property} Variance of Bernoulli Distribution
:label: prop:bernoulli_var

Let $X \sim \bern(p)$ be a Bernoulli random variable with parameter $p$. Then the variance of $X$ is given by

$$
\begin{align}
\var \lsq X \rsq = \sum_{x \in X(\S)} (x - \exp \lsq X \rsq)^2 \cdot \P(X=x) = (1 - p)^2 \cdot p + (0 - p)^2 \cdot (1-p) = p(1-p)
\end{align}
$$

It can also be shown using the second moment of $X$:

$$
\begin{align}
\var \lsq X \rsq = \exp \lsq X^2 \rsq - \exp \lsq X \rsq^2 = \exp \lsq X^2 \rsq - p^2 = p(1-p)
\end{align}
$$
```


## Maximum Variance 

### Minimum and Maximum Variance of Coin Toss

This example is taken from {cite}`chan_2021`, page 140.

Consider a coin toss, following a Bernoulli distribution. Define $X \sim \bern(p)$.

If we toss the coin $n$ times, then we ask ourselves what is the minimum and maximum variance of the coin toss.

Recall in {prf:ref}`def_variance` that the variance is basically how much the data deviates from the mean.

If the coin is biased at $p=1$, then the variance is $0$ because the coin always lands on heads. The 
intuition is that the coin is "deterministic", and hence there is no variance at all. If the coin
is biased at $p=0.9$, then there is a little variance, because the coin will consistently land on heads
$90\%$ of the time. If the coin is biased at $p=0.5$, then there is a lot of variance, because the coin
is fair and has a 50-50 chance of landing on heads or tails. Though fair, the variance is maximum here.

## Important Topics

See {doc}`./03_discrete_random_variables/0307_discrete_uniform_distribution` for my whole setup.

## The Empirical vs Theoretical Distribution Setup

- https://inferentialthinking.com/chapters/intro.html
- See {doc}`./03_discrete_random_variables/0307_discrete_uniform_distribution` for my whole setup.

## The Problem

### The ticket model and what is a r.v

- https://stats.stackexchange.com/questions/50/what-is-meant-by-a-random-variable/54894#54894
- https://stats.stackexchange.com/questions/68599/distribution-of-correlation-coefficient-between-two-discrete-random-variables-an/68782#68782

- A true population, say the height of all people in the world. To simplify the problem imagine living in a world of people whose 
height is a whole number ranging from 1-100 cm, ok I know it is absurd but it is just for the sake of the example, since discrete numbers are easier to visualize.
Also secretly imagine the population is 1000 people with 100 people of each height (important here as we will see later).
- Imagine a ticket box called the "population box" which has a ticket for each person in the world.
    - every person in the world write their height on a ticket and put it in the box.
- So note that we can think the population box as our sample space, each ticket is an outcome.
- Note that it can be the case that more than 1 ticket has the same height, so to be very concise, our population box not a "set", so not a sample space.
- Now recall that a random variable is a function that maps outcomes to real numbers.
- In our case if we treat $X: \S \to \R$ as a r.v, then it is obvious that the sample space $\S$ is all from 0 to 100
and for this case our mapping to $\R$ is just the identity function, since the height of a person is a real number ( a random variable is a way to assign a numerical code to each possible outcome).
This aside, what is more important is that we say $X$ is a r.v. that represents the height of a person (when we pick 1 ticket).
And the randomness comes from we don't know which ticket we pick, as it could be any person representing any height.
But once **it is picked**, the **realization** of the r.v. is the height of the person, say hongnan with height 175 cm.

### Why data points are considered random variables

- See https://mathstat.slu.edu/~speegle/_book/SimulationRV.html
- In ml context, Random variables $X_1, X_2, \ldots X_{???}$ are the data points of the height of people, and the true population space is the set of all possible data points (in our case it is actually 1 million people).
- I was confused because $X: \S \to \R$ is a r.v. that represents the height of a person, say if $100 cm \in \S$, then $X(100) = 100 cm$ is the realized outcome.
  Then why do we need to index the data points? Because the mapping of $X$ is already well defined for any outcome in $\S$, so for each person in the world
  we already can represent the single random variable $X$ that represents the height of that person. So why do we need to index the data points?
- For example, most cited definition is the iid assumption: random variables $X_1, X_2, \ldots X_{n}$ are called independent and identically distributed or iid if the variables are mutually independent, and each  
$X_i$ has the same probability distribution. Say $n=10$ people. It turns out we should think of it this way, in the true population box, all the tickets (height) of the people
are **numbered**, and each $X_i$ is actually remember is a deterministic answer after realization, and therefore the numbering makes sense. We treat each draw of the ticket as a random variable, and the numbering is just a way to index the random variables.
- Furthermore, we usually take a random sample of size $n$ from the population box, and treat each draw as a rv.

### Super important

Above has a hazy concept, one one hand sample space is a set and therefore should be unique, but on the other hand, if there's 1000 people
in the true population, and we treat the population box as the sample space, then the sample space is no longer "unique" since there are 
only 100 distinct heights? However, if we number the tickets, then the sample space is unique, in a way we are playing abuse of notation
that two different people with the same height 175 cm are two distinct outcomes in the sample space. This is very important to realize!!!

### Empirical distribution/histogram

To put more concrete example, consider the same experiment above, define the rv $X$ to be the height of a person,
then find probability of getting a person with height 175 cm, say $P(X=175) = ?$.

To find this answer, we need to find PMF. Note PMF is ideal means it is deterministic and hinged upon the true population.

So we secretly know the true PMF of the above distribution is actually simply 

$$
\begin{align}
\P(X=x) = \frac{10}{1000} = \frac{1}{10} \quad \forall x \in \{1, \ldots, 100\}
\end{align}
$$

$$
\begin{align}
\P(X=x) = \begin{cases}
\frac{1}{10} \text{ if } x=1 \\
\frac{1}{10} \text{ if } x=2 \\
\frac{1}{10} \text{ if } x=3 \\
\vdots \\
\frac{1}{10} \text{ if } x=100 \\
\end{cases}
\end{align}
$$

since it is equally likely to get any person with height 1-100 cm over 1000 people. Note
that for each height, we have exactly 10 people.

Recall the 
example 

$$
\P(A) = \dfrac{n(A)}{n(\S)}
$$

and since our $\S$ is 1000 people.

Now we randomly pick 10 people from the true population, then we can plot a histogram of the heights of the 10 people, and the histogram
says 3 people have height 175 cm, 2 people have height 180 cm, and so on. This is the empirical distribution, and it is not the true PMF,
and in our case we have $\dfrac{3}{10}$ probability that a person has 175cm $\P(X=175) = \dfrac{3}{10}$. But note carefully
here is **empirical** distribution and is non-deterministic.