There is an breakfast cereal boxes which contains coupons for prize. Imagine that to win a prize you need to collect (all) $n$ different coupons.

* Each box contains one coupon
* When you buy the box, you don't know which coupon is inside
* Our model: each box is given independently, uniformly at random, one of the $n$ coupons

What is the expected number of boxes you need to buy to win?

**Important**: expectation is *linear*, regardless of independence.

Let $X_i$ be the number of boxes you need to buy to get the $nth$ different coupon.

The total number of boxes is then

$$X_1+X_2+X_3+\cdots+X_n$$

The expectation of the sum is the sum of the expectation.

# Expectation

When we have $i-1$ different kinds of coupon already, then the probability the next box we buy has a **new** kind of coupon is $\frac{n-(i-1)}{n}$.

So the expectation of $X_i$, the number of boxes we need to buy to get the $ith$ coupon, is $\frac{n}{n-(i-1)}$.

Let's look at the expectation of the **total** number of boxes bought

$$\frac{n}{n}+\frac{n}{n-1}+\frac{n}{n-2}+\frac{n}{n-3}+\cdots+\frac{n}{1}$$

The numerator is always $n$, which we can extract as a **common factor**. The other factor is $H_n=1+\frac{1}{2}+\frac{1}{3}+\frac{1}{4}+\cdots+\frac{1}{n}$, which is called a **harmonic series**.

In fact, $H_n$ is very close to $\ln{n}$.

* Integrating the function $\frac{1}{x}$ from $1$ to $n$(and to $n+1$), we can show that $H_n$ lies between $\ln{n}$ and $(\ln{n})+1$
* So the expected number of boxes lies between $n\ln{n}$ and $n(\ln{n})+n$

# Probability bounds

The average number of boxes purchased is $nH_n$.

Straight away, we can say that the probability we buy at least $2nH_n$ boxes is at most $\frac{1}{2}$.

Why? A very useful result called **Markov's Inequality**.

If the mean of a non-negative random variable $X$ is $\mu$, then for every $\alpha>0$,

$$Pr[X\ge\alpha\mu]\le\frac{1}{\alpha}$$

Here is the **proof**:

Suppose we have a step function, which is $f(x)=\begin{cases}0\quad &x<a\\1\quad &x\ge a\end{cases}$. If we multiply this function by $a$, we get $af(x)=\begin{cases}0\quad &x<a\\a\quad &x\ge a\end{cases}$

We know that $af(x)\le x$ for every possible $x$. We take the expectation for this inequality $aE[f(x)]\le E[X]$.

For the step function, $E[f(x)]=Pr[x\ge a]$, therefore we have $Pr[x\ge a]\le\frac{\mu}{a}$.

If we substitute $a$ by $\alpha\mu$, we get $Pr[x\ge\alpha\mu]\le\frac{1}{\alpha}$.

End of Proof!

# Variance

Let's calculate the variance of the sum $X_1+X_2+X_3+\cdots+X_n$

* Since the $X_i$ are independent, we can sum the variance
* If the probability of success if $p$, the variance of a geometric random variable is $\frac{1-p}{p^2}$, which is at most $\frac{1}{p^2}$

**Proof**:

By definition:

$\begin{align}Var[X]&=E[(X-\mu)^2]\\
&=E[(X-E[X])^2]\\
&=E[X^2-2XE[X]+(E[X])^2]\\
&=E[X^2]-2E[X]E[X]+(E[X])^2\\
&=E[X^2]-(E[X])^2
\end{align}
$

<table width="60%">
    <tr><th width="16%">$X$</th><td width="16%">$1$</td><td width="16%">$2$</td><td width="16%">$3$</td><td width="16%">$\cdots$</td><td width="16%">n</td></tr>
    <tr><th>$X^2$</th><td>$1$</td><td>$4$</td><td>$9$</td><td>$\cdots$</td><td>$n^2$</td></tr>
    <tr><th>$Probability$</th><td>$p$</td><td>$p(1-p)$</td><td>$p(1-p)^2$</td><td>$\cdots$</td><td>$p(1-p)^{n-1}$</td></tr>
</table>

Let $q=1-p$, then 

$\begin{align}E[X^2]&=p+4pq+9pq^2+16pq^3+\cdots+n^2pq^{n-1}\\
&=p(1+4q+9q^2+16q^3+\cdots+n^2q^{n-1})\\
&=p((1+q+q^2+\cdots)+3q+8q^2+15q^3+\cdots)\\
&=p((1+q+q^2+\cdots)+3q(1+q+q^2+\cdots)+5q^2+12q^3+\cdots)\\
&=\cdots\\
&=p(1+q+q^2+\cdots+q^{n-1})(1+3q+5q^2+\cdots+(2n-1)q^{n-1})\tag{1}
\end{align}$

Then second term of formula $(1)=\frac{1-q^n}{1-q}=\frac{1}{1-q}$ since $q<1$($q^n$ will be $0$), while the third term of formula $(1)=\frac{1+q}{(1-q)^2}$

Then, $$E[X^2]=p\frac{1}{1-q}\frac{1+q}{(1-q)^2}=p\frac{2-p}{p^3}=\frac{2-p}{p^2}$$

Therefore, $$Var[X]=E[X^2]-(E[X])^2=\frac{2-p}{p^2}-(\frac{1}{p})^2=\frac{1-p}{p^2}$$

End of Proof!

Since $p=\frac{n-(i-1)}{n}$, the variance in the number of boxes to find the $ith$ coupon is $(\frac{n}{n-(i-1)})^2$.

Sum up over all $i$, this time we have $$Var[X]=n^2(1+\frac{1}{2^2}+\frac{1}{3^2}+\frac{1}{4^2}+\cdots+\frac{1}{n^2})=\frac{\pi^2}{6}n^2$$

Now we try something called **Chebyshev's Inequality**, which isn't too hard to derive from Markov's Inequality. It says for all $a>0$:

$$Pr[\big|X-E[X]\big|\ge a]\le\frac{Var[X]}{a^2}$$

**Proof**:

From Markov's Inequality $Pr[X\ge a]\le\frac{\mu}{a}$, we replace $X$ with $|X-\mu|$, we get $Pr[|X-\mu|\ge a]\le\frac{E[|X-\mu|]}{a}$

$Pr[|X-\mu|^2\ge a^2]\le\frac{E[|X-\mu|^2]}{a^2}=\frac{Var[x]}{a^2}$ and $Pr[|X-\mu|^2\ge a^2]=Pr[|X-\mu|\ge a]$,

Therefore, we have $$Pr[|X-\mu|\ge a]\le\frac{Var[x]}{a^2}$$

End of Proof.

Therefore, $Pr[|X-nH_n|\ge nH_n]\le\frac{n^2\pi^2/6}{(nH_n)^2}$, which is $O(1/\log^2n)$. This is the probability that we bought more than $2nH_n$ boxes.

# Another approach

Let's focus on a particular coupon. The probability that we didn't find it after we bought $n(\ln n+c)$ boxes is $(1-\frac{1}{n})^{n(\ln n+c)}$

Remember the $1-x\le e^{-x}$ inequality?

From this, we get a bound of $e^{-(\ln n+c)}=\frac{1}{e^cn}$

Now, there are $n$ coupons, so the probability that after we bought $n(\ln n+c)$ boxes, at least one wasn't found is at most $\frac{1}{e^cn}\cdot n=\frac{1}{e^c}$

If $c=\ln n$, the probability that (to find all $n$ kinds of coupon) we bought more than $n(\ln n+\ln n)=2n\ln n$ boxes is at most $\frac{1}{n}$.