# Exercise 1
## Bias of estimators: analytical derivation, Monte Carlo simulation and bootstrap simulation
### G. Bontempi


## Question

Let us consider a normal r.v. ${\mathbf z}$ such that $E[{\mathbf z}]=\mu$ and $\text{Var}[{\mathbf z}]=\sigma^2$.
Suppose we want to estimate from i.i.d. dataset $D_N$ the parameter $\theta=\mu^2=(E[\mathbf{z}])^2$.
Let us consider three estimators:

$$\hat{\theta}_1=\left(\frac{\sum_{i=1}^N{z_i}}{N} \right)^2$$

$$\hat{\theta}_2=\frac{\sum_{i=1}^N{z_i^2}}{N} $$

$$\hat{\theta}_3=\frac{ (\sum_{i=1}^N{z_i})^2}{N} $$


* Are they unbiased? 
* Compute analytically the bias.
* Verify the result by Monte Carlo simulation for different values of $N$.
* Estimate the bias by bootstrap.

## Analytical derivation of bias

### 1st estimator
Since $\text{Cov}[{\mathbf z}_i, {\mathbf z}_j]=0$ and $E[{\mathbf z}^2]=\mu^2+\sigma^2$,

$$E[\hat{\mathbf{\theta}}_1]=\frac{1}{N^2} E\left[\left(\sum_{i=1}^N{\mathbf{z}_i}\right)^2\right]=
\frac{1}{N^2} E\left[\sum_{i=1}^N{\mathbf{z}_i^2} + 2 \sum_{i < j}^N{\mathbf{z}_i \mathbf{z}_j}\right]=\\
=\frac{1}{N^2} (N\mu^2 +N \sigma^2 + N(N-1) \mu^2)= \mu^2+\frac{\sigma^2}{N}$$

then the bias of the first estimator is $B_1=E[\hat{\mathbf{\theta}}_1]-\mu^2=\frac{\sigma^2}{N}$.

### 2nd estimator

$$E[\hat{\mathbf{\theta}}_2]=\frac{1}{N} E\left[\left(\sum_{i=1}^N{\mathbf{z}_i^2}\right)\right]=
\frac{N\mu^2 +N \sigma^2}{N}=\mu^2 +\sigma^2$$
then the bias of the second estimator is $B_2=E[\hat{\mathbf{\theta}}_2]-\mu^2=\sigma^2$.

### 3rd estimator

$$E[\hat{\theta}_3]=N E[\hat{\theta}_1]=N \mu^2+\sigma^2$$
then the bias of the thirs estimator is $B_3=E[\hat{\theta}_3]-\mu^2=(N-1) \mu^2+\sigma^2$.

The three estimators are biased.

## Random variable distribution

In [6]:
import numpy as np

muz = 2
sdz = 1

N = 100  # number of samples

# Analytical results
anB1 = sdz**2 / N
anB2 = sdz**2
anB3 = sdz**2 + (N - 1) * muz**2
print('anB1=', anB1, 'anB2=', anB2, 'anB3=', mcB3)

anB1= 0.01 anB2= 1 anB3= 397.12560690395736


## Monte Carlo simulation

We need to make an hypothesis about the ${\mathbf z}$ distribution
if we want to simulate sample generation.
We assume here the ${\mathbf z} \sim N(\mu,\sigma^2)$ is Normal.

In [7]:
import numpy as np

S = 10000  # number of Monte Carlo trials

muhat2_1 = []
muhat2_2 = []
muhat2_3 = []

for s in range(S):
    DN = np.random.normal(muz, scale=sdz, size=N)
    muhat2_1.append(np.mean(DN)**2)
    muhat2_2.append(np.sum(DN**2) / N)
    muhat2_3.append(np.sum(DN)**2 / N)

mcB1 = np.mean(muhat2_1) - muz**2
mcB2 = np.mean(muhat2_2) - muz**2
mcB3 = np.mean(muhat2_3) - muz**2


print('mCB1=', mcB1, 'mCB2=', mcB2, 'mCB3=', mcB3)

mCB1= 0.009675303816949032 mCB2= 1.00082019102169 mCB3= 396.9675303816948


## Bootstrap estimation

Let us first note that only the first estimator is a plug-in estimator of $(E[\mathbf{z}])^2$.
This is then the one that should be used to estimate the gap 
$$
\text{Bias}_{bs}=\frac{\sum_{b=1}^B \theta_{(b)}}{B}-\hat{\theta}_1
$$
for all the three estimators.

In [9]:
B = 10000
N = len(DN)  # Assuming DN is already defined

muhat2_1 = np.mean(DN)**2  # plug-in estimator
muhat2_2 = np.sum(DN**2) / N
muhat2_3 = np.sum(DN)**2 / N

muhatb = []
muhatb2 = []
muhatb3 = []

for b in range(B):
    Ib = np.random.choice(N, size=N, replace=True)
    Db = DN[Ib]
    muhatb.append(np.mean(Db)**2)
    muhatb2.append(np.sum(Db**2) / N)
    muhatb3.append(np.sum(Db)**2 / N)

bsB1 = np.mean(muhatb) - muhat2_1
bsB2 = np.mean(muhatb2) - muhat2_1
bsB3 = np.mean(muhatb3) - muhat2_1


## Final check

In [10]:

print(f"anB1= {anB1} mcB1= {mcB1} bsB1= {bsB1}")
print(f"anB2= {anB2} mcB2= {mcB2} bsB2= {bsB2}")
print(f"anB3= {anB3} mcB3= {mcB3} bsB3= {bsB3}")

anB1= 0.01 mcB1= 0.009675303816949032 bsB1= 0.013557959485252624
anB2= 1 mcB2= 1.00082019102169 bsB2= 0.9999597707644239
anB3= 397 mcB3= 396.9675303816948 bsB3= 368.94131554490855


Try for different values of $\mu$, $\sigma^2$, $N$, $B$ and $S$.