<center><img src='https://drive.google.com/uc?export=view&id=12CrUdXDAiltLBT26sG7HZ_HciIhvGyT8'></center>

# Wstęp do uczenia maszynowego - Notebook 2, version for students
**Author: Michał Ciach**  
Exercises denoted with a star \* are optional. They may be more difficult or time-consuming.  


In today's class, we will learn some aspects of parameter and interval estimation.   
In the first section, we will focus on estimating the mean value of a distribution using a statistical sample.   
In the next section, we'll calculate confidence intervals for the mean.    
First, however, we'll talk a bit about how a statistician understands the data.  


***Theory time.*** Suppose you throw a fair dice. When the dice is still in your hand, the number of dots is a *random variable* - we don't know its value, and we can only say that it can take integer values from 1 to 6 with equal probability. The probability assigned to each outcome of a discrete random variable is called its *distribution*. Let's denote this random variable as $X$.    

When you throw the dice, the number of dots is no longer random. By throwing the dice, you've obtained a *realiztion* of the random variable, which is a number $X(\omega)$. In general, when we perform a "random experiment", such as throwing a set of $n$ dice, we fix $\omega$ and our random variables $X_1, \dots, X_n$ become numbers $X_1(\omega), \dots, X_n(\omega)$. Note that we use a single $\omega$ that realizes multiple random variables, rather than multiple $\omega$'s that realize a single random variable - that way it's easier to analyze. The $\omega$ is the experiment, and $X_i$ are the measured values. You don't need to care too much about $\omega$ it in this course - this is the last time we'll see it. What matters is that, in statistics, all the data is interpreted as a realization of a sequence of random variables.   

In practice, nobody writes $X_1(\omega)$ or talks about *realizations* - we simply write $X_1$ and talk about random variables, because it's quite obvious from context whether we talk about variables or about numbers.  

Probability theory, which you've learned in the previous semester, deals with analyzing the properties of random variables before they're realized. For this, we need to assume a fixed distribution of the random variable - for example, an uniform distribution over natural numbers from 1 to 6, or, in other applications, a normal (Gaussian) distribution. Then, we can ask "what values are likely to be observed?", or "what is the average value of infinitely many observations"?   

In statistics, we're interested in an exactly reverse question: given a finite set of observations, i.e. a realization of a sequence of random variables, what can we say about their distribution? We usually assume that all the random variables are independent and identically distributed ("i.i.d." for short). Furthermore, we usually don't assume that they can come from any probability distribution - instead, we assume that they come from some *parametric family* of distributions, i.e. ones that can be described with a set of parameters. The distribution of the number of dots on a dice is an example of such distribution - it can be described with five parameters (why not six?).

In the next sections, we'll see how it works in practice.



## Data & library imports

In [1]:
#!pip install gdown

In [2]:
!wget -O 2.protein_lengths.tsv https://drive.google.com/uc?id=1cWYjRIGdLZG39j5FmZKgbYaOeWFTLHrD

--2024-05-04 15:01:03--  https://drive.google.com/uc?id=1cWYjRIGdLZG39j5FmZKgbYaOeWFTLHrD
Resolving drive.google.com (drive.google.com)... 142.250.101.138, 142.250.101.102, 142.250.101.101, ...
Connecting to drive.google.com (drive.google.com)|142.250.101.138|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://drive.usercontent.google.com/download?id=1cWYjRIGdLZG39j5FmZKgbYaOeWFTLHrD [following]
--2024-05-04 15:01:03--  https://drive.usercontent.google.com/download?id=1cWYjRIGdLZG39j5FmZKgbYaOeWFTLHrD
Resolving drive.usercontent.google.com (drive.usercontent.google.com)... 74.125.137.132, 2607:f8b0:4023:c03::84
Connecting to drive.usercontent.google.com (drive.usercontent.google.com)|74.125.137.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 29258442 (28M) [application/octet-stream]
Saving to: ‘2.protein_lengths.tsv’


2024-05-04 15:01:07 (36.8 MB/s) - ‘2.protein_lengths.tsv’ saved [29258442/29258442]



In [3]:
import pandas as pd
import numpy as np
import numpy.random as rd
import plotly.express as px
from scipy.stats import norm, gmean
from scipy.stats import t as tstud
import math
import plotly.graph_objects as go
import scipy

In [4]:
protein_lengths = pd.read_csv('2.protein_lengths.tsv', sep='\t')
protein_lengths

Unnamed: 0,Scientific name,Common name,Protein ID,Protein length
0,Homo sapiens,Human,NP_000005.3,1474
1,Homo sapiens,Human,NP_000006.2,290
2,Homo sapiens,Human,NP_000007.1,421
3,Homo sapiens,Human,NP_000008.1,412
4,Homo sapiens,Human,NP_000009.1,655
...,...,...,...,...
648731,Imleria badia,Bay bolete (mushroom),KAF8560453.1,494
648732,Imleria badia,Bay bolete (mushroom),KAF8560454.1,737
648733,Imleria badia,Bay bolete (mushroom),KAF8560455.1,554
648734,Imleria badia,Bay bolete (mushroom),KAF8560456.1,813


## Exploratory analysis
The first step to any statistical analysis is to explore the data - check the basic statistics like the mean and variance and visualize the data to see what kind of distribution we're dealing with.

**Exercise 1.** In this exercise, we'll extract the data about human proteins, perform a simple data transformation, and do a basic exploratory analysis.

1. Select the data about human protein lengths from the `protein_lengths` data frame, and put it into a data frame `human_protein_lengths`. Here, you may need to use the `.copy()` method for the subsequent steps to work (ask your tutor if you need a further explanation).
2. Calculate the base-10 logarithm of the protein length and append it to the `human_protein_lengths` data frame as a column called `LogLength`.
3. Use the `human_protein_lengths.describe()` method to check the basic statistics of the numerical columns of the data frame.   
  3.1. What is the average length of a human protein?  
  3.2. What is the maximum length?  
4. Draw histograms that show the distributions of the protein lengths and log-lengths. I recommend using the `px.histogram()` function for this.  
  4.1. Which distribution is more spread around its average?   
  4.2. Does any distribution resemble the Normal (Gaussian) distribution?   
  4.3. Are there many proteins with lengths similar to the maximum length, or just a few?  
5. Calculate the average length and log-length and their standard deviations; Store them in variables `true_mean`, `true_mean_log`, `true_std` and `true_std_log`. We'll use them in subsequent exercises as our *ground truth* against which we'll evaluate our estimators.     

*Quick question*. Is $\text{true_mean}$ equal to $10^\text{true_mean_log}$? Why/why not? (note that we've used the base-10 logarithm)

*Why the base-10 logarithm?* In statistical data analysis, the base-10 logarithms are sometimes a convenient choice, because their values can be interpreted as orders of magnitude (or simply the numbers of digits) of our values. Although logarithms with different bases are mostly equivalent mathematically, an easier interpretation is important to get more meaningful conclusions from the data.

*Why the standard deviation?* Some students may wonder why we prefer the standard deviation rather than the variance - after all, the difference is just a square root, so mathematically it's almost the same thing. Here, again, the reason is the interpretability of the results. When we compute the variance, we square the observations. As a consequence, their units also get squared. This means that, if we estimate the number of mushrooms in a forest, the variance is expressed in terms of *mushrooms squared*, which doesn't make any sense. Taking a square root brings the unit back to mushrooms.   

In [5]:
human_protein_lengths = protein_lengths[protein_lengths["Common name"] == "Human"].copy()
human_protein_lengths["LogLength"] = np.log10(human_protein_lengths["Protein length"])
human_protein_lengths.describe()

Unnamed: 0,Protein length,LogLength
count,136193.0,136193.0
mean,692.655775,2.71154
std,746.993628,0.329892
min,12.0,1.079181
25%,316.0,2.499687
50%,514.0,2.710963
75%,842.0,2.925312
max,35991.0,4.556194


In [6]:
len = px.histogram(human_protein_lengths,
             x="Protein length",
             title="Distribution of the human protein lengths")

log_len = px.histogram(human_protein_lengths,
             x="LogLength",
             title="Distribution of the human protein log-lengths")

len.show()
log_len.show()

Output hidden; open in https://colab.research.google.com to view.

In [7]:
true_mean, true_mean_log = human_protein_lengths["Protein length"].mean(), human_protein_lengths["LogLength"].mean()
true_std, true_std_log = human_protein_lengths["Protein length"].std(), human_protein_lengths["LogLength"].std()

In [8]:
true_mean, 10**(true_mean_log), gmean(human_protein_lengths["Protein length"]) # Aritmetic and geometric mean

(692.6557752601088, 514.6827764147753, 514.6827764147753)

## Point estimation

One of the main strengths of statistical theory is that it allows us to estimate many quantities (like the mean protein length, mean income of cities in some country, or voting preferences)  using only a sample of randomly selected observations, and, most importantly, to estimate the uncertainty of such estimation. This is the main reason why we derive properties of estimators, such as their expected value and variance. Good statisticians can derive estimators which need less observations and give better results.   

First, let's talk about what it means to estimate from a statistician's point of view.  

*Theory time.* When we throw a single dice many times, we get a sequence of i.i.d. observations $X_1, \dots, X_n$ (I've ommitted $\omega$ here, because I hope you already understand that $X_i$'s become numbers when we actually throw the dice). We can now calculate the proportion of throws which resulted in six dots - this is a simple estimation of the quantity $\mathbb{P}(X_i=6)$. Mathematically, we would write it as follows. Let $[X_i = 6]$ denote a function that is equal $1$ if $X_i=6$ and $0$ otherwise (it's a random variable - we don't know its value before we throw the dice, but we know it once the dice is thrown). Now, we can define an *estimator*: $\frac{1}{n}\sum_{i=1}^n [X_i = 6]$. This is simply the proportion of throws that resulted in six dots. Finally, we can write that $\hat{\mathbb{P}}(X_i = 6) = \frac{1}{n}\sum_{i=1}^n [X_i = 6]$; a hat over $\mathbb{P}(X_i=6)$ means that this is an estimator of the probability $\mathbb{P}(X_i=6)$.  

Just to make it even more clear: when we throw the dice, we get a sequence of numbers $X_i(\omega)$ for $i=1,2, \dots, n$, and when we actually calculate the proportion, what we really calculate is $\hat{\mathbb{P}}(X_i = 6)(\omega) = \frac{1}{n}\sum_{i=1}^n [X_i(\omega) = 6]$. As I've said, in practice nobody writes $\omega$ in such formulas, but I'd like you to pay attention where I've written it on both sides. On the right hand side, it's next to $X_i$ - because this $X_i$ becomes a number when we throw the dice. On the left hand side, it's not next to $X_i$ - that's because $\mathbb{P}(X_i=6)$ is a quantity from probability theory, and writing $\mathbb{P}(X_i(\omega)=6)$ doesn't make sense.     

Throwing a dice is a very simple example of a random experiment. In the next exercise, we'll deal with a more complicated random variable: the length of a randomly selected human protein (here, the length is measured in terms of the number of amino acids). The parameter that we want to estimate is the average length of a human protein.  





**Exercise 2.** In this exercise, we'll do an empirical analysis of the properties of the estimator of the mean. We'll use a sample of $N=1000$ randomly selected human proteins. Denote $X_i$ as the length of a randomly selected human protein, and $\log(X_i)$ as its base-10 logarithm. Define the following two estimators:

$$\hat{\mu}_X = \sum_{i=1}^N X_i/N, \text{an estimator of the mean length}$$

$$\hat{\mu}_{\log(X)} = \sum_{i=1}^N \log(X_i)/N, \text{an estimator of the mean log-length}$$  

First, we'll draw $R=2000$ independent samples and calculate the estimators. Here's an example way to do this:   

1. Create empty lists called e.g. `means` and `means_log`.  
2. Repeat the following $R$ times (e.g., using a `for` loop):  
    2.1. Get a random sample of size $N$ of the observations (i.e., rows) from `human_protein_lengths`; you can use the `.sample()` method.   
    2.2. Calculate the mean length and append to `means`  
    2.3. Calculate the mean log-length and append to `means_log`.     
3. Convert both lists to `numpy` arrays (e.g. `means = np.array(means)`)

Now, we can inspect how well the estimators approximate the *true* mean length $\mu_X$ and the *true* mean log-length $\mu_{\log{X}}$ (notice the lack of hats above $\mu$'s - this means that these are the true parameters, not estimators).

4. Estimate the mean value of the estimator of the mean (by running `np.mean(means)`). Is it close to the true value $\mu$? In other words, does the estimator seem *unbiased*?
5. Estimate the bias of the estimator of the mean log-length (using the values in `means_log`). Does it seem biased? Does the result agree with the theoretical one about the estimator of the mean?       
5. Estimate the Root Mean Square Error of the estimator of the mean, defined as $\text{RMSE}(\hat{\mu}_X) = \sqrt{\mathbb{E}(\hat{\mu}_X - \mu_X)^2}$. This will tell you, approximately, the average error of $\hat{\mu}_X$ in terms of the number of amino acids (the building blocks of proteins). Why did I write *approximately*? (Hint: it's not just becasue we estimate it rather than calculate it theoretically)
6. Estimate $\text{RMSE}(\hat{\mu}_{\log(X)})$. How can you interpret the result?
7. Estimate the standard deviations of the estimators. Which one is less variable? Does it mean that one quantity is easier to estimate than the other?
8. Is $\text{sd}(\hat{\mu}_X) = \text{RMSE}(\hat{\mu}_X)$? Why/why not? Is the equation always true, sometimes true, or never true?    


In [None]:
R = 2000
N = 1000
means, means_log = np.zeros(R), np.zeros(R)
for i in range(R):
  sample = human_protein_lengths.sample(N)
  means[i], means_log[i] = sample["Protein length"].mean(), sample["LogLength"].mean()

In [None]:
print("Symmetry of the distributions of the estimators:")
print("Fraction of estimates over the true mean of lengths:", np.mean(means > true_mean))
print("Fraction of estimates over the true mean of log-lengths:", np.mean(means_log > true_mean_log))
print()
print("Means of the estimators:")
print("Mean for mean length:", np.mean(means))
print("Mean for mean log-length:", np.mean(means_log))
print()
print("Bias of the estimators:")
print("Bias for mean length:", np.mean(means) - true_mean)
print("Bias for mean log-length:", np.mean(means_log) - true_mean_log)
print()
print("RMSE of the estimators:")
print("RMSE for mean length:", np.sqrt(np.mean((means - true_mean) ** 2)))
print("RMSE for mean length:", np.sqrt(np.mean((means_log - true_mean_log) ** 2)))
print()
print("Variability of the estimators:")
print("SD for mean length:", means.std())
print("SD for mean log-length:", means_log.std())

Symmetry of the distributions of the estimators:
Fraction of estimates over the true mean of lengths: 0.4645
Fraction of estimates over the true mean of log-lengths: 0.5055

Means of the estimators:
Mean for mean length: 693.2565299999999
Mean for mean log-length: 2.7117541366311633

Bias of the estimators:
Bias for mean length: 0.6007547398910447
Bias for mean log-length: 0.00021450157154978555

RMSE of the estimators:
RMSE for mean length: 24.14696205011309
RMSE for mean length: 0.010659457102892195

Variability of the estimators:
SD for mean length: 24.139487774020804
SD for mean log-length: 0.010657298663554542


*Theory time.* After the last exercise, you may have been confused by all the different random variables, especially in phrases like *the mean of the estimator of the mean*. Let's summarize what was random and what was not:      
- The *true* mean length $\mu$ is not random in this case - it's an observed quantity, i.e. a constant value; The same goes for the standard deviation of the length $\sigma$. This is what we call the true mean and standard deviation.
- The randomly selected length is a random variable, let's call it $X$.
- The expected value of $X$, denoted $\mathbb{E}(X)$, is a constant value calculated theoretically and equal to the true mean $\mu$.
- The mean protein length estimated from the sample of size $N$ is a random variable $\hat{\mu}_X$ - an estimator of the true expected value $\mu$ of a random variable $X$.
- The estimator has its own expected value and standard deviation - which are not random quantities, but parameters calculated analytically, i.e., they are constant values again. We have $\mathbb{E}\hat{\mu} = \mu$, i.e. the empirical mean is an unbiased estimator of the true mean. This is (almost) never true for the estimator of the standard deviation (unless you use some custom estimators for specific distributions or have very simple data).
- The mean of the means calculated on 1000 samples is again a random variable - it's another estimator, that we could (but probably shouldn't...) denote as $\hat{\mu}_\hat{\mu}$, an estimator of the true expected value of a random variable $\hat{\mu}$.
- In summary: if we calculate, e.g., a standard deviation of 1000 empirical means, we get an estimate (a random value) of the standard deviation (a constant value) of the estimator (a random value) of the true mean (a constant value) of the length of a randomly selected protein (a random value).  


**Exercise 3.** The standard deviation of an unbiased estimator tells us how much it fluctuates around the true value. An estimator with a lower standard deviation will more often give us values that are close to the true one. This way, we can compare two estimators of the same thing.  

However, the standard deviation is often not that useful when comparing the measurements of two different things. This is because it depends on the units of the measurement. Suppose we measure the length $L$ of some objects in meters, and the standard deviation of the measurement is $\text{sd}(L)$. Measuring the same object in centimeters will give us a measurement equal $100L$, and the corresponding standard deviation $\text{sd}(100L) = 100\text{sd}(L)$ will appear to be much larger, but it doesn't mean that measurements in centimeters are more difficult. To make matters worse, in real-life applications, the variability of the measurement often depends on its average value, regardless of the units. The standard deviation of the height of a mouse (a few milimeters) is much lower than the one of an elephant (several centimeters), but it doesn't mean that mice are easier to measure. Similarly, in the case of protein length and log-length, the latter is much smaller, so it can be expected that its standard deviation will be smaller as well.

To evaluate the variability of an estimator regardless of its units and the average value, we can calculate a so-called [*coefficient of variation*](https://en.wikipedia.org/wiki/Coefficient_of_variation) (variation, not variance!). For a random value $Y$, this is defined as $\text{cv}(Y) = \text{sd}(Y)/\mathbb{E}(Y)$.   

1. Calculate the coefficients of variation for the estimators of mean protein length, i.e. $\text{cv}(\hat{\mu}_X)$, and log-length, i.e. $\text{cv}(\hat{\mu}_{\log(X)})$. Which estimator is better in this case? In general, does a lower coefficient of variation always mean that an estimator is better?
2. Is $\text{cv}(Y)$ always equal to  $\text{sd}(Y/\mathbb{E}(Y))$? Is there a condition for $Y$ that makes it equal? Give an analytical argument and verify that empirically on the protein length data.





In [None]:
print("Coefficient of variation of the estimators:")
print("CV for mean length:", means.std()/np.mean(means))
print("CV for mean log-length:", means_log.std()/np.mean(means_log))
print()
print("Coefficient of variation of the estimators, method 2:")
print("CV for mean length:", np.std(means/np.mean(means)))
print("CV for mean log-length:", np.std(means_log/np.mean(means_log)))

Coefficient of variation of the estimators:
CV for mean length: 0.0348204261040582
CV for mean log-length: 0.003930038685879613

Coefficient of variation of the estimators, method 2:
CV for mean length: 0.03482042610405821
CV for mean log-length: 0.003930038685879613


**Exercise 4.\*** Remember that the estimators are random variables with their own distributions! We can explore their distributions visually.  

4. Draw histograms of the estimators of the mean $\hat{\mu}_X$ and the log-mean $\hat{\mu}_{\log(X)}$.
5. Annotate the histograms with the true values $\mu_X$ and $\mu_{\log{X}}$ that you have computed in Exercise 1. You can do the annotation any way you want, I typically use a red dot at the bottom of the histogram or a vertical line. Are the estimators centered around the true values? Which estimator is more focused (i.e. less spread) around the true value?
6. Annotate the histograms with the average values of the estimators that you have computed in Exercise 3. Use different colors than in the previous point.
7. Is the distribution of the estimator $\hat{\mu}_{\log(X)}$ similar to the distribution of the protein log-lengths that you visualized in Exercise 1?
6. Is the distribution of the estimator $\hat{\mu}_X$ similar to the distribution of the protein lengths that you visualized in Exercise 1? Or maybe it's more similar to the normal distribution now? Why?


In [None]:
len_mean = px.histogram(means,
                        title="Distribution of the length means")
len_mean.add_vline(true_mean, line_color="red")
len_mean.add_vline(np.mean(means), line_color="green")

len_mean_log = px.histogram(means_log,
                        title="Distribution of the log-length means")
len_mean_log.add_vline(true_mean_log, line_color="red")
len_mean_log.add_vline(np.mean(means_log), line_color="green")

len_mean.show()
len_mean_log.show()

**Exercise 5.\*** We've learned how to analyze the properties of an estimator for a fixed sample size $N$, and we can use this knowledge to do something even more useful: determining how many observations we need for an estimation with a given precision.

1. Using the equations for the expected value and the standard deviation of the estimator of the mean, $\hat{\mu}_X = \sum_{i=1}^N X_i/N$ where $X_i$ is the length of a randomly selected protein, calculate the sample size $N^*$ that we need to take in order for the standard deviation of the estimator to be equal to a fraction $p$ of the true mean (i.e. $\sigma_\hat{\mu} = p\mu$; note the lack of a hat above sigma - it's not a random variable, but a parameter of the estimator). Express it in terms of coefficients of variability. You will need to assume that the standard deviation of the protein length is known; use the value in the `true_std` variable from Exercise 1.
2. Calculate $N^*$ for the estimator of the average log-length (use the true standard deviation of protein log-lengths in the `true_std_log` variable). Is there a noticeable difference compared to the estimator of the average length? Which quantity is easier to estimate?   
2. Analyze one of the estimators for a sample of size of the corresponding $N^*$: visualize its distribution, estimate its bias, standard deviation, coefficient of variation and RMSE. You can simply modify the code from the previous exercises.   
  2.1 Use the results to verify if your calculation of $N^*$ was correct.  
  2.2 How did the distribution of the estimator change compared to the previous sample size ($N = 1000$)?  
3. *Quick question 1*. Does $N^*$ depend on the number of proteins that humans have? In order to get the same precision of the estimation (measured in terms of the RMSE), would you need a larger sample size if humans had a million proteins?  
4. *Quick question 2.* Does $N^*$ depend on the distribution of the data?


*Note.* In practice, the estimation of $N^*$ is often more difficult, because we usually don't know the true standard deviation; instead, we need to estimate it and take into account the error of this estimation when deriving $N^*$. Because of this, the required sample size will typically be larger than in the case of a known standard deviation. This topic is too complex to cover in this course, so we'll focus on the simpler case with a known standard deviation.


In [None]:
p = 0.01
N = math.ceil((true_std / (p * true_mean))**2)
N_log = math.ceil((true_std_log / (p * true_mean_log))**2)

print("Required number of observations for estimator's CV equal:", p)
print("For protein length:", N)
print("For protein log-length:", N_log)

Required number of observations for estimator's CV equal: 0.01
For protein length: 11631
For protein log-length: 149


In [None]:
R = 1000
means, means_log = np.zeros(R), np.zeros(R)
for i in range(R):
  sample = human_protein_lengths['Protein length'].sample(N)
  sample_log = human_protein_lengths['LogLength'].sample(N_log)
  means[i], means_log[i] = sample.mean(), sample_log.mean()

print("The estimated mean and sd of the estimator of the mean for the calculated sample size:")
print(means_log.mean(), means_log.std())
print("The estimated CV of the estimator of the mean for the calculated sample size:")
print( means_log.std()/means_log.mean())

sample_hist = px.histogram(means_log, title="Dsitribution of the estimator of the mean log-length")
sample_hist.show()

The estimated mean and sd of the estimator of the mean for the calculated sample size:
2.711877385855875 0.026705113965234298
The estimated CV of the estimator of the mean for the calculated sample size:
0.009847463644380846


**Exercise 6.\*\*** Can we use the estimated average log-length to estimate the average length of human proteins?  

1. Consider a statistic given by the formula $\hat{\zeta} = 10^{\hat{\mu}_{\log(X)}}$. Is it an estimator of the average protein length?  
  1.1 Does $\hat{\zeta}$ correspond to any well-known mathematical object, e.g., some kind of mean?
2. Regardless of the answer, let's try to use $\hat{\zeta}$ to estimate the average protein length. Use the randomly sampled values of $\hat{\mu}_{\log(X)}$ from the previous exercises to calculate the corresponding values of $\hat{\zeta}$ and to estimate this estimator's expected value and standard deviation. You can use a sample size of your choice; try to compare the results for different sample sizes, like $N = 10, 100, 1000$.  
3. Based on the results, do you think that $\hat{\zeta}$ is an unbiased estimator of the average length? Try to confirm your expectations by deriving formulas for the expected value of $\hat{\zeta}$.  
  3.1.\* If $\hat{\zeta}$ is biased, then how does the bias scale with the number of observations? Check this either theoretically by analyzing equations or empirically by estimating the bias for different sample sizes (e.g., create a plot showing the estimated bias depending on the sample size).   
4. Does $\hat{\zeta}$ have a lower or a higher variance than $\hat{\mu}_{X}$? What about the coefficient of variability?   
5. Plot the values of the estimators $\hat{\zeta}$ and $\hat{\mu}_I$ on two boxplots side-by-side and annotate it with the true average length (for example, using a horizontal line). Which estimator seems better?  
6. Which estimator has a lower RMSE: $\hat{\zeta}$ or $\hat{\mu}_X$? Why?  
7.\* Do we have $\text{RMSE}(\hat{\zeta}) \geq \text{RMSE}(\hat{\mu}_X)$ for all sample sizes? If not, then try to characterize the sample sizes for which  $\hat{\zeta}$ works better than $\hat{\mu}_X$.  



In [None]:
zeta = 10**(true_mean_log)
geometric_mean = gmean(human_protein_lengths["Protein length"])
print(f"Zeta = {zeta} correspond to geometric mean = {geometric_mean}")

N = 1000
sample_mean_log = np.random.choice(means_log, N)
sample_zeta = 10**(sample_mean_log)
print("Mean, standard deviation and CV of the geometric mean (zeta):")
print(sample_zeta.mean(), sample_zeta.std(), sample_zeta.std()/sample_zeta.mean())


Zeta = 514.6827764147753 correspond to geometric mean = 514.6827764147753
Mean, standard deviation and CV of the geometric mean (zeta):
514.5538877923622 31.22911321938809 0.06069162814680737


## Interval estimation

In the previous section, we've learned how to quantify and analyze the uncertainty of an estimation by analyzing the standard deviation of the estimator. Estimating the value of a parameter is fine, but what does it tell us about the true value of the parameter? If we throw a dice 10 times and get $\hat{\mathbb{P}}(X_i = 6) = 0.159$, it obviously doesn't mean that $\mathbb{P}(X_i = 6) = 0.159$ - it's only an approximation.

In this section, we will learn a different technique - the estimation of *confidence intervals*, i.e., intervals which are likely to contain the true value of the parameter of interest. This tells us concretely what values of the true parameter are reasonable, rather than give us only an approximation.  

In general, we say that a confidence interval $[A, B]$ for a parameter $\theta$ (like the mean, or the standard deviaion, or some probability, etc.) has a confidence level $1-\alpha$ if it contains the true value of the parameter $\theta$ with probability $1-\alpha$:

$$\mathbb{P}(A \leq \theta \leq B) = 1-\alpha$$

Above, $A$ and $B$ are random variables calculated from the data (i.e. $A$ and $B$ are *statistics*). The most commonly used value for $\alpha$ is 0.05. Note: some authors use a more general definition with $\mathbb{P} \geq \alpha$ instead of $\mathbb{P} = \alpha$, which is sometimes useful, especially when we can't determine the exact probability and can only give its lower bound (you've seen this in the lecture with the Chebyshev confidence intervals).

In principle, we can construct confidence intervals for any parameter of any distribution (e.g., the expected value, the variance, the proportion in the Bernoulli distribution, the shape parameter of the Gamma distribution, etc.), but this is often difficult in practice. We'll focus on confidence intervals for the true mean (i.e., the expected value) of a normally distributed population -- this is one of the most commonly used and one of the most useful confidence intervals.

For the expected value of a normally distributed random variable, there are two commonly used confidence intervals: the confidence interval when the standard deviation $\sigma$ is known, given by the equation

$$\left (\hat{\mu} - q_{1-\alpha/2}\frac{\sigma}{\sqrt{N}},\quad \hat{\mu} + q_{1-\alpha/2}\frac{\sigma}{\sqrt{N}} \right ), $$
where $q_{1-\alpha/2}$ is the quantile of the standard normal distribution at the level of $1-\alpha/2$; and the confidence interval when the standard deviation $\sigma$ is unknown, given by the equation

$$\left (\hat{\mu} - t_{1-\alpha/2, N-1}\frac{\hat{\sigma}}{\sqrt{N}},\quad \hat{\mu} + t_{1-\alpha/2, N-1}\frac{\hat{\sigma}}{\sqrt{N}} \right ), $$
where $t_{1-\alpha/2, N-1}$ is the quantile of the Student's $t$ distribution with $N-1$ degrees of freedom at the level of $1-\alpha/2$, and $\hat{\sigma}$ is the square root of the **unbiased** estimator of the variance, i.e. $\hat{\sigma} = \sqrt{\sum_{i=1}^N (X_i - \bar{X})^2/(N-1)}$, where $\bar{X} = \hat{\mu} = \sum_{i=1}^N X_i/N$.

If we simply plug $\hat{\sigma}$ instead of $\sigma$ in the first kind of the confidence interval (the one for a known $\sigma$), we get a third type of a confidence interval, a so-called *asymptotic confidence interval* for the mean; the name *asymptotic* comes from the fact that $\hat{\sigma} → \sigma$ as $N → ∞$. As a consequence of this convergence, the asymptotic confidence interval gives quite accurate results for large sample sizes.  

**Exercise 7.** In this exercise, we'll see how to use the confidence intervals to analyze the true average value of the human protein log-lengths.  

For the quantiles, you can use the `norm.ppf` and `t.ppf` functions from the `scipy.stats` package.

1. Select a sample of size $N=20$ of human protein log-lengths. Calculate the mean and standard deviation from the sample.  
2. Calculate the studentized confidence interval, i.e. the one when $\sigma$ is unknown (after all, we estimated it in Point 1, so we pretend that we don't know the true value). Use $\alpha=0.05$, i.e. confidence level 95\%. Pay attention to the type of the estimator of standard deviation that you use! Some packages use the unbiased estimator of the variance, some don't!      
  2.1. Does the confidence interval contain the true average log-length?   
  2.2. How long is the confidence interval?  
  2.3. In your opinion, does it give you a precise estimation of the average log-length?  
  2.4. In your opinion, is interval estimation more useful than parameter estimation?    
  2.5. Which of the answers depend on the value of $N$?   
3. Now, use the true standard deviation (in the variable `true_std_log`) to calculate the Gaussian confidence interval, i.e. the one when $\sigma$ is known. Use a confidence level 95\%, i.e. $\alpha = 0.05$.   
  3.1. Does it contain the true average log-length?  
  3.2. Is it shorter or longer than the studentized one (i.e. the one when $\sigma$ is unknown)?  
  3.3. What are the advantages and disadvantages of this type of confidence interval compared to the one when $\sigma$ is unknown?  
4. Can you use any of these confidence interval for protein lengths, rather than log-lengths?   


In [None]:
N = 20
alfa = 0.05
s = human_protein_lengths["LogLength"].sample(N)
s_mean, s_std = s.mean(), s.std()

x = tstud.ppf(1 - alfa/2, df = N - 1) * s_std / np.sqrt(N)
s_cofidence_interval_tstud = [s_mean - x, s_mean + x]

y = norm.ppf(1 - alfa/2) * true_std_log / np.sqrt(N)
s_cofidence_interval_gauss = [s_mean - y, s_mean + y]

print("True mean log-length:")
print(true_mean_log)
print("Studentized confidence interval:")
print(s_cofidence_interval_tstud)
print("Gaussian confidence interval:")
print(s_cofidence_interval_gauss)


**Exercise 8.** In the previous exercise, we've seen how to use the confidence intervals. However, we've only used them on a single sample. A single calculation doesn't show us how variable their lengths are, how often they really contain the true parameter value, and so on. Perhaps most importantly, we don't know what happens when the assumptions are violated - for example, when we use the formulas to construct confidence intervals for the lengths rather than log-lengths, or when we use the asymptotic confidence intervals.

In this exercise, we'll do a more comprehensive empirical comparison of the properties of the three types of confidence intervals. Because the confidence intervals are random variables, it's not enough to calculate them just once - we need to calculate them many times, using independent samples, to see the full picture. We'll sample $R$ samples of some size $N$, calculate the corresponding confidence intervals for the mean, and check whether they have the desired confidence level and compare their lengths.   

First, we'll prepare our data.    
1. Create empty lists (or `numpy arrays`) that will contain the information whether the true mean is within a confidence interval (e.g., `within_normal` for the confidence interval with a known $\sigma$, `within_student` for the confidence interval with an unknown $\sigma$, `witin_asymptotic` for the asymptotic confidence interval).   
2. Create empty lists (or `numpy arrays`) that will contain the lengths of the intervals.    
3. Repeat the following $R=1000$ times (or more):  
  3.1. Select a random sample of size $N$ of protein log-lengths; select $N$ of your choice.   
  3.2. Calculate the three confidence intervals on the confidence level 95%. For the normal confidence interval, use the known standard deviation in `true_std_log`.
  3.3. Calculate the lengths of the confidence intevals and append them to the corresponding lists.
  3.4. Check whether the confidence intervals contain the true average log-length $\mu_{\log(X)}$, append the information to the corresponding lists.  

Now, we'll use the generated data to analyze the properties of the confidence intervals.

4. For each type of the confidence interval, estimate the probability that it contains $\mu_{\log(X)}$. Is the estimated probability close to the desired confidence level for each type? Why/why not? Does the answer depend on $N$?  
5. Calculate the average length of each type of the confidence interval. Which type tends to give the shortest intervals? Which type tends to give the longest? Why?    
6. Plot histograms depicting the distribution of the lengths of the confidence intervals.
7. What are the advantages and disadvantages of each type of the confidence interval? Does the asymptotic confidence interval have any advantages over the other two?   
8. *Quick question.* For a single sample, do all of the three types of confidence intervals always contain $\hat{\mu}_{\log(X)}$?  

Repeat this exericise using the protein lengths instead of log-lengths. What went wrong and why? Does the answer depend on $N$?      


In [None]:
R = 2000
N = 1000
alfa = 0.05

within_normal, within_student, within_asymptotic = np.zeros(R), np.zeros(R), np.zeros(R)
intervals_lengths = np.zeros((R, 3))

for i in range(R):
  s = human_protein_lengths["LogLength"].sample(N)
  s_mean, s_std = s.mean(), s.std()

  x = tstud.ppf(1 - alfa/2, df = N - 1) * s_std / np.sqrt(N)
  s_cofidence_interval_tstud = [s_mean - x, s_mean + x]
  s_cofidence_interval_tstud_length = 2 * x

  y = norm.ppf(1 - alfa/2) * true_std_log / np.sqrt(N)
  s_cofidence_interval_gauss = [s_mean - y, s_mean + y]
  s_cofidence_interval_gauss_length = 2 * y

  z = norm.ppf(1 - alfa/2) * s_std / np.sqrt(N)
  s_cofidence_interval_asymptotic = [s_mean - z, s_mean + z]
  s_cofidence_interval_asymptotic_length = 2 * z
  intervals_lengths[i,:] = [s_cofidence_interval_tstud_length, s_cofidence_interval_gauss_length, s_cofidence_interval_asymptotic_length]
  within_normal[i] = s_cofidence_interval_gauss[0] <= true_mean_log and true_mean_log <= s_cofidence_interval_gauss[1]
  within_student[i] = s_cofidence_interval_tstud[0] <= true_mean_log and true_mean_log <= s_cofidence_interval_tstud[1]
  within_asymptotic[i] = s_cofidence_interval_asymptotic[0] <= true_mean_log and true_mean_log <= s_cofidence_interval_asymptotic[1]

print("Estimated confidence levels:")
print("Gaussian confidence interval:", np.mean(within_normal))
print("Student confidence interval:", np.mean(within_student))
print("Asymptotic confidence interval:", np.mean(within_asymptotic))
print()
print("Average lengths:")
print("Gaussian confidence interval:", np.mean(intervals_lengths, axis=0)[1])
print("Student confidence interval:", np.mean(intervals_lengths, axis=0)[0])
print("Asymptotic confidence interval:", np.mean(intervals_lengths, axis=0)[2])


student_hist = px.histogram(intervals_lengths[:, 0],
             title="Student confidence interval length")

asymptotic_hist = px.histogram(intervals_lengths[:, 2],
             title="Asymptotic confidence interval length")

student_hist.show()
asymptotic_hist.show()

Estimated confidence levels:
Gaussian confidence interval: 0.944
Student confidence interval: 0.945
Asymptotic confidence interval: 0.945

Average lengths:
Gaussian confidence interval: 0.0408931259087184
Student confidence interval: 0.0409500905658425
Asymptotic confidence interval: 0.04090047744613531


**Exercise 9.** A researcher has measured the blood glucose concentration (BGC) on a well-selected sample of 1000 individuals. The researcher has determined that BGC is normally distributed and that the 95% confidence interval for the average GBC is [192 mg/dl, 203 mg/dl]. Next, the researcher has run the following further studies. Evaluate whether they are correct, and if not, try to suggest corrections. Support your claims with analytical proofs or numerical simulations.

- The researcher has tested a new, experimental methodology for measuring BGC on a well-selected selected sample of another 1000 individuals. The average BGC measured this way turned out to be 205 mg/dl. The researcher has concluded that the new methodology is biased.  
- The researcher has tested a randomly selected group of 100 people who visited fast-food restaurants more than 5 times in the previous month. The confidence interval for BGC turned out to be [205 mg/dl, 210 mg/dl]. The researcher has concluded that frequently eating fast food increases the blood glucose concentration.  
- The researcher has measured BGC on another set of 1000 individuals. The 95% confidence interval for this group was equal [198 mg/dl, 208 mg/dl]. The researcher has concluded that the average BGC in the whole population is most likely within [198 mg/dl, 203 mg/dl].   

**Exercise 10.\*** The length of the confidence interval is another (and more common) way to determine the required sample size. In this exercise, we'll see how to use it. As in exercise 6, we'll focus on the case of a known standard deviation.  

1. Using the formula for the confidence interval with a known standard deviation, derive a formula for a necessary sample size $N^*$ such that the length of the confidence interval is at most some value $l$.  
2. Calculate the required sample size for the average log-length for $l = 0.3$ (approximately 10% of the true mean).   
3. Select a sample of the size calculated in the previous point and calculate an example confidence interval. Check if its length really is at most $l$.  
  3.1. *Quick question 1.* Is the length of this confidence interval a random variable? Does it depend on the random sample?     
4. *Quick question 2.* Do your calculations work for protein lengths as well?  
5. Take a look at the formula for the confidence interval for an unknown standard deviation. Can you see why it may be difficult to derive a formula for $N^*$ in this case?  
  5.1.\* How would you approach calculating $N^*$ in this case?     

In [None]:
from collections import namedtuple
import math

alfa = 0.05

l = 0.1 * true_mean_log
N = (2 * norm.ppf(1 - alfa/2) * true_std_log / l) ** 2
length = 2 * norm.ppf(1 - alfa/2) * true_std_log / np.sqrt(N)
N = math.ceil(N)
print(f"We need {N} samples for a confidence interval of length at most {length:.3f}")

s = human_protein_lengths["LogLength"].sample(N) # works for Protein length as well
s_mean, s_std = s.mean(), s.std()

confidence_interval_norm = norm.ppf([alfa/2, 1-alfa/2], loc=s_mean, scale=true_std_log/np.sqrt(N))
print(f"Example confidence interval from {N} samples:", confidence_interval_norm)

real_length = confidence_interval_norm[1] - confidence_interval_norm[0]
print(f"Actual length: {real_length:.3f}, this length stays the same, it does not depend on the sample.")

confidence_interval_asymptotic = norm.ppf([alfa/2, 1-alfa/2], loc=s_mean, scale=s_std/np.sqrt(N))
print(f"Example asymptotic confidence interval from {N} samples:", confidence_interval_asymptotic)

asymptotic_length = confidence_interval_asymptotic[1] - confidence_interval_asymptotic[0]
print(f"Asymptotic confidence interval length: {asymptotic_length:.3f}, this length depends on a sample.")
print()

def N_unknown_std(n , alfa=0.05,l=0.1 * true_mean_log):
  print(n)
  n = int(n[0])
  s = human_protein_lengths["LogLength"].sample(n)
  return 2 * tstud.ppf(1 - alfa/2, df=n-1) * s.std() / np.sqrt(n) - l

scipy.optimize.root(N_unknown_std, [18], method='linearmixing', tol=0.01, options={'xtol':1})

We need 23 samples for a confidence interval of length at most 0.271
Example confidence interval from 23 samples: [2.60111234 2.87075363]
Actual length: 0.270, this length stays the same, it does not depend on the sample.
Example asymptotic confidence interval from 23 samples: [2.62194504 2.84992093]
Asymptotic confidence interval length: 0.228, this length depends on a sample.

5. It's not possible because t-score depends on a sample size N.
A solution to that is asymptotic confidence interval, when we are using estimated std of a sample instead of std of the population.
[18.]
[27.]
[20.44584528]


 message: A solution was found at the specified tolerance.
 success: True
  status: 1
     fun: [-8.340e-03]
       x: [ 2.045e+01]
     nit: 2

**Exercise 11.\*\*** In this exercise, we'll see how the three types of the confidence intervals for the mean change with the sample size $N$. We'll also see some more details about their variability. We'll work on the protein log-length data.     

1. Create variables to store the lower and upper bounds of each of the three types of the confidence intervals for the mean on a selected confidence level.
2. For each value of $N$ from 3 to 100:  
  2.1. Draw a random sample of size $N$.  
  2.2. Calculate the three confidence intervals and store their bounds.  
3. Create line plots with $N$ on the x axis and with the lower and upper bounds of the confidence intervals on the y axis.  
  3.1.\* For best results, try to create a single plot that shows all three types of confidence intervals. Make sure that each type of a confidence interval gets assigned a single, distinct color. To do this, you may need to import additional modules from `plotly`.    
4. Annotate the plots with a horizontal line corresponding to the true mean.  
5. Based on the plots, can you describe how the length of the confidence intervals changes with an increasing sample size? (this question is about the *trend*, i.e. the average change)    
6. Which type of confidence intervals has the most variability, and which has the least? Why? (this question is about the variability *around the trend*, i.e. on top of the average change)   
7. Where can you observe the largest differences between the three types of the confidence intervals?  
8. Can you see how all the three types become more and more equivalent as the sample size grows? Why does that happen?  
9. Does increasing the sample size always result in the same increase in the precision of the estimation?  

In [None]:
import plotly.graph_objects as go

lower_normal, upper_normal = [], []
lower_student, upper_student = [], []
lower_asymptotic, upper_asymptotic = [], []

N_values = np.arange(3,101)
for N in N_values:
  s = human_protein_lengths["LogLength"].sample(N)
  s_mean, s_std = s.mean(), s.std()

  x = norm.ppf(1 - alfa/2) * true_std_log / np.sqrt(N)
  lower_normal.append(s_mean - x)
  upper_normal.append(s_mean + x)

  y = tstud.ppf(1 - alfa/2, df = N - 1) * s_std / np.sqrt(N)
  lower_student.append(s_mean - y)
  upper_student.append(s_mean + y)

  z = norm.ppf(1 - alfa/2) * s_std / np.sqrt(N)
  lower_asymptotic.append(s_mean - z)
  upper_asymptotic.append(s_mean + z)

fig = go.Figure()

fig.add_trace(go.Scatter(x=N_values, y=lower_normal, mode='lines', name='normal, lower', line=dict(color='blue')))
fig.add_trace(go.Scatter(x=N_values, y=upper_normal, mode='lines', name='normal, upper', line=dict(color='blue')))

fig.add_trace(go.Scatter(x=N_values, y=lower_student, mode='lines', name='student, lower', line=dict(color='red')))
fig.add_trace(go.Scatter(x=N_values, y=upper_student, mode='lines', name='student, upper', line=dict(color='red')))

fig.add_trace(go.Scatter(x=N_values, y=lower_asymptotic, mode='lines', name='asymptotic, lower', line=dict(color='green')))
fig.add_trace(go.Scatter(x=N_values, y=upper_asymptotic, mode='lines', name='asymptotic, upper', line=dict(color='green')))

fig.update_layout(title='Confidence Intervals for Protein Log-Length',
                  xaxis_title='Sample Size (N)',
                  yaxis_title='Confidence Interval Bounds',
                  legend_title='Interval Type')

fig.add_hline(y=true_mean_log)

fig.show()

In [None]:
print("5. The length of confidence intervals decreases as the sample size increases.")
print("6. Student's confidence interval has the most variability,  because it relies on the sample std to estimate the population std. It is especially noticeable at smaller sample sizes.")
print("7. The largest differences are observed at smaller sample sizes.")
print("8. Beacause, as the sample size grows, the std estimation tends toward std of the population.")
print("9. Increasing the sample size generally leads to an increase in the precision of the estimation, but not always, there are some errors.")

<center><img src='https://drive.google.com/uc?export=view&id=12CrUdXDAiltLBT26sG7HZ_HciIhvGyT8'></center>