# Synopsis

With this notebook, we are getting to the practical side of statistics: inference.

The big realization to have at this point is that samples are stochastic objects themselves.  This fact takes away  our ability to achieve certainty about the processes under study.  

## Words to remember

**statistic**

**sampling distribution**

**point estimator**

**point estimate**


# Read libraries

In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

from colorama import Back, Fore, Style
from copy import copy, deepcopy
from pathlib import Path

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.stats as stats

from matplotlib import gridspec
from random import sample, choice, choices

from module_libraries.my_stats import half_frame

my_fontsize = 15

# Statistical Inference

Consider a sample comprised of $N$ i.i.d. random variables $x_1, x_2, ..., x_N$.  (What does i.i.d. mean?)  

What is the expected value of $X$? that is, what is the mean of $X$?

What we are trying to do here is to **estimate** the value of a **function of** $X$.  When dealing with a stochastic process, we are mostly interested in such estimates.  

**THIS ESTIMATE IS ITSELF A RANDOM VARIABLE**.  

As such, it also has an expected value and a distribution. The distribution of a function calculated on the random variables comprising a sample is called a **sampling distribution**.

<br><br><br><br>

More generally, any function calculated from the random variables in a sample is called **a statistic**. The mean of a sample of random variables is a statistic.  The difference between the means of two samples of random variables is also a statistic.

Sampling distributions are important in statistics because they provide a major simplification *en route* to **statistical inference**. More specifically, they allow analytical considerations to be based on the probability distribution of a statistic, rather than on the joint probability distribution of all the individual sample values. 

<br><br><br><br>


For example, consider a normal population with mean $\mu$ and variance $\sigma^2$. Assume we repeatedly take samples of a given size from this population and calculate the arithmetic mean $\bar x$  for each sample $-$ **this statistic is called the sample mean**. 

The distribution of these means is called the **sampling distribution of the sample mean**. 

This distribution is a normal $N ( \mu , \sqrt{\sigma^2 ~/~ N} )$ since the underlying population is normal but sampling distributions are often close to normal even when the population distribution is not because of the **central limit theorem**.

The mean of a sample from a population having a normal distribution is an case where one can get analytical results.  For other statistics and other populations, the situation is more complicated, and we are often unable to obtain closed-form expressions. **In such cases the sampling distributions may be approximated through Monte-Carlo simulations, bootstrap methods, or asymptotic distribution theory**. 

It is critical that you realzie that the sampling distribution depends on:

> the underlying distribution of the population, 
>
> the statistic being considered, 
>
> the sampling procedure employed, and 
>
> the sample size used. 

**Let's try to check these statements computationally!**



# Sampling distribution

## Monte Carlo simulation


In [None]:
n_samples = 2000
mu = 0.0
sigma = 1.0
my_sizes = [5, 25, 125]

statistic = np.mean
name_statistic = 'mean'
x_max = 3
x = np.linspace(-x_max, x_max, 800 )

sampling_dist = {}

for n in my_sizes:
    sampling_dist[n] = []
    for i in range(n_samples):
#         sample = stats.expon.rvs(0, 1, size = n)
        sample = stats.norm.rvs(mu, sigma, size = n)
        sampling_dist[n].append(statistic(sample))



<br>

The cell above has code for generating samples of normally (or exponentially) distributed random variables with sizes 5, 25, 125, and to calculate the sampling distribution.  

The current version is set for the **sample mean**. Change the code by **setting the statistic** as the sample's (a) variance, (b) median, and (c) maximum.

The code in the cell below enables you to plot the sampling distribution. 

In [None]:
fig = plt.figure( figsize = (8, 3*3) )

ax = []
for i, n in enumerate(my_sizes):
    ax.append( fig.add_subplot(3,1,i+1) )
    if n == my_sizes[1]:
        half_frame(ax[1], '', 'Probability density of sampling distribution', 
                   font_size = my_fontsize)
    elif n == my_sizes[2]:
        half_frame(ax[-1], f'sample {name_statistic}', '', 
                   font_size = my_fontsize)
    else:
        half_frame(ax[-1], '', '', font_size = my_fontsize)
    
    ax[-1].plot(x, stats.norm.pdf(x, mu, sigma), 'r', lw = 2)
    ax[-1].hist(sampling_dist[n], bins = x, density= True, alpha = 0.8)
    ax[-1].set_xticks(np.linspace(-x_max, x_max, 5))
    ax[-1].set_ylim(0, 6)
    ax[-1].set_xlim(-x_max, x_max)

    mean_sample_dist = np.mean(sampling_dist[n])
    st_dev_sample_dist = np.std(sampling_dist[n])
    
    # Print useful info
    ax[-1].text(x_max+0.05, 4, f'n = {n}', 
                fontsize = 1.5* my_fontsize, ha = 'right')
    ax[-1].vlines([mean_sample_dist-st_dev_sample_dist, 
                   mean_sample_dist+st_dev_sample_dist], 0, 6, 
                  color = 'orange', linewidth = 3)

plt.tight_layout()

### Exercise

Modify the code above so that it can be used to more systematically explore the dependence on the sample size of the sampling distribution.

## Bootstrap method

The data folder for this module has two files with GPA values for some fictional students. Load the data using pandas.

Print the first ten values of the data and find out which column contains the GPAs.

Print the last 10 GPA values.

Set as your statistic the sample's (a) mean, (b) variance, (c) median, and (d) maximum.

From your sample, pick at random with replacement sub-samples with $M = 5, 10, 20$ values. 

Plot your bootstrapped estimates of the sampling distribution for your chosen statistic.

**Hint:** The `random` methods `sample` or `choices` may be helpful.

# Point estimation

Point estimation involves the use of sample data to calculate a single value (known as a point estimate or statistic) which is to serve as a **best guess or best estimate** of an unknown **population parameter** (for example, the population mean). More formally, it is **the application of a point estimator to the data to obtain a point estimate**.

Point estimation can be used in alternative or in conjunction with **interval estimation**. Interval estimates are typically either **confidence intervals** in the case of frequentist inference, or **credible intervals** in the case of Bayesian inference. 

Typically, one uses $\theta$ to denote an **arbitrary statistic**, $\hat\Theta$ to denote a **point estimator** for that statistic, and $\hat\theta$ to denote a **point estimate** of that statistic.

<br><br><br><br>

<br><br><br><br>


Ideally, a point estimator would be unbiased and have low variance (that is, rapid convergence to asymptotic value). These requirements can be quantified using a **loss function** which dependes on the  difference between estimated and true values for an instance of data.  Two examples of loss functions are squared errors and absolute errors. **The risk is the expected value of the loss function**.

Because no estimator is optimal for all applications, a number of different 
estimators have been developed:

* method of moments and generalized method of moments
* maximum likelihood estimator (MLE)
* minimum-variance mean-unbiased estimator (MVUE) $-$ minimizes risk of the squared-error loss-function.
* best linear unbiased estimator (BLUE)
* minimum mean squared error (MMSE)
* median-unbiased estimator $-$ minimizes the risk of the absolute-error loss function



<br><br><br><br>

<br><br><br><br>



## Method of moments

Let's consider a simple example: a normal distribution. The normal is characterized by two parameters, $\mu$ and $\sigma^2$.  If our data was generated by an i.i.d. normal process, then we just need to obtain estimates of these two parameters.  

We know from the properties of the normal distribution that $E[X] = \mu$ and $E[(X-\mu)^2] = \sigma^2$.

In the method of moments, we calculate the sample moments, $\bar x$ and $s^2$, and equate them with the parameters:

> $~~~~~~~~~ \bar x = \hat\mu ~~~~~~~$ and $~~~~~~~~~ s^2 = \hat {\sigma^2}$


Things are even easier for the exponential distribution because it has a single parameter $\lambda$.  In this case, we need to calculate a single moment from the sample.  For the exponential distribution, we have $E[X] = \tau$. It then follows that:

> $~~~~~~~~~ \bar x = \hat\tau$

The method of moments is a **consistent estimator**, that is, as the size of the sample increases, the resulting sequence of estimates converges in probability to the true value. This means that **the distributions of the estimates become more and more concentrated near the true value of the parameter being estimated, so that the probability of the estimator being arbitrarily close to its true value converges to one**. 

Sadly, the method of moments is also typically biased.

**Think of what type of situation would result in a bias and code a demonstration of your hypothesis.**


In [None]:
n_samples = 100
mu = []
sigma = []



<br><br>

## Maximum likelihood estimator (MLE) 

MLE obtains parameter estimates by finding the parameter values that maximize the likelihood function. The estimates are called maximum likelihood estimates.  From the point of view of Bayesian inference, MLE is a special case of maximum *a posteriori* estimation (MAP) that assumes a uniform prior distribution of the parameters.

The likelihood function for a sample of discrete random variable $X$ drawn from a probability distribution with parameters **$\theta$** is defined as 

> ${\cal L}(\theta; x_1, x2, ..., x_N) = \Pi_{i=1}^N Pr(X = x_i; \theta)$

The MLE of $\theta$ is the one that maximizes the likelihood.  Frequently, one maximizes the **log-likelihood instead since it yields the same values but makes calculations easier**.

For some models, a maximum likelihood estimator can be found as an explicit function of the observed data. For others, however, no closed-form solution to the maximization problem is known or available, and an MLE can only be found via **numerical optimization**. 

Moreover, for some problems, there may be multiple values that maximize the likelihood. For other problems, no maximum likelihood estimate exists: either the log-likelihood function increases without ever reaching a supremum value, or the supremum does exist but is outside the set of acceptable parameter values. 

In [None]:
x =  stats.expon.rvs(0, 2, 10000)

# print(help(stats.expon.fit))

out = stats.norm.fit(x)

print(f"The fit method return a tuple of length {len(out)}\n")
print(f"The fit to a normal distribution yields a point estimate for "
      f"mu of {out[1]:.3f}\n")

loc, tau = stats.expon.fit(x)

my_text = f"The fit to an exponential distribution yields a point " + \
          f"estimate for tau of {tau:.3f}\n"
print(my_text)


Under very general and non-restrictive conditions, the MLE of a parameter $\theta$ for a large sample has the following properties:

1. $\Theta_{MLE}$ is an approximately unbiased estimator.

2. The variance of $\Theta_{MLE}$ is **nearly as small** as the variance that could be obtained with **any other estimator**.

3. $\Theta_{MLE}$ has an approximate normal distribution.

The implications of these properties are actually quite helpful. Promising *lack of bias* or *smallest variance* is really hard, but the MLE promises the next best thing: that it is close to those goals.

The approximate normal distribution is also nice because it tells us we can calculate many things by learning just two parameter values.

However, not all is perfect in MLE land.  The implementation of the method may involve using approximations 
in cases where determining the maximum of the likelihood function is difficult. 

# What does this all mean?

Given a random sample, and even if we know what statistical model/process generated that data, we are still facing the challenge of estimating the parameters of the statistical process.

As you have seen this is not easy.  The estimates of the parameters are themselves random variables with their own distributions.  

What do we **truly know** about our estimates?

How can we decide whether two samples were generated by the same statistical model with the same parameter values?

How much confidence can we have about anything based on our data?  Imagine the random variable is a  pressure measurement and that we are concerned about a container that will only withstand a given maximum pressure. How confident are we that the container will not fail?



# Exercises

Generate a random sample with N observations (data points) drawn from an exponential distribution.

Obtain the MLE estimate of the parameter using the .fit() function.

Obtain the method of moments estimate of the parameter.


Find how the difference in the two estimates changes with N.

# Let's keep going...

[Next lesson](nb_13_Statistical_intervals.ipynb)