# In-class notebook: 2024-01-17

In this notebook, we will look at some common usages of classical statistical inference. We first look at the maximum likelihood estimator (MLE), then look at various applications of MLE. 

This notebook is intended to support Chapter 4.1-4.5 of the textbook, and material is taken from the following scripts (from astroML):
* https://github.com/astroML/astroML-notebooks/blob/main/chapter4/astroml_chapter4_Maximum_Likelihood_Estimation_and_Goodness_of_fit.ipynb
* https://github.com/astroML/astroML-notebooks/blob/main/chapter4/astroml_chapter4_Gaussian_mixtures.ipynb
* https://github.com/astroML/astroML-notebooks/blob/main/chapter4/astroml_chapter4_Confidence_estimates.ipynb

In [None]:
import numpy as np
from matplotlib import pyplot as plt

## Maximum Likelihood Estimator (MLE) and Goodness-of-fit

Let's walk through an example. 

We have 120 measurements for the apparent magnitude of our star drawn from a Gaussian distribution with $\mu = 8$ (mean apparent magnitude) and $\sigma = 2$ (the error in our measurements). 

In [None]:
from scipy.stats import norm

np.random.seed(seed=42)
Nsamples= 120
mu_true = 8
x = np.linspace(0, 16, 1000)

# draw 120 pints from a Gaussian with mu = 8 and sigma = 2
measurements = np.random.normal(mu_true, 2, Nsamples)

In [None]:
plt.hist(measurements, bins='fd', density=True)
plt.plot(x, norm(8, 2).pdf(x))
plt.xlabel('mag')

In [None]:
# caluclate likelihood for 1000 possible values of mu from 0 to 16
products = []
for i in x:
    j = np.prod(norm(measurements, 2).pdf(i))
    # this is a neat way for coding this up!
    products.append(j)

plt.plot(x,products)
plt.xlim(0, 16)
plt.xlabel('$\mu$', fontsize = 14)
plt.ylabel(r'$L(\mu)$',fontsize = 14);


print(f"""Mean of our dataset: {np.mean(measurements):.2}
The value of mu that maximizes L: {x[products.index(max(products))]:.2}""")

As we expect, the value of $\mu$, which maximizes $L$, is the average of our measurements!

## The MLE in the case of truncated and censored data

Here's another example.

On the planet of Caladan, meteorites frequently hit the surface. Suppose some collectors are creating a museum of large meteorites, which they've defined as any meteorite larger than 45 meters in diameter. Before they get entered into the museum, they are held in a back room so a new exhibit can be set up. We also know that on this planet, meteorites are normally distributed, centered at 25 meters, with a standard deviation of 13. Suppose you were to enter the back room and find a meteorite of diameter $S_d$. What is your best estimate for the mean diameter of the meteorites that land on Caladan if you knew that the standard deviation is 13 but didn't know the mean was 25?

In [None]:
from scipy.optimize import minimize
from scipy.stats import truncnorm
from scipy.stats import norm

fig = plt.subplots(figsize=(12, 7.5))

x = np.linspace(44, 75, 1000)
mean = 25
std = 13
x_min = 45
x_max = np.inf
a, b = (x_min - mean) / std, (x_max - mean) / std
r = truncnorm.rvs(a, b, loc = mean, scale = std, size=1000)
plt.plot(x, truncnorm.pdf(x,a,b,loc = mean, scale = std));
plt.hist(r, bins = 'fd', density = True);

# this is the true mean of the population
print(f'Mean = {np.mean(r):.3}')


In [None]:
# define the negative log likelihood
# nominally this is just for one meteorite 
# but we add a second term for illustration

const = (-1/2)*(np.log(2*np.pi*(13**2)))

def neg_log_likelihood(mu, x_1, x_2=0, n=1, sum_term_x2=0):
    sum_term_x1 = ((x_1 - mu)**2)/(2*(13**2))
    
    if x_2 != 0:
        sum_term_x2 = ((x_2 - mu)**2)/(2*(13**2))
    
    norm_constant = np.log((norm(mu,13).cdf(x_max) - (norm(mu,13).cdf(x_min)))**(-1))
    logL = -(const - (sum_term_x1+sum_term_x2)+(n*norm_constant))
    return logL

In [None]:
# minimize?

In [None]:
# the value in args is x_1m the value right after neg_log_likelihood
# is the initial value of the maxlike

print(f"""Value of μ which maximizes lnL for:

S_d = 49: {minimize(neg_log_likelihood,30,args = (49)).get("x")[0]}
S_d = 52: {minimize(neg_log_likelihood,30,args = (52)).get("x")[0]}""")

In [None]:
# what if we see two meteorites?

print(f"""Values of μ which maximizes lnL for:

S_d1 = 49, S_d2 = 50 (both meteorites with diameters below the mean): {minimize(neg_log_likelihood,30,args = (49,50,2)).get("x")[0]}
S_d1 = 50, S_d2 = 51 (one above the mean, one below the mean): {minimize(neg_log_likelihood,30,args = (50,51,2)).get("x")[0]}
S_d1 = 52, S_d2 = 53 (both meteorites with diameters above the mean): {minimize(neg_log_likelihood,30,args = (52,53,2)).get("x")[0]}""")

If you run into a large meteorite, do not automatically assume that all meteorites have large diameters on average because it could be due to selection effects.


## The Goodness-of-Fit and model selection

We draw 10 samples from a Gaussian and a Poisson distribution with the same mu, and we are assuming sigma is known. From the two dataset we estimate mu, together with sigma we can fully specify a Gaussian distribution. We ask which data does the Gaussian model fits better.

In [None]:
from scipy.stats import chi2

np.random.seed(seed=42)
N_samples = 10
mu = 25
sigma = 13

## Gaussian data on Gaussian model
x_gauss = np.random.normal(mu, sigma, N_samples)
mu_gauss = np.average(x_gauss)
chi_2 = np.sum(np.square((x_gauss - mu_gauss) / sigma))
print(chi_2)

## Poisson data on Gaussian model
x_poisson = np.random.poisson(mu, Nsamples)
mu_poisson = np.average(x_poisson)
chi_2_poisson = np.sum(np.square((x_poisson - mu_poisson) / sigma))
print(chi_2_poisson)

In [None]:
# Plot the results
fig = plt.subplots(figsize=(12, 7.5))
z = np.linspace(0,32,1000)

plt.plot(z, chi2.pdf(z, N_samples - 1))
plt.axvline(chi_2, color='green',label = f"Correct model = {chi_2:.3}");
plt.axvline(chi_2_poisson, color='red',label = f"Incorrect model ={chi_2_poisson:.3}");

plt.ylim(0,0.12)
plt.legend(loc = "upper right", fontsize = 12);

The $\chi^2$ value drawn from a Poisson distribution is less likely to occur, whereas the $\chi^2$ value drawn from the Gaussian is more likely.

## The Goodness-of-Fit and model selection: a more practical example 

Consider the simple case of the luminosity of a single star being measured multiple times. Our model is that of a star with **no intrinsic luminosity variation**.

We will examine four different scenarios:
- Correct model with correct errors
- Correct model with overestimated errors
- Correct model with underestimated errors
- Incorrect model with correct errors

First, we will define $N$, $\ell^0$ (the constant luminosity of our star), and $\sigma_\ell$ (the measurement error).

our models, and our errors.

In [None]:
np.random.seed(42)
N = 50
luminosity = 10
sigma_L = 0.2

t = np.linspace(0, 1, N)
L_obs = np.random.normal(luminosity, sigma_L, N)

y_vals = [L_obs, L_obs, L_obs, L_obs + 0.5 - t ** 2]
y_errs = [sigma_L, sigma_L * 2, sigma_L / 2, sigma_L]
titles = ['correct errors',
          'overestimated errors',
          'underestimated errors',
          'incorrect model']

In [None]:
# Plot the results
fig = plt.figure(figsize=(14, 14))
fig.subplots_adjust(left=0.1, right=0.95, wspace=0.05,
                    bottom=0.1, top=0.95, hspace=0.05)

for i in range(4):
    ax = fig.add_subplot(2, 2, 1 + i, xticks=[])

    # compute the mean and the chi^2/dof
    mu = np.mean(y_vals[i])
    z = (y_vals[i] - mu) / y_errs[i]
    chi2 = np.sum(z ** 2)
    chi2dof = chi2 / (N - 1)

    # compute the standard deviations of chi^2/dof
    sigma = np.sqrt(2. / (N - 1))
    nsig = (chi2dof - 1) / sigma

    # plot the points with errorbars
    ax.errorbar(t, y_vals[i], y_errs[i],fmt='.k', ecolor='gray', lw=1,ms = 12)
    ax.plot([-0.1, 1.3], [luminosity, luminosity], ':k', lw=3)

    # Add labels and text
    ax.text(0.95, 0.95, titles[i], ha='right', va='top',
            transform=ax.transAxes,fontsize = 15,
            bbox=dict(boxstyle='round', fc='w', ec='k'))
    ax.text(0.02, 0.02, r'$\hat{\mu} = %.2f$' % mu, ha='left', va='bottom',
            transform=ax.transAxes, fontsize = 15)
    ax.text(0.98, 0.02, r'$\chi^2_{\rm dof} = %.2f\, (%.2g\,\sigma)$'
            % (chi2dof, nsig), ha='right', va='bottom', 
            transform=ax.transAxes,fontsize = 15)

    # set axis limits
    ax.set_xlim(-0.05, 1.05)
    ax.set_ylim(8.6, 11.4)

    # set ticks and labels
    ax.yaxis.set_major_locator(plt.MultipleLocator(1))
    if i > 1: ax.set_xlabel('observations', fontsize = 18)
    if i % 2 == 0: ax.set_ylabel('Luminosity',fontsize = 18)
    else: ax.yaxis.set_major_formatter(plt.NullFormatter())

$\chi^2_{\text{dof}} \approx 1$ indicates that the model fits the data well (upper-left panel). $\chi^2_{\text{dof}}$ much smaller than 1 (upper-right panel) is an indication that the errors are overestimated. $\chi^2_{\text{dof}}$ much larger than 1 is an indication either that the errors are underestimated (lower-left panel) or that the model is not a good description of the data (lower-right panel). In this last case, it is clear from the data that the star’s luminosity is varying with time.

## MLE with Gaussian Mixture Model


Imagine we have a Gaussian mixture of distributions $\mathcal{N}(-1,1.5)$, $\mathcal{N}(0,1)$, and $\mathcal{N}(3,0.5)$. In this case, $M$ = 3 (number of separate Gaussian distributions), $\boldsymbol{\theta}$ includes the normalization factors for each distribution, $\alpha_1$,$\alpha_2$,and $\alpha_3$ as well as the descriptive parameters $\mu_1$,$\sigma_1$,$\mu_2$,$\sigma_2$,and $\mu_3$,$\sigma_3$


First we will define our distributions and combine them using `numpy.concatenate`. Then we will create models using `sklearn.mixture.GaussianMixture`that range from one class to ten classes and calculate the AIC and BIC to find the optimal number of classes for our data.

In [None]:
from sklearn.mixture import GaussianMixture
random_state = np.random.RandomState(seed=1)

X = np.concatenate([random_state.normal(-1, 1.5, 350),
                    random_state.normal(0, 1, 500),
                    random_state.normal(3, 0.5, 150)]).reshape(-1, 1)

N = np.arange(1, 11)
models = [None for i in range(len(N))]

for i in range(len(N)):
    models[i] = GaussianMixture(N[i]).fit(X)

AIC = [m.aic(X) for m in models]
BIC = [m.bic(X) for m in models]

Next, we'll plot our results. By using `np.argmin` on our AIC and BIC arrays, we can find the model with the most optimal  $M$ value. After this, we will use `.score_samples` on this model to compute the log-likelihood (the PDF of the sum of Gaussians). Then we can use `.predict_proba` on our log-likelihood to get the density of the $j$th component.

Afterward, we can plot `pdf`, our Gaussian mixture, and `pdf_individual` for the three separate Gaussians along with the histogram of our data.

In [None]:
fig = plt.figure(figsize=(12, 5))
x = np.linspace(-8, 8, 1000)

M_best = models[np.argmin(AIC)] 
logprob = M_best.score_samples(x.reshape(-1, 1))
responsibilities = M_best.predict_proba(x.reshape(-1, 1))
pdf = np.exp(logprob)
pdf_individual = responsibilities * pdf[:, np.newaxis]

labels = ['Best-fit Mixture','$\mathcal{N}(x|0,1)$',
          '$\mathcal{N}(x|-1,1.5)$','$\mathcal{N}(x|3,0.5)$']

plt.hist(X, 100, density=True, histtype='stepfilled', 
        alpha=0.4,color = 'steelblue',edgecolor = 'black')

#Plot the Gaussian mixture
plt.plot(x, pdf, label = labels[0])

#Plot the individual Gaussians
for i, j in enumerate([1,2,3]):
    plt.plot(x,pdf_individual[:,i],label = labels[j])
    

plt.xlabel('$x$', fontsize = 14)
plt.ylabel('$p(x)$', fontsize = 14)
plt.xticks(fontsize = 12)
plt.yticks(fontsize = 12)
plt.legend(fontsize = 12);

In [None]:
plt.plot(N, AIC, '-k', label='AIC')
plt.plot(N, BIC, '--k', label='BIC')

plt.xticks(np.arange(1,10,1))
plt.xlabel('n. components')
plt.ylabel('information criterion')
plt.legend(loc=2)

## Confidence estimation: Bootstrap

In [None]:
from scipy.stats import norm
from astroML.resample import bootstrap
from astroML.stats import sigmaG

m = 1000  # number of points
n = 10000  # number of bootstraps

# sample values from a normal distribution
np.random.seed(123)
data = norm(0, 1).rvs(m)

# Compute bootstrap resamplings of data
mu1_bootstrap = bootstrap(data, n,  np.std, kwargs=dict(axis=1, ddof=1))
mu2_bootstrap = bootstrap(data, n, sigmaG, kwargs=dict(axis=1))

In [None]:
# Compute the theoretical expectations for the two distributions
x = np.linspace(0.8, 1.2, 1000)

# error on the estimation of sigma from bootstrap
sigma1 = 1. / np.sqrt(2 * (m - 1))
pdf1 = norm(1, sigma1).pdf(x)

# error on the estimation of sigma from bootstrap
sigma2 = 1.06 / np.sqrt(m) 
pdf2 = norm(1, sigma2).pdf(x)

In [None]:
fig, ax = plt.subplots(figsize=(8, 6))

ax.hist(mu1_bootstrap, bins=50, density=True, histtype='step',
        color='blue', ls='dashed', label=r'$\sigma\ {\rm (std. dev.)}$')
ax.plot(x, pdf1, color='gray')

ax.hist(mu2_bootstrap, bins=50, density=True, histtype='step',
        color='red', label=r'$\sigma_G\ {\rm (quartile)}$')
ax.plot(x, pdf2, color='gray')

ax.set_xlim(0.82, 1.18)

ax.set_xlabel(r'$\sigma$', fontsize = 16)
ax.set_ylabel(r'$p(\sigma|x,I)$', fontsize = 16)

ax.legend()

## Confidence estimation: Jackknife

In [None]:
from astroML.resample import jackknife
from astroML.stats import sigmaG

np.random.seed(123)
m = 1000
data = norm(0, 1).rvs(m)

# Compute jackknife resampling

# Standard deviation based
mu_s, sigma_mu_s, mu_s_raw = jackknife(data, np.std,
                                    kwargs=dict(axis=1, ddof=1),
                                    return_raw_distribution=True)

pdf1_theory = norm(1, 1. / np.sqrt(2 * (m - 1)))
pdf1_jackknife = norm(mu_s, sigma_mu_s)

# Sigma_G based
mu_sigG, sigma_mu_sigG, mu_sigG_raw = jackknife(data, sigmaG,
                                    kwargs=dict(axis=1),
                                    return_raw_distribution=True)
pdf2_theory = norm(data.std(), 1.06 / np.sqrt(m))
pdf2_jackknife = norm(mu_sigG, sigma_mu_sigG)


print(f"mu_s = {mu_s:.3}, sigma_mu_s = {sigma_mu_s:.3}")
print(f"mu_sigmaG = {mu_sigG:.3}, sigma_mu_sigmaG = {sigma_mu_sigG:.3}")

In [None]:
fig = plt.figure(figsize=(10, 6))
fig.subplots_adjust(left=0.11, right=0.95, bottom=0.2, top=0.9,
                    wspace=0.25)

ax = fig.add_subplot(121)
ax.hist(mu_s_raw, np.linspace(0.996, 1.008, 100),
        label=r'$\sigma^*\ {\rm (std.\ dev.)}$',
        histtype='stepfilled', fc='white', ec='black', density=False)
ax.hist(mu_sigG_raw, np.linspace(0.996, 1.008, 100),
        label=r'$\sigma_G^*\ {\rm (quartile)}$',
        histtype='stepfilled', fc='gray', density=False)
ax.legend(loc='upper left', handlelength=2, fontsize = 14)

ax.xaxis.set_major_locator(plt.MultipleLocator(0.004))
ax.set_xlabel(r'$\sigma^*$', fontsize = 14)
ax.set_ylabel(r'$N(\sigma^*)$', fontsize = 14)
ax.set_xlim(0.998, 1.008)
ax.set_ylim(0, 550)

ax = fig.add_subplot(122)
x = np.linspace(0.45, 1.15, 1000)
ax.plot(x, pdf1_jackknife.pdf(x),
        color='blue', ls='dashed', label=r'$\sigma\ {\rm (std.\ dev.)}$',
        zorder=2)
ax.plot(x, pdf1_theory.pdf(x), color='gray', zorder=1)

ax.plot(x, pdf2_jackknife.pdf(x),
        color='red', label=r'$\sigma_G\ {\rm (quartile)}$', zorder=2)
ax.plot(x, pdf2_theory.pdf(x), color='gray', zorder=1)
plt.legend(loc='upper left', handlelength=2, fontsize = 14)

ax.set_xlabel(r'$\sigma$', fontsize = 14)
ax.set_ylabel(r'$p(\sigma|x,I)$', fontsize = 14)
ax.set_xlim(0.45, 1.15)
ax.set_ylim(0, 24)

This failure is a general problem with the standard jackknife method, which performs well for smooth differential statistics such as the mean and standard deviation, but does not perform well for medians, quantiles, and other rank-based statistics. For these sorts of statistics, a jackknife implementation that removes more than one observation can overcome this problem. The reason for this failure becomes apparent upon examination of the figure above: for $\sigma_G$, the vast majority of jackknife samples yield one of three discrete values! Because quartiles are insensitive to the removal of outliers, all samples created by the removal of a point larger than $q_{75}$ lead to precisely the same estimate. The same is true for removal of any point smaller than $q_{25}$, and for any point in the range $q_{25} < x < q_{75}$. Because of this, the jackknife cannot accurately sample the error distribution, which leads to a gross misestimate of the result.

In [None]:
# convinient functions in astropy

from astropy.stats import jackknife_stats

x = np.random.normal(loc=0, scale=1, size=1000)
estimate, bias, stderr, conf_interval = jackknife_stats(x, np.std)
print(estimate , stderr)