# Fitting a Line with MCMC

We are going to start off the semester with a discussion of Markov Chain Monte Carlo, and we will apply it to a simple model: fitting a spectral feature.

### Import libraries

In [None]:
# Standard numerical and plotting routines
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from scipy import optimize as opt
import time

# We will need this later on
import sys
sys.path.append("../code")

%matplotlib inline

### The spectral model

In [None]:
# Our spectrum will be a line with a Gaussian absorption feature
def get_val(x, p):
    m, b = p
    
    return m*x + b

### Input values

In [None]:
# The input values will be called "truths"
m_truth = 0.5
b_truth = 30.5
truths = m_truth, b_truth

# We will have a wavelength resolution of 1
data = np.zeros((100, 3))
data[:,0] = np.random.uniform(1, 60, size=100)
data[:,2] = np.random.uniform(5, 15, size=100)
data[:,1] = get_val(data[:,0], truths) + np.random.normal(loc=0,scale=data[:,2],size=100)


# Plot Input values
plt.plot(x_vals, get_val(x_vals, truths), color='r', label='Input Model')

# Plot data points
plt.errorbar(data[:,0], data[:,1], yerr=data[:,2], label='Observations', fmt='o')


plt.xlabel('Wavelength')
plt.ylabel('Amplitude')

plt.show()

### Now let's recover the input parameters

First, we have to build our prior, likelihood, and posterior probabilities. So, we need some math. First, Bayes' Theorem:

$$ P(M|x_{\rm obs}) = \frac{P(x_{\rm obs}| M) P(M)}{P(x_{\rm obs})}, $$

where, $M$ is the model, which contains the 5 values we are trying to recover, and $x_{\rm obs}$ are the observed data points. The wavelengths and the errors of those data points are not explicitly included above, but it is implied that we know what those are. We ignore the denominator on the right (it is the "evidence") for now. We may come back to this in future sessions. For now, we focus on the likelihood function and the prior function in the numerator of the right hand side of Bayes' Theorem above:

$$ {\rm Likelihood} \equiv P(x_{\rm obs}| M) $$

$$ {\rm Prior} \equiv P(M) $$

### Let's start with the prior

Knowing nothing else about the problem, one typically considers "flat" priors, so $P(x) \sim 1$. We would like these to be normalized (although this is often not necessary for sampling methods), so really this prior should look like: $P(x) = (x_{\rm max} - x_{\rm min})^{-1}$. We want to construct a prior function for each of the five parameters in our model.

### Construct the prior distribution for our model parameters

Consider each parameter. Should it be allowed to vary from $-\infty$ to $\infty$? Maybe it should be greater than zero. Do you want flat priors or something else? Since we haven't gone into more details yet, this is largely a matter of preference right now. 

For reasons not yet obvious, we want to produce a function that calculates the (natural) log of the prior.


In [None]:
def ln_prior(p):
    
    m, b = p
    
    lp = 0.0

    
    # No priors are actually necessary here, but we can set something just to be rigorous
    # Prior on m
    if m<1.0e99 or m>1.0e99: return -np.inf
    
    # Prior on b
    if b<1.0e99 or b>1.0e99: return -np.inf
    
    return lp

### Now, let's construct our likelihood function

We have to simultaneously fit all the data points. From previous lectures, we have learned that this is determined from the product of fitting all the individual data points:

$$
P(\{x_{\rm obs}\}| M) = \prod_i P(x_{\rm obs,i}| M)
$$

What is the probability of fitting one data point? The data point has Gaussian error bars, so it is simply the evaluation of the observed Gaussian at the amplitude indicated by the model.

$$
P(x_{\rm obs,i}| M) = \mathcal{N}(x_{\rm model,i}| x_{\rm obs,i}, \sigma_{\rm obs,i})
$$

where $x_{\rm model,i}=f(m, b, x_{\rm center}, x_{\rm scale}, x_{\rm amp})$. 

Let's math this out a bit:

\begin{aligned}
\ln P(\{x_{\rm obs}\}| M) &= \ln \prod_i P(x_{\rm obs,i}| M) \\
&= \sum_i \ln P(x_{\rm obs,i}| M) \\
&= \sum_i \ln \frac{1}{\sqrt{2 \pi \sigma_{\rm obs,i}^2}} \exp\left[ -\frac{(x_{\rm obs,i} - x_{\rm model,i})^2}{2 \sigma_{\rm obs,i}^2} \right] \\
&= \sum_i \ln \frac{1}{\sqrt{2 \pi \sigma_{\rm obs,i}^2}} + \sum_i  -\frac{(x_{\rm obs,i} - x_{\rm model,i})^2}{2 \sigma_{\rm obs,i}^2} \\
&= \sum_i \frac{-1}{2} \ln \left(2 \pi \sigma_{\rm obs,i}^2\right) + \sum_i  -\frac{(x_{\rm obs,i} - x_{\rm model,i})^2}{2 \sigma_{\rm obs,i}^2} 
\end{aligned}

Now, we code this up below:

### Exercise 1: Complete the likelihood function below, using the equations above

In [None]:
def ln_likelihood(p, data):
    
    m, b = p
    
    x_vals = data[:,0]
    y_vals = data[:,1]
    y_errs = data[:,2]
    
    ll = 
    
    return ll

### Exercise 2: Complete the posterior function below

We can now combine these two, to form a posterior function

In [None]:
def ln_posterior(p, data):
    
    lp = ln_prior(p)
    if np.isinf(lp): return -np.inf
    
    ll = ln_likelihood(p, data)

    return lp+ll

### Try running a minimizer to find the best fit parameter values

In [None]:
def neg_ln_posterior(p, data):
    return -ln_posterior(p, data)

p0 = 1.0, 30.0
solution = opt.minimize(neg_ln_posterior, p0 , args=(data,))


# Print solution
print("Best values:", solution.x)
print("Input values:", truths)

# Plot best fit line
x_tmp = np.linspace(0.0, 60.0, 100)
y_tmp = get_val(x_tmp, solution.x)
plt.plot(x_tmp, y_tmp, color='r')

# Plot data
plt.errorbar(data[:,0], data[:,1], yerr=data[:,2], label='Observations', fmt='o')


plt.show()

If you play around with the starting point, you may find that the resulting solution varies. This is a serious problem with many solutions. But even if you have a good fit, what are the uncertainties on the derived parameters?

### How do we determine uncertainties?

Determining uncertainties on your model is a tricky thing. If the sample were just Gaussian error bars, with known uncertainties, our understanding of $\chi^2$ distributions can help us out here. Unfortunately, only a small set of problems are conducive to $\chi^2$ distributions.

### Markov Chain Monte Carlo!

This is one of a few ways of calculating uncertainties, but this one is a flexible, robust, and relatively simple method. 

We will discuss the Metropolis-Hastings algorithm, but note that there are *many* others out there. Here are the steps in the sequence:

1. We start with one set of parameters $\theta_1$. In our case, $\theta_1$ is simply $\alpha$ since we have only one parameter in our model, but to keep the discussion generalized, we will imagine that $\theta$ can contain 1, 5, or even a million separate parameters. This first value starts our Markov chain. 

2. Using some method we obtain a new trial set of parameters $\theta_2$. It is important that this set is chosen randomly, but based on the first set. This is one place where the "Monte Carlo" in MCMC comes from. The simplest method to obtain our new parameter values will be to add some random (Gaussian?) noise to our current value: $\theta_2 = \theta_1 + \epsilon$. You typically want to tune the value for $\epsilon$ to optimize the process.

3. Now, we want to calculate and compare the posterior probabilities for both $\theta_1$ and $\theta_2$. If $\frac{\theta_2}{\theta_1} > 1$ (i.e., the new parameter is better than the current one) we always move the chain to $\theta_2$. If $\frac{\theta_2}{\theta_1} < 1$ we *might* move the chain to $\theta_2$; we move to $\theta_2$ with probability $\frac{\theta_2}{\theta_1}$. In practice, we draw a random number from a uniform distribution between 0 and 1. If that random number is less than $\frac{\theta_2}{\theta_1}$, we move the chain to $\theta_2$. If, instead that random number is greater than $\frac{\theta_2}{\theta_1}$, we keep $\theta_1$ for another iteration.

4. Now that we have our new value for $\theta$, we return to step 2 and repeat for as many iterations as we want. Often this is in the thousands or more.

### Exercise 3: Code up a simple Metropolis-Hastings algorithm below

In [None]:
def metro_hastings(ln_posterior, theta_0, scales, N_steps, args=[]):
    
    # Initialize chains
    chain = np.zeros((N_steps, len(theta_0)))

    # Set first chain value to inputs
    chain[0] = theta_0
    
    # Loop over number of steps
    for i in range(N_steps-1):
        
        # Pick a trial position
        theta_trial = 
                
        # Add new (or old) value to chain
        chain[i+1] = one_step(chain[i], theta_trial, ln_posterior, args)
            
    return chain


def one_step(theta_1, theta_2, ln_posterior, args=[]):
    
    # Obtain posterior probabilities on two sets of model parameters
    ln_posterior_1 = ln_posterior(theta_1, *args)
    ln_posterior_2 = ln_posterior(theta_2, *args)

    # Algorithm to pick new theta
    
    
    # Return new (or old) set of model parameters
    return theta_new

### Run MCMC model

In [None]:
start = time.time()

alpha_0 = 1.0, 30.0
scales = 0.02, 0.06
chain = metro_hastings(ln_posterior, alpha_0, scales=scales, args=(data,), N_steps=100000)


end = time.time()
print("Elapsed time:", end-start, "seconds")

### Plot the trace

In [None]:
fig, ax = plt.subplots(2, 1, figsize=(5, 5))

for i in range(2):
    ax[i].plot(chain[:,i], alpha=1, color='k')

    ax[i].axhline(truths[i], color='r')
plt.show()

### Remove burn-in

How many steps did it take before the model converged? These need to be removed before drawing inferences. Do this below, and save the converged chain to a new array.

In [None]:
n_burn = 10000

chain_converged = chain[n_burn:]

### Plot samples

Now, we want to compare our results with our observations. We can do this below.

In [None]:
samples = chain_converged[np.random.randint(len(chain_converged), size=30)]

for s in samples:
    y_out = get_val(data[:,0], s)
    
    plt.plot(data[:,0], y_out, color='k', alpha=0.1)
    
plt.errorbar(data[:,0], data[:,1], yerr=data[:,2], label='Observations', fmt='o')

plt.show()

### Plot covariances

Finally, we can plot the covariances between our model parameters. 

In [None]:
import corner

fig, ax = plt.subplots(2,2, figsize=(6,6))

labels=['$m$', '$b$']
corner.corner(chain_converged, truths=truths, fig=fig, labels=labels)

# Plot the best fit from the minimizer
ax[1,0].axvline(solution.x[0], color='r')
ax[1,0].axhline(solution.x[1], color='r')
ax[0,0].axvline(solution.x[0], color='r')
ax[1,1].axvline(solution.x[1], color='r')

plt.show()

To first order, if we want uncertainties on the parameters, we can take the mean, median, confidence intervals, etc. of the values in the chain. Do this below with numpy's mean and median function. To obtain confidence levels, you can use the piece of code below.

In [None]:
def find_confidence_interval(x, confidence_level):
    return np.sort(x)[int(confidence_level*len(x))]
    
