### ASTR 3970 / 8070: Astrostatistics
***S. R. Taylor***
___

# Homework 3
### Due: Saturday, Feb 1st at 11.59pm CST
---

## Only one problem this week

This problem uses a dataset in `/coursework/homeworks/hw_data/`.

1) Read in `hw3_data_1.npy`. This is a (100 x 2) numpy array, with measurements in the first column and uncertainties in the second column. Using the analytic results for heteroscedastic Gaussian data from lectures, compute the sample mean and the standard error on the sample mean from for this data.

2) Reusing some approaches and tools from `Lecture_6`, write a ln-likelihood function for heteroscedastic Gaussian data, and use it in a fitting algorithm to find the best-fit mean. *Remember that scipy optimizers are set up to minimize functions.*

3) Using the same numerical technique from `Lecture_5`, compute the Fisher uncertainty estimate on the mean.

4) Using the bootstrap method, generate $2000$ bootstrap realizations of this dataset. Using an appropriate timing function in python, how long did the generation of these realizations take? 
*DO NOT use the `astroML` code. Write your own bootstrap function from scratch. Also recall that when resampling data, measurements and uncertainties should stay paired together. This code will be graded on efficiency and speed; it should not take more than 1 second to execute.*

5) Repeat (2) with all $2000$ boostrap datasets to find the distribution of the sample mean. How long did this take? Plot a normalized histogram of these bootstrap means, and overplot a Gaussian pdf with the mean and std found in (1). Do these agree?

6) While we have fitted a heteroscedastic Gaussian to this data, let's try something else. Write some code to define a ln-likelihood for a Laplace distribution evaluated on this data. Fit simultaneously for the Laplace location parameter $\mu$ and scale parameter $\Delta$.

7) Compute the AIC values for the heteroscedastic Gaussian model and the Laplacian model. Which model is favored by the data?

8) Using the $2000$ bootstrap datasets from before, fit for the Laplacian $\mu$ and $\Delta$ for each. Make a nice `corner` plot of the distributions of $\mu$ and $\Delta$ that shows both the marginal $1$D distributions and the joint $2$D distribution. Make sure the plot has labels, shows the titles on each $1$D marginal panel, and has $68\%$ and $95\%$ levels.

9) Let's finish with a Fisher uncertainty estimate of the Laplacian parameters. Use the following code to install `numdifftools` which provides a simple way to compute derivatives. We can then compute the Hessian matrix, which is the matrix of the second derivatives of the user's function. This should be computed at the best-fit Laplacian parameters $\mu$ and $\Delta$. To finish, invert the matrix, and then take the square root. The diagonal entries will then be the Fisher uncertainties on $\mu$ and $\Delta$. How does these compare to the bootstrap distribution widths found in (8)?

I had a packed weekend when this homework was due, and even Monday was busy; I should have planned an extension earlier in the week. I am submitting what I can, and I've got to plan for the future homeworks properly. 

In [None]:
!pip install numdifftools

In [None]:
import numdifftools as nd
H = nd.Hessian(f_lnlaplace)([beta_laplace[0], beta_laplace[1]])
sigma_laplace = np.linalg.inv(H)**0.5

### Solution

In [17]:
import numpy as np

#PROBLEM 1

filename = "/Users/harvir_d/NEW_repos/astr_8070_s25/coursework/homeworks/hw_data/hw3_data_1.npy" # file path
data = np.load(filename) # easy to load multivariate data if from a numpy file

measurements = data[:,0] # extracting column at a time
uncertainties = data[:,1]

numerator = np.sum(measurements / (uncertainties**2)) # we can apply element wise operations row by row in our data
denominator = np.sum(1 / (uncertainties**2))

sample_mean = numerator / denominator

sample_error_mean = (np.sum(1 / (uncertainties**2)))**(-0.5)

print("The sample mean computed analytically is: %.4f" % sample_mean)
print("The sample error of the mean computed analytically is: %.4f" % sample_error_mean)


The sample mean computed analytically is: 4.0821
The sample error of the mean computed analytically is: 0.0938


In [14]:
#PROBLEM 2

from scipy import optimize
from scipy.stats import norm

# We can compare two extracted mean values for our data to the sample mean from (1). 

# The first method is defining a squared loss function like with the Chi_squared in class, except now that we have a heteroscedastic Gaussian, we will not be fitting to a line but rather each individual data point (measurement, uncertainty) to a "best guess mean" beta
# the line of best fit method does not make sense here because the Gaussian does not have a constant variance, and instead each data point has an associated error bar

def squared_loss(y, y_fit, dy):
    # minimizing the sum of the residuals will result in our best-fit mean
    return np.sum(((y - y_fit) / dy) ** 2, -1) # -1 is a sum over a specific axis for the 1D array in question (for multi-dimensional arrays, we could specify an axis to sum over)

f_squared = lambda beta: squared_loss(data[:,0], beta, data[:,1]) # lambda is an option that allows us to pre-load guesses into our callable function so we do not have to change the function or function call itself when we want to adjust our initial guess beta
beta = 4 # guess for fitting algorithm to start with

# adjusting our initial guess for the mean by iteratively going thorugh selections of beta, until the residual sum is minimized
best_fit_mean = optimize.fmin(f_squared,beta)
print("The mean from our fitting algorithm is: %.4f" % best_fit_mean)



# the second method for extracting the mean of our data is by directly maximizing the Likelihood function

xgrid = np.linspace(3,5,1000) # array space for possible mean values
L_single = np.ones((data.shape[0],1000)) # allocating array space for individual measurement likelihoods
for i in range(len(data)): # continuing loop through each paired data point
    L_single[i] = norm.pdf(xgrid,loc=data[i,0],scale=data[i,1]) # inidividual measurement likelihoods stored

L = np.prod(L_single,axis=0) # Likelihood for dataset as a whole

sorted_indices = np.argsort(L)[-1] # argsort rearranges array in ascending order, [-1] chooses the last value in array and that is the maximum likelihood value
print("The MLE for the mean is: %.4f" % xgrid[sorted_indices])

Optimization terminated successfully.
         Current function value: 126.238875
         Iterations: 12
         Function evaluations: 24
The mean from our fitting algorithm is: 4.0821
The MLE for the mean is: 4.0831


  print("The mean from our fitting algorithm is: %.4f" % best_fit_mean)


In [18]:
#PROBLEM 3

sigma_mu = np.diff(np.log(L), n=2) # 2nd order differentiation of lnL
sigma_mu /= (xgrid[1]-xgrid[0])**2 # numerial differentiation with respect to the grid size per step
sigma_mu *= -1 # part of Fisher calculation
sigma_mu = 1/np.sqrt(sigma_mu)[0] 

print("Fisher matrix error on estimated mean is %.4f" % sigma_mu)

Fisher matrix error on estimated mean is 0.0938


In [2]:
#PROBLEM 4

import time

num_bootstrap = 2000 # number of bootstrap resamplings we want

num_old_rows = data.shape[0] # 100, number of paired values in original dataset
num_new_rows = 100 # number of paired values we want to extract from the original dataset (in a random way) for each of the 2000 resamplings

bootstrap_resamplings = np.empty((num_bootstrap, num_new_rows, data.shape[1])) # pre-allocating 3D array for our complete bootstrap data set, 2000 partitioned resamples that are num_new_rows x 2 in dimension

total_start_time = time.perf_counter() # start time for running entire loop for bootstrap

for i in range(num_bootstrap):
    loop_time_start = time.perf_counter() # time start for this loop iteration

    # sampling with replacement for bootstrapping
    indices = np.random.choice(data.shape[0], size=num_new_rows, replace=True) # randomly selecting rows from original dataset, results in a list of numbers where each number is a row from *data*

    bootstrap_resamplings[i] = data[indices] # the ith partition of our 3D array of bootstrapping contains the randomly selected rows or paired values from the original dataset

    loop_time_end = time.perf_counter()

    print(f"Loop {i+1:4d}/{num_bootstrap} took: {loop_time_end - loop_time_start:.6f} seconds")

total_end_time = time.perf_counter()

print(f"Total time for {num_bootstrap} loops: {total_end_time - total_start_time:.6f} seconds")




Loop    1/2000 took: 0.087945 seconds
Loop    2/2000 took: 0.000043 seconds
Loop    3/2000 took: 0.000027 seconds
Loop    4/2000 took: 0.000036 seconds
Loop    5/2000 took: 0.000020 seconds
Loop    6/2000 took: 0.000015 seconds
Loop    7/2000 took: 0.000029 seconds
Loop    8/2000 took: 0.000027 seconds
Loop    9/2000 took: 0.000017 seconds
Loop   10/2000 took: 0.000016 seconds
Loop   11/2000 took: 0.000019 seconds
Loop   12/2000 took: 0.000015 seconds
Loop   13/2000 took: 0.000013 seconds
Loop   14/2000 took: 0.000013 seconds
Loop   15/2000 took: 0.000013 seconds
Loop   16/2000 took: 0.000019 seconds
Loop   17/2000 took: 0.000149 seconds
Loop   18/2000 took: 0.000012 seconds
Loop   19/2000 took: 0.000011 seconds
Loop   20/2000 took: 0.000011 seconds
Loop   21/2000 took: 0.000012 seconds
Loop   22/2000 took: 0.000011 seconds
Loop   23/2000 took: 0.000010 seconds
Loop   24/2000 took: 0.000010 seconds
Loop   25/2000 took: 0.000010 seconds
Loop   26/2000 took: 0.000010 seconds
Loop   27/20

I am not sure why problem 5 is not working. I think a straightforward reuse of the code in (2), specific to the bootstrap dataset as I have done below, should work fine. I think it is possible that the way I constructed the bootstrap array as a 2000x3 space with resample indices for the first column, is causing an issue. But then again, I just have to access the 2nd and third columns to get the measurement and uncertainty values and I did that below. Unfortunately, I had to submit this as it is without resolving the bug. 

In [23]:
#PROBLEM 5

#Repeating (2) for the bootstrap set

def squared_loss_bootstrap(y, y_fit, dy):
    # minimizing the sum of the residuals will result in our best-fit mean
    return np.sum(((y - y_fit) / dy) ** 2, -1) # -1 is a sum over a specific axis for the 1D array in question (for multi-dimensional arrays, we could specify an axis to sum over)

f_squared_bootstrap = lambda beta: squared_loss_bootstrap(bootstrap_resamplings[:,1], beta_bootstrap, bootstrap_resamplings[:,2]) # lambda is an option that allows us to pre-load guesses into our callable function so we do not have to change the function or function call itself when we want to adjust our initial guess beta
beta_bootstrap = 4 # guess for fitting algorithm to start with

# adjusting our initial guess for the mean by iteratively going thorugh selections of beta, until the residual sum is minimized
best_fit_mean_bootstrap = optimize.fmin(f_squared_bootstrap,beta_bootstrap)
print("The mean from our fitting algorithm, from the bootstrap set, is: %.4f" % best_fit_mean_bootstrap)



# the second method for extracting the mean of our data is by directly maximizing the Likelihood function

xgrid_bootstrap = np.linspace(3,5,1000) # array space for possible mean values
L_single_bootstrap = np.ones((bootstrap_resamplings.shape[0],1000)) # allocating array space for individual measurement likelihoods
for i in range(len(bootstrap_resamplings)): # continuing loop through each paired data point
    L_single_bootstrap[i] = norm.pdf(xgrid_bootstrap,loc=bootstrap_resamplings[i,1],scale=bootstrap_resamplings[i,2]) # inidividual measurement likelihoods stored

L_bootstrap = np.prod(L_single_bootstrap,axis=0) # Likelihood for dataset as a whole

sorted_indices_bootstrap = np.argsort(L_bootstrap)[-1] # argsort rearranges array in ascending order, [-1] chooses the last value in array and that is the maximum likelihood value
print("The MLE for the mean from the bootstrap set is: %.4f" % xgrid_bootstrap[sorted_indices_bootstrap])

ValueError: The user-provided objective function must return a scalar value.