In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab06.ipynb")

# Lab 6: Model Selection and Uncertainty Quantification with GLMs
Welcome to the Data 102 Lab 6. In this lab, we are going to continue our discussion of GLMs, and give you the opportunity to practice concepts around model selection and uncertainty quantification with GLMs.

#### The code and responses you need to write are are represented by `...`. There is additional documentation for each part as you go along.

##### Please read the introduction and the instructions to each problem carefully.

## Collaboration Policy
Data science is a collaborative activity. While you may talk with others about the labs, we ask that you **write your solutions individually**. If you do discuss the assignments with others please **include their names** in the cell below.

**Collaborators:**

## Submission
See the [Gradescope Submission Guidelines](https://edstem.org/us/courses/42657/discussion/3350112) for details on how to submit your lab.

**For full credit, this assignment should be completed and submitted before Wednesday, October 11th, 2023 at 11:59 PM PST.**

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import itertools
import hashlib
from scipy.stats import poisson, norm, gamma, uniform, multivariate_normal
import statsmodels.api as sm
import arviz as az
  
sns.set(style="dark")
plt.style.use("ggplot")

# Atlantic Hurricane Season

As in Lab 5 Question 2, we will be exploring hurricane data from the 2020 Atlantic hurrican season. Just like in the last lab, we're investigating the number of named storms recorded since 1880 and examining the relationship between rising Sea Surface Temperature (SST) and the frequency of named storms.

For this lab we extracted the number of tropical storms from the [HURDAT Database](https://www.nhc.noaa.gov/data/hurdat/hurdat2-1851-2019-052520.txt). We also extracted data on Sea Surface Temperatures from the [National Center for Atmosferic Research](https://climatedataguide.ucar.edu/climate-data/global-surface-temperature-data-gistemp-nasa-goddard-institute-space-studies-giss). 

## Load the data

The `Num_Storms` column contains the number of named storms recorded each year between 1880 and 2019. The `Temp_Anomaly` column contains the deviation in yearly SST from the mean of 1951-1980.

In [None]:
# No need to modify: Just run the code to load the data
data_source = "hurricane_data.csv"
df = pd.read_csv(data_source)
df = df[["Year", "Num_Storms", "Temp_Anomaly"]]
df.tail()

## Review: Model Selection Metrics

In lab 5, we spent some time learning how to build, interpret, and implement GLMs from a frequentist and bayesian perspective. Here, we expand on that discussion by considering **model selection**: how do we compare the models that we've built?

In class, we looked at a few metrics that can help us decide on a model to use: 

### Log-likelihood:  $\log{p(y|x,\beta)}$
This is simply the log of the likelihood that we're maximizing when we fit a GLM. The motivation behind using this value comes from a simple idea: the more probable our data is under our model (the higher the log-likelihood), the better it explains the data. This metric is generally best when comparing models with similar numbers of parameters.

### Deviance: $(-2) \times \text{log-likelihood}(\text{model}) + \text{constant}$
While you don't need to know the exact details of how this is calculated, the deviance is a measure of how much worse our model is relative to a perfect, or saturated model (i.e. one that perfectly predicts every point). Generally, the lower the deviance, the better the model.

### Penalized Likelihood Information Criteria (AIC and BIC)
We also introduced the idea of penalized likelihood information criteria with the **Akaike Information Criteria (AIC)** and the **Bayes Information Criteria (BIC)**:
$$
\text{AIC}(\text{model}) = (-2) \times \text{log-likelihood}(\text{model}) + 2 \times \text{number of parameters in model}
$$
$$
\text{BIC}(\text{model}) = (-2) \times \text{log-likelihood}(\text{model}) + (\log n) \times \text{number of parameters in model}
$$
The reason why we prefer using metrics like the AIC and BIC is that they penalize models with more parameters. If a model utilizes many features to make a decision, this could lead to overfitting. This would cause an increase in the log-likelihood of the data, even though the model may be a poor representation of the relationship we're trying to understand. As with deviance, smaller AIC and BIC values mean better models!

## Model Fitting and Selection

For this part of the lab, we'll consider three models for this dataset: 

1. Linear regression model for Storms on SST
2. Linear regression model for log(Storms) on SST
3. Poisson regression model for Storms on SST
4. Negative Binomial regression model for Storms on SST

Using the metrics we've learned from class, we can compare and contrast these models against one another, and reach a decision about which best fits our use case.

Before we begin, let's first look at a scatterplot of `Num_Storms` over `Temp_Anomaly` (SST) to get a rough understanding of the relationship between the two variables. 

In [None]:
Y = df['Num_Storms']
X = df['Temp_Anomaly']
plt.scatter(X, Y, s = 10)
#plt.plot([min(poireg_fitted_values), max(poireg_fitted_values)],[min(poireg_fitted_values), max(poireg_fitted_values)], color = 'red', linestyle = '--')
plt.xlabel('SST')
plt.ylabel('Number of Storms')
plt.title('Storms plotted against SST')
plt.show()

There seems to be a clear increasing relationship between the two variables. By fitting models, we can obtain a rigorous quantitative understanding of the relationship. 

## Model 1: Linear Regression for Storms on SST

### 1a) Fit the model

In the cell below, use `statsmodels.api` to fit a linear regression (with intercept) for Storms on SST.

**Hint**: To add an intercept, `sm.add_constant` might come in handy.

In [None]:
#Model ONE: linear regression for Storms on SST
model_one = ...
print(model_one.summary())

In [None]:
grader.check("q1")

<!-- BEGIN QUESTION -->

### 1b) Interpreting our Model

Using the model summary you produced in `q1a`, interpret the `Temp_Anomaly` coefficient in the context of the problem.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

Now that we've fit our model, let's visually inspect its fit! As we continue building and comparing models, it's important to visually confirm the quality of our models, beyond just using metrics to compare one another!

In [None]:
#Plotting the data along with the regression equation: 
b0_one, b1_one = model_one.params.values
x_vals = np.linspace(df['Temp_Anomaly'].min(), df['Temp_Anomaly'].max(), 100)
y_vals_one = b0_one + b1_one * x_vals
plt.scatter(df['Temp_Anomaly'], Y, s = 10)
plt.plot(x_vals, y_vals_one, color='red',label=f'y = {b0_one:.2f} + {b1_one:.2f}x')
plt.legend()
plt.xlabel('SST')
plt.ylabel('Number of Storms')
plt.title('Storms plotted against SST')
plt.show()

### 1c) Calculating AIC

Now, let's find the AIC for our model for purposes of comparison. Fill out the cell below with the value of the AIC for this model.

**Hint**: This one's pretty straightforward: you can calculate it using the log-likelihood, or use a model attribute to get this!

In [None]:
aic_one = ...

In [None]:
grader.check("q1c")

## Model 2: Linear Regression for $\log(\text{Storms})$ on SST

### 2a) Fit the model

In the cell below, use `statsmodels.api` to fit a linear regression (with intercept) for $\log(\text{Storms})$ on SST.

**Hint**: To add an intercept, `sm.add_constant` might come in handy.

In [None]:
model_two = ...
print(model_two.summary())

In [None]:
grader.check("q2a")

Based on the output, linear regression fitted the model equation: $\log(y) = 2.1645 + 0.5197 x$. The interpretation of the coefficient 0.5197 is the following: every unit increase in SST leads to an increase of 0.5197 in the average number of $\log(\text{storms})$. In the next figure, we plot the linear regression equation fitted by Model One with the equation $y = \exp(2.1645 + 0.5197 x)$ fitted by Model Two. 

Let's again visualize the model we've outputted, and compare it visually to the first model we built:

In [None]:
#Plotting the data along with the regression equations: 
b0_two, b1_two = model_two.params.values
x_vals = np.linspace(df['Temp_Anomaly'].min(), df['Temp_Anomaly'].max(), 100)
y_vals_two = np.exp(b0_two + b1_two * x_vals)
plt.scatter(df['Temp_Anomaly'], Y, s = 10)
plt.plot(x_vals, y_vals_one, color='red',label=f'Model One: y = {b0_one:.2f} + {b1_one:.2f}x')
plt.plot(x_vals, y_vals_two, color='green',label=f'Model Two: y = exp({b0_two:.2f} + {b1_two:.2f}x)')
plt.legend()
plt.xlabel('SST')
plt.ylabel('Number of Storms')
plt.title('Storms plotted against SST')
plt.show()
#The two fitted functions clearly look similar even though the model two fit is a bit lower than model one fit in the middle of the range of SST.

### 2c) Calculating AIC

The AIC for this model is given by 149.2 which seems substantially smaller than the AIC for Model one (which was 763.1). Actually, these two AICs are not comparable because the first model treats $y_1, \dots, y_n$ as the dataset while the second model treats $\log(y_1), \dots \log(y_2)$ as the dataset. To make the AICs comparable, we can convert the second AIC so it applies to $y_1, \dots, y_n$. For this, we can use the density change of variable formula: 
\begin{align*}
   f_{Y}(y) = f_{\log Y}(\log y) \left| \frac{d}{dy} \log y \right| = f_{\log Y}(\log y) \frac{1}{y}.
\end{align*}
Thus
\begin{align*}
  \text{likelihood for } Y_1, \dots, Y_n = \left(\text{likelihood for } \log Y_1, \dots, \log Y_n \right) \times \frac{1}{y_1} \frac{1}{y_2} \dots \frac{1}{y_n}
\end{align*}
and
\begin{align*}
   & \text{AIC for model two in terms of } Y_1, \dots, Y_n \\ 
   &= -2  \left(\text{log-likelihood for } Y_1, \dots, Y_n \right) + 2 \times \text{number of parameters} \\
   &= -2  \left(\text{log-likelihood for } \log Y_1, \dots, \log Y_n \right) + 2 \times \text{number of parameters}  - 2 \log \left(\frac{1}{y_1} \frac{1}{y_2} \dots \frac{1}{y_n}\right) \\
   &= \left(\text{AIC for model two in terms of } \log Y_1, \dots, \log Y_n \right)+ 2 \sum_{i=1}^n \log y_i.
\end{align*}
Using this derivation, calculate the aic of the model in the cell below.

**Note**: To avoid rounding errors and pass the autograder test, you'll want to use the `model.aic` attribute to access the exact AIC value for your model.

In [None]:
aic_two = ...
aic_two

In [None]:
grader.check("q2b")

Now, when we compare model one's AIC and model two's AIC, we can see that Model One has a (slightly) lower AIC compared to Model Two. Even though the two models seem to give similar fits, we can prefer Model One because of the smaller AIC.

In [None]:
#Compare with aic_one
print([aic_one, aic_two])

## Model 3: Poisson Regression for Storms on SST

The Poisson regression model is given by 

$$Y_i \sim \text{Poisson}(\mu_i)$$

$$\mu_i = \exp(\beta_0 + \beta_1 X_i)$$ 

and $X_i$ is the SST deviation in Year $i$. 

### 3a) Fit the model

In the cell below, use `statsmodels.api` to fit a poisson regression (with intercept) for Storms on SST.

In [None]:
model_three = ...
print(model_three.summary())

In [None]:
grader.check("q3a")

### 3b) Comparing Models 2 and 3
The fitted coefficients give estimates for $\beta_0$ and $\beta_1$. The estimate 0.4866 for $\beta_1$ implies that for unit increase in 'Temp_Anomaly', the mean number of storms increases by a factor of $\exp(0.4866) \approx 1.63$ (or equivalently, by 63%). Thus, clearly, increased SST is related to more storms. The p-value for SST is close to 0 indicating statistical significance. 

Take a moment to compare the coefficients you got for Model 2 and Model 3. They seem very similar! This begs the question: how are these two models any different?

**Choose the option that best fills in the blanks in the following statements. For the second statement, you may use the same choice multiple times:**

The mathematic models fit by model 2 and 3 are \_\_\_**(1)**\_\_\_.

- **A.** Different
- **B.** The same

Mathematically, we can represent the model 2 as fitting the equation \_\_\_**(2)**\_\_\_ whereas we can represent model 3 as fitting the equation \_\_\_**(3)**\_\_\_.

- **A.** $\mathbb{E} (\log Y) = \beta_0 + \beta_1 X_i$
- **B.** $\log \mathbb{E} (Y) = \beta_0 + \beta_1 X_i$

Your answer should be strings, assigning either `"A"` or `"B"` to each variable.

In [None]:
blank_1 = ...
blank_2 = ...
blank_3 = ...

In [None]:
grader.check("q3b")

Again, let's plot all three regression lines in one figure and compare them visually.

In [None]:
#Plotting the data along with the regression equation: 
b0_three, b1_three = model_three.params.values
x_vals = np.linspace(df['Temp_Anomaly'].min(), df['Temp_Anomaly'].max(), 100)
y_vals_three = np.exp(b0_three + b1_three * x_vals)
plt.figure(figsize = (15, 10))
plt.scatter(df['Temp_Anomaly'], Y)
plt.plot(x_vals, y_vals_one, color='red',label=f'y = {b0_one:.2f} + {b1_one:.2f}x (linear regression of y on x)')
plt.plot(x_vals, y_vals_two, color='green',label=f'y = exp({b0_two:.2f} + {b1_two:.2f}x) (linear regression of log(y) on x)')
plt.plot(x_vals, y_vals_three, color='black',label=f'y = exp({b0_three:.2f} + {b1_three:.2f}x) (Poisson regression of y on x)')
plt.legend()
plt.xlabel('SST')
plt.ylabel('Number of Storms')
plt.title('Storms plotted against SST')
plt.show()

From the above figure, we can see that near the middle of the range of SST, there is little difference between linear regression and Poisson regression. However, near the extremes, Poisson regression gives larger fits compared to linear regression. The model two fitted line is somewhat below the other two fits in the middle region. 

<!-- BEGIN QUESTION -->

### 3c) Is a Change of Variable Needed?

Given your answer to `q3b`, when we do our AIC calculation, do we need to use the density change of variable formula as in `q2c`? Why or why not?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

### 3d) Calculating AIC

Fill out the cell below with the value of the AIC for this model.

In [None]:
aic_three = ...
aic_three

In [None]:
grader.check("q3d")

<!-- END QUESTION -->



In [None]:
print([aic_one, aic_two, aic_three])

So Model three has the smaller AIC compared to models one and two. 

## Model 4: Negative Binomial Regression for Y on X

The Negative Binomial regression model is given by 

$$Y_i \sim \text{NB}(\mu_i, \alpha)$$

$$\mu_i = \exp(\beta_0 + \beta_1 X_i)$$ 

Here $NB(\mu_i, \alpha)$ is the Negative Binomial distribution with mean $\mu_i$ and overdispersion parameter $\alpha$ ($NB(\mu, \alpha)$ and $\text{Poisson}(\mu)$ are very close to each other when $\alpha$ is close to zero).

### 4a) Fit the model

In the cell below, use `statsmodels.api` to fit a negative binomail regression (with intercept) for Storms on SST.

**Note**: While this can be fit in two ways in statsmodels (`sm.GLM(..., family = sm.families.NegativeBinomial()).fit()` and `sm.NegativeBinomial(...).fit()`), these methods differ slightly in their treatment of $\alpha$. The latter function also treats $\alpha$ as an unknown parameter and estimates it, including a row in the output for the parameter $\alpha$. The former function expects to be given a value of $\alpha$ and takes $\alpha = 1$ as default if not supplied. **For this question, use `sm.NegativeBinomial(...).fit()` to pass the tests.**

In [None]:
model_four = ...
print(model_four.summary())

In [None]:
grader.check("q4a")

This model seems to give essentially the same coefficient estimates as Poisson regression. The overdispersion parameter $\alpha$ is estimated by a small number (0.029) (note $NB(\mu, \alpha) \approx \text{Poisson}(\mu)$ when $\alpha \approx 0$).

### 4b) Calculating AIC

Once again, let's find the AIC for our model for purposes of comparison. Fill out the cell below with the value of the AIC for this model.

In [None]:
aic_four = ...
aic_four

In [None]:
grader.check("q4b")

All the four AIC are given below:

In [None]:
print(f"AIC of model one: {aic_one}")
print(f"AIC of model two: {aic_two}")
print(f"AIC of model three: {aic_three}")
print(f"AIC of model four: {aic_four}")

<!-- BEGIN QUESTION -->

### 4c) Choosing the best model

Given the AICs that you calculated, which model is the best fit for our data?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

# Model Checking and Posterior Predictive Distributions

So far in this lab, we've talked through issues of *model selection*. Now, we will pivot to viewing GLMs from a Bayesian perspective, and check if our model is right for the task at hand. 

To do this, remember that frequentist models are equivalent to Bayesian ones with a flat prior ($\text{uniform}[-C, C]$ for $C \rightarrow \infty$). That means that we can find a posterior for these models, and eventually find **posterior predictive distributions** for our data (more on that in the next part of this question). With these distributions, we can then perform a series of sanity checks to make sure that our model is at least a plausible explanation for the underlying relationship we see in the data.

## Posterior Normal Approximation

But how do we find the posterior here? Usually, as in labs 4 and 5, we can use approximate inference to get empirical posteriors. Here, we'll use a different approximation technique we've covered in lecture 11: **posterior normal approximation**.

To describe this, let us focus on Poisson regression for concreteness where the parameters are $\beta_0$ and $\beta_1$ (the analysis for negative binomial is entirely analogous). We assume a flat prior ($\text{uniform}[-C, C]$ for $C \rightarrow \infty$). The posterior distribution for $\beta_0, \beta_1$ is then approximated by the multivariate normal distribution with mean given by the MLE and the covariance given by the negative of the inverse of the Hessian matrix of the log-likelihood evaluated at the MLE. If that sounded like a lot, don't worry: **this covariance matrix is given by the output of statsmodels!** For example, for model 3, the mean and covariance of the posterior normal distribution can be calculated as follows:

In [None]:
posterior_normal_mean = model_three.params.values  #this is the mean of the posterior normal approximation
print(posterior_normal_mean)
posterior_normal_covariance = model_three.cov_params() #this is the covariance matrix of the posterior normal approximation
print(posterior_normal_covariance)

The marginal normal densities corresponding to this multivariate (in this case, bivariate) normal density can be plotted as follows. These give the marginal posterior densities for $\beta_0$ and $\beta_1$. 

In [None]:
from scipy.stats import norm
mu_0, mu_1 = posterior_normal_mean
V_0, V_1 = np.diag(posterior_normal_covariance)

# Generate separate x values for plotting each density
x_0 = np.linspace(mu_0 - 3*np.sqrt(V_0), mu_0 + 3*np.sqrt(V_0), 1000)
x_1 = np.linspace(mu_1 - 3*np.sqrt(V_1), mu_1 + 3*np.sqrt(V_1), 1000)

# Initialize subplots
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(12, 6))

# Plot the first normal density
ax[0].plot(x_0, norm.pdf(x_0, mu_0, np.sqrt(V_0)), color='blue', label=f'N({mu_0:.4f}, {V_0:.4f})')
ax[0].set_title(f'Posterior for beta_0: N({mu_0:.4f}, {V_0:.4f})')
ax[0].legend()

# Plot the second normal density
ax[1].plot(x_1, norm.pdf(x_1, mu_1, np.sqrt(V_1)), color='red', label=f'N({mu_1:.4f}, {V_1:.4f})')
ax[1].set_title(f'Posterior for beta_1: N({mu_1:.4f}, {V_1:.4f})')
ax[1].legend()

# Display the plots
plt.tight_layout()
plt.show()

**Note**: Under our flat prior, Bayesian analysis with this normal posterior gives **exactly the same results** as frequentist analysis. For example, posterior mean, median or mode coincide with the MLE. Posterior standard deviations coincide with frequentist standard errors. Credible intervals coincide with confidence intervals and so on.

### Sampling from the posterior

Now, to properly set ourselves up to do some posterior predictive checks, we'll need to create a function to draw samples from our approximate posteriors on $\beta_0$ and $\beta_1$. Fill in the function below to sample from our posterior distribution!

**Hint**: Given that we're using a normal approximation of our posterior, `np.random.multivariate_normal` may come in handy here!

In [None]:
np.random.seed(42) #Do not change this line!

def sample_from_posterior(N):
    """
    Draws samples from our approximate distribution on beta_0 and beta_1 from Model 3. 
    
    Inputs:
        N: int, number of samples to return
        
    Outputs:
        samples: np.array of shape (N, 2) containing samples from the posterior 
    """
    ...

posterior_samples = sample_from_posterior(1000)
posterior_samples

In [None]:
grader.check("q5a")

Now, let's visualize the samples we've drawn!

In [None]:
#Histograms of samples along with 'density plots': these are plots of estimated densities and are supposed to approximate the posterior normal densities.
# Initialize subplots
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12, 5))

# Plot histogram for the first column
axes[0].hist(posterior_samples[:, 0], bins=100, alpha=0.3, color='blue', density = True)
az.plot_kde(posterior_samples[:, 0], ax=axes[0], label="Density Plot")
axes[0].set_title('Histogram for beta_0 samples')

# Plot histogram for the second column
axes[1].hist(posterior_samples[:, 1], bins=100, alpha=0.3, color='green', density = True)
az.plot_kde(posterior_samples[:, 1], ax=axes[1], label="Density Plot")
axes[1].set_title('Histogram for beta_1 samples')

# Display the plots
plt.tight_layout()
plt.show()

## Posterior Predictive Checks

In order to validate our Bayesian Poisson Regression model, we can perform **Posterior Predictive Checks (PPC)**. After performing Bayesian Regression we have access to a generating distribution for the response $Y'_i|X_i$. This generating distribution is the **posterior predictive distribution**, and it gives us the distribution of the *possible* outputs (denoted $Y'_i$) and associated probabilities ($p(Y'_i|X_i)$) that an input $X_i$ could potentially have. The crux of PPC is to sample from this **posterior predictive distribution** and to compare these samples to our original response data.

The code below computes PPC samples and plots their distribution. Note that here, the band labeled "Posterior predictive $y$" is a collection of curves, each of which is the density plot of $y$ for a given draw of the coefficient vector $\beta$ from the posterior. Specifically to generate a red line we follow these steps: 
1. First, we take one of our posterior samples of $\beta_0$ and $\beta_1$.
2. Then we draw samples from the distribution $p(Y_i| \beta_0,\beta_1, X_i)$ for each data point $X_i$.
3. Then we plot the density of the resulting $Y$'s.

In [None]:
#Generating Posterior Predictive Samples (using samples from the posterior_samples)
X = sm.add_constant(df['Temp_Anomaly'])

# Reshaping samples and data to vectorize computation
X_reshaped = X.values[:, np.newaxis, :]
posterior_samples_reshaped = posterior_samples[np.newaxis, :, :]

# Step 2: drawing samples from p(Y_i | betas, X_i)
poisson_means = np.exp(np.sum(X_reshaped * posterior_samples_reshaped, axis = -1)) # Deriving mean of each Poisson likelihood under different values of beta and different inputs
poisson_samples_matrix = np.random.poisson(poisson_means) # Drawing samples from each Poisson Distribution

print(poisson_samples_matrix.shape)

With these posterior predictive samples, we can generate kde density plots for each sampled pair of $\beta_0$ and $\beta_1$ (these density plots are generated from the package arviz). For purposes of comparison, we also want to superimpose the average posterior predictive distribution (averaged over all the $n$ samples in the original dataset). This latter quantity is computed as follows:

In [None]:
#Calculation of the average posterior predictive distribution
from scipy.stats import poisson
num_rows = X.shape[0]
y_values = np.arange(0, 36)
average_probabilities = np.zeros_like(y_values, dtype=float)

for row in range(num_rows):
    poisson_mean = np.exp(np.dot(posterior_normal_mean, X.iloc[row]))
    average_probabilities += poisson.pmf(y_values, poisson_mean)

average_probabilities /= num_rows

Now, we're ready to plot our posterior predictive distributions, and compare them against the distribution of our observed data.

In [None]:
plt.figure(figsize=(10, 6))

# Plot the densities of all samples
for sample in range(poisson_samples_matrix.shape[1]):
    plot_kwargs = {"color": "red", "linewidth": 1, "alpha": 0.2}
    if sample == 0:
        plot_kwargs["label"] = "Posterior Predictive"
    az.plot_kde(poisson_samples_matrix[:, sample], plot_kwargs=plot_kwargs)

# Plot the density of the Y data
az.plot_kde(Y, plot_kwargs={"color": "black", "linewidth": 2, "label": "Observed Response Data"})

# Superimpose the average_probabilities using a blue line
y_values = np.arange(0, 36)  # Corresponding to y = 0, 1, ..., 35
plt.plot(y_values, average_probabilities, color='blue', label="Posterior Predictive Mean for Bayesian Poisson Regression with SST", linewidth=2, linestyle='--', alpha = 1)

plt.legend()
plt.xlabel('Number of Named Storms')
plt.ylabel('Density')
plt.title('Posterior Predictive Check')
plt.show()

When we visualize this data, notice that the black line (i.e the distribution of our observed $Y$) falls snugly within the red region. This is exactly what we're looking for: the red region, composed of samples from our posterior predictive distribution, represents our model's belief about the distribution of $Y$. Interpreted in the context of our problem, the red region tells us how it believes frequencies of named storms are distributed, and since the observed distribution falls inside of the region, our models expectations and our observed reality line up nicely.

As a comparison, let's use this posterior predictive check on an obviously poor model: a poisson regression model that uses only the intercept ($\beta_0$) as a parameter. The analogous plot for this non-SST model can be obtained by the following code. 

In [None]:
#Only intercept model
X_intercept_only = sm.add_constant(X.drop('Temp_Anomaly', axis = 1))
model_noSST = sm.GLM(Y, X_intercept_only, family = sm.families.Poisson()).fit()
print(model_noSST.summary()) #note that deviance is now much higher
print(model_noSST.aic)

Note that the AIC of the intercept only model is 817.9 which is much higher than the AIC (756.5) of model three with both intercept and SST.

Now, we'll use the exact same code as above to generate the posterior predictive check visualization on our new model.

In [None]:
#Obtain the posterior normal distribution for this model:
posterior_normal_mean = model_noSST.params.values  #this is the mean of the posterior normal approximation
print(posterior_normal_mean)
posterior_normal_covariance = model_noSST.cov_params() #this is the covariance matrix of the posterior normal approximation
print(posterior_normal_covariance)

#Generate posterior samples:
n_samples = 1000
beta_samples = sample_from_posterior(n_samples)

#Generating Posterior Predictive Samples (using beta_samples)
X = X_intercept_only
n = X.shape[0]
X_reshaped = X.values[:, np.newaxis, :]
beta_samples_reshaped = beta_samples[np.newaxis, :, :]
poisson_means = np.exp(np.sum(X_reshaped * beta_samples_reshaped, axis = -1))
poisson_samples_matrix = np.random.poisson(poisson_means)
print(poisson_samples_matrix.shape)

#Calculation of the average posterior predictive distribution
from scipy.stats import poisson
num_rows = X.shape[0]
y_values = np.arange(0, 36)
average_probabilities = np.zeros_like(y_values, dtype=float)
for row in range(num_rows):
    poisson_mean = np.exp(np.dot(posterior_normal_mean, X.iloc[row]))
    average_probabilities += poisson.pmf(y_values, poisson_mean)
average_probabilities /= num_rows



plt.figure(figsize=(10, 6))

# Plot the densities of all samples
for sample in range(poisson_samples_matrix.shape[1]):
    plot_kwargs = {"color": "red", "linewidth": 1, "alpha": 0.2}
    if sample == 0:
        plot_kwargs["label"] = "Posterior Predictive"
    az.plot_kde(poisson_samples_matrix[:, sample], plot_kwargs=plot_kwargs)

# Plot the density of the Y data
az.plot_kde(Y, plot_kwargs={"color": "black", "linewidth": 2, "label": "Observed Response Data"})

# Superimpose the average_probabilities using a blue line
y_values = np.arange(0, 36)  # Corresponding to y = 0, 1, ..., 35
plt.plot(y_values, average_probabilities, color='blue', label="Posterior Predictive Mean", linewidth=2, linestyle='--', alpha = 1)

plt.legend()
plt.xlabel('Value')
plt.ylabel('Density')
plt.title('Posterior Predictive Check for model without SST')
plt.show()

<!-- BEGIN QUESTION -->

### 5b) With and Without SST

Compare the two plots above. In your opinion, which model is a better fit for the observed data? Why?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

## Congratulations! You have finished Lab 6! ##

Below, you will see two cells. Running the first cell will automatically generate a PDF of all questions that need to be manually graded, and running the second cell will automatically generate a zip with your autograded answers. **You are responsible for both the coding portion (the zip from Lab 6) and the written portion (the PDF of written responses from Lab 6) to their respective Gradescope portals.** The coding proportion should be submitted to the `Lab 6` assignment as a single zip file, and the written portion should be submitted to `Lab 6 PDF` assignment as a single pdf file. When submitting the written portion, please ensure you select pages appropriately.

If there are issues with automatically generating the PDF in the first cell, you can try downloading the notebook as a PDF by clicking on `File -> Save and Export Notebook As... -> PDF`. If that doesn't work either, you can manually take screenshots of your answers to the manually graded questions and submit those. Either way, **you are responsible for ensuring your submission follows our requirements, we will NOT be granting regrade requests for submissions that don't follow instructions.**

In [None]:
import matplotlib.image as mpimg
from otter.export import export_notebook
from os import path
from IPython.display import display, HTML
export_notebook("lab06.ipynb", filtering=True, pagebreaks=True)
if(path.exists('lab06.pdf')):
    img = mpimg.imread('cute_flemish.jpg')
    imgplot = plt.imshow(img)
    imgplot.axes.get_xaxis().set_visible(False)
    imgplot.axes.get_yaxis().set_visible(False)
    plt.show()
    display(HTML("Download your PDF <a href='lab06.pdf' download>here</a>."))
else:
    print("\n Pdf generation fails, please try the other methods described above")

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(run_tests=True)