# Homework 5: Part I

1. Go get data from kaggle.com and do a ***Bayesian Linear Regression*** analysis

```python
import pymc as pm; import numpy as np
n,p=100,10; X,y=np.zeros((n,p)),np.ones((n,1))
# Replace this made up data with your data set from kaggle...
with pm.Model() as MLR:
    betas = pm.MvNormal('betas', mu=np.zeros((p,1)), cov=np.eye(p), shape=(p,1))
    sigma = pm.TruncatedNormal('sigma', mu=1, sigma=1, lower=0) # half normal
    y = pm.Normal('y', mu=pm.math.dot(X, betas), sigma=sigma, observed=y)

with MLR:
    idata = pm.sample()
```    

2. Choose ***prior*** that are sensible: certainly you might change the ***hyperparameters***, and perhaps you can experiment with different distributional families for `sigma`...

3. [Optional] Assess the performance of the MCMC and note any issues or warnings

    1. Traceplots, inference (credible) intervals, effective sample sizes, energy plots, warnings and other notes... just the usual stuff they do [here](https://www.pymc.io/projects/docs/en/stable/learn/core_notebooks/pymc_overview.html#pymc-overview)

4. [Optional] Perform ***Multiple Linear Regression*** diagnostics... residual plots, etc.


In [1]:
import pymc as pm
import numpy as np
import pandas as pd
data = pd.read_csv('toronto-movies.csv')
X = data[['metascore', 'rated', 'genre']].values
y = data['imdb_rating'].values
n,p=100,10; X,y=np.zeros((n,p)),np.ones((n,1))

with pm.Model() as MLR:
    betas = pm.MvNormal('betas', mu=np.zeros((p,1)), cov=np.eye(p), shape=(p,1))
    sigma = pm.TruncatedNormal('sigma', mu=1, sigma=1, lower=0) # half normal
    y = pm.Normal('y', mu=pm.math.dot(X, betas), sigma=sigma, observed=y)

with MLR:
    idata = pm.sample()

NameError: name 'pd' is not defined

# Homework 5: Part II
    
## Answer the following with respect to $p(\boldsymbol \beta |\boldsymbol\Sigma, \mathbf{X},\mathbf{y})$ on the previous slide
    
1. Rewrite $p(\boldsymbol \beta |\boldsymbol\Sigma, \mathbf{X},\mathbf{y})$ in terms of $\sigma^2$ (no longer using $\Sigma$) if $\Sigma=\sigma^2I$

2. What is $E[\boldsymbol \beta |\boldsymbol\Sigma, \mathbf{X},\mathbf{y}]$?

3. What ***hyperparameters*** values (legal or illegal) would make $E[\boldsymbol \beta |\boldsymbol\Sigma, \mathbf{X},\mathbf{y}] = (\mathbf{X^\top X})^{-1}\mathbf{X^\top y}$?

4. What ***hyperparameters*** values (legal or illegal) would make $E[  \mathbf{\hat y} = \mathbf{X}\boldsymbol \beta |\boldsymbol\Sigma, \mathbf{X},\mathbf{y}] = \mathbf{X}(\mathbf{X^\top X})^{-1}\mathbf{X^\top y}$?

5. What is $\text{Var}[\boldsymbol \beta |\boldsymbol\Sigma, \mathbf{X},\mathbf{y}]$?

# Homework 5: Part II Answered in Comment Format for Python

# 1. Rewrite p(beta | Sigma, X, y) in terms of sigma^2 (no longer using Sigma) if Sigma=sigma^2I
Given that Sigma = sigma^2I, the expression for the posterior distribution
p(beta | Sigma, X, y) can be rewritten to incorporate sigma^2 directly.
This is because Sigma being equal to sigma^2I simplifies the covariance matrix to a scalar
multiplied by the identity matrix, indicating that the errors are homoscedastic and
independently distributed with variance sigma^2.


# 2. What is E[beta | Sigma, X, y]?
The expected value of beta given Sigma, X, and y, denoted as E[beta | Sigma, X, y],
is the mean of the posterior distribution of the beta coefficients. In the context of Bayesian
linear regression, this would typically be derived from the posterior distribution using Bayes' theorem,
factoring in the likelihood of the data given the parameters and the prior distribution of the parameters.


# 3. What hyperparameters values (legal or illegal) would make E[beta | Sigma, X, y] = (X^T X)^{-1}X^T y?
To make the expected value of beta equal to (X^T X)^{-1}X^T y, the prior distribution of beta must be
set such that it reflects a non-informative or flat prior. This effectively reduces the Bayesian model
to the ordinary least squares (OLS) estimator. Therefore, the hyperparameters that would result in this
are those that minimize the influence of the prior on the posterior, essentially making the prior
distribution of beta uniform or assigning a very large variance to it.


# 4. What hyperparameters values (legal or illegal) would make E[ y_hat = X beta | Sigma, X, y] = X(X^T X)^{-1}X^T y?
This question essentially asks for the conditions under which the expected value of the predicted y
(y_hat) equals the OLS prediction. This occurs when the beta coefficients are estimated as in OLS,
which, as mentioned in answer 3, requires the prior on beta to be non-informative or to have a very
large variance. Therefore, the same hyperparameter conditions apply here.


# 5. What is Var[beta | Sigma, X, y]?
The variance of beta given Sigma, X, and y, denoted as Var[beta | Sigma, X, y], represents the uncertainty
or spread of the beta coefficients in their posterior distribution. In the context of a Bayesian linear
regression model with Sigma = sigma^2I, the variance of beta can often be expressed as sigma^2(X^T X)^{-1},
which aligns with the covariance matrix of the OLS estimator but now interpreted in a Bayesian framework.
This variance depends on the inverse of the design matrix (X^T X) and the variance of the errors (sigma^2),
reflecting how data informativeness and error variability influence uncertainty in the coefficient estimates.



# Homework 5: Part III

1. Go get data from kaggle.com and perform inference for a ***Bayesian Multivariate Normal Model***

<SPAN STYLE="font-size:18.0pt">

```python
import numpy as np; from scipy import stats
p=10; Psi=np.eye(p); a_cov = stats.invwishart(df=p+2, scale=Psi).rvs(1)
n=1000; y=stats.multivariate_normal(mean=np.zeros(p), cov=a_cov).rvs(size=n)
# Replace this made up data with your data set from kaggle...
    
with pm.Model() as MNV_LKJ:
    packed_L = pm.LKJCholeskyCov("packed_L", n=p, eta=2.0,
                                 sd_dist=pm.Exponential.dist(1.0, shape=2), compute_corr=False)
    L = pm.expand_packed_triangular(p, packed_L)
    # Sigma = pm.Deterministic('Sigma', L.dot(L.T)) # Don't use a covariance matrix parameterization
    mu = pm.MvNormal('mu', mu=np.array(0), cov=np.eye(p), shape=p);
    # y = pm.MvNormal('y', mu=mu, cov=Sigma, shape=(n,1), observed=y)
    # Figure out how to parameterize this with a Cholesky factor to improve computational efficiency
with MNV_LKJ    
    idata = pm.sample()
```    
</SPAN>

2. As indicated above, don't use a covariance matrix parameterization and instead figure out how to parameterize this with a ***Cholesky factor*** to improve computational efficiency. The ***Cholesky***-based formulation allows general $O(n^3)$ $\det({\boldsymbol \Sigma})$ to be computed using a simple $O(n)$ product and general $O(n^3)$ ${\boldsymbol \Sigma}^{-1}$ to be instead evaluated with $O(n^2)$ ***backward substitution***.

2. Specify ***priors*** that work: certainly you'll likely need to change the ***prior hyperparameters*** for $\boldsymbol \mu$ (`mu`) and $\mathbf{R}$ (`packed_L`)...
    1. And you could consider adjusting the ***prior*** for $\boldsymbol \sigma$ using `sd_dist`...

3. [Optional] Assess the performance of the MCMC and note any issues

    1. Traceplots, inference (credible) intervals, effective sample sizes, energy plots, warnings and other notes... just the usual stuff they do [here](https://www.pymc.io/projects/docs/en/stable/learn/core_notebooks/pymc_overview.html#pymc-overview)



In [4]:
import pymc as pm
import numpy as np
from scipy import stats

p = 10  
Psi = np.eye(p)  
a_cov = stats.invwishart(df=p+5, scale=Psi).rvs(1)  
n = 1000  
y = stats.multivariate_normal(mean=np.zeros(p), cov=a_cov).rvs(size=n)  

with pm.Model() as MNV_LKJ:
    packed_L = pm.LKJCholeskyCov("packed_L", n=p, eta=1.0,  
                                 sd_dist=pm.Exponential.dist(1.0, shape=p), compute_corr=False)
    L = pm.expand_packed_triangular(p, packed_L)
    Sigma = pm.Deterministic('Sigma', L.dot(L.T)) 
    mu = pm.MvNormal('mu', mu=np.zeros(p), cov=np.eye(p)*1e-5, shape=p)  
    y_obs = pm.MvNormal('y_obs', mu=mu, chol=L, observed=y)

with MNV_LKJ:
    trace = pm.sample(500)  