# Linear Regression Exercises

In [1]:
import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc as pm
import seaborn as sns
import scipy
from scipy import stats

What do I want people to do

* Load palmer penguins
* Run a linear regression on weight on just Adelie penguins
* Use posterior predictive to understand how many penguins are above X grams
* Use two other features to predict penguin mass

* Expand regression to all species, just using bill length
  * How many beta parameters do we have now?
  * What is the slope of each

## Load Penguins Dataset

In [2]:
penguins = pd.read_csv("https://raw.githubusercontent.com/BayesianModelingandComputationInPython/BookCode_Edition1/main/data/penguins.csv")

In [3]:
penguins.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,Adelie,Torgersen,,,,,,2007
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007


### Set random seed
Just for reproducibility in this class

In [4]:
RANDOM_SEED = 8296
np.random.seed(RANDOM_SEED)

## Exercise 1: Intercept only regression
The Palmer Penguin dataset has been loaded for you. For the first exercise we want to focus just on the Adelie penguins. 

Specifically we would like you to

1. Perform some exploratory data analysis to get a sense of the data
2. Filter to only observations Adelie penguins that have *complete* data
2. Write the model for an intercept only regression for `body_mass_g`
3. Sample from the prior predictive
4. Use MCMC sampling to estimate the model parameters
5. Using the ArviZ posterior plot, plot the posterior estimates
   * What is the 75% HDI for these parameters
6. Plot the posterior predictive
  * How does it compare to the prior predictive?


### Beta Feedback question
In Exercise 1 we left some code blocks with commented hints below. Let us know if this helps "just the right amount" or is too helpful. The exercises should be challenging, that's how you learn, but we're looking to strike the right balance here! Your feedback will help

In [5]:
# EDA

In [6]:
# Adelie Filter

In [7]:
# Missing Data Filter

In [19]:
# Model, including prior predictive

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [intercept, eps]


Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 13 seconds.


In [1]:
# Plot Prior predictive

In [None]:
# Sample from model

In [None]:
# Plot Posterior

In [2]:
# Sample from posterior predictive

In [None]:
# Plot Posterior Predictive

## Exercise 2: Adding Regressors

Still using the Adelie dataset perform two regressions. [This diagram](https://github.com/allisonhorst/palmerpenguins#bill-dimensions) will give you an intuition for what each means.

From a code perspective be sure to use dims and coords!

_Hint_: In plotting the posterior the `var_names` argument is quite helpful to focus on what you plots you care about

1. Perform a regression with `bill_length_mm` only
2. Perform a regression with `bill_length_mm` and `bill_depth_mm`

For each regression plot the posterior parameters  

## Exercise 3: Multiple Species
In this exercise we'll introduce all the species and perform a regression for each.

* Expand the regression to include all penguins still using `bill_length_mm` and `bill_depth_mm` as the regressors
* Plot the posterior distribution. Verify the slopes for the Adelie penguins are the same as the previous model
  * `Hint`: the `.sel` xarray slicing will be helpful here