<img src="../../shared/img/banner.svg" width=2560></img>

# Lab 09 - Bayesian Linear Regression

In [None]:
%matplotlib inline

In [None]:
import sys

sys.path.append("../../")

from shared.src import quiet
from shared.src import seed
from shared.src import style

In [None]:
from pathlib import Path

from client.api.notebook import Notebook
from IPython.display import Image
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc3 as pm
import seaborn as sns

import shared.src.utils.util as shared_util

In [None]:
sns.set_context("notebook", font_scale=1.7)

In [None]:
ok = Notebook("ok/config")

## Learning Objectives

1. Become comfortable designing linear regression models based on visualizations of data.
2. Practice using pyMC to specify a linear regression model in terms of priors and likelihoods.
3. Draw inferences from the posterior of a linear regression model and relate the parameters to predictions.

In this week's lab,
you'll develop a linear regression model:
a model that where one variable's parameters
are a linear function of another variable.

You will be given the choice of dataset,
the opportunity to select variables to relate to one another,
and the freedom to design an accompanying linear regression model.

## Loading the Datasets

The `seaborn` library can download a number of "demonstration" datasets,
many of which are classic datasets in statistics.

The sections below load and describe two datasets from this collection.

They are saved as `csv` files in the `content/shared/data` folder of this course
and loaded into the Python workspace as `DataFrame`s.

In [None]:
shared_data_dir = Path(".") / ".." / ".." / "shared"/ "data"

#### `iris`

The `iris` dataset was also used in Lab 06. The description there is reproduced below.

The `iris` dataset has a long history:
it was introduced by Ronald Fisher in the 1930s
to develop early ideas in statistical classification.

In [None]:
iris = sns.load_dataset("iris", data_home=shared_data_dir)
iris.columns

The dataset contains anatomical measurements of three different `species` of the iris flower:

In [None]:
iris["species"].unique()

The measurements are of the `length` and `width` of two components of the flower:
the `petal` and `sepal`, pictured below.

In [None]:
Image(url="https://upload.wikimedia.org/wikipedia/commons/7/78/Petal-sepal.jpg", width=250)

Petals are the component that most associate with flowers.
Sepals are a more leaf-like, typically green component that primarily serve to protect flowers before they bloom.

The question behind the `iris` dataset is whether
these anatomical features can be used to predict the `species`.

For more information about this dataset, see [Kaggle](https://www.kaggle.com/arshid/iris-flower-dataset).

#### `mpg`

The `auto-mpg` dataset is another common example dataset in data science,
less classic than `iris` but still widely-used.

In [None]:
mpg = sns.load_dataset("mpg", data_home=shared_data_dir)
mpg.columns

It contains information on the engines of cars produced between 1970 and 1982 in the United States, Europe, and Japan.

In [None]:
mpg["model_year"].unique(), mpg["origin"].unique()

The information includes the number of
[`cylinders`](https://en.wikipedia.org/wiki/Cylinder_(engine)) in the engine,
(the famous [V8 engine](https://en.wikipedia.org/wiki/V8_engine) is named for its 8 cylinders),
the size of the engine
(technically [`displacement`](https://en.wikipedia.org/wiki/Engine_displacement),
a measure of the volume acted on by the pistons),
the `horsepower`, a measure of the engine's ability to move mass over distance and time,
the overall `weight` of the vehicle in pounds,
and the `acceleration`, whose meaning and units are somewhat opaque.

In [None]:
mpg.sample(10)

These values are typically used to try and predict the `mpg` variable:
the number of `m`iles the car travels `p`er `g`allon of gas
([the city mpg, rather than highway](https://www.kbb.com/what-is/highway-mpg/)),
also known as the _gas mileage_ or _mileage_.

You are free to relate any pair of numerical variables to one another in your regression model.

For more information about this dataset, see [Kaggle](https://www.kaggle.com/uciml/autompg-dataset).

## Visualizing the Data

The first thing to do when you embark on a new analysis,
especially with new data,
is to visualize the data.

#### Use `pairplot` to visualize the dataset you've chosen to work with.

#### Q Select a pair of variables that appear to be linearly related to one another. Why do you suspect that the relationship is linear? Use the visual features of the `pairplot`.

## Creating a Model of the Data

#### For the variables you described above, define a linear regression model and use it to infer a posterior distribution over the slope and intercept parameters of the relationship.

You are encouraged, but not required,
to use some of the advanced linear regression concepts
covered in the second set of slides,
e.g. robust regression, ridge regression, and LASSO regression.

You do not need to standardize your data with $z$-scoring,
but that might make the setting of a ROPE easier below.

#### Q Identify the "prior" component of your model and explain your choice of distribution for each random variable in it.

#### Q Identify the "likelihood" component of your model and explain your choice of distribution for each random variable in it.

## Drawing an Inference about the Presence of a Relationship

Draw samples from the posterior of your model
(you'll want at least 1000, perhaps more).
Then visualize the posterior over the
slope parameter of your model
and obtain a 95% highest posterior density interval.

Include a Region of Practical Equivalence (ROPE) using the `rope` kwarg of `pm.plot_posterior`.

#### Use the visualization of the posterior samples, the ROPE, and the 95% HPD to answer the question(s) below.

#### Q What does your posterior tell you about the probable nature of the relationship between the two variables? Relate your claim to the features of the posterior, the 95% HPDI, and the ROPE, or to the value of a function computed on the posterior samples.

Regression models are more commonly used for prediction,
rather than hypothesis testing,
and so MAP estimation is often natural.

#### Find the MAP parameters of your model and plot the predictions associated with those parameters on top of the data.