<img src="../../shared/img/slides_banner.svg" width=2560></img>

# Regression 01

In [None]:
import sys

sys.path.append("../../")

from shared.src import quiet
from shared.src import seed
from shared.src import style

In [None]:
from pathlib import Path
import random

from IPython.display import HTML, Image
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc3 as pm
import seaborn as sns
import scipy.stats

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

import scipy.stats

In [None]:
sns.set_context("notebook", font_scale=1.7)

In [None]:
import shared.src.utils.util as shared_util

# Over the past few weeks, we've focused on inference in models with categorical independent variables and continuous dependent variables.

For example, how does the performance of subjects on a working memory task depend on their age and the difficulty of the task?

# This approach is very general, because we can turn anything into a category.

For the first part of this lecture, we'll work with a famous dataset:
Sir Francis Galton's parent-child height dataset ([source](https://doi.org/10.7910/DVN/T0HSJ1)),
on which the technique of regression was named and invented
([original paper](http://www.stat.ucla.edu/~nchristo/statistics100C/history_regression.pdf)).

In [None]:
df = pd.read_csv("./data/galton_height.csv", index_col=0)

It contains the heights of a nearly 1000 English individuals, their sex, and the height of both their parents, collected in 1885.

Following Galton, we summarize the parental heights by averaging them to obtain a "`midparental_height`".

In [None]:
df.head()

Galton did not have access to `matplotlib`, so he took stock of his data by means of a table:

In [None]:
Image("img/galton_table.png")

The table in the center is the joint distribution.
The horizontal axis of the table (aka the columns)
is the adult height of the children,
while the vertical axis (aka the rows)
is the midparental height.

The totals within each column and row--the
marginal, unnormalized histograms--
are indicated in the second row from the bottom (**Totals**)
and in the third column from the right (**Total Number of Mid-parents**).

Note: Galton didn't have `matplotlib` or Python,
but he did have a "computer":
a clerk who did all of his calculations for him!

He also wasn't able to make a kernel density estimate,
but he did the next best thing:
he averaged the values in each cell with its four neighbors,
which he referred to as "smoothing".

We might make a similar summary by means of `jointplot`:

In [None]:
sns.jointplot(x="midparental_height", y="height",  data=df, kind="hex");

Unlike Galton, we don't modify the heights of the female children,
so our results are slightly different.

Notice the "marginal" distributions appear in roughly the same place in this plot
as they do in Galton's table!
_Plus ça change_.

It seems like the child's height can be predicted from the parents' height:
if the parents are taller than average, the child is also taller than average.

But how can we quantify this?

If the data were categorical, we'd know what to do.

## We can redefine numerical variables as categorical variables by binning them.

Let's think of the bins as group labels:
if I am in the bin of heights between 64 and 66 inches,
then I am in the "group" of "64-66 inches".

In [None]:
def categorize_height(height):
    if height <= 64:
        return "<64"
    if height <= 66:
        return "64-66"
    if height <= 68:
        return "66-68"
    if height <= 70:
        return "68-70"
    return ">70"

This function converts a "numerical height" into a "categorical height":
the inputs are numbers and the outputs are strings indicating to which bin the numerical height belonged.

In [None]:
df["midparental_height_category"] = df["midparental_height"].apply(
    categorize_height)

In [None]:
df.sample(5)

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
sns.pointplot(x="midparental_height_category", y="height", data=df, ax=ax);

If we visualize the resulting data with one of our categorical visualizations,
like the `pointplot` or the `violinplot`,
we can apply the techniques we developed for other categorical problems.

Do the means appear different across the levles of the midparental height variable?

For convenience,
we'd like to work with indices:
category labels that are numbers beginning at 0.

In [None]:
category_to_idx = {
    "<64"   : 0,
    "64-66" : 1,
    "66-68" : 2,
    "68-70" : 3,
    ">70"   : 4}

df["midparental_height_category_idx"] = df["midparental_height_category"].map(category_to_idx)
df.sample(5)

This cell uses an alternative to `apply` called `map`.
The result is similarly a `Series` with transformed values.

In addition to taking in a function,
[`map`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.map.html)
can take in a dictionary.
Each entry in the calling `Series` is passed as a key to the dictionary,
and the resulting value becomes the entry in the output `Series`.

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
sns.stripplot(x="midparental_height_category_idx", y="height", data=df, ax=ax);

A `stripplot` is a convenient, easy-to-read alternative to the `violinplot`.
It still displays the entire distribution of the data,
not just summary statistics,
but it doesn't do any smoothing.
All of the data from a given group are plotted as a "strip".
In this case, the strips are vertical.
Within a strip, the horizontal position is meaningless:
points are simply "jittered" in order to make the density easier to see.

It is analogous to the `rugplot`,
but more useful for cotintuous data that also varies along a categorical variable.

As a reminder:
in the language of mixture distributions,
we are thinking of the marginal distribution of individuals' heights
as a mixture of distributions, one for each `midparental_height_category`:

In [None]:
f, axs = plt.subplots(figsize=(12, 12), nrows=2, sharex=True, sharey=True)
sns.distplot(df["height"], color="k", ax=axs[0], label="Marginal", axlabel=""); axs[0].legend();
[sns.distplot(df["height"][selector], ax=axs[1])
 for selector in [df["midparental_height_category_idx"]  == idx for idx in range(5)]];

## Now that we have a categorical variable, we can use our categorical effect modeling tools.

At this point,
writing down a model like this should be straightforward for you.

Make sure it's clear why we've chosen each of the components,
which parts are the prior and which are the likelihood,
and which parts we'll get a posterior over when we sample.

$$
\text{group_means} \sim \text{Normal}(\mu_g, 1e3, \text{shape}=5)\\
\sigma \sim \text{Exponential}\left(\frac{1}{2 \sigma_p}\right)\\
\text{height} \sim \text{Normal}\left(\text{group_means}[i], \sigma\right)
$$

Where $\mu_g$ is the grand mean and
$\sigma_p$ is the pooled standard deviation,
as below:

In [None]:
grand_mean = df.groupby("midparental_height_category_idx")["height"].mean().mean()
pooled_sd = df.groupby("midparental_height_category_idx")["height"].std().mean()

In [None]:
with pm.Model() as categorical_model:
    group_means = pm.Normal("mus", mu=grand_mean, sd=1e3, shape=5)
    sd = pm.Exponential("sigma", lam=1 / pooled_sd)
    
    heights = pm.Normal("heights",
                        mu=group_means[df["midparental_height_category_idx"]],
                        sd=sd,
                        observed=df["height"])

In [None]:
with categorical_model:
    categorical_trace = pm.sample(tune=1000)

categorical_posterior_df = shared_util.samples_to_dataframe(categorical_trace)

In [None]:
pm.plot_posterior(categorical_trace, figsize=(12, 18), text_size=16,
                  ref_val=grand_mean, varnames=["mus"]);

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
sns.pointplot(x="midparental_height_category_idx", y="height", data=df, ax=ax);
[plt.plot(sample["mus"], color="C0", alpha=0.05)
 for _, sample in categorical_posterior_df.iloc[:100, :].iterrows()];

In the terms of the models we've built so far,
there appears to be "an effect" of `midparental_height_category`
on the `height` of the child:
the means of the heights of the children in each category of midparental height
appear to differ.
Think about how you might quantify this.

Note that we can't say anything more cogent about the relationship,
because the categorical index is arbitrary -- more on that below.

## With this model in hand, we can make predictions about the height of children from their mid-parental height.

By "prediction" I don't necessarily mean "statement about the future"
-- though in this case, you could perhaps predict the height of a pair's children before they are born.

Instead, in statistical modeling we can call a statement about any unknown quantity a "prediction",
not just quantities that are unknown because they are realized in the future.

Other roughly equivalent terms you might use here include "guess" and "estimate".

A good guess/estimate/prediction is "close", in some sense, to the correct answer.

In regression models,
prediction/estimation is often more important than effect detection.

## In a categorical model with a Normal likelihood, the mean is used as the predictor.

$$\text{height} \approx \text{group_means}[i]$$

where $\approx$ is pronounced "is approximately equal to".
It means here something like "is close to", in some loose sense.

If our predictions are good, then our predictions are typically close to the true values.

Note that this is different from $\sim$, the symbol we use to mean that the variable on the left side
has the distribution on the right.

In [None]:
def predict_height_categorical(midparental_height, parameters):
    means = parameters["mus"]
    midparental_height_category = categorize_height(midparental_height)
    midparental_height_idx = category_to_idx[midparental_height_category]
    group_mean = means[midparental_height_idx]
    
    return group_mean

In [None]:
predict_height_categorical(66, categorical_posterior_df.iloc[-1])

# But for data where numerical values are meaningful, categories are unnatural.

In truly categorical data,
the order of the values doesn't matter.

Perhaps we might be interested in whether
there are differences in the rate at which cars of different colors get tickets.
Colors don't have a natural order to them, so it makes sense to treat color as a category.

But if we were interested in whether
the rate of getting tickets is dependent on the speed a car is traveling,
the situation is very different:
we have reason to believe there is a _natural order_ to speeds,
from low to high, and that the relationship is simplest in terms of this natural order.

Do demonstrate this point,
let's look at the midparental height data
with the category indices reordered.

Changing the order of the categories
shouldn't affect the analysis of categorical data,
since our indices are arbitrary,
but for this data, the picture that emerges is very different.

In [None]:
new_order = [4, 2, 3, 1, 0]

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
sns.pointplot(x="midparental_height_category_idx", y="height", data=df, ax=ax,
              order=new_order);
[plt.plot(sample["mus"][new_order], color="C0", alpha=0.05)
 for _, sample in categorical_posterior_df.iloc[:100, :].iterrows()];

The relationship between midparental height and the height of the child now looks complicated.

# If we instead retain the numerical meaning of the independent variable, the relationship often appears simpler.

Let's instead plot the categories in the original, natural order,
along with the numerical values to which they correspond.

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
sns.pointplot(x="midparental_height_category", y="height", data=df, ax=ax,
              order=reversed(df["midparental_height_category"].unique()));
[plt.plot(sample["mus"], color="C0", alpha=0.05)
 for _, sample in categorical_posterior_df.iloc[:100, :].iterrows()];
# plt.plot([0, 4], [65, 70], color="k", lw=6);

In the right order,
there appears to be a relatively simple relationship between
midparent height and child height:
increasing the midparent height from one category to the next,
that is, by about two inches,
appears to increase the average height by something slightly less
than two inches.
This effect appears reasonable consistently across the categories.

Whenever the effect of changing one variable is an increment of a fixed proportion
in another, the relationship between the two is described by a line:
it is _linear_.

By uncommenting the final line in the above cell,
you can plot a straight line over the data
that passes close by the average heights of each group.

We can plot the same information on top of the raw data, for comparison.
It does appear that the line passes
close to the middle of each group's values.

Furthermore, the groups appear to be fairly clustered around their central values.

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
sns.stripplot(x="midparental_height_category_idx", y="height", data=df, ax=ax);
[plt.plot(sample["mus"], color="C0", alpha=0.05)
 for _, sample in categorical_posterior_df.iloc[:100, :].iterrows()];
plt.plot([0, 4], [64, 70], color="k", lw=6);

We can plot that same line over the original data:

In [None]:
g = sns.jointplot(x="midparental_height", y="height", data=df)
g.ax_joint.plot([63, 71], [64, 70], color="k", lw=6);

For any given choice of midparental height,
this line appears to come close to most of the heights of the children.

# As with the categorical model, we can use this relationship to make predictions.

For the categorical model, we had

$$\text{height} \approx \text{group_means}[i] $$

Consider the classic formula for a line:

$$y = m \cdot x + b$$

In [None]:
slope = 1.5; intercept = 0.75
xs = pd.Series(range(-3, 4))
ys = slope * xs + intercept

In [None]:
f, ax = plt.subplots(figsize=(10, 10)); #plt.axis("equal");
ax.vlines(0, *[-4, 4], lw=6); ax.hlines(0, *[-4, 4], lw=6);
ax.plot(xs, ys, lw=4, label=f"$y = {slope}\cdot x + {intercept}$", zorder=3);
ax.set_ylim([-3, 3]); ax.set_xlim([-3, 3]); ax.legend();

So we write the predicted height for the individual as a function of the midparental height like so:

$$
\text{predicted_height} = \text{slope} \cdot \text{midparental_height} + \text{intercept}
$$

That is, if we want to guess the child's adult height,
we need to multiply their parent's average height by some value and add another.

which is equivalent to saying

$$
\text{height} \approx \text{slope} \cdot \text{midparental_height} + \text{intercept}
$$

In Python, that looks like:

In [None]:
def predict_height(midparental_heights, slope, intercept):
    return slope * midparental_heights + intercept

hand_slope, hand_intercept = 0.75, 17

where `hand_slope` and `hand_intercept` are values picked "by hand",
(or rather, "by eye")
to give decent predictions.

The predictions of this model look like so,
compared to the original data:

In [None]:
g = sns.jointplot(x="midparental_height", y="height", data=df)
g.ax_joint.scatter(
    df["midparental_height"],
    predict_height(df["midparental_height"], hand_slope, hand_intercept),
    color="C1", lw=6);

## There are two big issues with this process:

- we didn't define what we meant by "$\approx$".

This left us with no choice but to pick the parameters of the line by hand.

- we no longer have a probabilistic model for our data.

It's unclear how we might express our uncertainty about the size of our errors
or our uncertainty about the parameters of the line.

# We solve both of these issues at once by specifying a likelihood.

Or, alternatively, we define $\approx$ in terms of a distribution, $\sim$.

$$
\text{height} \sim \text{Normal}\left(\text{predicted_height}, \sigma\right)
$$

$$
\text{height} \sim \text{Normal}\left(\text{slope} \cdot \text{midparental_height} + \text{intercept}, \sigma\right)
$$

That is, if our predictions are good,
we expect that
- we are most likely to observe `height`s close to our `predicted_height`
- we will observe `height`s both above and below our prediction in equal proportion
- the average `height` for children of parents with a given `midparental_height` is equal to our `predicted_height`
- the spread of observed `height`s around our `predicted_height` is given by $\sigma$

Meaning of intercept: what is predicted height of a child whose parent heights average to 0 inches?

Meaning of slope: if the parents' average height increases by 1 inch, by how many inches (and in which direction!) should I change my prediction of the child's height?

This gives us a likelihood for our data,
and we just need to define a prior over the parameters:

$$
\text{slope} \sim \text{Normal}(0, 1)\\
\text{intercept} \sim \text{Normal}(0, 10)
$$

The prior on the slope means that we expect the heights of children to change by somewhere between ±3 inches
for each inch of change in midparental height.

The prior on the intercept means that we expect a child whose parents average to 0 inches tall
to be between -30 and 30 inches tall.

These priors are possibly too loose
(we might suspect that heights are more likely to increase than decrease,
that they don't increase by more than one inch per parental inch,
etc.),
but let's work with them for now.

Think about how you might incoroporate those beliefs into a prior,
and try them out in the model below.

$$
\sigma \sim \text{Exponential}\left(1 / \sigma_{h}\right)
$$

In [None]:
height_sd = df["height"].std()

As usually, we select a weak, overestimating prior for the standard deviation.

Here, we use the standard deviation of the height variable, $\sigma_h$,
aka `height_sd`,
as the mean for the prior for the standard deviation of our errors.

Since our predictions can't increase this variability,
the value $\sigma_h$ is something of an upper bound.

That done, we've defined prior distributions for all of the random variable components of the likelihood

$$
\text{height} \sim \text{Normal}\left(\text{slope} \cdot \text{midparental_height} + \text{intercept}, \sigma\right)
$$

## Once we have a likelihood and a prior, we can use `pyMC` to define a model and sample from its posterior. 

In [None]:
with pm.Model() as linear_model:
    intercept = pm.Normal("intercept", mu=0, sd=1e1)
    slope = pm.Normal("slope", mu=0, sd=1)
    
    sigma = pm.Exponential("sigma", lam=1 / height_sd)
    
    height = pm.Normal("heights",
                       mu=slope * df["midparental_height"] + intercept,
                       sd=sigma,
                       observed=df["height"])

In [None]:
with linear_model:
    linear_trace = pm.sample(tune=1000)
    
linear_posterior_df = shared_util.samples_to_dataframe(linear_trace)

## The resulting posterior represents our uncertainty about the true values of the parameters.

As always, the posterior has two pieces:

$$
\color{green}{p(\text{slope}, \text{intercept}, \sigma \vert \text{data})}
\propto \color{darkgoldenrod}{p(\text{data} \vert \text{slope}, \text{intercept}, \sigma)}
\cdot \color{darkblue}{p(\text{slope}, \text{intercept}, \sigma)}
$$

That is, our

$\color{green}{\text{updated belief about the plausibility of a setting of the parameters}}$

is proportional to

$\color{darkgoldenrod}{\text{how likely the data is with that choice of the parameters}}$

multiplied by

$\color{darkblue}{\text{how plausible we thought those parameters were before we saw the data}}$.

## For a `Normal` distribution, the likelihood is driven by how close the data is to the mean.

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
xs = np.linspace(-3, 3)
ax.plot(xs, scipy.stats.norm(0, 1).pdf(xs), lw=6);

The `Normal` likelihood is simple:
there are no extra bumps or other strange shapes;
the distribution is symmetric,
so it doesn't matter where you're above or below the center.
The closer the value is to the mean (0 in the plot above),
the greater its probability under a `Normal` distribution.
No other information is needed.

Therefore our likelihood term will push the model to predict a `height` as close as possible
to the value observed.

Because we restricted it to making predictions using a line,
the model will try to find lines
whose $y$-values pass as close as possible to the observed `height`s.
Our prior further indicates that it should prefer lines whose
slope and intercept are not too large.

In [None]:
pm.plot_posterior(linear_trace, figsize=(12, 12), text_size=16,
                  ref_val=[0, 0, height_sd]);

The fact that the `slope` is far away from 0
indicates that the height of the parents is useful for
predicting the height of the child in adulthood.

If the slope were sometimes positive, sometimes negative,
then we would suspect that it is not useful for prediction.

## From a single sample, we can produce predictions.

We define a quick function for predicting heights using parameters in the format of our posterior:

In [None]:
def predict_height_from_parameters(midparent_heights, parameters):
    return predict_height(midparent_heights, parameters["slope"], parameters["intercept"])

In [None]:
predict_height_from_parameters(0, linear_trace[0]), linear_trace[0]["intercept"]

In [None]:
predict_height_from_parameters(1, linear_trace[0]) - linear_trace[0]["intercept"], linear_trace[0]["slope"]

In [None]:
predict_height_from_parameters(0, linear_trace[-1]), linear_trace[0]["intercept"]

In [None]:
xs = np.linspace(60, 75)  # evenly-spaced numbers between 60 and 75
g = sns.jointplot(y="height", x="midparental_height", data=df);
g.ax_joint.plot(xs, predict_height_from_parameters(xs, linear_trace[0]), lw=4, color="C1");

## The standard deviation parameter indicates the expected variability in observed values.

In [None]:
xs = np.linspace(60, 75)  # evenly-spaced numbers between 60 and 75
g = sns.jointplot(y="height", x="midparental_height", data=df);
g.ax_joint.plot(xs, predict_height_from_parameters(xs, linear_trace[0]), lw=4, color="C1");
g.ax_joint.errorbar(xs,
                    predict_height_from_parameters(xs, linear_trace[0]),
                    yerr=2 * linear_trace[0]["sigma"],
                    lw=5, color="C1", alpha=0.3);

The transparent gold background here represents
the middle 95% of the likelihood component of this sample from the posterior,
overlaid on top of the data and underneath the predictions according to this sample's parameters.
We should expect about 95% of observations to fall inside this region.

It is obtained by taking the value of `sigma` and multiplying it by two--
the middle four standard deviations of a Normal cover 95% of the distribution 
--and plotting it using `errorbar` to get an approximation to the true shape.

But a single sample isn't necessarily representative,
nor does it indicate anything about our uncertainty in the prediction.

## If we combine predictions across many of our samples, we can get a sense of our uncertainty in what the correct prediction is.

This is a separate issue from our uncertainty in the values, given the prediction, as plotted above.

In [None]:
posterior_samples = linear_posterior_df.sample(n=100)
xs = np.linspace(60, 75)  # evenly-spaced numbers between 64 and 75
g = sns.jointplot(y="height", x="midparental_height", data=df);
[g.ax_joint.plot(xs, predict_height_from_parameters(xs, sample), lw=4, color="C1", alpha=0.05)
 for _, sample in posterior_samples.iterrows()];

# In regression, the goal is typically prediction, rather than hypothesis testing.

That is, we want to get high quality guesses for the values of the "$y$-variable".

The presumption is that if we can guess the values of the $y$-variable well
based only on the value of the $x$-variable
for the data we measured,
then in the future if we can only measure $x$,
we can use our model to obtain a good guess for $y$.

## To obtain our _best guess_ about the true value of the parameters, we use MAP inference.

That is, we select the parameters with the highest probability under the posterior.

That is, we maximize the value on the left-hand side of the equation below
by changing the values of slope and intercept and calculating the value on the right-hand side.

$$
\color{green}{p(\text{slope}, \text{intercept} \vert \text{data})}
\propto \color{darkgoldenrod}{p(\text{data} \vert \text{slope}, \text{intercept})}
\cdot \color{darkblue}{p(\text{slope}, \text{intercept})}
$$

In [None]:
linear_MAP = pm.find_MAP(start=linear_trace[0], model=linear_model)

linear_MAP

In [None]:
posterior_samples = linear_posterior_df.sample(n=100)
xs = np.linspace(60, 75)  # evenly-spaced numbers between 64 and 75
g = sns.jointplot(y="height", x="midparental_height", data=df);
[g.ax_joint.plot(xs, predict_height_from_parameters(xs, sample), lw=4, color="C1", alpha=0.1)
 for _, sample in posterior_samples.iterrows()];
g.ax_joint.plot(xs, predict_height_from_parameters(xs, linear_MAP), lw=4, color="k");

This plot shows the MAP prediction
(the predictions based on the maximum probability parameters a posteriori)
in black over the sampled posterior predictions in transparent gold.

# A model where the parameters of one random variable change continuously as a function of another is known as a _regression_ model.

Regression means "backwards movement". What does a continuous prediction have to do with backwards movement?

This terminology is silly, and comes directly from Galton and his unfortunate historical position.

He invented this technique to answer a question
about the inheritance of complex traits,
like height,
and in particular to explain a phenomenon of great concern to him:
the children of tall parents, though taller than average,
were usually shorter than their parents.

This implied that, after multiple generations,
a particularly tall individual's descendants would be no more likely
to be tall than the descendants of an individual of average height.

As a Victorian aristocrat, his opinion of average things was quite low,
so he referred to this phenomenon as _regression to mediocrity_.

## This shows up in our model as the slope being less than 1.

The result is that if we start with an individual whose height is far from the average,
e.g. 72 inches/six feet or 62 inches/5'2'',
then after several generations,
presuming they have children
without specifically picking mates based on height,
then the expected height of their great-great-great-grandchildren is most of the way
back to the population average,
as the simulation shows.

In [None]:
parameters = linear_MAP
original_height = 72; num_generations = 5

predicted_height = original_height

for _ in range(num_generations):
    predicted_height = predict_height_from_parameters(predicted_height, parameters)
    print(predicted_height)

In [None]:
df["height"].mean()

# If the function relating the two variables is linear, the model is a _linear regression_ model.

This model is, with its generalizations,
the workhorse of statistics and data science.

We'll spend the next lecture covering it in detail.

# We are not restricted to linear relationships between variables and parameters.

## Let's consider the probability that a golf putt goes into a hole.

This section is based off of a [pyMC translation](https://nbviewer.jupyter.org/github/pymc-devs/pymc3/blob/master/docs/source/notebooks/putting_workflow.ipynb)
of an original case study by [Andrew Gelman](https://mc-stan.org/users/documentation/case-studies/golf.html),
written for the Stan library, an R equivalent of pyMC.

The data is from [_Statistics: A Bayesian Perpsective_, by Donald Berry](https://www.jstor.org/stable/2684909),
and represents measurements of the putting outcomes for a number of professional golfers. 

The original goes into much greater depth and demonstrates some very cool pyMC tricks!

In [None]:
golf = pd.read_csv("data/golf.csv", index_col=0)

golf

If we take a look at this data,
we might be able to convince ourselves that a line comes "close enough"
to predicting the probability of getting the putt in:

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
ax.set_ylim([0, 1])
ax.scatter(golf["distance"], golf["successes"] / golf["tries"]);
ax.plot([2, 20], [0.95, 0.05], lw=2, color="C1");

You might quibble with the parameters, but these are just meant as a proof-of-principle.

## A line is inadequate to capture the relationship between distance and putt success chance.

In order to adapt our original model,
it seems like we'd only need to change our likelihood:

$$
\text{success} \sim \text{Bernoulli}(p=\text{slope}\cdot\text{distance} + \text{intercept})
$$

$$
\text{successes} \sim \text{Binomial}(n=\text{tries}, p=\text{slope}\cdot\text{distance} + \text{intercept})
$$

In [None]:
with pm.Model() as golf_linear:
    slope = pm.Normal("slope", mu=0, sd=1)
    intercept = pm.Normal("intercept", mu=0, sd=1)
    
    successes = pm.Binomial("successes",
                            n=golf["tries"],
                            p=slope * golf["distance"] + intercept,
                            observed=golf["successes"])

But if we try to sample from this model's posterior,
we get an error:

In [None]:
try:
    with golf_linear:
        golf_linear_trace = pm.sample(target_accept=0.9, draws=1000)
except pm.parallel_sampling.ParallelSamplingError:
    print("Error!")

Let's try and understand why this is the case.

First, let's plot the line from above again,
then determine its slope and intercept.

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
ax.set_ylim([-1, 2]); ax.hlines(0, 0, 30); ax.set_xlim(0, 30);
ax.scatter(golf["distance"], golf["successes"] / golf["tries"]);
ax.plot([2, 20], [0.95, 0.05], lw=4, color="C1");

In [None]:
golf_slope = -0.05
golf_intercept = 1.05

If we extend that line past 20 feet,
we see that the predicted value goes below 0 --
meaning that we predict a _negative_ probability of putting success!

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
xs = np.linspace(0, 30)  # evenly-spaced numbers between 0 and 30
ax.set_ylim([-1, 2]); ax.hlines(0, 0, 30, lw=6); ax.set_xlim(0, 30);
ax.scatter(golf["distance"], golf["successes"] / golf["tries"]);
ax.plot(xs, xs * golf_slope +  golf_intercept, lw=4, color="C1");

If the output values were restricted to being between 0 and 1,
then we'd still be in business.

## We can introduce other functions to our models in order to get other relationships besides linear ones.

In this case,
we introduce a "squasher" function
to make sure our predictions never go below 0 or above 1.

$$
\text{successes} \sim \text{Binomial}(n=\text{tries}, p=\text{squash}\left(\text{slope}\cdot\text{distance} + \text{intercept}\right))
$$

The typical choice for a "squasher" when we want values between 0 and 1 is the function below:

In [None]:
def squash(xs):
    return 1 / (1 + np.exp(-xs))

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
xs = np.linspace(-5, 5)
ax.plot(xs, squash(xs), lw=4);

It's sometimes called the _sigmoid_ (meaning "S-shaped") function
or the logistic function.

If we apply this function to our parameters from before,
we see that the predictions no longer go outside of the acceptable range:

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
xs = np.linspace(0, 30)  # evenly-spaced numbers between 0 and 30
ax.set_ylim([-1, 2]); ax.hlines(0, 0, 30, lw=6); ax.set_xlim(0, 30);
ax.scatter(golf["distance"], golf["successes"] / golf["tries"]);
ax.plot(xs, xs * golf_slope +  golf_intercept, lw=4, color="C1", label="before squashing");
ax.plot(xs, squash(golf_slope * xs + golf_intercept), color="C2", lw=4, label="after squashing");
ax.legend();

## The functions have to be provided by `pyMC` in order to work with sampling and MAP inference.

Technically, you can write your own,
but this requires some heavy-duty math work using the guts of pyMC.

In [None]:
pymc_squash = pm.math.invlogit

## Once that's taken care of, we simply include them inside of our model and we are ready to go:

In [None]:
with pm.Model() as golf_binomial:
    slope = pm.Normal("slope", mu=0, sd=1)
    intercept = pm.Normal("intercept", mu=0, sd=10)
    
    successes = pm.Binomial("successes",
                            n=golf["tries"],
                            p=pymc_squash(slope * golf["distance"] + intercept),
                            observed=golf["successes"])

In [None]:
with golf_binomial:
    golf_binomial_trace = pm.sample(target_accept=0.9, draws=1000)
    
golf_binomial_posterior = shared_util.samples_to_dataframe(golf_binomial_trace)

In [None]:
pm.plot_posterior(golf_binomial_trace, figsize=(12, 6), text_size=16);

## Our tools of MAP inference and posterior visualization transfer directly.

In [None]:
golf_MAP = pm.find_MAP(start=golf_binomial_trace[0], model=golf_binomial)

We just need to make sure to define our prediction function correctly:

In [None]:
def predict_putt_success(distance, parameters):
    return squash(parameters["slope"] * distance + parameters["intercept"])

This function was already present in our model:

$$
\text{successes} \sim \text{Binomial}(n=\text{tries}, p=\text{squash}\left(\text{slope}\cdot\text{distance} + \text{intercept}\right))
$$

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
xs = np.linspace(0, 30)  # evenly-spaced numbers between 0 and 30
MAP_predictions = predict_putt_success(xs, golf_MAP)
ax.set_ylim([0, 1.1]); ax.scatter(golf["distance"], golf["successes"] / golf["tries"]);
[ax.plot(xs, predict_putt_success(xs, sample), lw=4, color="C1", alpha=0.1)
 for _, sample in golf_binomial_posterior[::20].iterrows()];
ax.plot(xs, MAP_predictions, color="k", lw=4, label="MAP Prediction"); ax.legend();

Again, the predictions from the posterior samples are plotted in transparent gold
underneath the MAP prediction and above the data, in dark blue.

Important note:
the notion of a good prediction has changed -- it's based on a `Binomial` likelihood,
not a `Normal` one.

It's still in general good for predictions to be "close to" observations,
but the notion of distance can change.