Bayes Rule Book:

https://www.bayesrulesbook.com/chapter-13.html

Materials from the Bayes Rule github:

https://github.com/bayes-rules/bayesrules

# Imports

In [1]:
import math, pyreadr
import pandas as pd
import numpy as np
import plotly.graph_objects as go
import plotly.express as px
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
from scipy.stats import norm, beta, binom, mode
from os.path import exists

import pyro
import torch as t
import pyro.distributions as dist
import pyro.distributions.constraints as constraints
from pyro.infer import MCMC
from pyro.infer.mcmc.nuts import HMC, NUTS

device = t.device("cuda" if t.cuda.is_available() else "cpu")
t.set_default_tensor_type(t.FloatTensor)
if t.cuda.is_available():
    t.set_default_tensor_type(t.cuda.FloatTensor)

Ch 13 goes over logistic regression.

The techniques are very similar to normal regression, however I chose to replicate the examples from this chapter since they employ one new trick: using a logit link on the results.

Specifically:

$Y_{i}|\beta_{0},\beta_{1} \sim Bern(\pi_{i}) ~ with ~ log(\frac{\pi_{i}}{1-\pi_{i}}) = \beta_{0} + \beta_{1}X_{i}$

The intuition behind *why* we use a logit-link is worth rewriting here.

Let's assume we have an event with probability $\pi_{i}$ with odds $\frac{\pi_{i}}{1-\pi_{i}}$ and a single feature $x_{i}$. Which is the most appropriate model:

1. $\pi_{i} = \beta_{0} + \beta_{1}X_{i}$
2. $odds_{i} = \beta_{0} + \beta_{1}X_{i}$
3. $log(\pi_{i}) = \beta_{0} + \beta_{1}X_{i}$
4. $log(odds_{i}) = \beta_{0} + \beta_{1}X_{i}$

Keeping in mind that our regression with params $\beta_{0}, \beta_{1} \in \mathbb{R}$, let's go through each individually:

1. $\pi_{i}$ is our probability, which means it is constrained to the range $[0...1]$.
2. $odds_{i}$ are confined to positive numbers.
3. $log(\pi_{i})$ values (inputs between 0 and 1) is limmited to negative numbers.
4. $log(odds_{i})$ - while odds are confined to positive numbers, log(odds) are not. Because of this fact, this is our best match.

We can rearrange our equation to have our intended $\pi_{i}$ result like so:

- $log(\frac{\pi_{i}}{1-\pi_{i}}) = \beta_{0} + \beta_{1}X_{i}$
- $\frac{\pi_{i}}{1-\pi_{i}} = e^{\beta_{0} + \beta_{1}X_{i}}$
- $\pi_{i} = \frac{e^{\beta_{0} + \beta_{1}X_{i}}}{1+e^{\beta_{0} + \beta_{1}X_{i}}}$

The book further gives us some intuition into what $\beta_0...\beta_{n}$ mean in the context for logistic regression:

- $\beta_0$ - if we set all $x_0...x_n$ to 0, $e^{\beta_0}$ is simply our odds.
- $e^{\beta_1}$ is just the multiplicative change in odds (of course), where $\beta_1 = log(odds_{x+1}) - log(odds_x)$, holding all other $x_n$

# Data

In [2]:
# using the same data as ch11
file_name = 'weather_WU'
folder = 'ch11'

data_url = f"https://github.com/bayes-rules/bayesrules/raw/master/data/{file_name}.rda"

if exists(f"/Users/zr/Geek/tutorials/bayesian_rules/{folder}/{file_name}.csv"):
    df = pd.read_csv(f"/Users/zr/Geek/tutorials/bayesian_rules/{folder}/{file_name}.csv")
else:
    # pyreadr downloads remote file, saves locally and converts the RDA datafile to a pandas DataFrame
    file_path = f"/Users/zr/Geek/tutorials/bayesian_rules/{folder}/{file_name}.rda"
    pyreadr.download_file(data_url, file_path)
    result = pyreadr.read_r(file_path)
    df = result[file_name]
    df.to_csv(f"/Users/zr/Geek/tutorials/bayesian_rules/{folder}/{file_name}.csv")

In [8]:
df = df[['humidity9am', 'raintomorrow']]
df = df.dropna()

# Model

- $Y_i|\beta_0,\beta_1 \sim Bern(\pi_i) ~ with ~ log(\frac{\pi_{i}}{1-\pi_{i}}) = \beta_{0} + \beta_{1}X_{i}$
- $\beta_1 \sim N(0.07, 0.035)$
- $\beta_0 \sim N(-4.5, 5)$

In [98]:
x_data = t.tensor(df.humidity9am.values, dtype=t.float)
y_data = t.tensor(df.raintomorrow.replace({'No':0, 'Yes':1}).values, dtype=t.float)

def model(x, y=None):
    beta0 = pyro.sample('beta_0', dist.Normal(-4.5, 5))
    beta1 = pyro.sample('beta_1', dist.Normal(0.05, 0.05))

    logits = beta0 + beta1*x # these can be fed directly into dist.Bernoulli(logits=logits), but for reference:
    odds   = t.exp(logits)
    pi     = odds / (1+odds)

    with pyro.plate('data', len(x)):
        return pyro.sample('obs', dist.Bernoulli(pi), obs=y)

pyro.clear_param_store()
mcmc = MCMC(NUTS(model), num_samples=1000, warmup_steps=200)
mcmc.run(x_data, y_data)
mcmc.summary(prob=.8)

Sample: 100%|██████████| 1200/1200 [00:39, 30.72it/s, step size=1.76e-01, acc. prob=0.938]


                mean       std    median     10.0%     90.0%     n_eff     r_hat
    beta_0     -5.50      0.83     -5.50     -6.45     -4.33    191.64      1.00
    beta_1      0.06      0.01      0.06      0.04      0.07    190.74      1.00

Number of divergences: 0





From book:

- $\beta_0 ~ in ~ range ~ [-5.0879...-4.1345]$
- $\beta_1 ~ in ~ range ~ [0.04147...0.05487]$

In [100]:
n_samples = 50
sample_betas = pd.DataFrame(mcmc.get_samples(n_samples))[['beta_0', 'beta_1']].values
x = np.sort(np.unique(x_data.numpy()))

fig = go.Figure()
for b0, b1 in sample_betas:
    logits = b0 + b1*x
    odds   = np.exp(logits)
    pi     = odds / (1+odds)
    fig.add_trace(
        go.Scatter(x=x, y=pi, marker_color='lightblue')
    )
fig.update_layout(showlegend=False)
fig

In [204]:
results = []
for x in df.humidity9am.values:
    pred   = np.array([model(t.tensor([x], dtype=t.float)) for _ in range(100)]).mean().round()
    actual = df.replace({'Yes':1, 'No':0}).loc[df.humidity9am==x, 'raintomorrow'].mean().round()
    results.append((pred, actual, 1 if pred==actual else 0))

results = pd.DataFrame(results, columns=['pred', 'actual', 'result'])

grid_0_0 = len(results.loc[(results.pred==0)&(results.actual==0)]) / len(results)
grid_1_0 = len(results.loc[(results.pred==1)&(results.actual==0)]) / len(results)
grid_0_1 = len(results.loc[(results.pred==0)&(results.actual==1)]) / len(results)
grid_1_1 = len(results.loc[(results.pred==1)&(results.actual==1)]) / len(results)

print(f'Accuracy: {(results.result.sum()*100 / len(results)):.2f}%')

pd.DataFrame([[grid_0_0, grid_0_1, (grid_0_0+ grid_0_1)],[grid_1_0, grid_1_1, (grid_1_0+ grid_1_1)], [(grid_0_0+grid_1_0),(grid_0_1+grid_1_1),(1)]], index=['Pred 0', 'Pred 1', 'Total'], columns=['Actual 0', 'Actual 1', 'Total']).round(2)

Accuracy: 92.39%


Unnamed: 0,Actual 0,Actual 1,Total
Pred 0,0.91,0.02,0.93
Pred 1,0.06,0.01,0.07
Total,0.97,0.03,1.0


Well.... If our control model just assumed 0, it would get a 97% accuracy... welp