<a href="https://colab.research.google.com/github/danielbauer1979/MSDIA_PredictiveModelingAndMachineLearning/blob/main/GB886_III_7_GammaRegressionExample.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Gamma Regression

We consider a dataset that has miles-per-gallon for caes based on horse-power, acceleration, and car origin (we take an excerpt from the [UCI dataset mpg](https://archive.ics.uci.edu/dataset/9/auto+mpg]) that has various other features). Our objective is to predict current mpg. You can imagine a use-case as checking reported mpg figures by producers and whether a care is efficient relative to its characteristics.

We will first use OLS linear regression, illustrate potential issues, and then showcase Gamma regression as an alternative.

Let's start by loading the libraries that are going to be helpful. We're again going to rely on the package [statsmodels](https://www.statsmodels.org/stable/index.html) and the statistical learning toolkit [ski-cit learn](https://scikit-learn.org/stable/index.html), which provide GLM functionalty.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from sklearn.linear_model import GammaRegressor

### Load Data

In [None]:
!git clone https://github.com/danielbauer1979/MSDIA_PredictiveModelingAndMachineLearning.git

In [None]:
dat_mpg = pd.read_csv('MSDIA_PredictiveModelingAndMachineLearning/GB886_III_7_auto-mpg.csv')
dat_mpg.head()

In [None]:
dat_mpg.describe()

Data explanation:
* mpg - miles per gallon
* horsepower - Engine horsepower
* acceleration - Time to accelerate from 0 to 60 mph (sec.)
* origin - Origin of car (A. American, B. European, C. Japanese)

## Run an OLS Regression and Analyze

Typically, we would do some data exploration. However, instead we are going to run a OLS linear regression as a baseline. We will then check model validity, and plot some relationships that may be tricky. This is to showcase that running a model "blind" can be problematic.

In [None]:
y = dat_mpg['mpg']
X = dat_mpg.drop(columns=['mpg','origin'])
X = pd.concat([X,pd.get_dummies(dat_mpg['origin'], drop_first=True)], axis =1)
X = sm.add_constant(X) # Add a constant term as the default model doesn't include one
model_ols = sm.OLS(y, X.astype(float)).fit()
model_ols.summary()

So, we see that some of the "fit metrics" are indicating a poor firt. The skew is positive (i.e., the distibution seems skewed) and the kurtosis is larger than three. Furthermore, the Jaque-Bera test seeems to reject a normal assumption (although the sample size is somewhat small for the JB test).

Let's look at the residual plots for model validation:

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(10, 4))
sns.histplot(model_ols.resid, bins=10, ax=axes[0], kde=True)
axes[0].set_xlabel('Residuals')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Histogram of Residuals')
sm.qqplot(model_ols.resid, line='s', ax=axes[1])
axes[1].set_title('Normal Q-Q Plot of Residuals')
plt.tight_layout()
plt.show()

We notice the skewness and the tails of the distributions do not seem to be approproately captured by the normal distribution.

Let's also plot the errors as a function of the feature variables horsepower and acceleration--recall that the assumption of OLS linear regression is an iid. error distribution with a constant variance:

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(10, 4))

axes[0].scatter(dat_mpg['horsepower'], model_ols.resid)
axes[0].set_xlabel('Horsepower')
axes[0].set_ylabel('Residuals')
axes[0].axhline(0, color='red')

axes[1].scatter(dat_mpg['acceleration'], model_ols.resid)
axes[1].set_xlabel('Acceleration')
axes[1].set_ylabel('Residuals')
axes[1].axhline(0, color='red')

plt.tight_layout()
plt.show()

We notice that the assumption of a constant variance around the zero does not seem like an appropriate assumption.

Let's also plot the target variable as a function of the features horsepower and acceleration -- recall that the OLS linear regression assumption is that $y$ depends linearly on $x_k$:

In [None]:
# Plot mpg vs horsepower with linear trendline
plt.figure(figsize=(16, 8))
plt.subplot(1, 2, 1)
sns.regplot(x=dat_mpg['horsepower'], y=dat_mpg['mpg'], line_kws={'color': 'red'})
# Add exponential trendline
sns.regplot(x=dat_mpg['horsepower'], y=dat_mpg['mpg'], order=2, line_kws={'color': 'blue'})

# Plot mpg vs acceleration with linear trendline
plt.subplot(1, 2, 2)
sns.regplot(x=dat_mpg['acceleration'], y=dat_mpg['mpg'], line_kws={'color': 'red'})
# Add exponential trendline
sns.regplot(x=dat_mpg['acceleration'], y=dat_mpg['mpg'], order=2, line_kws={'color': 'blue'})

plt.show()

So we see that the data are not well matched by the red linear regression line, further indicating potential issues with the OLS linear regression assumption.

Finally, let's compare the outcomes with the predictions:

In [None]:
plt.scatter(dat_mpg['mpg'], model_ols.predict())
plt.plot([min(dat_mpg['mpg']), max(dat_mpg['mpg'])], [min(dat_mpg['mpg']), max(dat_mpg['mpg'])], 'r--')
plt.xlabel('Actual mpg')
plt.ylabel('Predicted mpg')
plt.title('Actual vs. Predicted mpg')
plt.show()

So it doesn't look very satisfactory overall...

## Gamma Regression

So let's try a Gamma regression with a log-link function.

### Why and what does that mean?

Formally, a Gamma regression with log-link assumes:

$$
y_i | x_i ~ Gamma(\mu_i | \varphi), \text{ where } \log\{\mu_i\} = \beta_0 + \beta_1\,x_{i1}+\ldots+\beta_p\,x_{ip} = x_i \, \beta
$$

Note that this means for the expected value and the variance of a random sample $Y_i$ with features $x_i$

\begin{eqnarray*}
E[Y_i | x_i] &=& \mu_i = e^{ \beta_0 + \beta_1\,x_{i1}+\ldots+\beta_p\,x_{ip}} = e^{x_i \, \beta}\\
Var[Y_i | x_i ] &=& \mu_i^2 \, \varphi.
\end{eqnarray*}

Hence, the assumptions differ from OLS linear regression in several ways:

* We assume $Y$ is Gamma distributed. The Gamma distribution is positive, skewed, and has heavier tails than the Normal distribution.
* We assume that $Y$ depends on $x$ in an exponential fashion, as the equation for the expected value makes clear. We included an exponential trend

So let's run the Gamma regression using the statsmodel GLM routines (note that we need to defined the log function, because the standard link function for Gamma regression is different):

In [None]:
link_g = sm.genmod.families.links.log
model_gamma = sm.GLM(y, X.astype(float), family=sm.families.Gamma(link_g())).fit()
model_gamma.summary()

### So, is it better?

The short answer is yes. The pseudo-R-squared (determined as the correlation of predictions and observations, squared, as we discussed before) with 93% is substantively higher than the OLS R-squared of 68%. The Pearson chi2 goodness of fit metric with 11.9 isn't gigantic (though it's also not very small, which would indicated a very good fit).

Let's compare predictions and outcomes here:

In [None]:
plt.scatter(dat_mpg['mpg'], model_gamma.predict())
plt.plot([min(dat_mpg['mpg']), max(dat_mpg['mpg'])], [min(dat_mpg['mpg']), max(dat_mpg['mpg'])], 'r--')
plt.xlabel('Actual mpg')
plt.ylabel('Predicted mpg')
plt.title('Actual vs. Predicted mpg')
plt.show()

So the fit seems better, although we still seem to underpredict some for very high mpg values.

Let's compare the linear regression and the gamma regression predictions:

In [None]:
plt.scatter(model_ols.predict(), model_gamma.predict())
plt.xlabel('Linear Regression Predictions')
plt.ylabel('Gamma Regression Predictions')
plt.plot([min(model_ols.predict()), max(model_ols.predict())], [min(model_ols.predict()), max(model_ols.predict())], 'r--')
plt.show()

### How do we interpret the Gamma regression?

The directional interpretation is analogous: A positive coefficient is indicative of an increasing relationship, and vice versa for a negative relationship. Furthermore, we intepret the standard errors, p-values, and confidence intervals as before.

However, for the magnitude we have to incorporate the link-function. The way to think about it with the relationship about the expected value above is:
$$
E[\text{mpg} | \text{acceleration} +1] = E[\text{mpg} | \text{acceleration}] \times e^{-0.0235} =
$$

In [None]:
np.exp(-0.0235)

so, it is reduced by a *factor* of about 2.3% per acceleration unit.

Similarly, going from region A to region B will change mpg by:
$$
E[\text{mpg} | \text{region B}] = E[\text{mpg} | \text{region A} \times e^{-0.0235} =
$$

In [None]:
np.exp(0.0893)

So mpg increases by roughly 9.3%.