# Exercise set 5


>This exercise aims to give you more experience with
least squares and the [statsmodels](https://www.statsmodels.org)
library. In particular, you will check for outliers in a
least squares model and perform least squares
regression for a case where we have several variables (some
of which may be irrelevant or correlated to other variables).
The last point should give you some experience in judging if
certain variables should be included or not in a least squares model.
>
>The last part of the exercise aims to give you experience
with calculating effects from the results of a full fractional factorial
experimental design.

## Exercise 5.1

We will revisit [Forbes' data](https://doi.org/10.1017/S0080456800032075) from exercise 4 and investigate
if we have any outliers. Below you will find some code that will help you along.

### 5.1(a)

Create a linear model that predicts the atmospheric pressure
from the boiling point
with `statsmodels`. Plot your model together with the raw data, and plot the residuals. Do you have
any comments about the residuals?

In [None]:
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import pandas as pd

%matplotlib inline
sns.set_context("notebook")

In [None]:
# If you do not have statsmodels installed you can uncomment
# and run this line to try to install it:
#!pip install statsmodels

In [None]:
# Load data
data_forbes = pd.read_csv("Data/forbes.csv")
data_forbes.head()

In [None]:
# Create a least squares model with statsmodels:
x = data_forbes["Temperature (F)"]
y = data_forbes["Pressure (inches Hg)"]

X = sm.add_constant(x)  # Make a matrix with a column of ones and then x

model = sm.OLS(y, X)  # Set up for OLS = Ordinary Least Squares
results = model.fit()  # Find parameters!
y_hat = results.predict(X)  # Use the model to predict y_hat

In [None]:
# Print a small summary to show you R², and the coefficients:
print(results.summary(slim=True))

**Note:** A description of the summary from statsmodels can be found in the [Appendix](#Appendix:-The-summary-results-from-statsmodels).

In [None]:
# Your code here for plotting the model, raw data, and the residuals

#### Your answer to question 5.1(a): Do you have any comments about the residuals?
*Double click here*

### 5.1(b)

Calculate [influence/outlier measures with statsmodels](https://www.statsmodels.org/dev/generated/statsmodels.stats.outliers_influence.OLSInfluence.html#statsmodels.stats.outliers_influence.OLSInfluence) and plot the following:

* (i) the studentized residuals,
* (II) the leverage ($h_{ii}$ from the $\mathbf{H}$-matrix),
* (iii) the Cook's distance, and
* (iv) the [influence plot](https://www.statsmodels.org/dev/generated/statsmodels.graphics.regressionplots.influence_plot.html).

Based on these plots: Do you see any potential outliers?

In [None]:
# The influence measures can be computed with:
influence = results.get_influence()

In [None]:
# A table of results is:
influence_table = influence.summary_frame()
influence_table

In [None]:
# We can get several things from this table, to use for the plotting:

# (i) studentized residuals:
studentized_residual = influence_table["student_resid"]

# (ii) the leverage:
hii = influence_table["hat_diag"]

# (iii) Cook's distance
cooks_distance = influence_table["cooks_d"]

In [None]:
# Create your plots (i)-(iii) here:

In [None]:
# The influence plot, part (iv) can be created with:
fig = influence.plot_influence()

#### Your answer to question 5.1(b): Do you see any potential outliers?
*Double click here*

### 5.1(c)

Run a hypothesis test for outliers - use the [outlier test](https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLSResults.outlier_test.html)
method from `statsmodels`. This test outputs:
* `student_resid`: The studentized residuals.
* `unadj_p`: The unadjusted p-value for the hypothesis test that the expected value of the studentized residual for point *i* is zero, under the null hypothesis that the point is not an outlier.
* `bonf(p)`: A Bonferroni corrected p-value, which adjusts for the increased risk of Type I errors (mistaken rejection of a true null hypothesis) due to multiple comparisons.

The `outlier_test` method tests the null hypothesis that each point is not an outlier by considering if its studentized residual is significantly different from zero. Since we perform this test *N* times for *N* points, the risk of incorrectly labelling at least one point as an outlier (Type I error) increases. To mitigate this, `outlier_test` will apply a [correction](https://en.wikipedia.org/wiki/Bonferroni_correction), and we should base our decisions on these corrected p-values.

Are any of the points classified as outliers? If yes, remake the model without these points and
compare it with the model you made in part [5.1(a)](#5.1(a)). Did removing the point(s) change the
model substantially?

**Hint:** You can mark a point as an outlier if the corrected p-value is smaller than the significance level, set to `alpha = 0.05` by default in the `outlier_test()` method.

In [None]:
# This is how you run the hypothesis test:
test = results.outlier_test()
test

In [None]:
# Your code here

#### Your answer to question 5.1(c):
*Double click here*

## Exercise 5.2

The file [Data/bloodpress.csv](Data/bloodpress.csv) contains data for 20 individuals with high blood pressure.
Table 1 describes the columns in this file.
Your goal is to create a least squares model for predicting 
the blood pressure (from the other variables) that achieves $R^2 > 0.95$ with a maximum of two variables.


| Column | Description              |             Unit |
|:-------|:-------------------------|-----------------:|
| BP     | Blood pressure           |             mmHg |
| Age    | Age                      |            years |
| Weight | Weight                   |               kg |
| BSA    | Body surface area        |            m$^2$ |
| DUR    | Duration of hypertension |            years |
| BHR    | Basal heart rate         | beats per minute |
| Stress | Stress index             |              --- |
||**Table 1:** *Data columns present in the file [Data/bloodpress.csv](Data/bloodpress.csv)*|

### 5.2(a)

Begin by exploring the data. Do this by creating
the [scatter plot matrix](https://seaborn.pydata.org/examples/scatterplot_matrix.html)
and the [correlation heatmap](https://seaborn.pydata.org/generated/seaborn.heatmap.html).

Based on your plots, do you find any promising correlations between blood pressure and the
other variables?

In [None]:
data = pd.read_csv("Data/bloodpress.csv")
data.head()

In [None]:
# Example for creating the scatter plot matrix:
grid = sns.pairplot(
    data,
    kind="reg",  # Show a regression line
    #y_vars=["BP",],  # Uncomment to just use BP as y (this makes a smaller plot!)
    #corner=True,  # Uncomment to just plot under the diagonal!
)
grid.fig.tight_layout()  # Clean up the plot

In [None]:
# Example to create the correlation heat map:

# Simple, with just pandas:
data.corr()

In [None]:
# Correlation heat map with seaborn (please experiment with colors etc):
fig, ax = plt.subplots(constrained_layout=True)
sns.heatmap(
    data.corr(),
    cmap="vlag",  # Select color scheme
    annot=True,  # Annotate with numbers
    ax=ax,  # Axis to plot in
    #annot_kws={"size": "small"},  # Font size for numbers (in case they are too big/small)
    linewidths=4,  # Just to separate the squares
    square=True,
    fmt=".2f"  # Formatting to the numbers
)

In [None]:
# Your code here

#### Your answer to question 5.2(a): Do you find any promising correlations between blood pressure and the other variables?
*Double click here*

### 5.2(b)

Create a linear model in which you predict the blood pressure ($y$) from
all six available variables (Age, Weight, BSA, DUR, BHR, and Stress).
Use the `statsmodels` package
and the ordinary least squares (OLS) method
to create the model. Scale the variables before making the model (if you think
this is a good idea!).

Print out the summary created by `statsmodels` (please see [Appendix](#Appendix:-The-summary-results-from-statsmodels) for a short overview of the summary output from `statsmodels`).
What values do you get for $R^2$ and $R^2_\text{adjusted}$?

In [None]:
from sklearn.preprocessing import scale  # If you want to scale the data

# Here is an example to get you started:
y = data["BP"]
X = data[["Dur", "Stress"]]  # Select two of the variables
X = sm.add_constant(X)
model = sm.OLS(y, X).fit()
print(model.summary())


# And another example if you want to scale the variables:
y = data["BP"]
X = data[["Dur", "Stress"]]  # Select two of the variables
y = scale(y)
X = scale(X)
model = sm.OLS(y, X).fit()
print(model.summary())

In [None]:
# Your code here

#### Your answer to question 5.2(b): What values do you get for $R^2$ and $R^2_\text{adjusted}$?
*Double click here*

### 5.2(c)
Inspect the $p$-values for the different variables in the summary you just printed.
Use this to create a new model with fewer variables, recalculate $R^2$
and  $R^2_\text{adjusted}$, and compare with the previous results.

In [None]:
# Your code here

#### Your answer to question 5.2(c): (Any comments for your comparison with the previous model?)
*Double click here*

### 5.2(d)
Simplify the model to two variables and recalculate
$R^2$ and $R^2_\text{adjusted}$. What variables are you using now,
and do you still get a $R^2 > 0.95$?

In [None]:
# Your code here

#### Your answer to question 5.2(d): What variables are you using now?
*Double click here*

## Exercise 5.3

The growth rate of a particular bacterium species depends
on the concentration of nutrients such as phosphate ($P$),
sucrose ($S$), and nitrate ($N$). 
Table 2 displays the experimental design
used to investigate how these three concentrations
influence the growth rate.


|$P$  | $S$ | $N$ | $PS$ | $PN$ | $SN$ | $PSN$ | **Growth rate**  |
|:---:|:---:|:---:|:----:|:----:|:----:|:-----:|:----------------:|
| $+$ | $-$ | $-$ | $-$  | $-$  | $+$  | $+$   | $7$              |  
| $-$ | $+$ | $-$ | $-$  | $+$  | $-$  | $+$   | $10$             | 
| $+$ | $-$ | $+$ | $-$  | $+$  | $-$  | $-$   | $8$              | 
| $-$ | $+$ | $+$ | $-$  | $-$  | $+$  | $-$   | $11$             |  
| $-$ | $-$ | $-$ | $+$  | $+$  | $+$  | $-$   | $11$             |
| $+$ | $+$ | $+$ | $+$  | $+$  | $+$  | $+$   | $12$             |
| $+$ | $+$ | $-$ | $+$  | $-$  | $-$  | $-$   | $7$              |
| $-$ | $-$ | $+$ | $+$  | $-$  | $-$  | $+$   | $7$              | 

**Table 2:** *Experimental design matrix for the growth rate of the investigated bacteria. The factors are the concentration of phosphate ($P$), the concentration of sucrose ($S$), and the concentration of nitrate ($N$).*

### 5.3(a)
Compute the main effects.

In [None]:
# Your code here

#### Your answer to question 5.3(a):
*Double click here*

### 5.3(b)
Verify that the columns for the 2-factor and 3-factor interaction effects are correct in table 2 and compute the interaction effects.

In [None]:
# Your code here

#### Your answer to question 5.3(b):
*Double click here*

### 5.3(c)
What factors and interactions seem
to increase the growth rate?

In [None]:
# Your code here

#### Your answer to question 5.3(c):
*Double click here*

## Appendix: The summary results from `statsmodels`


The summary method in `statsmodels` prints out a lot of information.
We have fitted a model $y=a + bx$ to 10 $(x, y)$ points with `statsmodels`
and the resulting summary output is printed below. This output is described in the
sections below. The most important one for this exercise is the [Information-about-the-coefficients](#Information-about-the-coefficients).

```code
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.956
Model:                            OLS   Adj. R-squared:                  0.951
Method:                 Least Squares   F-statistic:                     175.6
Date:                Tue, 14 Feb 2023   Prob (F-statistic):           1.00e-06
Time:                        08:42:06   Log-Likelihood:                -16.957
No. Observations:                  10   AIC:                             37.91
Df Residuals:                       8   BIC:                             38.52
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.4248      0.931      4.754      0.001       2.278       6.571
x1             1.9235      0.145     13.253      0.000       1.589       2.258
==============================================================================
Omnibus:                        3.674   Durbin-Watson:                   2.067
Prob(Omnibus):                  0.159   Jarque-Bera (JB):                0.755
Skew:                           0.464   Prob(JB):                        0.686
Kurtosis:                       3.975   Cond. No.                         13.0
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
```

### Information about the model


- **Dep. Variable:** The dependent variable (the variable we are predicting, $y$) in the model.
- **Model:** The type of model we have created (OLS = Ordinary Least Squares).
- **Method:** We have used Least squares to find the parameters.
- **Date & Time:** The date and time for when we created the model.
- **No. Observations:** The number of observations in the data set (we had 10 ($x$,$y$) values here).

### Information about the calculation
- **Df Residuals:** Degrees of freedom for the residuals (sum of squares). 
  This is equal to $n - k - 1$ where $n$ is the number of observations and $k$ is
  the number of variables. In our case: $n - k - 1 = 10 - 1 - 1 = 8$. If we did the
  fitting without the constant term (for instance, by centering the data first), this
  number would be $n-k = 10-1=9$.
- **Df Model:** Degrees of freedom for the model (number of variables in the model).
- **Covariance type:** Calculations of standard errors assume homoscedastic errors.
  If this is not the case, then the standard error is not computed correctly. There
  are alternative ways of calculating the standard error; this field tells you
  if statsmodels used a more robust method.

### Information about the overall quality
- **R-squared:** Coefficient of determination ($R^2$) for the model.
- **Adj. R-squared:** The adjusted $R^2$ for the model. Useful from comparing
  models as this one will only increase (when adding more variables) if the
  increase in $R^2$ is more than one would expect by chance.
- **F-statistic:** This is the result of an F-test where the null hypothesis is that all
  regression coefficients are equal to zero! Effectively, this compares the model we
  have just made to an alternative model equal to the constant intercept term. 
  To use this value, we would have to decide on a $\alpha$ level and look up a critical F-value.
  This is some extra work for us, so we typically rather focus on the **Prob (F-statistic)**.
- **Prob (F-statistic):** This is the probability of getting an **F-statistic** at
  least as extreme as the one above if all regression coefficients are zero. 
  It is also known as the $p$-value.
  If we have selected $\alpha$ value, we will reject the null hypothesis if 
  the $p$-value is smaller than $\alpha$. Here, we have a very small $p$-value, and we reject the
  null hypothesis: We conclude that at least one regression parameter is
  significant for predicting $y$.
- **Log-Likelihood:** In least squares, we are minimizing the squared error.
  This is equivalent (if the errors are normally distributed)
  to maximizing the likelihood. The value printed here is the
  logarithm of the likelihood for the model.
- **AIC and BIC:** The
  [Akaike information criterion (AIC)](https://en.wikipedia.org/wiki/Akaike_information_criterion) and
  [Bayesian information criterion (BIC)](https://en.wikipedia.org/wiki/Bayesian_information_criterion).
  These can be directly calculated from the Log-Likelihood and are useful for comparing alternative
  models. Generally, we prefer models with lower AIC and BIC.

### Information about the coefficients

- **coef:** The determined coefficients for the model.

- **std err:** The standard of the coefficients. This
  is calculated from,
  
  \begin{equation*}
  \text{Var}(\mathbf{b}) = s^2 \cdot \text{diag} \left(\mathbf{X}^\top \mathbf{X} \right)^{-1},
  \end{equation*}
  
  where,
  
  \begin{equation*}
  s^2 = \frac{SSE}{n - k - 1},
  \end{equation*}
  
  and $SSE$ is the sum of squared error/residuals, $n$ the number of data points (10 in this case)
  and $k$ the number of variables (1 in this case).

- **t, P>|t|, and [0.025 0.975]:** Some statistics for the
  coefficients. **t** is the $t$ statistic, which is obtained by dividing
  the coefficient by the standard error.
  This is the statistic in a test where the null hypothesis is that the coefficient is zero.
  To use the $t$ statistic we would have to consult a table with critical $t$-values for $n-k-1$
  degrees of freedom. The **P>|t|** is the $p$-value for such a $t$-test.
  Here, the $t$ statistic
  is high (and the p-value is low) and we would reject this null hypothesis for both the
  constant and x1. In other words, these coefficients are indeed different from
  zero.
  Finally, the **[0.025 0.975]**
  represents a $100(1-\alpha)\%$ confidence interval for the coefficients. We did not specify 
  $\alpha$ here, but we can give it as a parameter. The default is $\alpha=0.05$.

### Information about residuals
- **Omnibus and Prob(Omnibus):** This is a
  [statistical test](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.normaltest.html)
  that checks if
  the residuals are normally distributed. The probability indicates the
  probability of the residuals being normally distributed.
- **Skew:** This is a measure of the asymmetry of the residuals. For a normal distribution, the skewness is 0.
- **Kurtosis:** This measures the "tailedness" of the residuals. For a normal distribution, the skewness is 3.
- **Jarque-Bera (JB) and Prob(JB):** This is a statistical test that checks the
  same thing as the **Omnibus** (but the test itself is different). Ideally, it should
  agree with the **Omnibus** test.
- **Durbin-Watson:** This is a statistical test that essentially checks if there
  is some correlation (relationship) in the residuals. The value is between 0 and 4.
  If this is equal to 2, then there is no correlation. Values close to 0 indicate
  a positive serial correlation. Values close to 4 indicate 
  a negative serial correlation.
- **Cond. No.:** The [condition number](https://numpy.org/doc/stable/reference/generated/numpy.linalg.cond.html)
  measures the sensitivity
  of the solution (our parameters) to small perturbations in the input data. With just
  one variable, this value is not so important. Statsmodels will print a warning if
  the condition number is larger than 1000; statsmodels interpret this as an indication
  of multicollinearity or numerical problems.