# Lab 7: Linear Regression Analysis

For this first section, we will be using our NHANES BMX dataset again.

***

### Load packages and dataset

In [None]:
# import packages

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

from statsmodels.regression.linear_model import OLS
import statsmodels.api as sm

In [None]:
# load dataset

bmx_data = pd.read_csv("datasets/BMX_data.csv")
display(bmx_data.head())

## Scatter Plots with Simple Regression Line

### Task 1

Below, we will create the scatter plot of weight vs height from lab-6. **Run the cell to display this scatter plot.**

In [None]:
# pull out the height and weight values from the dataframe
# these values are stored in the columns "BMXHT" and "BMXWT"

height_values = bmx_data["BMXHT"]
weight_values = bmx_data["BMXWT"]

# complete the code below to create the scatter plot

fig, axs = plt.subplots(figsize=(10,7))
axs.scatter(height_values, weight_values, c="SkyBlue")
plt.title("Weight vs Height", fontsize=20)
axs.set_xlabel("Height (cm)", fontsize=15)
axs.set_ylabel("Weight (kg)", fontsize=15)
axs.tick_params(labelsize=10)
plt.show()

Now, to further describe the relationship between weight and height, we will **fit a simple regression model to our data**. This is done in the cell below. **Run this cell.**

<p>
<details><summary>Click to show explanation of code in the cell below</summary><br>

```python 
reg_model_1 = OLS(bmx_data["BMXWT"], sm.add_constant(bmx_data["BMXHT"].values)).fit()
```
- The `OLS.fit()` method fits a linear regression model to supplied x- and y-values. First we supply a list of y-values (`bmx_data["BMXWT"]`).
- Then, we pass the list of x-values (`bmx_data["BMXHT"].values`) to the `sm.add_constant()` function (in order to add a constant, $b_0$, to the regression model).
- We pass the result of the `sm.add_constant()` function to OLS.
- Finally, we call `fit()` to fit the regression line.

</details>
</p>

In [None]:
# generate a simple linear regression model

# specify x_vals and y_vals

x_vals = height_values.values #.values attribute is required by OLS function
y_vals = weight_values

reg_model_1 = OLS(y_vals, sm.add_constant(x_vals)).fit()

We can extract the parameters estimated by the regression model above and use them to add a regression line to our scatter plot. **Run the cell below to do this.**

<p>
<details><summary>Click to show explanation of code in the cell below</summary><br>

- The first block of code below is just the regular scatter plot code.

```python
x_vals_1 = list(bmx_data["BMXHT"])
x_vals_1.sort()
x_plot_1 = np.linspace(x_vals_1[0], x_vals_1[-1], len(x_vals_1))
```
- The above code first creates a list of all x-values (height values) and then sorts this list. The last line of code creates a list of x-values at regular intervals between the smallest and largest x-values.

```python
b0 = reg_model_1.params[0]
b1 = reg_model_1.params[1]
plt.plot(x_plot_1, x_plot_1*b1 + b0, color='black')
```
- Finally, we can add the regression line to the plot. This bit of code will be explained in the next section.

</details>
</p>

In [None]:
# create scatter plot of weight vs height
fig, axs = plt.subplots(figsize=(10,7))
axs.scatter(x_vals, y_vals, c="SkyBlue")
plt.title("Weight vs Height", fontsize=20)
axs.set_xlabel("Height (cm)", fontsize=15)
axs.set_ylabel("Weight (kg)", fontsize=15)
axs.tick_params(labelsize=10)

# create x-vals for regression line
x_vals_1 = list(x_vals)
x_vals_1.sort()
x_plot_1 = np.linspace(x_vals_1[0], x_vals_1[-1], len(x_vals_1)) 

# add line to scatter plot
b0 = reg_model_1.params[0]
b1 = reg_model_1.params[1]
plt.plot(x_plot_1, x_plot_1*b1 + b0, color='black')

plt.show()

So now we have our resulting regression line, which is an estimate of the linear relationship between and individual’s height and weight.

## Linear Regression Output Analysis

### Task 2

Now, let us look at a description of our simple linear regression model fit in the previous section. **Run the cell which prints a summary of the linear regression model.**

In [None]:
# print a summary of our linear regression model created above
# this model is stored in the reg_model_1 variable

display(reg_model_1.summary())

Our model is characterized by two important values, called parameters: a $b_0$ (the constant) and a $b_1$ (the coefficient). These values are indicated in a screenshot of the output below:
![](img/model-output.png)
So, the resulting equation for out weight-height regression line is: $\hat{w}_i = -89.6894 + 0.9965h_i$ (where $w$ is weight and $h$ is height).
> A coefficient so close to 1 (0.9965) means that an approximately 1-unit increase in an individual's height is associated with almost a 1-unit increase in weight.

To access the values of these important parameters we use the following code:
```python
b0 = reg_model_1.params[0]
b1 = reg_model_1.params[1]
```
> Then, to plot this regression line characterized by these parameters we use the line of code from the previous section: `plt.plot(x_plot_1, x_plot_1*b1 + b0, color='black')`. We supply the plot() function with (1) a list of x-values over which to draw the line, (2) the equation for the line, (3) a color for the line (optional).

Another important value in our regression model output is the *R-squared value* (top value in the right column).
> This indicates the percentage of variation in our data that is explained by our model.
> In our case, about 64% of the variation in our observed weight values can be explained by height. The rest of the variation in weight values may be explained by some other factor that we have not captured.

Now, we can use our regression equation to predict the value of someone's weight given their height. It is best to predict the y-value of an x-value that is within the range of our data. For example, we shouldn't really predict the weight of an individual who is 70cm tall. This would be called extrapolation, and is dangerous because it is not based on any observable data.

In our observed data there is an individual with the following measurements:
![](img/individual1.png)

Let's use the parameters of our linear regression model to see what our model predicts for the weight value of this individual who is 170cm tall.

In the next code cell **calculate the predicted weight of an individual with a height of 170cm**.

In [None]:
# calculate the predicted weight of an individual with a height of 170cm
# using the regression equation

b0 = reg_model_1.params[0]
b1 = reg_model_1.params[1]

pred_weight = 

print("Predicted weight:", pred_weight)

<p>
<details><summary>Click to show answer</summary><br>

`Predicted weight: 79.72367707822576`

</details>
</p>

<p>
<details><summary>Click to show solution</summary><br>

```python
pred_weight = b0 + b1*170
```

</details>
</p>

Our model predicts 79.79kg to be the weight of this individual. This is somewhat close to our actual value!
> There is some difference between our observed and predicted values. This difference is called the residual deviation of our predicted value, and is equal to: $e_i = y_i - \hat{y}_i = 75.3 - 79.72 = -4.42$

The underlying mechanics of linear regression are attempting to choose a line that gets as close to each of the data points as possible by minimizing the *sum of the squared residual deviations*.
> We want to minimize this value: $\sum_{i=1}^n(y_i - \hat{y}_i)^2$

## Multiple Regression Analysis

### Task 3

So, we know that in our dataset, height explains approximately 64% of the variation in weight. What other factors could explain the variation in weight?

Waist circumference is another variable captured in the NHANES dataset, and it is possible that there is some association between an individual’s waist circumference and weight, as well as between height and weight.

Modeling the relationship between two or more explanatory variables and one predictor is called *multiple regression*. Again, we will focus on linear relationships.

With an additional explanatory variable, our linear regression model will now look like this: $\hat{y}_i = b_0 + b_1x_{i1} + b_2x_{i2}$

Provided in the next cell in the code notebook, is code to create a multiple linear regression model for these variables. **Run the cell.**

<p>
<details><summary>Click to show explanation of code in the cell below</summary><br>

```python 
X = bmx_data[['BMXHT', 'BMXWAIST']]
X = sm.add_constant(X.values)
```
- This code extracts from the dataset all values in the columns which correspond to our explanatory variables.
- Then we add a constant to our x-values so that we include an intercept in our model.

```python
y = bmx_data['BMXWT']
```
- This code extracts from the dataset our y-values.

```python
multiple_reg_model = OLS(y, X).fit()

display(multiple_reg_model.summary())
```
- Then we fit the model and print out a summary of its parameters.

</details>
</p>

In [None]:
# generate a multiple linear regression model

# extract the x-values (explanatory variables)
X = bmx_data[['BMXHT', 'BMXWAIST']]
X = sm.add_constant(X.values) #add a constant to both x-variables

# extract the y-values (predictor variable)
y = bmx_data['BMXWT']

# fit the model
multiple_reg_model = OLS(y, X).fit()

# display a summary description of the model
display(multiple_reg_model.summary())

The first thing we notice in our multiple linear regression model is that the R-squared value is now 94% (it was 64% without waist circumference as an explanatory variable). With the addition of waist circumference as an explanatory variable, our model is able to explain a lot more of the variation in height.

Our model output still provides us with the 3 parameters, which we can access with the following code:
```python
b0 = multiple_reg_model.params[0]
b1 = multiple_reg_model.params[1]
b2 = multiple_reg_model.params[2]
```

Here is the data on our 170cm tall individual again:
![](img/individual2.png)

This individual's waist circumference is 92.6cm. **Use our multiple linear regression model to estimate the weight of this individual.**

In [None]:
# calculate the predicted weight of an individual who is
# 170cm tall and has waist circumference 92.6cm

b0 = multiple_reg_model.params[0]
b1 = multiple_reg_model.params[1]
b2 = multiple_reg_model.params[2]

pred_weight = 

print("Predicted weight:", pred_weight)

<p>
<details><summary>Click to show answer</summary><br>

Predicted weight: 76.52849876781235

</details>
</p>

<p>
<details><summary>Click to show solution</summary><br>

```python
pred_weight = b0 + b1*170 + b2*92.6
```

</details>
</p>

Notice that the predicted weight is now much closer to the observed weight of this individual!