<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Objectives" data-toc-modified-id="Objectives-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Objectives</a></span></li><li><span><a href="#Dataset" data-toc-modified-id="Dataset-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Dataset</a></span></li><li><span><a href="#Build-and-Evaluate-a-Quadratic-Model" data-toc-modified-id="Build-and-Evaluate-a-Quadratic-Model-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Build and Evaluate a Quadratic Model</a></span></li><li><span><a href="#Build-and-Evaluate-a-4th-Degree-Polynomial-Model" data-toc-modified-id="Build-and-Evaluate-a-4th-Degree-Polynomial-Model-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Build and Evaluate a 4th Degree Polynomial Model</a></span></li><li><span><a href="#Build-and-Evaluate-an-8th-Degree-Polynomial-Model" data-toc-modified-id="Build-and-Evaluate-an-8th-Degree-Polynomial-Model-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Build and Evaluate an 8th Degree Polynomial Model</a></span></li><li><span><a href="#Plot-All-Models" data-toc-modified-id="Plot-All-Models-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Plot All Models</a></span><ul class="toc-item"><li><span><a href="#Interpret-Findings" data-toc-modified-id="Interpret-Findings-7.1"><span class="toc-item-num">7.1&nbsp;&nbsp;</span>Interpret Findings</a></span></li></ul></li><li><span><a href="#Summary" data-toc-modified-id="Summary-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Summary</a></span></li></ul></div>

# Polynomial Regression - Lab

## Introduction

In this lab, you'll practice your knowledge on adding polynomial terms to your regression model! 

## Objectives

You will be able to:

* Determine if polynomial regression would be useful for a specific model or set of data
* Create polynomial terms out of independent variables in linear regression

## Dataset

For this lab you'll be using some generated data:

In [2]:
# Run this cell without changes
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

df = pd.read_csv('sample_data.csv')
df.head()

FileNotFoundError: [Errno 2] No such file or directory: 'sample_data.csv'

Let's check out a scatter plot of `x` vs. `y`: 

In [None]:
# Run this cell without changes
df.plot.scatter(x="x", y="y");

You will notice that the data is clearly of non-linear shape. Begin to think about what degree polynomial you believe will fit it best.

You will fit several different models with different polynomial degrees, then plot them in the same plot at the end.

In [None]:
import statsmodels.api as sm

X =  df.drop("y", axis =1) 
y =  pd.DataFrame(df["y"])


In [None]:
# Your code here - import StatsModels and separate the data into X and y
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
%matplotlib inline
mpl.rcParams["axes.grid"] = True 

fig, axes = plt.subplots(ncols=4, figsize=(20, 5))

for i, ax in enumerate(axes):
    ax.axhline(0, color="black")
    ax.axvline(0, color="black")
    degree = i + 2
    ax.plot(X, X**degree)
    ax.set_title(f"$x^{degree}$", fontsize=24)

fig.tight_layout()


## Build and Evaluate a Quadratic Model

This model should include a constant, `x`, and `x` squared. You can use `pandas` or `PolynomialFeatures` to create the squared term.

In [None]:
# Your code here - prepare quadratic data and fit a model

# adding polynomial terms 
X_quad = X.copy()
X_quad["X_squared"] = X_quad["x"]**2
print(X_quad)

quad_model = sm.OLS(y, sm.add_constant(X_quad))
quad_results = quad_model.fit()
print(quad_results.summary())

In [None]:
# Your code here - evaluate (adjusted) R-Squared and coefficient p-values
print(f"Adjusted r-squared: {quad_results.rsquared_adj}")

#check whether the coefficients are statistically significant
pvalues_df = pd.DataFrame(quad_results.pvalues, columns=["p-value"])
pvalues_df["p < 0.05"] = pvalues_df["p-value"] < 0.05
pvalues_df[pvalues_df["p < 0.05"]]

In [None]:
# Your written answer here - summarize findings
"""
This is not a good model. Because we have multiple terms and are explaining so little of the variance in y, we actually have a negative adjusted R-Squared.

None of the coefficients are statistically significant at an alpha of 0.05
"""

<details>
    <summary style="cursor: pointer"><b>Answer (click to reveal)</b></summary>
    
This is not a good model. Because we have multiple terms and are explaining so little of the variance in `y`, we actually have a negative adjusted R-Squared.

None of the coefficients are statistically significant at an alpha of 0.05
    
</details>

## Build and Evaluate a 4th Degree Polynomial Model

In other words, the model should include $x^0$ (intercept), $x^1$, $x^2$, $x^3$, and $x^4$ terms.

At this point we recommend importing and using `PolynomialFeatures` if you haven't already!

In [None]:
# Your code here - prepare 4th degree polynomial data and fit a model
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(4)
X_poly_4 = poly.fit_transform(X)

feature_names = poly.get_feature_names(input_features=X.columns)

X_poly_4 = pd.DataFrame(X_poly_4, columns=feature_names, index=X.index)

poly_4_model = sm.OLS(y, sm.add_constant(X_poly_4))
poly_4_results = poly_4_model.fit()
print(poly_4_results.summary())

In [None]:
# Your code here - evaluate (adjusted) R-Squared and coefficient p-values
pvalues_df = pd.DataFrame(poly_4_results.pvalues, columns=["p-value"])
pvalues_df["p < 0.05"] = pvalues_df["p-value"] < 0.05
pvalues_df[pvalues_df["p < 0.05"]]

In [None]:
# Your written answer here - summarize findings
"""This is much better. We are explaining 57-58% of the variance in the target and all of our coefficients are statistically significant at an alpha of 0.05."""

<details>
    <summary style="cursor: pointer"><b>Answer (click to reveal)</b></summary>
    
This is much better. We are explaining 57-58% of the variance in the target and all of our coefficients are statistically significant at an alpha of 0.05.
    
</details>

## Build and Evaluate an 8th Degree Polynomial Model

This model should include $x^0$ through $x^8$.

In [None]:
# Your code here - prepare 8th degree polynomial data and fit a model
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(8)
X_poly_8 = poly.fit_transform(X)

feature_names = poly.get_feature_names(input_features=X.columns)

X_poly_8 = pd.DataFrame(X_poly_8, columns=feature_names, index=X.index)

poly_8_model = sm.OLS(y, sm.add_constant(X_poly_8))
poly_8_results = poly_8_model.fit()
print(poly_8_results.summary())

In [None]:
# Your code here - evaluate (adjusted) R-Squared and coefficient p-values
# Your code here - evaluate (adjusted) R-Squared and coefficient p-values
pvalues_df = pd.DataFrame(poly_8_results.pvalues, columns=["p-value"])
pvalues_df["p < 0.05"] = pvalues_df["p-value"] < 0.05
pvalues_df[pvalues_df["p < 0.05"]]

In [None]:
# Your written answer here - summarize findings
"""Our R-Squared is higher, but none of the coefficients are statistically significant at an alpha of 0.05 any more. If what we care about is an inferential understanding of the data, this is too high a degree of the polynomial."""

<details>
    <summary style="cursor: pointer"><b>Answer (click to reveal)</b></summary>
    
Our R-Squared is higher, but none of the coefficients are statistically significant at an alpha of 0.05 any more. If what we care about is an inferential understanding of the data, this is too high a degree of the polynomial.
    
</details>

## Plot All Models

Build a single plot that shows the raw data as a scatter plot, as well as all of the models you have developed as line graphs. Make sure that everything is labeled so you can tell the different models apart!

In [None]:
# Your code here

fig, ax = plt.subplots(figsize=(10, 6))

models = [quad_results, poly_4_results, poly_8_results]
data = [X_quad, X_poly_4, X_poly_8]  # Use the correct DataFrames
colors = ['orange', 'green', 'blue']

ax.scatter(X, y, label="data points", color="black")
for i, model in enumerate(models):
    ax.plot(
        X,  # Plot same x values for every model
        model.predict(sm.add_constant(data[i])),  # Generate predictions using relevant preprocessed data with constant
        label=f"polynomial degree {(i + 2)*2}",  # Degree happens to be 2 times (i + 1)
        color=colors[i],  # Select color from list declared earlier
        linewidth=5,
        alpha=0.7
    )

ax.legend()
plt.show()

### Interpret Findings

Based on the metrics as well as the graphs, which model do you think is the best? Why?

In [None]:
# Your written answer here
"""
The quadratic model (polynomial degree 2) is definitely not the best based on all of the evidence we have. It has the worst R-Squared, the coefficient p-values are not significant, and you can see from the graph that there is a lot of variance in the data that it is not picking up on.

Our visual inspection aligns with the worse R-Squared for the 4th degree polynomial compared to the 8th degree polynomial. The 4th degree polynomial is flatter and doesn't seem to capture the extremes of the data as well.

However if we wanted to interpret the coefficients, then only the 4th degree polynomial has statistically significant results. The interpretation would be challenging because of the number of terms, but we could apply some calculus techniques to describe inflection points.

Overall it appears that this dataset is not particularly well suited to an inferential linear regression approach, even with polynomial transformations. So the "best" model could be either the 4th or 8th degree polynomial depending on which aspect of the model is more important to you, but either way it will be challenging to translate it into insights for stakeholders.
"""

<details>
    <summary style="cursor: pointer"><b>Answer (click to reveal)</b></summary>
    
The quadratic model (polynomial degree 2) is definitely not the best based on all of the evidence we have. It has the worst R-Squared, the coefficient p-values are not significant, and you can see from the graph that there is a lot of variance in the data that it is not picking up on.

Our visual inspection aligns with the worse R-Squared for the 4th degree polynomial compared to the 8th degree polynomial. The 4th degree polynomial is flatter and doesn't seem to capture the extremes of the data as well.
    
However if we wanted to interpret the coefficients, then only the 4th degree polynomial has statistically significant results. The interpretation would be challenging because of the number of terms, but we could apply some calculus techniques to describe inflection points.

Overall it appears that this dataset is not particularly well suited to an inferential linear regression approach, even with polynomial transformations. So the "best" model could be either the 4th or 8th degree polynomial depending on which aspect of the model is more important to you, but either way it will be challenging to translate it into insights for stakeholders.
    
</details>

## Summary

Great job! You now know how to include polynomials in your linear models as well as the limitations of applying polynomial regression. 