# Video: Adding Polynomial Features

This video shows how to add polynomial features to allow polynomial regression using a linear model.

In [None]:
import pandas as pd

In [None]:
abalone = pd.read_csv("https://raw.githubusercontent.com/bu-cds-omds/dx602-examples/main/data/abalone.tsv", sep="\t")
abalone

Script:
* A common example of feature engineering is to combine two existing features to highlight their interactions.
* For example, you can multiply two columns to get a feature that has particularly large values when both combined values are large.
* We can generalize this idea to polynomial features where each individual feature is the product of powers of the original features.
* With this context, adding a single polynomial feature to a data set is easy.
* You just calculate it from the existing columns and add it as a new column.

In [None]:
abalone["rough_volume"] = abalone["Length"] * abalone["Diameter"] * abalone["Height"]

Script:
* In this case, I add a degree three term that was the product of three of the original columns.
* In general, there are two likely cases where you will add polynomial features.
* In the first, there are one or more combinations of variables and exponents that you think will be particularly helpful to your modeling problem.
* This will generally be driven by problem-specific concerns and intuitions.
* In this first case, then you just add the specific terms that you think will be helpful.
* In the second case, you are adding all the combinations of variables up to some maximum degree.
* In this case, you will loop from degree 2 to your maximum degree, and generate all the column and exponent combinations, and add each of them as a new column.
* That loop starts from degree 2 because degree 1 is the original columns.
* And you will want to implement this programmatically, not one at a time like the previous example, since the number of new columns will grow very quickly with the maximum degree.
* Scikit-learn has a class to handle this too, so let's try it out now.

In [None]:
from sklearn.preprocessing import PolynomialFeatures

In [None]:
pf = PolynomialFeatures(degree=2)
pf.fit(abalone.drop("Sex", axis=1))

Script:
* Like all of these number-based transforms, the data should be filtered to just numeric columns.

In [None]:
pf.transform(abalone.drop("Sex", axis=1))

Script:
* Running the transform for polynomial features will give you a lot more columns.

In [None]:
pf.transform(abalone.drop("Sex", axis=1)).shape

In [None]:
abalone.shape

Script:
* So from 9 initial numeric columns, a total of 55 columns were created.
* Scikit-learn made this easy.
* Just create the features object, fit it, and transform the data.
* Figuring out when to use these features is the trickier part.
* If we just add lots of polynomial features using the highest degrees that we can handle, we will be prone to overfitting.
* If we keep the maximum degree low, they might be good.
* Bear in mind that low is contextual, and you should check that the number of rows of data is more than the final number of columns.
* These features will work best if you can select them carefully based on the problem.
* For example adding a quadratic time term when modeling an object in free fall.
* If you do not have a specific intuition, trying low degree polynomials is ok, but make sure to validate your work.