Nonlinear features let us turn simple linear models into powerful tools that can learn curved decision boundaries and complex patterns, without immediately jumping to “heavy” models like deep nets. [inria.github](https://inria.github.io/scikit-learn-mooc/python_scripts/linear_models_feature_engineering_classification.html)

### 1. From linear to nonlinear: the core idea

#### 1.1 Linear models and straight boundaries

A basic linear classifier (like logistic regression) predicts using a score of the form  
$\text{score}(x) = w_0 + w_1 x_1 + \dots + w_d x_d$. 

The decision boundary where the model flips class is given by setting this score to 0, which produces a line in 2D, a plane in 3D, or a hyperplane in higher dimensions. [stanford-cs221.github](https://stanford-cs221.github.io/autumn2022-extra/modules/machine-learning/non-linear-features.pdf)

This works well when classes can be separated by such a straight surface, but fails on patterns like:

- Circles or rings around a center.
- XOR patterns where “different corners” belong to one class.
- Curved clusters that bend around each other. [stanford-cs221.github](https://stanford-cs221.github.io/autumn2022-extra/modules/machine-learning/non-linear-features.pdf)

In these cases, no single straight line (or hyperplane) can separate the data well.

#### 1.2 Nonlinear features: bending the space

The trick is: instead of changing the model, change the **features**.  
You build new features from the original ones using nonlinear functions, for example:

- Polynomial terms: $x_1^2$, $x_2^2$, $x_1 x_2$. [inria.github](https://inria.github.io/scikit-learn-mooc/python_scripts/linear_models_feature_engineering_classification.html)
- Higher powers: $x_1^3$, $x_1^2 x_2$, etc. [data36](https://data36.com/polynomial-regression-python-scikit-learn/)
- Other transforms: $\log(x)$, $\exp(x)$, $\sin(x)$, $\cos(x)$. [stanford-cs221.github](https://stanford-cs221.github.io/autumn2020-extra/modules/machine-learning/non-linear-features.pdf)

The model is still linear in these new features, but since the features themselves are nonlinear in the original inputs, the decision boundary in the original space becomes curved. [data36](https://data36.com/polynomial-regression-python-scikit-learn/)

Formally: you define a feature map $\phi(x)$ (vector of nonlinear features) and apply a linear model to $\phi(x)$ instead of $x$. [stanford-cs221.github](https://stanford-cs221.github.io/autumn2022-extra/modules/machine-learning/non-linear-features.pdf)

### 2. The wine example: understanding the geometry

#### 2.1 Dataset and setup

The classic scikit-learn wine dataset: 178 wines, each with 12 numeric attributes (alcohol, malic acid, magnesium, total phenols, flavonoids, color intensity, hue, etc.), and a class label 0, 1, or 2 for three wine types. [stat.cmu](https://www.stat.cmu.edu/~cshalizi/dm/20/lectures/08/lecture-08.html)

For visualization, pick only:

- $x_0$: total_phenols  
- $x_1$: color_intensity  

and plots the wines in 2D with color-coded classes. [stat.cmu](https://www.stat.cmu.edu/~cshalizi/dm/20/lectures/08/lecture-08.html)

From the scatterplot:

- Class 1: low color_intensity (bottom of the plot).
- Class 2: higher color_intensity but low total_phenols.
- Class 0: higher color_intensity and higher total_phenols than class 2. [stat.cmu](https://www.stat.cmu.edu/~cshalizi/dm/20/lectures/08/lecture-08.html)

So each class occupies a different region, and these regions are clearly not separated by a single straight line.

#### 2.2 Trying existing models

Compare:

- A decision tree of depth 2.
- k-nearest neighbors with k=5.
- Multinomial logistic regression (a linear model in the original features). [stat.cmu](https://www.stat.cmu.edu/~cshalizi/dm/20/lectures/08/lecture-08.html)

Observations:

- **Decision tree (depth=2)**  
  - Splits the plane with a small number of axis-aligned cuts:  
    “If color_intensity < threshold ⇒ class 1; else if total_phenols < threshold ⇒ class 2; else ⇒ class 0.”  
  - Matches human intuition and is easy to explain. [emagine](https://www.emagine.org/blogs/example-of-non-linear-machine-learning-algorithms-decision-trees/)
  - If you increase depth (say to 10), you get many tiny, irregular rectangles and overfitting. [emagine](https://www.emagine.org/blogs/example-of-non-linear-machine-learning-algorithms-decision-trees/)

- **k-nearest neighbors (k=5)**  
  - Decision boundary “wraps” around points, making many tight turns to classify training examples correctly. [stat.cmu](https://www.stat.cmu.edu/~cshalizi/dm/20/lectures/08/lecture-08.html)
  - High training accuracy but very wiggly boundaries → high variance, likely overfitting. [willett.psd.uchicago](https://willett.psd.uchicago.edu/research/nonlinear-models-in-machine-learning/)

- **Multinomial logistic regression (linear features only)**  
  - Very stable; main hyperparameter is regularization strength, but here changing it doesn’t change the boundary much. [stat.cmu](https://www.stat.cmu.edu/~cshalizi/dm/20/lectures/08/lecture-08.html)
  - Boundaries are almost straight lines, so the model underfits the curved structure and misclassifies many points. [stanford-cs221.github](https://stanford-cs221.github.io/autumn2020-extra/modules/machine-learning/non-linear-features.pdf)

This motivates nonlinear features: we want something as stable and well-behaved as logistic regression, but flexible enough to bend the boundaries.

### 3. Building nonlinear features in the wine example

#### 3.1 From two linear features to five nonlinear ones

We start with the two original features:

- $x_0$: total_phenols  
- $x_1$: color_intensity. [stat.cmu](https://www.stat.cmu.edu/~cshalizi/dm/20/lectures/08/lecture-08.html)

They are **linear** features because they appear to the first power in the model. Then construct three **quadratic** features:

- $\phi_0 = x_0 x_1$ (interaction term).
- $\phi_1 = x_0^2$.
- $\phi_2 = x_1^2$. [stanford-cs221.github](https://stanford-cs221.github.io/autumn2020-extra/modules/machine-learning/non-linear-features.pdf)

The new feature vector is:

$$
\phi(x) = [x_0, x_1, x_0 x_1, x_0^2, x_1^2].
$$

We build a new DataFrame $X$ containing these five columns and keep the class label as the target. [stat.cmu](https://www.stat.cmu.edu/~cshalizi/dm/20/lectures/08/lecture-08.html)

### 3.2 Effect on the decision boundary

Training multinomial logistic regression on these five features gives decision boundaries that are no longer straight lines in the original 2D plot. [stanford-cs221.github](https://stanford-cs221.github.io/autumn2020-extra/modules/machine-learning/non-linear-features.pdf)

Instead, they become smooth curves that more closely wrap around the three clusters, improving classification accuracy compared with the purely linear case. [inria.github](https://inria.github.io/scikit-learn-mooc/python_scripts/linear_models_feature_engineering_classification.html)

Mathematically, the model is still linear in $\phi(x)$ (it’s logistic regression), but the mapping $x \mapsto \phi(x)$ is nonlinear, so the boundary in the original $x_0$-$x_1$ space is curved.

### 4. Coding it in Python (scikit-learn)

#### 4.1 Loading and preparing the wine data

Below is a beginner-friendly implementation that mirrors the transcript’s logic using scikit-learn’s wine dataset. [stackoverflow](https://stackoverflow.com/questions/55937244/how-to-implement-polynomial-logistic-regression-in-scikit-learn)

In [7]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_wine
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

# 1. Load the wine dataset
data = load_wine()
X_all = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target  # 0, 1, or 2

# 2. Keep only two features: total_phenols and color_intensity
X_two = X_all[["total_phenols", "color_intensity"]]

# Train/test split for evaluation
X_train, X_test, y_train, y_test = train_test_split(
    X_two, y, test_size=0.3, random_state=42, stratify=y
)

print(X_train.head())
print(y_train[:5])

     total_phenols  color_intensity
12            2.60             5.60
30            3.00             5.70
36            2.60             4.60
31            2.86             6.90
120           2.90             3.25
[0 0 0 0 1]


This replicates the “reduced table” with two features plus the class column  [stat.cmu](https://www.stat.cmu.edu/~cshalizi/dm/20/lectures/08/lecture-08.html)

#### 4.2 Plain multinomial logistic regression (linear features)
This gives the “straight boundary” version. [stat.cmu](https://www.stat.cmu.edu/~cshalizi/dm/20/lectures/08/lecture-08.html)

In [8]:
linear_clf = Pipeline([
    ("scaler", StandardScaler()),
    ("log_reg", LogisticRegression(
        multi_class="multinomial",
        solver="lbfgs",
        max_iter=1000
    )),
])

linear_clf.fit(X_train, y_train)

y_pred_lin = linear_clf.predict(X_test)
print("Linear logistic regression accuracy:", accuracy_score(y_test, y_pred_lin))

Linear logistic regression accuracy: 0.8703703703703703




#### 4.3 Adding quadratic nonlinear features

You can manually create the three quadratic features:

In [9]:
X_train_nl = X_train.copy()
X_train_nl["phi0"] = X_train_nl["total_phenols"] * X_train_nl["color_intensity"]
X_train_nl["phi1"] = X_train_nl["total_phenols"] ** 2
X_train_nl["phi2"] = X_train_nl["color_intensity"] ** 2

X_test_nl = X_test.copy()
X_test_nl["phi0"] = X_test_nl["total_phenols"] * X_test_nl["color_intensity"]
X_test_nl["phi1"] = X_test_nl["total_phenols"] ** 2
X_test_nl["phi2"] = X_test_nl["color_intensity"] ** 2

nonlinear_clf = Pipeline([
    ("scaler", StandardScaler()),
    ("log_reg", LogisticRegression(
        multi_class="multinomial",
        solver="lbfgs",
        max_iter=1000
    )),
])

nonlinear_clf.fit(X_train_nl, y_train)

y_pred_nl = nonlinear_clf.predict(X_test_nl)
print("Logistic regression with quadratic features accuracy:", accuracy_score(y_test, y_pred_nl))

Logistic regression with quadratic features accuracy: 0.8703703703703703




Or, more generally, use `PolynomialFeatures` to automatically generate all polynomial terms up to a chosen degree: [stackoverflow](https://stackoverflow.com/questions/55937244/how-to-implement-polynomial-logistic-regression-in-scikit-learn)

In [10]:
poly_clf = Pipeline([
    ("poly", PolynomialFeatures(degree=2, include_bias=False)),
    ("scaler", StandardScaler()),
    ("log_reg", LogisticRegression(
        multi_class="multinomial",
        solver="lbfgs",
        max_iter=1000
    )),
])

poly_clf.fit(X_train, y_train)
y_pred_poly = poly_clf.predict(X_test)
print("Logistic regression with PolynomialFeatures (degree=2):",
      accuracy_score(y_test, y_pred_poly))

Logistic regression with PolynomialFeatures (degree=2): 0.8703703703703703




`PolynomialFeatures(degree=2)` will generate exactly the original features, their squares, and pairwise products (interaction terms), analogous to the hand-built $\phi_0$, $\phi_1$, $\phi_2$. [stackoverflow](https://stackoverflow.com/questions/55937244/how-to-implement-polynomial-logistic-regression-in-scikit-learn)

### 5. Exploring richer nonlinear feature families

The transcript mentions several families of nonlinear features that can be tried in exactly the same way: add columns, then fit logistic regression. [stanford-cs221.github](https://stanford-cs221.github.io/autumn2020-extra/modules/machine-learning/non-linear-features.pdf)

#### 5.1 Higher-degree polynomials

You can go beyond degree 2:

- Degree 3: includes terms like $x_0^3$, $x_0^2 x_1$, $x_0 x_1^2$, $x_1^3$. [data36](https://data36.com/polynomial-regression-python-scikit-learn/)
- Higher degrees: more complex shapes but also higher risk of overfitting. [data36](https://data36.com/polynomial-regression-python-scikit-learn/)

Code example: [inria.github](https://inria.github.io/scikit-learn-mooc/python_scripts/linear_models_feature_engineering_classification.html)

#### 5.2 Exponential and log features

For some problems, exponential or logarithmic growth fits domain knowledge:

#### 5.3 Trigonometric and Fourier features

Trigonometric features are useful when you suspect periodicity or want very flexible wavy boundaries. [stanford-cs221.github](https://stanford-cs221.github.io/autumn2022-extra/modules/machine-learning/non-linear-features.pdf)

Fourier features with multiple frequencies produce “wavy” decision boundaries. [stanford-cs221.github](https://stanford-cs221.github.io/autumn2020-extra/modules/machine-learning/non-linear-features.pdf)

### 6. How do we choose which nonlinear features?

#### 6.1 Domain knowledge

If you know something about your problem, use it:

- Biology/chemistry → saturating or sigmoidal relationships, logs, and exponentials may make sense.
- Physics → polynomials, inverse-square laws, or specific formulas.
- Periodic phenomena → sinusoids and Fourier features. [aml4td](https://aml4td.org/chapters/interactions-nonlinear.html)

This reduces the search space to functions that reflect real-world behavior.

#### 6.2 Regularization and feature selection

You can also start with many candidate nonlinear features and let regularization prune them. [aml4td](https://aml4td.org/chapters/interactions-nonlinear.html)

- L2 regularization (ridge): shrinks all coefficients but rarely sets them exactly to zero. Helps control overfitting but doesn’t explicitly drop features. [aml4td](https://aml4td.org/chapters/interactions-nonlinear.html)
- L1 regularization (LASSO): encourages many coefficients to be exactly zero, effectively selecting a subset of features and ranking them by importance. [aml4td](https://aml4td.org/chapters/interactions-nonlinear.html)

Using LASSO to **rank features by importance** and then trimming them by increasing the regularization weight. [aml4td](https://aml4td.org/chapters/interactions-nonlinear.html)

### 7. Where SVMs with kernels fit into this story

- Instead of explicitly constructing nonlinear features $\phi(x)$, use a **kernel function** $k(x, x')$ that computes inner products in an implicit high-dimensional feature space. [geeksforgeeks](https://www.geeksforgeeks.org/machine-learning/linear-vs-non-linear-classification-analyzing-differences-using-the-kernel-trick/)
- You solve for the SVM in terms of these kernel evaluations; the feature mapping can even be infinite-dimensional, but you never explicitly construct it. [geeksforgeeks](https://www.geeksforgeeks.org/machine-learning/linear-vs-non-linear-classification-analyzing-differences-using-the-kernel-trick/)

Common kernels:

- Polynomial kernel: $k(x, x') = (\gamma x^\top x' + r)^d$. Captures polynomial interactions up to degree $d$. [geeksforgeeks](https://www.geeksforgeeks.org/machine-learning/linear-vs-non-linear-classification-analyzing-differences-using-the-kernel-trick/)
- RBF/Gaussian kernel: $k(x, x') = \exp(-\gamma \|x - x'\|^2)$. Gives very flexible, smooth, local decision boundaries. [geeksforgeeks](https://www.geeksforgeeks.org/machine-learning/linear-vs-non-linear-classification-analyzing-differences-using-the-kernel-trick/)
- Sigmoid kernel: related to neural networks; used less commonly in practice. [geeksforgeeks](https://www.geeksforgeeks.org/machine-learning/linear-vs-non-linear-classification-analyzing-differences-using-the-kernel-trick/)

Selecting kernels and their hyperparameters is usually done via cross-validation, similar to selecting polynomial degrees and regularization strengths. [inria.github](https://inria.github.io/scikit-learn-mooc/python_scripts/linear_models_feature_engineering_classification.html)

### 8. Where “nonlinear feature selection” fits

**Filter-based nonlinear feature selection**:

- Each feature is scored individually using some measure of association with the target (e.g., correlation coefficient, mutual information, dependence measure). [aml4td](https://aml4td.org/chapters/interactions-nonlinear.html)
- Features are ranked by this score, and those below a threshold are discarded. [aml4td](https://aml4td.org/chapters/interactions-nonlinear.html)

These methods:

- Are cheap and scale well to very high dimensions because they look at one feature at a time. [aml4td](https://aml4td.org/chapters/interactions-nonlinear.html)
- Help address multicollinearity by removing redundant or weak features, but they **ignore interactions**, so they may miss features that are only useful in combination. [aml4td](https://aml4td.org/chapters/interactions-nonlinear.html)

In contrast, embedded methods (like LASSO or tree-based models) consider the model structure and can capture interactions, often giving better predictive performance, though at higher computational cost. [emagine](https://www.emagine.org/blogs/example-of-non-linear-machine-learning-algorithms-decision-trees/)