### What are nonlinear features?

Nonlinear features are input variables whose relationship with the target cannot be captured well by a straight line (or linear decision boundary). Instead, they form curves, thresholds, interactions, or more complex patterns in the feature space.

Examples include:

- Target changes only after a feature crosses a certain threshold.
- The effect of a feature increases or decreases at different rates (e.g., quadratic or exponential growth).
- The outcome depends on combinations or interactions of features (e.g., XOR patterns).

Recognizing such **nonlinear** structure is important because purely linear models will underfit in these settings.

### Why they matter for classification

When features and labels are related nonlinearly, linear classifiers (like plain logistic regression or linear SVM) cannot find a single hyperplane that separates the classes well. To model these patterns, we either:

- Transform the features into a richer, nonlinear feature space and apply a linear model there.
- Use inherently nonlinear models that can learn curved and piecewise-constant decision boundaries directly.

This choice strongly affects predictive performance, interpretability, and computational cost.

### Common nonlinear classification algorithms

Several supervised learning methods are well suited to nonlinear feature–label relationships:

- Decision trees  
  - Recursively split the feature space into regions using axis-aligned thresholds.  
  - Naturally capture stepwise, threshold-based, and interaction effects.

- Random forests  
  - Ensembles of many decision trees trained on bootstrapped samples and feature subsets.  
  - Average many nonlinear trees to reduce variance and improve generalization.

- Gradient boosting machines (GBMs, e.g., XGBoost, LightGBM, CatBoost)  
  - Build trees sequentially, where each new tree corrects the residual errors of the previous ones.  
  - Very flexible, often state-of-the-art for tabular nonlinear data.

- Support Vector Machines (SVMs) with nonlinear kernels  
  - Use a kernel function to implicitly map inputs into a high-dimensional feature space.  
  - Learn a linear separator in that space, which corresponds to a nonlinear boundary in the original space.

- Neural networks (e.g., multilayer perceptrons, deep networks)  
  - Stack layers of linear transformations and nonlinear activations.  
  - Can approximate highly complex nonlinear functions given enough data, capacity, and regularization.

- Gaussian processes (GPs)  
  - Define a distribution over functions, specified by a mean and covariance (kernel) function.  
  - With appropriate kernels, can flexibly model smooth nonlinear relationships and provide uncertainty estimates.

These methods differ in their bias–variance characteristics, scalability, interpretability, and typical application domains.

### Role of feature engineering and regularization

Even nonlinear models benefit from careful feature design:

- Feature engineering  
  - Create polynomial terms, interaction terms, or domain-specific transformations (e.g., log, ratios, cyclic encodings).  
  - Use learned representations (e.g., embeddings) for complex inputs like text or categorical variables.

- Regularization  
  - Penalize model complexity (e.g., weight decay, max depth, dropout, early stopping) to combat overfitting.  
  - Choose kernel parameters, tree depth, learning rates, and network width via cross-validation.

Balancing expressive power with appropriate regularization is central to successful nonlinear classification.

### Emphasis of this module: SVMs with nonlinear kernels

This module will focus on using SVMs to handle nonlinear features through kernels. The key ideas are:

- Map inputs into a high-dimensional feature space implicitly via a **kernel function** $k(x, x')$.
- Learn a maximum-margin hyperplane in that feature space.
- Interpret the resulting classifier as a nonlinear decision boundary in the original input space.

Typical kernels to be covered:

- Polynomial kernel: captures interaction and polynomial effects up to a chosen degree.
- Radial Basis Function (RBF) or Gaussian kernel: captures smooth, localized nonlinear patterns.
- Other kernels (e.g., sigmoid) as special-purpose choices.

We will see how kernel choice and hyperparameters (such as degree, gamma, and regularization parameter control the complexity of the learned nonlinear boundary, and how to tune them in practice using cross-validation.