## Scikit-Learn Transformers

**Transformers** in Scikit-Learn are classes that *transform* data before passing it to models. They are central to data preprocessing pipelines — automating transformations like scaling, encoding, imputation, and feature engineering.

Each transformer has two main actions:
- **fit()** – learns information from the training data (e.g., which combinations of features exist).
- **transform()** – applies that transformation (e.g., generating polynomial features, scaling values, or filling missing data).

### Polynomial Features — Capturing Nonlinearity

Polynomial features help linear models capture nonlinear relationships by adding powers or combinations of existing features.

#### Concept

If you start with one feature `x`, then:
- Degree 1: `x`
- Degree 2: `x`, `x²`
- Degree 3: `x`, `x²`, `x³`

If you have two features, `x1` and `x2`, and use degree 2, the generated features are:
- `1` (a bias or constant term)
- `x1`, `x2`
- `x1²`, `x1*x2`, `x2²`

These additional terms represent nonlinear relationships and interactions between features.

#### Example from the fuel efficiency case
In the previous example, the relationship between **fuel efficiency (mpg)** and **horsepower** was nonlinear. A linear model couldn’t easily fit that curve. Adding a squared term (`horsepower²`) allowed the model to better follow that curvature — improving prediction quality.



### PolynomialFeatures Transformer — Step-by-Step

Scikit-Learn provides **PolynomialFeatures** to automate this process.

In [8]:
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures

# Original dataset with two features
df = pd.DataFrame({
    "alpha": [1, 2, 3, 4],
    "beta": [5, 6, 7, 8]
})

# Create transformer for degree 2 polynomial features
poly_transform = PolynomialFeatures(degree=2)
poly_transform.fit(df[["alpha", "beta"]])

# Transform data
transformed_data = poly_transform.transform(df[["alpha", "beta"]])
transformed_data

array([[ 1.,  1.,  5.,  1.,  5., 25.],
       [ 1.,  2.,  6.,  4., 12., 36.],
       [ 1.,  3.,  7.,  9., 21., 49.],
       [ 1.,  4.,  8., 16., 32., 64.]])

#### What’s happening under the hood
1. **fit(df)** — determines possible feature combinations given the degree.
2. **transform(df)** — generates new columns:
   - 1 (bias)
   - alpha
   - beta
   - alpha²
   - alpha×beta
   - beta²

For instance, when `alpha=2` and `beta=6`:
- α×β = 12  
- α² = 4  
- β² = 36  
So the full transformed row is `[1][2][6][4][12][36]`.

In [9]:
# By default, `transform()` outputs a numeric NumPy array. To recover the column labels, use:

poly_transform.get_feature_names_out()

array(['1', 'alpha', 'beta', 'alpha^2', 'alpha beta', 'beta^2'],
      dtype=object)

In [10]:
# To create a clearly labeled DataFrame:

transformed_data = pd.DataFrame(
    poly_transform.transform(df[["alpha", "beta"]]),
    columns=poly_transform.get_feature_names_out()
)
print(transformed_data)

     1  alpha  beta  alpha^2  alpha beta  beta^2
0  1.0    1.0   5.0      1.0         5.0    25.0
1  1.0    2.0   6.0      4.0        12.0    36.0
2  1.0    3.0   7.0      9.0        21.0    49.0
3  1.0    4.0   8.0     16.0        32.0    64.0


In [11]:
# combine `fit()` and `transform()` in one step using:

transformed_data = poly_transform.fit_transform(df[["alpha", "beta"]])

### Using Polynomial Features in a Regression Model

Suppose we want to predict a target variable `y` based on features `alpha` and `beta`.

In [12]:
from sklearn.linear_model import LinearRegression

df = pd.DataFrame({
    "alpha": [1, 2, 3, 4],
    "beta": [5, 6, 7, 8],
    "y": [7.9, 16.1, 29.9, 42.1]
})

poly_transform = PolynomialFeatures(degree=2)
X_poly = poly_transform.fit_transform(df[["alpha", "beta"]])

model = LinearRegression()
model.fit(X_poly, df["y"])

### Building a Pipeline for Automation

Now `model` has learned coefficients for six input terms: bias, alpha, beta, alpha², alpha×beta, and beta².

However, if you try:
```python
model.predict([[3, 5]])
```
you will get an error — because the model trained on six features, not two.

To fix this and make preprocessing automatic, use a **Pipeline**:

In [13]:
from sklearn.pipeline import Pipeline

# Create a pipelined model that includes both transformer and regressor
pipelined_model = Pipeline([
    ('poly_features', PolynomialFeatures(degree=3)),
    ('linear_regression', LinearRegression())
])

# Fit on original two features
pipelined_model.fit(df[["alpha", "beta"]], df["y"])

# Now you can predict directly using only the original features:
pipelined_model.predict([[3, 5]])



array([66.39181554])

Here’s what happens internally:
1. The pipeline takes the input `[3][5]`.
2. It sends it to the `'poly_features'` transformer → generates polynomial terms.
3. Then it sends those transformed features into `'linear_regression'`.
4. Finally, it returns the prediction.

In [14]:
# Accessing Stages Inside a Pipeline
# You can inspect or extract parts of the pipeline. For example:

pipelined_model.named_steps['linear_regression'].coef_

# This retrieves the learned coefficients from the linear regression component inside the pipeline.

array([ 8.00470801e-14, -2.17241379e-01, -2.17241379e-01,  1.07619443e+00,
        2.07228916e-01, -6.61736602e-01, -3.28130453e+00,  1.02347320e+00,
        1.85238887e+00, -7.94557541e-01])

### Why Pipelines Are Important

Pipelines ensure the same sequence of steps occurs every time you train or make predictions. This:
- Prevents mistakes (like forgetting to apply the same preprocessing on test data),
- Keeps code cleaner and more maintainable,
- Allows chaining more complex processes (e.g., scaling → polynomial transform → regression).

### Summary of Key Concepts

| Concept | Description | Example |
|----------|--------------|----------|
| fit() | Learns data structure or computes parameters | Determine feature combinations |
| transform() | Applies the learned transformation | Generate polynomial features |
| fit_transform() | Combines the two steps | One-liner preprocessing |
| get_feature_names_out() | Returns names of new features | Verify polynomial terms |
| Pipeline() | Chains transformations and model training | Polynomial + Regression |

### Visual Analogy

Imagine your data as ingredients (alpha, beta). The **PolynomialFeatures transformer** blends and combines them into richer mixtures (alpha², alpha×beta, etc.). The **Pipeline** is your recipe that ensures you always mix and bake in the correct order, producing consistent results every time.