# Day-17 Feature Engineering Basics

# Topics Covered

- What is feature engineering ?
- Data Encoding	
    - Label Encoding, 
    - One-Hot Encoding
- Numeric Prep	
    - Binning, 
    - Scaling (Standard, MinMax)
- Interaction
    - Polynomial features
    - Crossed features
- Integration	With pipelines (Ref: Day 16!)

## What is feature engineering?

## Data Enconding

### Label Encoding

Often models can’t work with text values like gender = 'Male'.

In [4]:
from sklearn.preprocessing import LabelEncoder

gender = ['Male', 'Female', 'Female', 'Male']
le = LabelEncoder()
le.fit_transform(gender)  # Output: [1, 0, 0, 1]


array([1, 0, 0, 1])

Use when categories are ordinal or for tree-based models (Decision Tree – Ref: Day 13)

### One-hot Encoding

In [3]:
from sklearn.preprocessing import OneHotEncoder
import pandas as pd

df = pd.DataFrame({'gender': ['Male', 'Female', 'Female', 'Male']})
ohe = OneHotEncoder(sparse_output=False)
ohe.fit_transform(df[['gender']])


array([[0., 1.],
       [1., 0.],
       [1., 0.],
       [0., 1.]])

#### Use when:

- Categories are nominal

- You're working with linear/logistic regression (Ref: Days 9–12)

Avoid Dummy Variable Trap — drop one column.

## Numeric prep

### Binning (Bucketing)

Great when you want to simplify or handle outliers.

In [6]:
import numpy as np
import pandas as pd

ages = pd.Series([18, 25, 30, 40, 60])
bins = [0, 18, 35, 60, 100]
labels = ['Teen', 'Young Adult', 'Adult', 'Senior']

age_binned = pd.cut(ages, bins=bins, labels=labels)
print(age_binned)


0           Teen
1    Young Adult
2    Young Adult
3          Adult
4          Adult
dtype: category
Categories (4, object): ['Teen' < 'Young Adult' < 'Adult' < 'Senior']


#### Example Use Case:

- Instead of predicting exact age → classify as age group.

- Great for decision trees (Ref: Day 13)

### Scaling

Scaling is critical for:

- Distance-based algorithms (KNN, SVM)

- Gradient descent-based models (Linear/Logistic Regression – Day 9, 12)

#### StandardScaler (Z-score normalization)

In [7]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit_transform([[1], [2], [3]])


array([[-1.22474487],
       [ 0.        ],
       [ 1.22474487]])

#### MinMaxScaler (0 to 1 normalization)

In [8]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit_transform([[10], [20], [30]])


array([[0. ],
       [0.5],
       [1. ]])

Always scale numeric features when using Logistic Regression, SVM, or KNN.
This is handled automatically if you use Pipelines (Day 16).

In [9]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# Let’s say age and salary are numeric, gender is categorical
numeric = ['age', 'salary']
categorical = ['gender']

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer

numeric_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

cat_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer([
    ('num', numeric_pipe, numeric),
    ('cat', cat_pipe, categorical)
])


| Technique         | Use Case                                            |
| ----------------- | --------------------------------------------------- |
| Label Encoding    | Ordinal data, Trees (e.g., gender = Male/Female)    |
| One-Hot Encoding  | Categorical data for linear/logistic models         |
| Binning           | Simplifying continuous features (e.g., age groups)  |
| Scaling           | Needed for Linear, Logistic, KNN, SVM               |
| Interaction Terms | Polynomial regression, feature crossing             |
| Pipelines         | When combining preprocessing + models (Ref: Day 16) |
