<center><h1><strong>Feature Engineering</strong></h1></center>

Feature engineering is the process of transforming raw data into meaningful input variables (features) that help machine learning models learn patterns better and make more accurate predictions. It usually has more impact on your final model performance than choosing a fancy algorithm.

1. Core idea and motivation
- A feature is an input variable used by a model, e.g. age, salary, city, last_7_days_clicks.
​
- Feature engineering = selecting, cleaning, transforming, and creating such variables so that the model can understand the problem and separate classes or predict values effectively.

- Good feature engineering leads to:

  - Higher accuracy and more stable models.
  - Faster training and simpler models that generalize better.
  - Less need for very complex algorithms to get good results.
<br/>​
When you say “model building” in applied ML, 60–80% of the time is usually feature engineering and data prep.

2. Typical feature engineering workflow
- Think of a repeatable pipeline; for teaching, it’s best to present it as these steps:
​
- Understand the problem and data

  - Define target (e.g. churn yes/no, house price) and business goal.
  - Inspect data types, distributions, data quality, and leakage possibilities.
​
- Data cleaning

  - Handle missing values, inconsistent formats, wrong types, and duplicates.
  - Remove or cap extreme outliers if they are errors or distort training.
​

- Basic preprocessing

  - Encode categorical variables, scale/normalize numeric variables, convert dates, and parse text or IDs.
  - Split into train/validation/test before heavy feature engineering to avoid leakage.
​
- Feature creation & transformation

  - Create domain-specific features, ratios, aggregations, interactions, and time-based features.
  - Apply mathematical transforms (log, square root, binning, etc.) to stabilize distributions.
​

- Feature selection & reduction

  - Remove redundant, noisy, or low-importance features using statistical tests or model-based methods.
  - Optionally use dimensionality reduction like PCA for high-dimensional data.
​



In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.DataFrame({
    "age":[25,32,np.nan,40,29],
    "income":[40000,52000,61000,np.nan,45000],
    "city":["Delhi","Mumbai","Delhi",np.nan,"Banglore"],
    "bought":[0,1,0,1,0]
})

Separate Feature and target

In [3]:
x = df.drop("bought",axis=1)
y = df["bought"]

Handling missing values

In [4]:
from sklearn.impute import SimpleImputer

#Identify numerica and categorical columns
numeric_features = ["age","income"]
categorical_features = ["city"]

#Create imputers
num_imputer = SimpleImputer(strategy="median")
cat_imputer = SimpleImputer(strategy="most_frequent")

#fit and transform
x_num = num_imputer.fit_transform(x[numeric_features])
x_cat = cat_imputer.fit_transform(x[categorical_features])

Encoding Categorical Variable

In [5]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse_output=False,handle_unknown="ignore")
x_cat_encoded = encoder.fit_transform(x_cat)

Scaling numeric features

In [6]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
x_num_scaled = scaler.fit_transform(x_num)

putting numeric + Categorical back together

In [7]:
x_processed = np.hstack([x_num_scaled,x_cat_encoded])

In [None]:
x_processed