# 🧱 Feature Engineering in Machine Learning

## 🧠 What is Feature Engineering?

**Feature Engineering** is the process of using domain knowledge and data understanding to **create, transform, or select** features that improve the performance of machine learning models.

---

## 🎯 Objectives of Feature Engineering

- Enhance **model accuracy** by providing meaningful input
- Reduce **noise and redundancy** in features
- Enable **simpler models** by removing irrelevant data
- Make models more **interpretable** and **generalizable**

---

## ⚙️ Key Steps in Feature Engineering

---

### 1️⃣ Feature Creation

**Definition**: Creating new features by combining or transforming existing ones.

#### 📌 Examples:
- Total cost = Quantity × Price
- Age from Date of Birth
- Text length = `len(text)`

```python
df['total_cost'] = df['quantity'] * df['unit_price']
df['text_length'] = df['review'].apply(len)
```

---

## 2️⃣ Feature Transformation
Definition: Converting features into forms suitable for modeling.

### 🔸 a. Scaling (Normalization / Standardization)

**Standardization: Mean = 0, SD = 1**
```python
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df[['age', 'salary']] = scaler.fit_transform(df[['age', 'salary']])
```
**Min-Max Scaling: Rescales data to [0, 1]**
```python
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df[['age', 'income']] = scaler.fit_transform(df[['age', 'income']])
```
### 🔸 b. Log Transform / Power Transform
Useful for skewed data or heteroscedasticity
```python
import numpy as np
df['log_income'] = np.log1p(df['income'])
```

---

## 3️⃣ Encoding Categorical Features

### 🔸 a. Label Encoding
Assigns numerical values to each category (ordinal)

``` PYTHON
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['gender'] = le.fit_transform(df['gender'])
```
### 🔸 b. One-Hot Encoding
Converts each category into separate binary columns (nominal)
``` python
pd.get_dummies(df, columns=['city'], drop_first=True)
```

---

## 4️⃣ Binning
Definition: Convert continuous variables into categorical by dividing them into bins.

```python

df['age_group'] = pd.cut(df['age'], bins=[0, 18, 35, 60, 100], labels=['Teen', 'Young Adult', 'Adult', 'Senior'])
```
---


## 5️⃣ Date-Time Feature Extraction
Extract useful components from date fields like year, month, day, hour, weekday.

```python

df['purchase_date'] = pd.to_datetime(df['purchase_date'])
df['month'] = df['purchase_date'].dt.month
df['weekday'] = df['purchase_date'].dt.day_name()
df['hour'] = df['purchase_date'].dt.hour
```

---

## 6️⃣ Text Feature Extraction
### 🔸 a. Word Count / Character Count
``` python
df['char_count'] = df['review'].apply(len)
df['word_count'] = df['review'].apply(lambda x: len(str(x).split()))
```
### 🔸 b. TF-IDF Vectorization (for ML models)
```python
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_features=100)
X = tfidf.fit_transform(df['review'])
```

---

## 7️⃣ Feature Selection
Definition: Selecting the most relevant features for the model.

Methods:
- Correlation Matrix
- Univariate Feature Selection (SelectKBest)
- Recursive Feature Elimination (RFE)
- Tree-based importance (e.g., RandomForestClassifier)

```python
from sklearn.feature_selection import SelectKBest, f_classif
selector = SelectKBest(score_func=f_classif, k=5)
X_new = selector.fit_transform(X, y)
```


---

| Technique          | Purpose                       | Example Tool                     |
| ------------------ | ----------------------------- | -------------------------------- |
| Feature Creation   | Add domain knowledge          | `df['total'] = A * B`            |
| Scaling            | Normalize data                | `StandardScaler`, `MinMaxScaler` |
| Encoding           | Convert categories to numbers | `LabelEncoder`, `get_dummies()`  |
| Binning            | Discretize continuous data    | `pd.cut()`                       |
| Text Processing    | Extract info from text        | `TfidfVectorizer()`              |
| Date-Time Features | Add time-based columns        | `.dt.month`, `.dt.hour`          |
| Feature Selection  | Remove irrelevant variables   | `SelectKBest`, `RFE`             |


## 🧾 Final Notes
- Feature Engineering is often more impactful than choosing the algorithm.
- Good features simplify models, improve accuracy, and reduce overfitting.
- Combine domain knowledge + statistical insight + EDA for best results.