

#### 🧠 Intermediate Machine Learning

This course prepares you to **build more accurate and robust models** by tackling common data issues and exploring more advanced techniques.

##### 1. **Introduction**

* Quick recap of what you've learned so far (fitting models, validation, random forests).
* Overview of **key real-world challenges**:

  * Missing values
  * Categorical variables
  * Feature engineering
  * Model tuning

---

##### 2. **Handling Missing Values**

Real-world data often has **missing entries** (NaNs). This lesson covers:

* How to **detect** missing values
* Strategies to handle them:

  * Drop rows/columns
  * **Imputation** (filling in with mean, median, or constant)

```python
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='mean')
imputed_X = pd.DataFrame(imputer.fit_transform(X))
imputed_X.columns = X.columns
```

---

##### 3. **Handling Categorical Variables**

Models can’t directly work with text (categorical) data. You’ll learn two main methods:

* **Label Encoding**: Assigns a number to each category

  * Good for **ordinal** (ordered) data
* **One-Hot Encoding**: Creates binary columns for each category

  * Good for **nominal** (unordered) data

```python
# One-Hot Encoding
X = pd.get_dummies(X)
```

You’ll also learn how to align encoded train/test data so the model doesn’t fail.

---

##### 4. **Pipelines**

Pipelines let you **bundle preprocessing and modeling** steps together. This helps:

* Keep code cleaner
* Avoid data leakage
* Automate preprocessing during predictions

```python
from sklearn.pipeline import Pipeline

pipeline = Pipeline(steps=[
    ('preprocessor', SimpleImputer()),
    ('model', RandomForestRegressor())
])
pipeline.fit(X_train, y_train)
```

---

##### 5. **Cross-Validation**

Instead of a single train/test split, **cross-validation** helps you better estimate your model’s performance by training and testing it multiple times on different splits of the data.

```python
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error')
```

---

##### 6. **XGBoost (Extreme Gradient Boosting)**

A **powerful ML algorithm** that often outperforms Random Forests in structured/tabular data.

* Based on gradient boosting trees
* Requires conversion of data to numeric format

```python
from xgboost import XGBRegressor
model = XGBRegressor()
model.fit(X_train, y_train)
```

You’ll also learn how to use:

* Early stopping
* Parameter tuning for XGBoost

---

##### 7. **Data Leakage**

One of the most important real-world ML concepts.

**Data leakage** happens when the model accidentally learns from information it **shouldn’t have access to during training**, usually leading to **over-optimistic results**.

You’ll learn:

* How to detect leakage
* How to prevent it by properly splitting data and excluding features

---

##### ✅ Skills You'll Gain

* Handle missing and categorical data like a pro
* Build clean, reusable pipelines
* Use cross-validation for better model evaluation
* Train advanced models like XGBoost
* Avoid common pitfalls like data leakage


In [13]:
import pandas as pd

df = pd.DataFrame(
    {"S-No": [1, 2, 3], "Age": [12, 13, 14], "Color_Group": ["Blue", "Red", "Green"]}
)

df["Age_Rank"] = pd.factorize(df["Color_Group"])[0]
df = pd.get_dummies(data=df, columns=["Color_Group"])
df

Unnamed: 0,S-No,Age,Age_Rank,Color_Group_Blue,Color_Group_Green,Color_Group_Red
0,1,12,0,True,False,False
1,2,13,1,False,False,True
2,3,14,2,False,True,False
