## CatBoost Classifier:

**CatBoost** is a gradient boosting algorithm designed to efficiently handle datasets containing categorical features and to reduce overfitting during training.

### Key Features

#### 1. Native Categorical Feature Support

* Unlike many traditional algorithms that require preprocessing categorical variables into numeric representations (e.g., one-hot encoding), CatBoost **processes categorical features natively**.
* It employs a technique known as **Ordered Target Statistics**, which computes category-based statistics in a manner that prevents information leakage from the target variable, thereby reducing the risk of overfitting.

#### 2. Ordered Boosting Technique

* Conventional gradient boosting methods often suffer from **prediction shift** due to the use of the same dataset for model training and residual calculation, which can lead to overfitting.
* CatBoost mitigates this issue through **ordered boosting**, where data is processed sequentially to ensure that predictions for each instance are generated without using its own target information.
* This approach improves generalization and model robustness.


## Mathematical Calculation of CatBoost’s Categorical Encoding

### Dataset

| Index | Color | Class (Y) |
| ----- | ----- | --------- |
| 1     | Red   | 0         |
| 2     | Blue  | 1         |
| 3     | Green | 0         |
| 4     | Blue  | 1         |

---

### Goal

Convert the categorical feature **Color** into numbers, but **without “peeking” at the current example’s target value** (to avoid overfitting).

---

### Step-by-step Calculation

For each example, the encoded value of **Color** is the **average target value** of all **previous** examples with the same category.

If there is no previous example with that category, use a **prior value**, for example, the global average target.

---

### Calculate global average target

$$
\text{Global mean} = \frac{0 + 1 + 0 + 1}{4} = \frac{2}{4} = 0.5
$$

---

### Encoding for each example

| Index | Color | Target $y_i$ | Previous targets with same Color | Encoding (Ordered Target Statistic) |
| ----- | ----- | ------------ | -------------------------------- | ----------------------------------- |
| 1     | Red   | 0            | None                             | Prior = 0.5                         |
| 2     | Blue  | 1            | None                             | Prior = 0.5                         |
| 3     | Green | 0            | None                             | Prior = 0.5                         |
| 4     | Blue  | 1            | Index 2 (target = 1)             | $\frac{1}{1} = 1.0$                 |

---

### Summary Table

| Index | Color | Encoded Value |
| ----- | ----- | ------------- |
| 1     | Red   | 0.5           |
| 2     | Blue  | 0.5           |
| 3     | Green | 0.5           |
| 4     | Blue  | 1.0           |

---

### Explanation

* For **example 1**, “Red” has no previous occurrence, so we use the global average 0.5.
* For **example 2**, “Blue” first time — again, use 0.5.
* For **example 3**, “Green” first time — 0.5.
* For **example 4**, “Blue” appeared before at example 2 with target 1, so encoding is 1.0.


In [2]:
data = [
    {'Index': 1, 'Color': 'Red', 'Target': 0},
    {'Index': 2, 'Color': 'Blue', 'Target': 1},
    {'Index': 3, 'Color': 'Green', 'Target': 0},
    {'Index': 4, 'Color': 'Blue', 'Target': 1},
]

In [3]:
data

[{'Index': 1, 'Color': 'Red', 'Target': 0},
 {'Index': 2, 'Color': 'Blue', 'Target': 1},
 {'Index': 3, 'Color': 'Green', 'Target': 0},
 {'Index': 4, 'Color': 'Blue', 'Target': 1}]

In [4]:
# Calculate global mean of target (prior)
global_mean = sum(row['Target'] for row in data) / len(data)

In [5]:
# Dictionary to keep track of previous targets per category
category_history = {}

In [6]:
# List to store encoded values
encoded_values = []

for row in data:
    cat = row['Color']
    # Check if we have previous targets for this category
    if cat in category_history and category_history[cat]:
        # Calculate mean of previous targets
        encoded = sum(category_history[cat]) / len(category_history[cat])
    else:
        # Use global mean if no previous data
        encoded = global_mean

    encoded_values.append(encoded)

    # Update history with current target
    if cat not in category_history:
        category_history[cat] = []
    category_history[cat].append(row['Target'])


In [7]:
# Print results
print("Index | Color  | Target | Encoded Value")
print("---------------------------------------")
for i, row in enumerate(data):
    print(f"{row['Index']:5} | {row['Color']:6} | {row['Target']:6} | {encoded_values[i]:13.3f}")

Index | Color  | Target | Encoded Value
---------------------------------------
    1 | Red    |      0 |         0.500
    2 | Blue   |      1 |         0.500
    3 | Green  |      0 |         0.500
    4 | Blue   |      1 |         1.000
