## LightGBM Classifier: Simple Formal Explanation

**LightGBM** (Light Gradient Boosting Machine) is a popular machine learning algorithm used for classification and regression problems. It is known for being fast and accurate, especially with large datasets.

### Key Features

#### 1. Histogram-based Tree Growth

* Instead of checking every possible value of a feature, LightGBM first **groups the values into bins** (like dividing them into ranges).
* This process is similar to making a histogram in statistics.
* By using bins, LightGBM can find the best splits much faster, which saves time and memory.

#### 2. Leaf-wise Splitting

* When building a decision tree, most algorithms split the tree level by level.
* LightGBM, however, **splits the leaf node that will reduce the prediction error the most** (called “leaf-wise” growth).
* This means it focuses on the parts of the tree where the model is making the most mistakes, leading to better accuracy.



## Formal Mathematical Example of LightGBM

### Dataset

Consider the following dataset:

| Index | Height (cm) | Weight (kg) | Class (Y) |
| ----- | ----------- | ----------- | --------- |
| 1     | 150         | 45          | 0 (Short) |
| 2     | 160         | 55          | 0 (Short) |
| 3     | 170         | 70          | 1 (Tall)  |
| 4     | 180         | 80          | 1 (Tall)  |

---

### Step 1: Histogram-based Binning

LightGBM applies histogram-based binning to discretize continuous features into a fixed number of bins. For this example, assume two bins per feature:

* **Height:**

  * Bin 1: 150–165 (containing values 150, 160)
  * Bin 2: 166–180 (containing values 170, 180)

* **Weight:**

  * Bin 1: 45–62 (containing values 45, 55)
  * Bin 2: 63–80 (containing values 70, 80)

Thus, the binned dataset is:

| Index | Height Bin | Weight Bin | Class |
| ----- | ---------- | ---------- | ----- |
| 1     | 1          | 1          | 0     |
| 2     | 1          | 1          | 0     |
| 3     | 2          | 2          | 1     |
| 4     | 2          | 2          | 1     |

---

### Step 2: Leaf-wise Splitting Using Gini Impurity

LightGBM evaluates candidate splits by calculating the reduction in impurity. We use **Gini impurity** as the metric, defined for a node as:

$$
Gini = 1 - \sum_{k=1}^K p_k^2
$$

where $p_k$ is the proportion of class $k$ instances in the node.

#### Before Split

* Total samples: 4
* Class distribution: 2 Short (class 0), 2 Tall (class 1)
* Class probabilities: $p_0 = \frac{2}{4} = 0.5$, $p_1 = 0.5$
* Gini impurity:

$$
Gini_{parent} = 1 - (0.5^2 + 0.5^2) = 1 - (0.25 + 0.25) = 0.5
$$

#### Splitting on Height Bin

* **Left child (Height Bin 1):** 2 samples, both class 0

  * $p_0 = 1, p_1 = 0$
  * $Gini_{left} = 1 - (1^2 + 0^2) = 0$

* **Right child (Height Bin 2):** 2 samples, both class 1

  * $p_0 = 0, p_1 = 1$
  * $Gini_{right} = 1 - (0^2 + 1^2) = 0$

* Weighted Gini after split:

$$
Gini_{split} = \frac{2}{4} \times 0 + \frac{2}{4} \times 0 = 0
$$

#### Gini Reduction

$$
\Delta Gini = Gini_{parent} - Gini_{split} = 0.5 - 0 = 0.5
$$

This indicates a perfect split.

---

### Step 3: Termination

Since both child nodes are pure (Gini impurity 0), no further splitting is required.

---

## Summary

| Feature    | Split Condition | Gini After Split | Gini Reduction |
| ---------- | --------------- | ---------------- | -------------- |
| Height Bin | Bin 1 vs Bin 2  | 0                | 0.5            |
| Weight Bin | Bin 1 vs Bin 2  | 0                | 0.5            |

Both features provide perfect splits; LightGBM selects the split maximizing impurity reduction.

---

### Key Points

* **Histogram-based binning** discretizes continuous features to reduce computational complexity.
* **Leaf-wise splitting** prioritizes splitting the leaf node that yields the greatest impurity reduction, enhancing model accuracy.
* **Gini impurity** is a measure of node purity used to evaluate split quality.


In [1]:
import lightgbm as lgb
import pandas as pd

# Prepare tiny dataset
data = pd.DataFrame({
    'Height': [150, 160, 170, 180],
    'Weight': [45, 55, 70, 80],
    'Class':  [0, 0, 1, 1]  # 0 = Short, 1 = Tall
})

X = data[['Height', 'Weight']]
y = data['Class']

# Create and train LightGBM model
model = lgb.LGBMClassifier(
    max_bin=2,         # Use 2 bins to mimic histogram binning in example
    num_leaves=3       # Small number of leaves for simplicity
)
model.fit(X, y)

# Predict on new samples
test_samples = [[155, 50], [175, 75]]
predictions = model.predict(test_samples)

for sample, pred in zip(test_samples, predictions):
    label = "Short" if pred == 0 else "Tall"
    print(f"Input: Height={sample[0]}, Weight={sample[1]} -> Predicted class: {label}")


[LightGBM] [Info] Number of positive: 2, number of negative: 2
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 4, number of used features: 0
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
Input: Height=155, Weight=50 -> Predicted class: Short
Input: Height=175, Weight=75 -> Predicted class: Short


