I want to see how tree ensembles behave on the wine dataset. I will try Random Forest and Extra Trees. Then I will move to boosting with Gradient Boosting and Histogram based Gradient Boosting. I will finish with XGBoost and LightGBM. I will keep notes inline so the reasoning is easy to follow later.

The machine learning algorithms that I learned previously fits best on structured data. Among those algorithms, ensemble learning prints the best output for treating structured data, which is an algorithm is based on mostly decision tree. 

Now lets dig into the ultimate algorithm of treating structured data in scikit learn, the ensemble learning. 

# Imports

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split, cross_validate
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier, HistGradientBoostingClassifier
from sklearn.inspection import permutation_importance
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

# Load data and split

I'll reuse the same wine set.

In [2]:
wine = pd.read_csv('https://bit.ly/wine_csv_data')

X = wine[['alcohol', 'sugar', 'pH']]
y = wine['class']

train_X, test_X, train_y, test_y = train_test_split(
    X, y, test_size=0.2, random_state=42
)

wine.head()

Unnamed: 0,alcohol,sugar,pH,class
0,9.4,1.9,3.51,0.0
1,9.8,2.6,3.2,0.0
2,9.8,2.3,3.26,0.0
3,9.8,1.9,3.16,0.0
4,9.4,1.9,3.51,0.0


# Random Forest

**Random forest** is one of the representatives of ensemble learning and is widely used due to its stable per formance. As we can infer from the name itself, a random forst makes a forest of decision trees by randomizing decision trees, and we make a final prediction using the predictions of each decision tree. 

First, a random forest makes the data for training each tree random, which is unique in how it is created. It creates training data by randomly extracting samples from the training data we input. At this time, one sample may be duplicated and extracted.

After all samples are extracted, it is called a **bootstrap sample**, and it is normally same as the size of the training set.

In [3]:
rf = RandomForestClassifier(n_jobs = -1, random_state = 42) #RandomForestClassifier uses 100 decision trees by default, so better use every CPU core by n_jobs = -1

scores = cross_validate(rf, train_X, train_y,
return_train_score = True, n_jobs = -1) #cross validation, return_train_score returns train score, better for finding out over/under fitting

print(np.mean(scores['train_score']), np.mean(scores['test_score']))

#Seems like the set is overfitted.

0.9973541965122431 0.8903229806766861


In [4]:
# Random forest is an ensemble of decision tree, so all the parameters that consists DecisionTreeClassifier are provided (also calculates feature importance).

rf.fit(train_X, train_y)
print(rf.feature_importances_)

[0.23183515 0.50059756 0.26756729]


In [5]:
# Another feature of RandomForestClassifier, it obtains the evaluation score of the model itself by OOB sample
# OOB : Out of Bag - the samples that are not included in the bootstrap sample, works like a validation set

rf = RandomForestClassifier (oob_score = True, n_jobs = -1, random_state = 42) # use oob_score = True to get the oob score
rf.fit(train_X, train_y)

print(rf.oob_score_)

0.8945545507023283


# Extra Trees

Extra Trees uses more randomization at each split, which often gives similar accuracy with a different bias variance tradeoff.

The extra tree uses the decision tree of DecisionTreeClassifier, with the parameter **splitter = 'random'** 

If you randomly split the features from one decision tree, it will overfit. However, ExtraTrees ensembles a number of trees so it prevents overfitting and increasing the score of the set. 

In [6]:
et = ExtraTreesClassifier(n_jobs=-1, random_state=42)
scores = cross_validate(et, train_X, train_y,
return_train_score = True, n_jobs= - 1)

print(np.mean(scores['train_score']), np.mean(scores['test_score']))

0.9974503966084433 0.8887848893166506


The score is similar to the randomforest. Of course, since there are less samples to figure out the difference between the models. Normally, extra tree needs to train more trees since it have more randomness, but is also owns advantage in fast calculation speed since it divides the nodes randomly.

In [7]:
et.fit(train_X, train_y)

print(et.feature_importances_)

[0.20183568 0.52242907 0.27573525]


# Gradient Boosting

Gradient Boosting is a method of ensembleing by using shallow decision trees to compensate for errors in previous trees. By default, scikit learn's GradientEBoosting Classifier uses 100 decision trees with depths of 3.

Because it uses shallow decision trees with depth, we can expect overfitting and generally high generalization performance.

In [8]:
gb = GradientBoostingClassifier(random_state = 42)
scores = cross_validate(gb, train_X, train_y,
return_train_score = True, n_jobs = -1)

print(np.mean(scores['train_score']), np.mean(scores['test_score']))

#gradient boosting is resistant to overfitting. 

0.8881086892152563 0.8720430147331015


In [9]:
gb = GradientBoostingClassifier(n_estimators = 500, learning_rate = 0.2, random_state = 42) #default learning_rate = 0.1

scores = cross_validate(gb, train_X, train_y,
return_train_score = True, n_jobs = -1)

print(np.mean(scores['train_score']), np.mean(scores['test_score']))

0.9464595437171814 0.8780082549788999


In [10]:
gb.fit(train_X, train_y)

print(gb.feature_importances_)

[0.15881044 0.67988912 0.16130044]


It still is resistant to overfitting even n_estimators have been increased five times. And also as you can see, gradient boosting foucses on specific feature than random forest. 

# Histogram-based Gradient Boosting

Histogram-based Gradient Boosting is the most popular algorithm among those which treats structured data. 

Histogram-based Gradient Boosting divides input features into 256 sections, so it quickly finds the best metric while it divides the node.

In [11]:
hgb = HistGradientBoostingClassifier(random_state = 42)
scores = cross_validate(hgb, train_X, train_y,
return_train_score = True)

print(np.mean(scores['train_score']), np.mean(scores['test_score']))

#It controls overfitting and provides a better performance than gradient boosting.

0.9321723946453317 0.8801241948619236


In [12]:
# HistGradientBoostingClassifier doesn't provide feature_importances_, so we should use permutation_importance() to calculate it. 

hgb.fit(train_X, train_y)
result = permutation_importance(hgb, train_X, train_y,
n_repeats = 10 ,random_state = 42, n_jobs = -1)

print(result.importances_mean)

# result = [importances, importances_mean, importances_std]

[0.08876275 0.23438522 0.08027708]


In [13]:
result = permutation_importance(hgb, test_X, test_y,
n_repeats = 10 ,random_state = 42, n_jobs = -1)

print(result.importances_mean)

[0.05969231 0.20238462 0.049     ]


In [14]:
hgb.score(test_X, test_y)

0.8723076923076923

We can see a 87% accuracy. 

Of course, in real life, the performance will be a bit low than this. 

However still, ensemble learning can acheive a higher score than single decision tree.

# RandomForest vs ExtraTrees vs GradientBoosting vs HistGradientBoosting

## RandomForestClassifier
- Builds many decision trees using bootstrap samples and random feature subsets  
- Parallel, stable, low overfitting risk  
- Strong baseline, works well with noisy data

## ExtraTreesClassifier
- Similar to RandomForest but splits are chosen completely at random  
- Faster training, more variance reduction  
- Often slightly less accurate but more efficient

## GradientBoostingClassifier
- Sequentially builds trees that correct previous errors  
- High accuracy potential but slower and more prone to overfitting  
- Requires tuning (learning_rate, depth)

## HistGradientBoostingClassifier
- Histogram-based version of Gradient Boosting  
- Much faster on large datasets  
- Handles missing values natively and scales better than standard GradientBoosting

# XGBoost

In [15]:
xgb = XGBClassifier(tree_method='hist', random_state=42)
xgb._estimator_type = "classifier"  # ensure compatibility with sklearn utilities

scores = cross_validate(
    xgb, train_X, train_y,
    return_train_score=True, n_jobs=-1
)

print(np.mean(scores['train_score']), np.mean(scores['test_score']))

0.9567059184812372 0.8783915747390243


# LightGBM

LightGBM is also a fast gradient boosting implementation, developed by Microsoft. It also implements lots of updated technologies, which is making it more popular.

In [16]:
lgb = LGBMClassifier(random_state=42)
scores = cross_validate(
    lgb, train_X, train_y,
    return_train_score=True, n_jobs=-1
)

print(np.mean(scores['train_score']), np.mean(scores['test_score']))

[LightGBM] [Info] Number of positive: 3151, number of negative: 1006[LightGBM] [Info] Number of positive: 3151, number of negative: 1007

[LightGBM] [Info] Number of positive: 3151, number of negative: 1007
[LightGBM] [Info] Number of positive: 3152, number of negative: 1006
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000895 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 372
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000829 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000862 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000865 seconds.
You can set `force_col_wise=true` to 

## XGBoost
- Optimized gradient boosting with regularization, tree pruning, and efficient handling of sparse data  
- Very strong performance, widely used in competitions and structured data tasks

## LightGBM
- Gradient boosting using histogram and leaf-wise growth for faster training and lower memory use  
- Scales extremely well on large datasets and often outperforms other boosting methods in speed

# What I learned

On this wine dataset the tree ensembles form a clear pattern. Random Forest and Extra Trees achieve high training accuracy and strong test accuracy without scaling. Gradient Boosting, Histogram based Gradient Boosting, XGBoost, and LightGBM sit in a similar performance band while offering different speed and tuning behavior. Sugar is consistently the most informative feature which aligns with the decision trees from earlier chapters. For a quick and reliable baseline I can start with Random Forest. When I need a bit more headroom or more control I can switch to a boosting model and tune the number of trees and the learning rate.