stacking

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# load dataset
numeric_df = pd.read_csv("numeric_dataset.csv")
np.random.seed(42)
numeric_df["target"] = np.random.choice([0, 1], size=len(numeric_df))
X = numeric_df.drop("target", axis=1)
y = numeric_df["target"]

# split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# stacking model
base_models = [("dt", DecisionTreeClassifier()), ("rf", RandomForestClassifier())]
stacking_clf = StackingClassifier(estimators=base_models, final_estimator=LogisticRegression(), passthrough=False)

# train
stacking_clf.fit(X_train, y_train)

# predict
y_pred = stacking_clf.predict(X_test)

print("Stacking Accuracy (Numeric):", accuracy_score(y_test, y_pred))


Stacking Accuracy (Numeric): 0.48


This Python script demonstrates the use of stacking ensemble learning on a numeric dataset. The code begins by importing essential libraries: pandas and numpy for data handling, and scikit-learn modules for model training, ensemble techniques, and evaluation. The dataset is loaded from "numeric_dataset.csv" into a DataFrame called numeric_df. A synthetic binary target column "target" is generated using numpy.random.choice to simulate classification labels, with a fixed seed (np.random.seed(42)) to ensure reproducibility.

The features (X) and target (y) are separated, and the dataset is split into training and testing sets using an 80:20 ratio via train_test_split. The StackingClassifier is defined with two base models: a DecisionTreeClassifier and a RandomForestClassifier. These base learners capture different patterns and decision boundaries in the data. The predictions from the base models are then used as input features for a meta-model, here LogisticRegression, which learns to optimally combine the base models’ outputs. Setting passthrough=False means only base model predictions—not the original features—are passed to the meta-model.

The stacking model is trained on the training data using .fit() and evaluated on the test set with .predict(). Accuracy is calculated via accuracy_score. Stacking leverages the strengths of multiple heterogeneous models, often improving predictive performance compared to single models or homogeneous ensembles like bagging or boosting.