bagging 

In [None]:
import pandas as pd
import numpy as np

import sklearn
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

# --- load data (the file you uploaded) ---
df = pd.read_csv("categorical_dataset.csv")

# --- create a target (you had added a random target in your snippet) ---
np.random.seed(42)
df["target"] = np.random.choice([0, 1], size=len(df))

X = df.drop("target", axis=1)
y = df["target"]

# --- OneHotEncoder compatibility (sparse_output vs sparse) ---
try:
    encoder = OneHotEncoder(handle_unknown="ignore", sparse_output=False)
except TypeError:
    encoder = OneHotEncoder(handle_unknown="ignore", sparse=False)

# preprocessor using column NAMES (works because pipeline.fit receives a DataFrame)
preprocessor = ColumnTransformer(
    [("cat", encoder, X.columns.tolist())],
    remainder="passthrough"
)

# --- sklearn version compatibility for BaggingClassifier param name ---
ver = sklearn.__version__.split(".")
major = int(ver[0])
minor = int(ver[1]) if len(ver) > 1 else 0
param_name = "estimator" if (major > 1 or (major == 1 and minor >= 2)) else "base_estimator"

# Build the pipeline: PREPROCESSOR first, then BAGGING (with DecisionTree as the estimator)
bagging_kwargs = {param_name: DecisionTreeClassifier(), "n_estimators": 50, "random_state": 42}
pipeline = Pipeline([
    ("pre", preprocessor),
    ("bag", BaggingClassifier(**bagging_kwargs))
])

# train / test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

print("sklearn version:", sklearn.__version__)
print("Using BaggingClassifier parameter:", param_name)
print("Bagging Accuracy (Categorical):", accuracy_score(y_test, y_pred))


Because the dataset contains only categorical columns, the preprocessing step uses OneHotEncoder inside a ColumnTransformer. This encoder converts categorical values into binary indicators, allowing the classifier to interpret them numerically. Compatibility is ensured by checking the correct parameter (sparse_output or sparse) depending on the scikit-learn version.

Next, a BaggingClassifier is set up with a DecisionTreeClassifier as its base learner. Bagging (Bootstrap Aggregating) trains multiple decision trees on random samples of the training data and then combines their predictions, which helps reduce variance and improve stability. The code also detects the correct parameter name (estimator vs. base_estimator) depending on the scikit-learn version, ensuring compatibility across versions.

The model is wrapped in a Pipeline where preprocessing occurs first, followed by the bagging step. The dataset is split into training and test sets, the pipeline is fitted, and predictions are made. Finally, the model’s accuracy on unseen test data is printed, showing the performance of bagging on categorical features