boosting

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

# load dataset
categorical_df = pd.read_csv("categorical_dataset.csv")

# create synthetic target
np.random.seed(42)
categorical_df["target"] = np.random.choice([0, 1], size=len(categorical_df))

X = categorical_df.drop("target", axis=1)
y = categorical_df["target"]

# automatically detect categorical features
categorical_features = X.select_dtypes(include=["object"]).columns.tolist()

# preprocessing
preprocessor = ColumnTransformer(
    [("cat", OneHotEncoder(handle_unknown="ignore"), categorical_features)]
)

# adaboost
ada = Pipeline([("pre", preprocessor),
                ("clf", AdaBoostClassifier(n_estimators=100, random_state=42))])
ada.fit(X, y)
y_pred = ada.predict(X)
print("AdaBoost Accuracy (Categorical):", accuracy_score(y, y_pred))

# gradient boosting
gb = Pipeline([("pre", preprocessor),
               ("clf", GradientBoostingClassifier(n_estimators=100, random_state=42))])
gb.fit(X, y)
y_pred = gb.predict(X)
print("GradientBoosting Accuracy (Categorical):", accuracy_score(y, y_pred))



This code applies two boosting techniques, AdaBoost and Gradient Boosting, on a categorical dataset. The dataset is read from categorical_dataset.csv, and since it lacks a target column, a synthetic binary target (0 or 1) is generated randomly. Features (X) are separated from the target (y).

Because all features are categorical, preprocessing is necessary. The code automatically detects categorical columns and applies OneHotEncoder through a ColumnTransformer, converting categories into binary features that classifiers can use.

Two boosting models are then built within Pipelines. The first is AdaBoostClassifier, which trains multiple weak learners (by default, decision stumps) sequentially. Each new learner focuses more on the samples misclassified by the previous ones, thereby reducing bias and improving accuracy. The second model is GradientBoostingClassifier, which also builds models sequentially but optimizes by fitting learners on the residual errors of previous models using gradient descent, making it powerful for complex patterns.

Both models are trained (fit) on the dataset, and predictions are made on the same training data. Their accuracies are printed using accuracy_score. Although the results may appear optimistic here (since no train-test split is applied), the script illustrates how boosting methods can handle categorical datasets effectively when combined with proper preprocessing.