voting

In [2]:
# Voting Classifier on Categorical Dataset
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

# 1. Load Categorical Dataset

categorical_df = pd.read_csv("categorical_dataset.csv")

# Generate synthetic target variable (binary classification)
import numpy as np
np.random.seed(42)
categorical_df["target"] = np.random.choice([0, 1], size=len(categorical_df))

X = categorical_df.drop("target", axis=1)
y = categorical_df["target"]

# ------------------------------
# 2. Preprocessing for Categorical Data
# ------------------------------
categorical_features = X.columns.tolist()
preprocessor = ColumnTransformer(
    transformers=[("cat", OneHotEncoder(handle_unknown="ignore"), categorical_features)]
)

# 3. Define Models inside Pipeline

dt_clf = Pipeline(steps=[("pre", preprocessor),
                         ("clf", DecisionTreeClassifier(random_state=42))])

rf_clf = Pipeline(steps=[("pre", preprocessor),
                         ("clf", RandomForestClassifier(random_state=42))])

voting_clf = VotingClassifier(
    estimators=[("dt", dt_clf), ("rf", rf_clf)],
    voting="hard"
)


# 4. Train-Test Split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 5. Train & Evaluate

voting_clf.fit(X_train, y_train)
y_pred = voting_clf.predict(X_test)

print("Voting Classifier Accuracy (Categorical Dataset):", accuracy_score(y_test, y_pred))


Voting Classifier Accuracy (Categorical Dataset): 0.50875


The code applies a Voting Classifier to a categorical dataset for binary classification. First, the dataset is read from categorical_dataset.csv. Since it has no target column, a synthetic binary target is created using NumPy’s random choice. The dataset is then divided into features (X) and target labels (y).

Because all features are categorical, they must be converted into numerical form. This is handled by a ColumnTransformer with a OneHotEncoder, which encodes each category into binary columns. The preprocessing step ensures that classifiers can work with categorical variables.

Two models are defined inside pipelines: a DecisionTreeClassifier and a RandomForestClassifier. Each pipeline includes the preprocessing and classification steps. A VotingClassifier is then created to combine these two models. With voting="hard", the final class label is chosen by majority rule, meaning whichever class gets more votes from the two models becomes the output.

The dataset is split into training and test sets, with 80% for training and 20% for evaluation. The voting classifier is trained and then used to predict test data. Finally, the accuracy is measured with accuracy_score, showing how well the ensemble performs on unseen data.