random forest

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

# load dataset
categorical_df = pd.read_csv("categorical_dataset.csv")
np.random.seed(42)
categorical_df["target"] = np.random.choice([0, 1], size=len(categorical_df))
X = categorical_df.drop("target", axis=1)
y = categorical_df["target"]

# preprocessing
categorical_features = X.select_dtypes(include=["object"]).columns.tolist()
preprocessor = ColumnTransformer([("cat", OneHotEncoder(handle_unknown="ignore"), categorical_features)])

# pipeline
rf = Pipeline([("pre", preprocessor), ("clf", RandomForestClassifier(n_estimators=100, random_state=42))])

# split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# train
rf.fit(X_train, y_train)

# predict
y_pred = rf.predict(X_test)

# accuracy
print("Random Forest Accuracy (Categorical):", accuracy_score(y_test, y_pred))


This program uses a Random Forest Classifier to classify categorical data. The dataset is first loaded from categorical_dataset.csv. Since it lacks a target variable, a synthetic binary target (0 or 1) is generated randomly for demonstration purposes.

The features (X) are separated from the target (y). As all features are categorical, preprocessing is necessary. A ColumnTransformer is applied with OneHotEncoder, which transforms categorical values into numerical form that machine learning models can handle.

A Pipeline is then built, consisting of two steps:

Preprocessing step – converts categorical data into one-hot encoded vectors.

Random Forest Classifier – an ensemble method that builds multiple decision trees and aggregates their results, improving accuracy and reducing overfitting compared to a single decision tree.

The dataset is divided into training and test sets using an 80/20 split. The Random Forest model is trained on the training data with fit(). Predictions are generated on the test set using predict().

Finally, model performance is evaluated with accuracy_score. This score reflects how well the Random Forest classified the unseen test data. Typically, Random Forests outperform single decision trees because they combine multiple weak learners into a strong, more robust classifier.