decision tree

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

# load dataset
categorical_df = pd.read_csv("categorical_dataset.csv")
np.random.seed(42)
categorical_df["target"] = np.random.choice([0, 1], size=len(categorical_df))
X = categorical_df.drop("target", axis=1)
y = categorical_df["target"]

# preprocessing
categorical_features = X.select_dtypes(include=["object"]).columns.tolist()
preprocessor = ColumnTransformer([("cat", OneHotEncoder(handle_unknown="ignore"), categorical_features)])

# pipeline
dt = Pipeline([("pre", preprocessor), ("clf", DecisionTreeClassifier(random_state=42))])

# split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# train
dt.fit(X_train, y_train)

# predict
y_pred = dt.predict(X_test)

# accuracy
print("Decision Tree Accuracy (Categorical):", accuracy_score(y_test, y_pred))


This program applies a Decision Tree Classifier to a dataset with categorical features. First, the dataset is loaded from categorical_dataset.csv. Since the dataset does not originally contain a target variable, a synthetic binary target (0 or 1) is generated randomly for demonstration purposes.

The independent variables (X) are separated from the target (y). Because the dataset consists of categorical features, preprocessing is required. A ColumnTransformer is created using OneHotEncoder, which converts categorical variables into numerical values. This step is essential because decision trees in scikit-learn can only process numerical input.

Next, a Pipeline is defined. The pipeline first applies preprocessing and then fits a DecisionTreeClassifier with a fixed random state for reproducibility. Pipelines make the process efficient and ensure that encoding and model training happen together without data leakage.

The dataset is then split into training and testing subsets using an 80/20 split. The model is trained on the training set with .fit() and predictions are generated on the test set using .predict().

Finally, the accuracy score is calculated using accuracy_score, which shows how well the decision tree performed on unseen data. This serves as a baseline model before exploring more advanced ensemble methods.