In [1]:
import pixiedust

Pixiedust database opened successfully


In [2]:
import pandas as pd
import numpy as np
import plotly_express as px

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# from bayesian_decision_tree.classification import BinaryClassificationNode
from bayesian_decision_tree.classification import MultiClassificationNode
# from bayesian_decision_tree.regression import RegressionNode

np.random.seed(42)

# Experiment 2 - Benchmark with bigger data set

Let's test the performance of bayesian decision trees vs standard decision trees vs Random Forests. In [the paper](https://arxiv.org/abs/1901.03214), the authors claim a performance similar to Random Forest.

---

## Data set

I will use the [clean Craiglist cars data set](https://github.com/beeva-labs/tabular-dataset):

In [3]:
df = pd.read_csv("data/clean_craiglist_4.csv")

In [4]:
df.columns

Index(['price', 'year', 'condition', 'cylinders', 'fuel', 'odometer',
       'transmission', 'size', 'type', 'paint_color', 'manufacturer_cluster'],
      dtype='object')

In [5]:
X = df.drop(columns="condition")
y = df[['condition']]

In [6]:
X.shape, y.shape

((103818, 10), (103818, 1))

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

---

## Bayesian Decision Tree model

### Model settings
The implementation lets us choose the prior probability of each class and the prior probability of splitting a node into sub-nodes.

In [12]:
y_train.condition.unique()

array([3, 4, 5, 6, 2, 1])

We need to convert the target variable classes to the range 0 .. N-1:

In [13]:
y_train = y_train - 1

In [14]:
bdt_model = MultiClassificationNode(partition_prior=0.8, prior=[1,1,1,1,1,1])

### Training the model

In [None]:
%%pixie_debugger
bdt_model.fit(X_train, y_train)

### Exploring the model

 * Is the obtained model interpretable?  
 * What is its performance?

In [None]:
print(bdt_model)

The model is as interpretable as a standard decision tree.

In [None]:
bdt_predictions = bdt_model.predict(X_test)
bdt_acc = np.sum(bdt_predictions == y_test) / X_test.shape[0]
print("Bayesian Decision Tree Accuracy = {}".format(np.round(bdt_acc, decimals=3)))

## Decision Tree Model

In [None]:
dt_model = DecisionTreeClassifier()
dt_model.fit(X_train, y_train)
df_acc = np.round(dt_model.score(X_test, y_test), decimals=3)
print("Decision Tree Accuracy = {}".format(df_acc))

## Random Forest Model

In [None]:
rf_model = RandomForestClassifier(n_estimators=10)
rf_model.fit(X_train, y_train)
rf_acc = np.round(rf_model.score(X_test, y_test), decimals=3)
print("Random Forest Accuracy = {}".format(rf_acc))

---

# Conclusions

Apparently the bayesian decision tree performs better than the alternatives. It is also fully interpretable, so this results are promising. 

## Next steps
Experiments with larger and more complex data sets should be done.