# Bayesian DT vs TrainWeights

As we saw in [previous experiments](https://github.com/beeva-labs/profweight-experiments), the performance of a Decision Tree using TrainWeights techniques reached:

* An accuracy of 0.924 when predicting the Breast Cancer Wisconsin data set.
* A MAE of 42.27 when predicting the Wine Quality data set.

The goal of this notebook is to train Bayesian Decision Trees in both data sets to compare their perfomance with that of TrainWeights.

## Classification: Breast Cancer Wisconsin data set

### Prepare the data set and libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from bayesian_decision_tree.classification import BinaryClassificationNode

np.random.seed(11) # to get the same partition than the other experiment

In [2]:
X, y = load_breast_cancer(return_X_y=True)
X.shape, y.shape

((569, 30), (569,))

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

### Train de Bayesian Decision Tree model

#### Model settings
The implementation lets us choose the prior probability of each class and the prior probability of splitting a node into sub-nodes.

In [4]:
prior = np.array([1,1])
delta = 0
bdt_model = BinaryClassificationNode(partition_prior=0.8, prior=prior)

In [5]:
bdt_model.fit(X_train, y_train, delta)

In [6]:
bdt_predictions = bdt_model.predict(X_test)
bdt_acc = np.sum(bdt_predictions == y_test) / X_test.shape[0]
print("Bayesian Decision Tree Accuracy = {}".format(np.round(bdt_acc, decimals=3)))

Bayesian Decision Tree Accuracy = 0.965


The improvement over TrainWeights is huge! (0.965 > 0.924)

## Regression: Wine Quality data set

### Prepare the data set and libraries

In [7]:
import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import load_diabetes
from sklearn.metrics import mean_absolute_error

from bayesian_decision_tree.regression import RegressionNode

np.random.seed(42) # to get the same partition than the other experiment

In [8]:
X, y = load_diabetes(return_X_y=True)
X.shape, y.shape

((442, 10), (442,))

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

### Train de Bayesian Decision Tree model

#### Model settings
In regression, we need to set the prior on several parameters:

In [10]:
mu = y_train.mean()
sd_prior = y_train.std() / 10
prior_obs = 1
kappa = prior_obs
alpha = prior_obs/2
var_prior = sd_prior**2
tau_prior = 1/var_prior
beta = alpha/tau_prior

prior = (mu, kappa, alpha, beta)

# Bayesian decision tree parameters
partition_prior = 0.9
delta = 0

regression_model = RegressionNode(partition_prior, prior)

# train
regression_model.fit(X_train, y_train, delta)

In [11]:
y_pred = regression_model.predict(X_test)
print("Test MAE = {}".format(np.round(mean_absolute_error(y_test, y_pred), decimals=2)))

Test MAE = 43.69


In this case, the result in a bit worse.

# Conclusions

* More experiments would be needed to fairly compare both methods.
* Maybe the best option in to use both, compare the resulting trees and their performance and use the one we feel more confortable with.