<img src="data/images/div/lecture-notebook-header.png" />

# Classification & Regression II: Decision Trees

Decision Trees are a fundamental model used for classification and regression. While they typically do not yield state-of-the art performances, their inner workings lay the foundation towards more sophisticated models based on tree ensembles.

The construction of a decision tree involves recursively partitioning the data based on the values of the input features. The goal is to create homogeneous subsets of data at each internal node, where the instances within each subset share similar characteristics. The process continues until a stopping criterion is met, such as reaching a maximum depth or a minimum number of instances in a leaf node. In the case of classification, decision trees make predictions by assigning the majority class of the instances in a leaf node to new, unseen instances that follow the same path down the tree. In regression, the predicted value is typically the mean or median value of the instances in the leaf node.

One of the key advantages of decision trees is their interpretability. The resulting tree structure can be visualized, allowing users to understand the decision-making process and gain insights into the important features. Decision trees can also handle a mix of continuous and categorical features without requiring extensive data preprocessing.

However, decision trees can be prone to overfitting, especially when they grow deep and complex. Overfitting occurs when the tree becomes too specific to the training data and fails to generalize well to unseen data. This issue can be addressed through techniques like pruning, which involves removing or merging nodes to simplify the tree and reduce overfitting. To improve the performance of decision trees, ensemble methods like random forests and gradient boosting are often used. These methods combine multiple decision trees to make more accurate predictions and enhance generalization. We will cover those more advanced tree ensemble teachniques in later notebooks

As mentioned above, Decision Trees can handle numerical and categorical. However, `scikit-learn`'s implementation "does not support categorical variables for now." (see [documentation](https://scikit-learn.org/stable/modules/tree.html)). Of such details you need to be aware off when applying off-the-shelf implementations of classification or regression algorithms on your own data. For example, a categorical feature that "looks" like a number such as `postal_code` will be treated as a numerical features when using the `DecisionTreeClassifier` or the `DecisionTreeRegressor` provided by `scikit-learn`. While the model will train without errors, the result will be off due to the misinterpretation of the data.

As you will see in the examples below, `DecisionTreeClassifier` and `DecisionTreeRegressor` will only create binary decision trees, i.e., each non-leaf node will only have 2 child subtrees. Note that Decision Trees do not require the data to be normalized since each decision (i.e., node in the tree) is based on only a single feature. On the other hand, this also means that Decision Trees do not consider the relationship between features. We will explore the consequences in this notebook.

## Setting up the Notebook

### Specify How Plots Get Rendered

In [None]:
%matplotlib inline

### Make all Required Imports. 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl

from tqdm import tqdm

from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn import tree
from sklearn.model_selection import train_test_split

from matplotlib.colors import ListedColormap

from sklearn.metrics import f1_score, mean_squared_error

### Auxiliary Code

The method `plot_decision_boundaries()` below plots the decision boundaries of a trained Decision Tree (or any other classification model), assuming the input is 2-dimensional, i.e., the dataset has 2 input features. Looking at the decision boundaries of a Decision Tree helps to understand its inner works as well as its limitations and challenges such as overfitting.

In [None]:
# All classification datasets in this notebook have no more the 3 labels, so 3 colors is enough
colors = ['blue', 'red', 'green']

# Method to plot the decision boundaries (for classification)
# Only applicable if there are 2 input features
def plot_decision_boundaries(clf, X, y, resolution=0.01):

    plt.figure()
    margin = 0.05
    x_min, x_max = X[:, 0].min() - margin, X[:, 0].max() + margin
    y_min, y_max = X[:, 1].min() - margin, X[:, 1].max() + margin
    xx, yy = np.meshgrid(np.arange(x_min, x_max, resolution), np.arange(y_min, y_max, resolution))
    cmap = ListedColormap(colors[:len(np.unique(y))])
    Z = clf.predict(np.array([xx.ravel(), yy.ravel()]).T)
    Z = Z.reshape(xx.shape)
    plt.contourf(xx, yy, Z, alpha=0.4, cmap=cmap)

    plt.scatter(X[:,0], X[:,1], c=[colors[int(c)] for c in y], s=100)
    plt.tick_params(top=False, bottom=False, left=False, right=False, labelleft=False, labelbottom=False)   
    plt.tight_layout()
    plt.show()

---

## Working with Toy Data

To better understand the basic characteristics of Decision Trees, we first look at the 2 small examples covered in the lecture.

### Linear Relationships Between Features

The first example is a small classification dataset comprising 26 data points and 2 features. 

#### Create and Visualize Data

In [None]:
data = np.array([
    (0.05, 0.65, 0), (0.65, 0.2, 0), (0.15, 0.5, 0), (0.25, 0.55, 0), (0.2, 0.4, 0), (0.3, 0.35, 0),
    (0.4, 0.45, 0), (0.45, 0.35, 0), (0.5, 0.25, 0), (0.85, 0.05, 0), (0.6, 0.3, 0), (0.7, 0.25, 0),
    (0.85, 0.3, 1), (0.05, 0.95, 1), (0.2, 0.9, 1), (0.35, 0.85, 1), (0.4, 0.7, 1), (0.5, 0.65, 1), 
    (0.1, 0.85, 1), (0.6, 0.5, 1), (0.7, 0.45, 1), (0.8, 0.4, 1), (0.25, 0.7, 1), (0.35, 0.85, 1), 
    (0.7, 0.6, 1), (0.8, 0.5, 1), 
])

X = data[:,0:2]
y = data[:,2]

num_samples, num_features = X.shape

print('The dataset consists of {} data points, each with {} features.'.format(num_samples, num_features))

Let's plot the data points.

In [None]:
plt.figure()
plt.scatter(X[:,0], X[:,1], c=[colors[int(c)] for c in y], s=100)
plt.tick_params(top=False, bottom=False, left=False, right=False, labelleft=False, labelbottom=False)   
plt.tight_layout()
plt.show()

Just from looking at this plot, we can see that the classes could be easily separated by a diagonal line, as there is some linear relationship between the features. However, Decision Trees do not capture such relationships between features as each decision is based only on a single feature.

#### Train a Decision Tree Classifier

Since we have numerical values only, we can use the [Decision Tree implementation of scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html). The implementation considers a wide range of input parameters, but we consider here only 2: `max_depth` to specify the maximum depth of the Decision Tree, and `criterion` to specify which scoring function to use to find the best split.

Try changing `max_depth` and see how the resulting Decision Tree looks like. Since this is a very small dataset, it won't be very deep anyway; `max_depth=100` is just to guarantee its maximum size.

In [None]:
clf = DecisionTreeClassifier(max_depth=100, criterion='gini').fit(X, y)

print('The Decision Tree has {} nodes.'.format(clf.tree_.node_count))

plt.figure()
tree.plot_tree(clf)
plt.show()

The figure of the Decision Tree above gives you already useful insights. For example, as the second feature `X[1]` it's used in the root node, this feature is the most "valuable" since it creates the best first split of the complete dataset. The figures also shoes the respective thresholds, e.g., `0.575` in case of the root node.

#### Plot Decision Boundaries

Again, try different values for `max_depth` and see how the decision boundaries change.

In [None]:
plot_decision_boundaries(clf, X, y)

We can see that Decision Trees can only generate decision boundaries made out of vertical and horizontal sections -- in the context of this plot. Each section represents a single decision, i.e., a single node in the Decision Tree. That means that any more intricate decision boundary has to estimated by a series of simple decision boundaries, potentially required large/deep Decision Trees.

### Overfitting & Underfitting

We now perform the same steps as above for a different toy dataset to illustrate the notion of overfitting and underfitting in the context of Decision Trees. This dataset again reflects the example used in the lecture.

#### Create and Visualize Data

In [None]:
data = np.array([
    (0.05, 0.4, 0), (0.15, 0.1, 0), (0.15, 0.35, 0), (0.2, 0.25, 0), (0.4, 0.4, 0), (0.45, 0.3, 0), 
    (0.95, 0.4, 1), (0.8, 0.4, 1), (0.65, 0.05, 0), (0.7, 0.15, 0), (0.85, 0.1, 0), (0.8, 0.3, 1),
    (0.6, 0.42, 0), (0.4, 0.1, 0), (0.63, 0.32, 0),
    (0.1, 0.55, 1), (0.08, 0.7, 1), (0.32, 0.55, 1), (0.53, 0.75, 1), (0.25, 0.78, 1), (0.9, 0.9, 1),
    (0.38, 0.85, 1), (0.65, 0.9, 1), (0.95, 0.6, 1), (0.80, 0.55, 1), (0.55, 0.6, 1), (0.05, 0.85, 1),
    (0.85, 0.7, 1), (0.32, 0.89, 1), (0.95, 0.05, 0), (0.95, 0.15, 0), (0.92, 0.3, 1)
])

# Add "outlier" point
data = np.concatenate((data, np.array([(0.32, 0.7, 0)])))

X = data[:,0:2]
y = data[:,2]

num_samples, num_features = X.shape

print('The dataset consists of {} data points, each with {} features.'.format(num_samples, num_features))

And again, we first plot the data points to have a look.

In [None]:
plt.figure()
plt.scatter(X[:,0], X[:,1], c=[colors[int(c)] for c in y], s=100)
plt.tick_params(top=False, bottom=False, left=False, right=False, labelleft=False, labelbottom=False)   
plt.tight_layout()
plt.show()

The most noticeable data point is arguably the blue data point in an area surrounded by red data points. This point is likely to be an outlier. From our understanding of Decision Trees, we can already tell that it would require several additional splits just to classify this single data point correctly -- even though it might be better to ignore this outlier.

#### Train a Decision Tree Classifier

As above, play with the value of `max_depth` and see how the resulting Decision Tree looks like.

In [None]:
clf = DecisionTreeClassifier(max_depth=3, criterion='gini')

clf.fit(X, y)

print('The Decision Tree has {} nodes.'.format(clf.tree_.node_count))

plt.figure()
tree.plot_tree(clf)
plt.show()

#### Plot Decision Boundaries

Using different values for `max_depth` will again change the decision boundaries. See how different values effect the area around the outlier.

In [None]:
plot_decision_boundaries(clf, X, y)

Just based on your intuition, `max_depth=3` yields the arguably most meaningful decision boundaries which ignore the outlier and are likely to generalize better on unseen data points. If we reduce `max_depth` we lose quite some separation power in the bottom-right area of the data distribution. In contrast, if we increase `max_depth`, we start taking the outlier into account, which introduces a "blue area" around that data point. Unseen data points falling into this area would likely be misclassified.

In short, there is a best choice of `max_depth` and other parameters that would result in the best classification based on our different evaluation metrics (e.g., f1-score) and evaluation techniques such as cross validation. We cover this below when using a real-world dataset below.

---

## Decision Tree Classification Using Vessel Details Dataset (Predict Type)

### Load Data

Using `pandas`, we first load the dataset from the comma-separated file into a DataFrame. We also perform 2 additional steps

* Convert the string class labels *Setosa*, *Versicolor*, and *Virginica* to numeric class labels 0, 1, and 2

* Shuffle the records to ensure that both training set and test feature a similar distribution (see below)

In [None]:
df = pd.read_csv('data/datasets/vessels/vessel-details.csv')

# We just ignore all rows with missing values here
df = df.dropna()

# Convert the vessel to numerical categories 0, 1, 2, ... (expected input for most classifiers)
df['Type'] = pd.factorize(df['Type'])[0]

# Show the first 5 columns
df.head()

### Consideration of all Numerical Features

To avoid any additional preprocessing steps here such as encoding categorical features, let's focus on only the numerical features.

#### Create Training and Test Data

We again use an 80/20 split for creating the training and test set.

In [None]:
# Convert data to numpy arrays
X = df[['Build Year', 'Length', 'Width', 'Gross Tonnage', 'Deadweight Tonnage', 'Efficiency']].to_numpy()
y = df[['Type']].to_numpy().squeeze()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

print("Size of training set: {}".format(len(X_train)))
print("Size of test: {}".format(len(X_test)))
print(len(X_test), len(y_test))

#### Train a Single Decision Tree Classifier

Let's just first pick a value for `max_depth` and check out the resulting decision tree. You can change the value to observe the effect on the resulting tree.

In [None]:
clf = DecisionTreeClassifier(max_depth=5, criterion='gini', random_state=10)

clf.fit(X_train, y_train)

print('The Decision Tree has {} nodes.'.format(clf.tree_.node_count))

plt.figure()
tree.plot_tree(clf)
plt.show()

#### Finding the Best Value for `max_depth`

Similar to above, the code cell below tries to find the best choice for `max_depth`. Again, note that we are very sloppy here by using the test set for this to keep it simple.

In [None]:
max_depth = 50

# Keep track of depth and f1 scores for plotting
ds, f1s = [], []

# Loop over all values for max_depth
for d in tqdm(range(1, max_depth+1)):
    ds.append(d)
    # Train Decision Tree classifier for current value of max_depth
    clf = DecisionTreeClassifier(max_depth=d, criterion='gini', random_state=10).fit(X_train, y_train)
    # Predict class labels for test set
    y_pred = clf.predict(X_test)
    # Calculate f1 score between predictions and ground truth
    f1 = f1_score(y_test, y_pred, average='micro')
    f1s.append(f1)
    
print('A maximum depth of {} yields the best f1 score of {:.3f}'.format(ds[np.argmax(f1s)], np.max(f1s), ))
    
plt.figure()
plt.plot(ds, f1s)
plt.xlabel('Maximum Depth')
plt.ylabel('F1 Score')
plt.show()

---

## Decision Tree Regression Using Vessel Details Dataset (Predict Efficiency)

### Load Data

We are loading the same data file as above

In [None]:
df = pd.read_csv('data/datasets/vessels/vessel-details.csv')

# We just ignore all rows with missing values here
df = df.dropna()

# Show the first 5 columns
df.head()

### Create Training and Test Data

Again, we only consider numerical features for convenience here, and use an 80/20 split for creating the training and test set.

In [None]:
# Convert data to numpy arrays
X = df[['Build Year', 'Length', 'Width', 'Gross Tonnage', 'Deadweight Tonnage', 'Efficiency']].to_numpy()
y = df[['Efficiency']].to_numpy().squeeze()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

print("Size of training set: {}".format(len(X_train)))
print("Size of test: {}".format(len(X_test)))
print(len(X_test), len(y_test))

### Train a Decision Tree Regressor

Training a regressor is basically the same as training a classifier. We have seen in the lecture that a Decision Tree for regression and for classification are very similar; the core difference is only in the calculation of the impurity since we now have real values instead of labels as outputs.

In [None]:
reg = DecisionTreeRegressor(max_depth=5, random_state=10).fit(X_train, y_train)

print('The Decision Tree has {} nodes.'.format(reg.tree_.node_count))

plt.figure()
tree.plot_tree(reg)
plt.show()

### Finding the Best Value for `max_depth`

We can use almost the same code as above to find the best value of `max_depth`. We only have to change the evaluation metric from f1 to RSME.

In [None]:
max_depth = 50

# Keep track of depth and RSMEs for plotting
ds, rsmes = [], []

for d in tqdm(range(1, max_depth+1)):
    ds.append(d)
    # Train Decision Tree regressor for current value of max_depth
    reg = DecisionTreeRegressor(max_depth=d, random_state=10).fit(X_train, y_train)
    # Predict output values for test set
    y_pred = reg.predict(X_test)
    # Calculate RSME between predictions and ground truth
    rsme = mean_squared_error(y_test, y_pred, squared=False)
    rsmes.append(rsme)
    
    
print('A maximum depth of {} yields the best RSME of {:.3f}'.format(ds[np.argmin(rsmes)], np.min(rsmes), ))    
    
plt.figure()
plt.plot(ds, rsmes)
plt.show()

Based on the plot above the  best value for `max_depth` is 17. Since we use RSME as out metric, the lower the better.

## Summary

This notebook introduced and experimented with Decision Trees. Decision trees are a popular machine learning algorithm known for their simplicity and interpretability. They provide a hierarchical structure that mimics a tree, where each internal node represents a feature or attribute, each branch corresponds to a decision based on that attribute, and each leaf node represents a class label or a predicted value. Decision trees offer several advantages, but they also have limitations.

One of the key advantages of decision trees is their interpretability. The resulting tree structure can be easily visualized and understood, allowing users to gain insights into the decision-making process. Decision trees provide clear rules that can be explained and communicated to stakeholders, making them useful in domains where interpretability is crucial.

Another advantage is that decision trees can handle a mix of continuous and categorical features without requiring extensive data preprocessing. They are robust to outliers and can automatically handle missing values by utilizing surrogate splits. Decision trees can be effective even with a relatively small amount of training data and can handle high-dimensional feature spaces.

However, decision trees have some limitations. One major drawback is their tendency to overfit, especially when the trees grow deep and complex. Overfitting occurs when the tree becomes too specific to the training data and fails to generalize well to unseen data. Techniques like pruning, which involves removing or merging nodes, can help alleviate this issue.

Additionally, decision trees may struggle with capturing complex relationships in the data compared to other algorithms like neural networks or ensemble methods. They can be sensitive to small variations in the training data and may lead to different tree structures for similar datasets. Decision trees also struggle with handling class imbalance in classification tasks, as they tend to favor majority classes.

In summary, decision trees are simple and interpretable machine learning models that offer advantages such as interpretability, ease of use with mixed data types, and robustness to outliers and missing values. However, they are prone to overfitting, may struggle with complex relationships, and can be sensitive to small data variations. Despite their limitations, decision trees are widely used and form the basis for more advanced ensemble methods like random forests and gradient boosting.