**Section 5: Classification Trees**

Notebook for "Introduction to Data Science and Machine Learning"

version 1.1, May 19 2025

In order to use the relevant packages we need the following import statements: 

In [None]:
import matplotlib.pyplot as plt
import numpy as np

import pandas as pd
from sklearn import tree

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix

from sklearn.metrics import ConfusionMatrixDisplay

# 1. Classification / Decision Tree: Self-Study Exercise

## 1.1 The Data

We will create a classification tree for the data of the self-study exercise:


| X | Y | Z | C1 | C2|
|-----|-----|-----|-----|----|
|0|0|0|5|40|
|0|0|1|0|15|
|0|1|0|10|5|
|0|1|1|45|0|
|1|0|0|10|5|
|1|0|1|25|0|
|1|1|0|5|20|
|1|1|1|0|15|


Please load the data file:

In [None]:
df=pd.read_csv('data/classTreeExercise.csv', delimiter=',')

Please take a look at the first lines of the data frame.

In [None]:
# your code

Please call the following code.

In [None]:
df.value_counts()

In order to verify whether the data is consistent with the table of our exercise, we would prefer a different display. This can be achieved using the `groupby()` method (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html#pandas.DataFrame.groupby).

Please run the following code, check its output and compare it with the table of the exercise. Can you explain the output?

In [None]:
print(df.groupby(["X","Y","Z",'Class']).size())

We create now two data frames: one with the features, i.e. the X, Y and Z values, and one with the target, the labels, that is the Class values. We use the following code. It should be familiar to you.

In [None]:
x_data=df.copy()
y_data=x_data.pop('Class')

## 1.2 Decision Tree Induction

We use the decision tree that is implemented in the class `DecisionTreeClassifier`. 

We use the same steps as with `MinMaxScaler` and `Regression`:

1. instantiate an object of the class that implements a "model"
2. use `fit()` to train / adapt the "model" to specific data ("learning")
3. use the "Model" on new data (test data)

What `fit()` learns depends on the model:
- The method `fit()` of `MinMaxScaler` determines the minimum and maximum values of the data.
- The method `fit()` in  `Regression` learns the coefficient of the regression function using gradient descent.
- `fit()` in the `DecisionTreeClassifier` learns the tree (the split criterion and labels in the leaves). 

You find the documentation at https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

First we create a basic decision tree classifier. Without any further specification, this tree uses the Gini index to determine the best splitting criterion.

$$GiniIndex=1-\sum_{i=0}^{c-1}p_i(t)^2 $$


We instantiate the object:

In [None]:
classif = tree.DecisionTreeClassifier()

And run it to classify our data:

In [None]:
classif = classif.fit(x_data, y_data)

Now we can display the decision tree as text:

In [None]:
theTree=tree.export_text(classif)
print(theTree)

## 1.3 Plotting the Decision Tree

And we can equally display the tree as a figure. In order to save the figure we need to create a figure object, that contains the displayed tree. With `dpi=600` (dots per inch) we set a reasonable resolution. The resultling plot is stored in the directory `plots`. 

In [None]:
fig=plt.figure(dpi=600)
tree.plot_tree(classif)# here we plot the tree
fig.suptitle("Decision tree, version 1") # this is the title for the figure
fig.savefig('plots/tree1a.png') # and here we save it


The documentation for plotting is found here: https://scikit-learn.org/stable/modules/generated/sklearn.tree.plot_tree.html#sklearn.tree.plot_tree

The tree is not so easy to understand. E.g. we might not know what `X[0]` signifies. We can specify the feature names:

In [None]:
fig=plt.figure(dpi=600)
tree.plot_tree(classif,feature_names=["X","Y","Z"])# here we plot the tree
fig.suptitle("Decision tree, version 1 with feature names") # this is the title for the figure
fig.savefig('plots/tree1b.png') # and here we save it

As well as the class names:

In [None]:
fig=plt.figure(dpi=600)
tree.plot_tree(classif,feature_names=["X","Y","Z"], class_names=['C1','C2'])# here we plot the tree
fig.suptitle("Decision tree, version 1, feature and class names") # this is the title for the figure
fig.savefig('plots/tree1c.png') # and here we save it

And color the tree:

In [None]:
fig=plt.figure(dpi=600)
tree.plot_tree(classif,feature_names=["X","Y","Z"], class_names=['C1','C2'], filled=True)# here we plot the tree
fig.suptitle("Decision tree, version 1, names and colors") # this is the title for the figure
fig.savefig('plots/tree1d.png') # and here we save it

**Question:**

Please look at above tree:
- Explain the different colors and shades of the colors.
- How is the class attribute in each node (internal nodes as well as leaves) determined?
  

**Answer:**

## 1.4 Impurity Measure

To determine the best criterion for splitting of a node, we can use the Gini index as impurity measure. This is the default.

Another measure is the **Entropy** which is defined as
$$-\sum_{i=0}^{c-1}p_i(t) \log_2 p_i(t)$$
If we wish to use the entropy as a splitting criterion, we need to define a new classifier and specify the parameter `criterion`.

In [None]:
classifE = tree.DecisionTreeClassifier(criterion="entropy")
classifE = classifE.fit(x_data, y_data)
fig=plt.figure(dpi=600)
tree.plot_tree(classifE)# here we plot the tree
fig.suptitle("Decision tree, version 2") # this is the title for the figure
fig.savefig('plots/tree2a.png') # and here we save it

Please enhance the tree for the second classifier with class names, feature names as well as colors.

In [None]:
# your code


**Question:**

Please download the two figures to compare them. Can you see any differences among the trees induced using Gini index and Entropy?

**Answer:**

# 2. Classification / Decision Tree: Classroom Assignment Example 

Now let's take a look at the data set used in the classroom assignment. First we load the data set:

In [None]:
df=pd.read_csv('../data/classTreeExercise2.csv')
#df=pd.read_csv('data/classTreeExercise2.csv')
df

Now we prepare the `X` data (features) and the `y` data (labels). We will use a version where `Customer ID` is part of the features:

In [None]:
df.columns


In [None]:
x_data=df.copy()
y_data=x_data.pop('Class')
x_data

Create the decision tree classifier:

In [None]:
# Your code


Induce / Train the decision tree classifier:

In [None]:
# Your code


You will see an **error message** `ValueError: could not convert string to float: 'M'`.


## 2.1 Data Preparation

While the decision tree algorithm can handle nominal (catgorical or ordinal) data like the gender and shirt size in our example, the Python implementation of `DecisionTree` in `sklearn` **can't**.

In order to use decision trees we need to convert all features to numbers or Boolean values (remember that the data type `bool` is a numeric data type).

Of course we could devise an encoding by replacing each numerical feature by a number. But we should be careful by this! As we learned there are different data types. Categories do not imply an order while numbers imply orders.

If a feature only has two values, like `Gender`  we might replace e.g.  'M' with 0 and 'F' with 1. But we should be careful when we continue using the data later on, as 0 and 1 have an implicit order which gender does not have. There is no order among men and women!

`Shirt Size` on the other hand has a clear order among the different sizes. We might replace the categorical values with numerical values, e.g. 1 for `Small`, 2 for `Medium`, 3 for `Large` and 4 for `Extra Large`. 

First let's create a copy of the data before making any changes:

In [None]:
x_data2=x_data.copy()

Then we replace Shirt Size by the values 1 to 4 as described above:

In [None]:
# we use with to avoid the FutureWarning
with pd.option_context('future.no_silent_downcasting', True):
    x_data2.replace({"Shirt Size":{"Small":1,"Medium":2,"Large":3,"Extra Large":4}},inplace=True)

We then change the data type of the column `Shirt Size` to `int32`.

In [None]:
x_data2['Shirt Size']=x_data2['Shirt Size'].astype('int32')

And check the information:

In [None]:
x_data2.info()

**Note:** with above change we transformed an ordered attributed to an interval data type which may lead to wrong results for many models. The change of the data type implies that the difference between `Small` and `Medium` is the same as between `Medium` and `Large`. This is of course not the case. So in most cases this is not a feasible solution and should be avoided if we are not sure that the algorithm we use can deal with this change. In many cases we should thus use one-hot-encoding.   

**Note 2:** The transformation to a numerical value is also problematic as the classifier always make a two way split based on the value. Therefore it can group smaller and larger shirt sizes. But it is not possible to group the more "extreme" and the more medium shirt sizes ({small, extra large} and {medium, large}) together or to separate one middle short size ({large} and {small, medium, extra large}). Such splits are part of teh general attribute splittinmg procedure for categorical attributes. Therefore, even for apparently ordered attribute values often the one-hot envoding method is preferred (see next paragraph).

The column `Car Type` has more than 2 values (it is not binary). If the types do not have an implicit order, i.e. are not ordinal, we should not impose such an order by replacing the categorical values with numbers. In this case we can use the so-called **one-hot-encoding**. With one-hot-encoding we create a separate column for each categorical value and assign values of 0 (the data record does not have this value) or 1 (the data record has this value) to it. The function `get_dummies()` in `pandas` creates a data set with a one-hot-encoder for all categorical columns. In our data set we create thus one-hot-encoding columns for `Gender` and `Car Type`: 

In [None]:
x_data2=pd.get_dummies(x_data2)

Let's take a look at the data:

In [None]:
x_data2.head()

and compare it with the original data:

In [None]:
x_data.head()

and the types:

In [None]:
x_data.info()

In [None]:
x_data2.info()

## 2.2 Decision Tree Induction

Now we can induce the decision tree:

In [None]:
# induce the decicion Tree
classif2.fit(x_data2,y_data)

In [None]:
fig=plt.figure(dpi=600)
tree.plot_tree(classif2,feature_names=x_data2.columns,class_names=['C0','C1'], filled=True)# here we plot the tree
fig.suptitle("Decision tree, Example 2, version 1, names and colors") # this is the title for the figure
fig.savefig('plots/treeEx2_1.png') # and here we save it

Above decision tree classifies our data perfectly! But it is based on `Customer ID`. As discussed in the lecture, the attribute `Customer ID` should not be used for the split even if the attribute test condition has the lowest value.

Therefore, we will remove the attribute from the data:

In [None]:
x_data2.pop('Customer ID')

Create a decision tree for the data without `Customer ID` and plot it:
- create a new object from the class `DecisionTreeClassifier`
- train the object / induce the tree using `x_data2` and `y_data`
- display the tree using feature names, class names and collored nodes

In [None]:
# your code

Above tree looks different from the tree we would have constructed in the lecture. Based on the values calculated in the lecture we would use as first splitting criterion `Gender`. In the classroom assignment we determined the Gini Index for a multiway split. `DecisionTreeClassifier` in `sklearn` always uses a binary split for numerical values. As all Sports-cars belong to class C0, the attribute `Sports` leads to the best split on the first level.

# 3. Classification / Decision Tree: Breast Cancer data set

## 3.1 The Data Set

We will now derive a decision tree for the Breast Cancer data set that is often used as an example for machine learning: https://scikit-learn.org/stable/datasets/toy_dataset.html#breast-cancer-wisconsin-diagnostic-dataset

First we load the data set:

In [None]:
cancer=load_breast_cancer(as_frame=True)

`cancer` is a dictionary. With `keys()`we can display its entries.

In [None]:
cancer.keys()

The `cancer` dictionary contains several entries (key-value pairs) such as the data set of teh features, the features names, the target and target value names, the description and the complete data as data frame. As we have worked so far with data frames containing the features as well as the target, we will continue doing so. Therefore we use the `frame` and store it in a data frame variable: 

In [None]:
cancerDF=cancer.frame.copy()

**Question:**

Please answer the following questions:

- how many features are in the data set?
- how many samples are there?
- what are the data types of the features?
- how many distinct values are in the column names target?

In [None]:
# Code to determine the answers to above questions


**Answers:**

- how many features are in the data set?
- how many samples are there?
- what are the data types of the features?
- how many distinct values are in the target column?

In [None]:
X_data.info()

Please note that using the values of the `target` column in `cancerDF` as index in `cancer.target_names` one can obtain the categorical class values.

In [None]:
cancer.target_names

## 3.2 The Decision Tree

Let's create `X` as our data and `y` as our target. 

In [None]:
X=cancerDF.copy()
y=X.pop("target")

and build the decistion tree:

In [None]:
classif = tree.DecisionTreeClassifier()
classif = classif.fit(X, y)

In [None]:
theTree=tree.export_text(classif,feature_names=cancer.feature_names) # translate the tree to a set of rules
print(theTree)

And we can display the tree in a figure:


In [None]:
fig=plt.figure(dpi=600)
tree.plot_tree(classif,feature_names=cancer.feature_names) # or X.columns as feature_names
fig.suptitle('Cancer Data Set')
fig.savefig('plots/cancer1.png')

It is difficult to read the tree. But when you open the graphics on the disk, you can easily zoom in and explore the tree.

Let's output the accuracy of the decision tree with `score()`.

In [None]:
print(classif.score(X, y))

Perfect accuracy! Wow!

But is this realistic? **No!** There is something wrong. When we look at the leaves in the tree, there is no impurity. The Gini index in all leaves is 0. We achieved a minimum **bias**, i.e. minimum training errors. But the score is calculated on the same data. That is, we have no idea how well the tree can classify data it has not seen before.

We did not split the data into a trainings and test data set!


## 3.3 Splitting the Data into Trainings and Test Data Set

Use the method `train_test_split()` to split the data in trainings and test data using the following parameter values:
- `test_size` is 0.33
- `random_state` is 33
and name the remaining sets `X_train`, `X_test`, `y_train` and `y_test`.

In [None]:
# your code

We now train a new model using the training data `X_train` and `y_train`. To allow for reproducability we equally set the `random_state` in the classifier:

In [None]:
classif2 = tree.DecisionTreeClassifier(random_state=33)
classif2 = classif2.fit(X_train, y_train)

and we plot the result:

In [None]:
fig=plt.figure(dpi=600)
tree.plot_tree(classif2,feature_names=cancer.feature_names)
fig.suptitle('Cancer Data Set')
fig.savefig('plots/cancer2.png')

In order to determine the accuracy we use the test data:

In [None]:
print("accuracy classif2:",classif2.score(X_test, y_test))

As before all leaves in the decision tree have an impurity of 0, i.e. the Gini index is 0. We equally observe that the accuracy is less than 100%, which makes sense as the testing data has not been seen before. This accuracy value now refers to the **variance**.

In order to determine the **confusion matrix** we need to predict the classes for the test data (`test_X`). To do so we use the method `predict()` of the classifier.

In [None]:
y_predict=classif2.predict(X_test)

And compute the confusion matrix:

In [None]:
cm=confusion_matrix(y_test,y_predict)
print(cm)

The matrix is an `numpy` array and we can access its individual elements.

In [None]:
print('False negatives:',cm[0,1])

In [None]:

cmd = ConfusionMatrixDisplay(cm, display_labels=['WDBC-Malignant','WDBC-Benign'])
cmd.plot()

The classifier can be specified using different parameters. These can influence the variance.


In [None]:
classif3 = tree.DecisionTreeClassifier( min_samples_leaf=3, max_depth=4, max_features=8,random_state=33)
classif3 = classif3.fit(X_train, y_train)
y_predict=classif3.predict(X_test)
cm2=confusion_matrix(y_test,y_predict)
print('accuracy classif3:',classif3.score(X_test, y_test))
print(cm2)
cmd = ConfusionMatrixDisplay(cm2, display_labels=['WDBC-Malignant','WDBC-Benign'])
cmd.plot()

**Question**

Use the documentation https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html to understand the different parameters.

- What does `min_sample_leaf` mean?
- What does `max_depth` mean?
- What does `max_features` mean

Please try to explain the potential effect of the parameter on the performance (quality and learning speed).

**Answer:**
- `min_sample_leaf`
- `max_depth`
- `max_features`

# 4. Exercise

Load the penguins data set and induce a decision tree to predict the species.

Please note that decision tree classifers do not accept input with `nan` values. Use `dropna()` to remove samples with `nan` values.

In [None]:
import seaborn as sns

penguins=sns.load_dataset('penguins')

# Your code



*End of the Notebooks*

<a rel="license" href="http://creativecommons.org/licenses/by-nc-nd/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-nd/4.0/88x31.png" /></a><br />This notebook was created by Christina B. Class for teaching at EAH Jena and is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-nd/4.0/">Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License</a>.