**Section 5: Classification Trees - Solution**

Notebook for "Introduction to Data Science and Machine Learning"

version 1.0, May 28 2024

In order to use the relevant packages we need the following import statements: 

In [None]:
import matplotlib.pyplot as plt
import numpy as np

import pandas as pd
from sklearn import tree

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix

# 1. Classification / Decision Tree: Self-Study Exercise

We will create a classification tree for the data of the self-study exercise:


| X | Y | Z | C1 | C2|
|-----|-----|-----|-----|----|
|0|0|0|5|40|
|0|0|1|0|15|
|0|1|0|10|5|
|0|1|1|45|0|
|1|0|0|10|5|
|1|0|1|25|0|
|1|1|0|5|20|
|1|1|1|0|15|


Please load the data file:

In [None]:
df=pd.read_csv('data/classTreeExercise.csv', delimiter=',')

Please call the command to look at the first lines of the code.

In [None]:
# your code


Please call the following code.

In [None]:
df.value_counts()

In order to verify whether the data is consistent with the table of our exercise, we would prefer a different display. This can be achieved using the `groupby()` method (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html#pandas.DataFrame.groupby).

Please run the following code, check its output and compare it with the table of the exercise. Can you explain the output?

In [None]:
print(df.groupby(["X","Y","Z",'Class']).size())

We create now two data frames: one with the features, i.e. the X, Y and Z values, and one with the target, the labels, that is the Class values. We use the following code. It should be familiar to you.

In [None]:
x_data=df.copy()# 
y_data=x_data.pop('Class')

We use the decision tree that is implemented in the class `DecisionTreeClassifier`. As with `MinMaxScaler` and `Regression`  we first must instantiate an object of the class and then use `fit()` to train / adapt the object.

The method `fit()` of `MinMaxScaler` determines the minimum and maximum values of the data. The method `fit()` in  `Regression` learns the coefficient of the regression function using graient descent. `fit()` in the classification tree learns the tree. 

You find the documentation at https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

We can then use the classifier to classify new data. We can equally display it either as text or in a figure. In the following we will not only display the figures in the notebook but store them on the disk, so that you can download and open them as imagine and examine them in detail, as well as compare them.



First we create a basic decision tree classifier. Without any further specification, this tree uses the Gini index to determine the best splitting criterion.

$$GiniIndex=1-\sum_{i=0}^{c-1}p_i(t)^2 $$


In [None]:
classif = tree.DecisionTreeClassifier()

And run it to classify our data:

In [None]:
classif = classif.fit(x_data, y_data)

Now we can display the decision tree as text:

In [None]:
theTree=tree.export_text(classif)
print(theTree)

And we can equally display the tree as a figure. In order to save the figure we need to create a figure object, that contains the displayed tree. With `dpi=600` (dots per inch) we set a reasonable resolution. 

In [None]:
fig=plt.figure(dpi=600)
tree.plot_tree(classif)# here we plot the tree
fig.suptitle("Decision tree, version 1") # this is the title for the figure
fig.savefig('plots/tree1a.png') # and here we save it


The documentation for plotting is found here: https://scikit-learn.org/stable/modules/generated/sklearn.tree.plot_tree.html#sklearn.tree.plot_tree

The tree is not so clear. E.g. we might not know what `X[0]` signifies. We can specify the feature names:

In [None]:
fig=plt.figure(dpi=600)
tree.plot_tree(classif,feature_names=["X","Y","Z"])# here we plot the tree
fig.suptitle("Decision tree, version 1 with feature names") # this is the title for the figure
fig.savefig('plots/tree1b.png') # and here we save it

As well as the class names:

In [None]:
fig=plt.figure(dpi=600)
tree.plot_tree(classif,feature_names=["X","Y","Z"], class_names=['C1','C2'])# here we plot the tree
fig.suptitle("Decision tree, version 1, feature and class names") # this is the title for the figure
fig.savefig('plots/tree1c.png') # and here we save it

And color the tree:

In [None]:
fig=plt.figure(dpi=600)
tree.plot_tree(classif,feature_names=["X","Y","Z"], class_names=['C1','C2'], filled=True)# here we plot the tree
fig.suptitle("Decision tree, version 1, names and colors") # this is the title for the figure
fig.savefig('plots/tree1d.png') # and here we save it

Please look at above tree and explain the different colors and shades of the colors.

We have now a "beautiful" decision tree. But we should not continue too fast. Please make sure to look carefully at the the tree and the information in all nodes. Please do not continue until you can explain all information.

To determine the splitting of a node, we can use the Gini index.

Another measure is the **Entropy** which is defined as
$$-\sum_{i=0}^{c-1}p_i(t) \log_2 p_i(t)$$
If we wish to use the entropy as a splitting criterion, we need to define a new classifier and specify the parameter `criterion`.

In [None]:
classifE = tree.DecisionTreeClassifier(criterion="entropy")
classifE = classifE.fit(x_data, y_data)
fig=plt.figure(dpi=600)
tree.plot_tree(classifE)# here we plot the tree
fig.suptitle("Decision tree, version 2") # this is the title for the figure
fig.savefig('plots/tree2a.png') # and here we save it

Please enhance the tree for the second classifier with class names, feature names as well as colors.

In [None]:
# your code


Can you see any differences when comparing the trees induced using Gini index and Entropy?

# 2. Classification / Decision Tree: Classroom Assignment Example 

Now let's take a look at the data set used in the classroom assignment. First we load the data set:

In [None]:
df=pd.read_csv('../data/classTreeExercise2.csv')
#df=pd.read_csv('data/classTreeExercise2.csv')
df

Now we prepare the `X` data (features) and the `y` data (labels). We will use a version where `Customer ID` is part of the features:

In [None]:
df.columns


In [None]:
x_data=df.copy()
y_data=x_data.pop('Class')
x_data

In [None]:
# Create the decision tree classifier


In [None]:
# induce the decision Tree (there will be an error, don't worry)


Now you will see an **error message** `ValueError: could not convert string to float: 'M'`.


## 2.1 Data Preparation

Now you will see an **error message** `ValueError: could not convert string to float: 'M'`.

While the decision tree algorithm can handle nominal (catgorical or ordinal) data like the gender and shirt size in our example, the Python implementation of `DecisionTree` in `sklearn` **can't**.

In order to use decision trees we need to convert all features to numbers.

Of course we could devise an encoding by replacing each numerical feature by a number. But we should be careful by this! As we learned there are different data types. Categories do not imply an order while numbers imply orders.

If a feature only has two values, like `Gender` we can replace it for such algorithms. In this case we might replace 'M' with 0 and 'F' with one. But we should be careful when we continue using the data later on, as 0 and 1 have an implicit order which gender does not have. There is no order among men and women!

Now let's replace these values:

In [None]:
# we use with to avoid the FutureWarning
with pd.option_context('future.no_silent_downcasting', True):
    x_data.replace({"Gender":{"M":0,"F":1}},inplace=True)

Now we see that the data type of `Gender` was changed to int.

In [None]:
x_data.info()

Let's take a look at the column `Shirt Size`. A shirt size has a clear order. We can use 1 for `Small` to 4 for `Extra Large` as follows: 

In [None]:
# we use with to avoid the FutureWarning
with pd.option_context('future.no_silent_downcasting', True):
    x_data.replace({"Shirt Size":{"Small":1,"Medium":2,"Large":3,"Extra Large":4}},inplace=True)

And check the types again.

In [None]:
x_data.info()

The column `Car Type` has more than 2 values (it is not binary). If the types do not have an implicit order, i.e. are not ordinal, we should not impose such an order by replacing the categorical values with numbers. In this case we can use the so-called **one-hot-encoding**. With one-hot-encoding we create a separate column for each categorical value and assign values of 0 (the data record does not have this value) or 1 (the data record has this value) to it. The function `get_dummies()` in `pandas` creates a data set with a one-hot-encoder for all categorical columns:  

In [None]:
x_data2=pd.get_dummies(x_data)

Let's take a look at the data:

In [None]:
x_data2.head()

and compare it with the original data:

In [None]:
x_data.head()

and the types:

In [None]:
x_data2.info()

## 2.2 Decision Tree Induction

Now we can create a decision tree:

In [None]:
# induce the decicion Tree


In [None]:
fig=plt.figure(dpi=600)
tree.plot_tree(classif2,feature_names=x_data2.columns,class_names=['C0','C1'], filled=True)# here we plot the tree
fig.suptitle("Decision tree, Example 2, version 1, names and colors") # this is the title for the figure
fig.savefig('plots/treeEx2_1.png') # and here we save it

And now we have a decision tree that classifies our data perfectly! But it is based on `Customer ID`. As discussed in the lecture, the attribute `Customer ID` should not be used for the split even if the attribute test condition has the lowest value.

Therefore we will remove the attribute from the data:

In [None]:
x_data2.pop('Customer ID')

In [None]:
classif2=tree.DecisionTreeClassifier()
classif2.fit(x_data2,y_data)

In [None]:
fig=plt.figure(dpi=600)
tree.plot_tree(classif2,feature_names=x_data2.columns,class_names=['C0','C1'], filled=True)# here we plot the tree
fig.suptitle("Decision tree, Example 2, version 1, names and colors") # this is the title for the figure
fig.savefig('plots/treeEx2_2.png') # and here we save it

You will see, that above tree looks different from the tree we would have constructed in the lecture. Based on the calculated values we would use as first splitting criterion `Gender`. In the classroom assessment we determined the Gini Index for a multiway split. `DecisionTreeClassifier` in `sklearn` always uses a binary split for numerical values. As all Sports-cars belong to class C0, the attribute `Sports` leads to the best split on the first level.

# 3. Classification / Decision Tree: Breast Cancer data set

We will now derive a decision tree for the Breast Cancer data set that is often used as an example for machine learning: https://scikit-learn.org/stable/datasets/toy_dataset.html#breast-cancer-wisconsin-diagnostic-dataset

First we need to load the data set:

In [None]:
cancer=load_breast_cancer()

Please note that `cancer` is a data set and not a data frame. It has among others the following elements: 
- `cancer.data` : the data in form of a matrix
- `cancer.feature_names` : the names of the attributes as a list
- `cancer.target` : the labels
- `cancer.target_names` : the values of the class labels

Plese use the following cell to look at this data.

In [None]:
# your code


`X` is our data, `y` is our target. We create a decision tree and fit the data.

In [None]:
X,y=cancer.data, cancer.target
classif = tree.DecisionTreeClassifier()
classif = classif.fit(X, y)

We output the data as text:

In [None]:
# the feature_names array is not accepted for the function, therefore I use Liste comprehension
# in the following line of code, to translate the array to a list of strings
names=[i for i in cancer.feature_names]
theTree=tree.export_text(classif,feature_names=names) # translate the tree to a set of rules
print(theTree)

And we can display the tree in a figure:


In [None]:
fig=plt.figure(dpi=600)
tree.plot_tree(classif,feature_names=cancer.feature_names)
fig.suptitle('Cancer Data Set')
fig.savefig('plots/cancer1.png')

It is difficult to read the tree. But when you open the graphics on the disk, you can easily zoom in and explore the tree.

We can now output the accuracy of the decision tree with `score()`.

In [None]:
print(classif.score(X, y))

Perfect accuracy! Wow!

But is this realistic? **No!** There is something wrong. When we look at the leaves in the tree, there is no impurity. The Gini index in all leaves is 0. We achieved a minimum **bias**, i.e. minimum training errors. But the score is calculated on the same data. That is, we have no idea how well the tree can classify data it has not seen before.

Therefore we now split our data in two sets. A test set, containing 33% of the data as well as a training set, with the remaining data. We will derive a classifcation tree based on the training data and then see how well it performs on the (so far unseen) test data. 

We now use the method `train_test_split()` to split the data in trainings and test data. The method receives `X` and `Y` as paramaters. Optional parameters allow for the specification of the size of the two sets. One optional paramater `random_state` allows for specifying a reproducable state of the random number generator, as the splitting is controlled by random numbers. When setting the state of the random number generator, the same sequence of random numbers can be reproduced and, thus, "experiments" repeated.

The split is produced in the following line of code:

In [None]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.33, random_state=33)

We now train a new model using the training data `X_train` and `y_train`. To allow for reproducability we equally set the `random_state` in the classifier:

In [None]:
classif2 = tree.DecisionTreeClassifier(random_state=33)
classif2 = classif2.fit(X_train, y_train)

and we plot the result:

In [None]:
fig=plt.figure(dpi=600)
tree.plot_tree(classif2,feature_names=cancer.feature_names)
fig.suptitle('Cancer Data Set')
fig.savefig('plots/cancer2.png')

In order to determine the accuracy we use the test data:

In [None]:
print("accuracy classif2:",classif2.score(X_test, y_test))

As before all leaves in the decision tree have an impurity of 0, i.e. the Gini index is 0. We equally observe that the accuracy is less than 100%, which makes sense as the testing data has not been seen before. This accuracy value now refers to the **variance**.

In order to determine the **confusion matrix** we need to predict the classes for the test data (`test_X`). To do so we use the method `predict()` of the classifier.

In [None]:
y_predict=classif2.predict(X_test)

And compute the confusion matrix:

In [None]:
cm=confusion_matrix(y_test,y_predict)
print(cm)

The matrix is an `numpy` array and we can access its individual elements.

In [None]:
print('False negatives:',cm[0,1])

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay
cmd = ConfusionMatrixDisplay(cm, display_labels=['WDBC-Malignant','WDBC-Benign'])
cmd.plot()

The classifier can be specified using different parameters. These can influence the variance.


In [None]:
classif3 = tree.DecisionTreeClassifier( min_samples_leaf=3, max_depth=4, max_features=8,random_state=33)
classif3 = classif3.fit(X_train, y_train)
y_predict=classif3.predict(X_test)
cm2=confusion_matrix(y_test,y_predict)
print('accuracy classif3:',classif3.score(X_test, y_test))
print(cm2)
cmd = ConfusionMatrixDisplay(cm2, display_labels=['WDBC-Malignant','WDBC-Benign'])
cmd.plot()

Use the documentation https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html to understand the different parameters.

Call the function `error_measures()` for `cm2` and compare the results.  

## 4. Exercise

Load the penguins data set and induce a decision tree to predict the species.

Please note that decision tree classifers do not accept input with `nan` values. 

In [None]:
import seaborn as sns

penguins=sns.load_dataset('penguins')

# Your code


*End of the Notebooks*

<a rel="license" href="http://creativecommons.org/licenses/by-nc-nd/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-nd/4.0/88x31.png" /></a><br />This notebook was created by Christina B. Class for teaching at EAH Jena and is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-nd/4.0/">Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License</a>.