# Decision Tree Induction with scikit-learn

**CS5483 Data Warehousing and Data Mining**
___

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import datasets, tree

%matplotlib widget
plt.close("all")

## Decision Tree Induction

We will use the [*iris dataset*](https://en.wikipedia.org/wiki/Iris_flower_data_set) from [`sklearn.datasets` package](https://scikit-learn.org/stable/datasets/index.html) 

In [None]:
iris = datasets.load_iris()

Recall that the classification task is to train a model that can classify the spieces (*target*) automatically based on the lengths and widths of the petals and sepals (*input features*).

To build a decision tree, we simply create a tree using `DecisionTreeClassifier` from `sklearn.tree` and apply its method `fit` on the training set.

In [None]:
clf_gini = tree.DecisionTreeClassifier(random_state=0).fit(iris.data, iris.target)

To display the decision tree, we can use the function `plot_tree` from `sklearn.tree`:

In [None]:
plt.figure()
tree.plot_tree(clf_gini)
plt.show()

To make the decision tree looks better, we can provide additional options:

In [None]:
options = {
    "feature_names": iris.feature_names,
    "class_names": iris.target_names,
    "label": "root",
    "filled": True,
    "node_ids": True,
    "proportion": True,
    "rounded": True,
    "fontsize": 7,
}  # store options as dict for reuse

plt.figure(figsize=(9, 6))
tree.plot_tree(clf_gini, **options)  # ** unpacks dict as keyword arguments
plt.show()

For each node:
- `___ <= ___` is the splitting criterion for internal nodes, satisfied only by samples going left.
- `gini = ...` shows the impurity index. By default, the algorithm uses Gini impurity index to find the best binary split. Observe that the index decreases down the tree towards the leafs.
- `value = [_, _, _]` shows the number/fraction of examples for each of the three classes, and `class = ...` indicates the majority class, which may be used as the decision for a leaf node. The majority classes are also color coded. Observe that the color gets lighter towards the root, as the class distribution is more impure. 

In particular, check that iris setosa is distinguished immediately after checking the petal width/length.

All the information of the decision is stored in the `tree_` attribute of the classifer. For more details:

In [None]:
help(clf_gini.tree_)

**Exercise** Assign to `clf_entropy` the decision tree classifier created using *entropy* as the impurity measure. You can do so with the keyword argument `criterion='entropy'` in `DecisionTreeClassifier`. Furthermore, Use `random_state=0` and fit the classifier on the entire iris dataset. Check whether the resulting decision tree is the same as the one created using the Gini impurity index.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

plt.figure(figsize=(9, 6))
tree.plot_tree(clf_entropy, **options)
plt.show()

YOUR ANSWER HERE

It is important to note that, although one can specify whether to use Gini impurity or entropy, `sklearn` implements neither C4.5 nor CART. In particular, it supports only binary splits on numeric input attributes, unlike C4.5 which supports multi-way splits using information gain ratio.  
(See a [workaround][categorical].)

[categorical]: https://stackoverflow.com/questions/38108832/passing-categorical-data-to-sklearn-decision-tree

## Splitting Criterion

To induce a good decision tree efficiently, the splitting criterion is chosen 
- greedily to maximize the reduction in impurity and 
- recursively starting from the root.

### Overview using pandas

To have a rough idea of what are good features to split on, we will use [pandas](https://pandas.pydata.org/docs/user_guide/index.html) [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html?highlight=dataframe#pandas.DataFrame) 
to operate on the iris dataset.

In [None]:
# write the input features first
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)

# append the target values to the last column
df["target"] = iris.target
df.target = df.target.astype("category")
df.target.cat.categories = iris.target_names
df

To display some statistics of the input features for different classes:

In [None]:
df.groupby("target").boxplot(rot=90, layout=(1, 3), figsize=(7, 9))
df.groupby("target").agg(["mean", "std"]).round(2)

**Exercise** Identify good feature(s) based on the above statistics. Does your choice agree with the decision tree generated by `DecisionTreeClassifier`?

YOUR ANSWER HERE

### Measuring impurity

If nearly all instances of a dataset belong to the same class, i.e., the class distribution is pure, we can simply return the majority class as the decision without further splitting. A measure of impurity a distribution is the Gini impurity index defined as follows:

---

**Definition**

Given a dataset $D$ with a class attribute (discrete target), the Gini impurity index is defined as 
$$
\operatorname{Gini}(D):= g(p_0,p_1,\dots)
$$ (Gini)

where $(p_0,p_1,\dots)$ are probability masses corresponding to the empirical class distribution of $D$, and

$$
\begin{align}
g(p_0,p_1,\dots) &:= \sum_k p_k(1-p_k)\\
&= 1- \sum_k p_k^2.
\end{align}
$$ (g)

---

---

**Note**

For convenience, we may also write

- $g(\boldsymbol{p})$ for the stochastic vector $\boldsymbol{p}=\begin{bmatrix}p_0 & p_1 & \dots\end{bmatrix}$ of probability masses, and
- $g(p)$ for the probability mass function $p: k \mapsto p_k$.

---

We can represent a distribution simply as a numpy array. To return the empirical class distributions of the iris dataset:

In [None]:
def dist(values):
    """Returns the empirical distribution of the given 1D array of values as a
    1D array of probabilites. (The ordering is immaterial.)"""
    counts = np.unique(values, return_counts=True)[-1]
    return counts / counts.sum()


print(f"Distribution of target: {dist(iris.target).round(3)}")

The Gini impurity index can be implemented as follows:

In [None]:
def g(p):
    """Returns the Gini impurity of the distribution p."""
    return 1 - (p ** 2).sum()


print(f"Gini(D) = {g(dist(iris.target)):.3g}")

Another measure of impurity uses the information meature called entropy in information theory:

---

**Definition**

The information content is defined as 

$$
\operatorname{Info}(D):= h(p_0,p_1,\dots),
$$ (Info)

which is the entropy of the class distribution

$$
\begin{align}
h(\boldsymbol{p}) = h(p_0,p_1,\dots) &= \sum_{k:p_k>0} p_k \log_2 \frac1{p_k}.
\end{align}
$$ (h)

---

---

**Note**

- For convenience, we often omit the constraint $p_k>0$ by regarding $0 \log \frac10$ as the limit $\lim_{p\to 0} p \log \frac1{p} = 0$.
- Unless otherwise specified, all the logarithm is base 2, i.e., $\log = \lg$, in which case the information quantities are in the unit of bit (binary digit). A popular alternative is to use the natural logarithm, $\log = \ln$, in which case the unit is in *nat*.

---

**Exercise** Complete the following function to compute the entropy of a distribution. You may use the function `log2` from `numpy` to calculate the logarithm base 2.

---

**Hint**

Consider the solution template:

```Python
def h(p):
    ...
    return (p * ___ * ___).sum()
```

---

In [None]:
def h(p):
    """Returns the entropy of distribution p (1D array)."""
    p = np.array(p)
    p = p[(p > 0) & (p < 1)]  # 0 log 0 = 1 log 1 = 0
    # YOUR CODE HERE
    raise NotImplementedError()


print(f"Info(D): {h(dist(iris.target)):.3g}")

In [None]:
# tests
assert np.isclose(h([1 / 2, 1 / 2]), 1)
assert np.isclose(h([1, 0]), 0)
assert np.isclose(h([1 / 2, 1 / 4, 1 / 4]), 1.5)

In [None]:
# hidden tests

### Drop in impurity

---

**Definition**

The drop in Gini impurity for a splitting criterion $A$ on a dataset $D$ with class attribute is defined as

$$
\begin{align}
\Delta \operatorname{Gini}_A(D) &:= \operatorname{Gini}(D) - \operatorname{Gini}_A(D)\\
\operatorname{Gini}_A(D) &:= \sum_{j} \frac{|D_j|}{|D|} \operatorname{Gini}(D_j),
\end{align}
$$ (Delta-Gini)

where $D$ is split by $A$ into $D_j$ for different outcomes $j$ of the split.

---

We will consider the binary splitting criterion $X\leq s$ in particular, which gives

$$
\begin{align}
\Delta \operatorname{Gini}_A(D) = g(\hat{P}_Y) - \left[\hat{P}\{X\leq s\} g(\hat{P}_{Y|X\leq s}) + \hat{P}\{X> s\}g(\hat{P}_{Y|X> s})\right]
\end{align}
$$ (Delta-Gini-binary)

where 

- $Y$ denotes the target,
- $\hat{P}$ denotes the empirical distribution, and
- $\hat{P}_{Y|X\leq s}$, $\hat{P}_{Y|X> s}$, and $\hat{P}_{Y}$ denote the empirical probability mass functions of $Y$ with or without conditioning.

In [None]:
def drop_in_gini(X, Y, s):
    """Returns the drop in Gini impurity of the target Y
    for the binary splitting criterion X <= s.

    Parameters
    ----------
    X: 1D array
        Input feature values for different instances.
    Y: 1D array
        Target values corresponding to X.
    s: Splitting point for X.
    """
    S = X <= s
    q = S.mean()
    return g(dist(Y)) - q * g(dist(Y[S])) - (1 - q) * g(dist(Y[~S]))


X, Y = df["petal width (cm)"], df.target
print(f"Drop in Gini: {drop_in_gini(X, Y, 0.8):.4g}")

To compute the best splitting point for a given input feature, we check every consecutive mid-points of the observed feature values:

In [None]:
def find_best_split_pt(X, Y, gain_function):
    """Return the best splitting point s and the maximum gain evaluated using
    gain_function for the split X <= s and target Y.

    Parameters
    ----------
    X: 1D array
        Input feature values for different instances.
    Y: 1D array
        Target values corresponding to X.
    gain_function: function of (X, Y, x)
        A function such as drop_in_gini for evaluating a 
        splitting criterion X <= s.

    Returns
    -------
    tuple: (s, g) where s is the best split point and g is the maximum gain.

    See also
    --------
    drop_in_gini
    """
    values = np.sort(np.unique(X))
    split_pts = (values[1:] + values[:-1]) / 2
    gain = np.array([gain_function(X, Y, s) for s in split_pts])
    i = np.argmax(gain)
    return split_pts[i], gain[i]


print(
    """Best split point: {0:.3g}
Maximum gain: {1:.3g}""".format(
        *find_best_split_pt(X, Y, drop_in_gini)
    )
)

The following ranks the features according to the gains of their best binary splits:

In [None]:
rank_by_gini = pd.DataFrame(
    {
        "feature": feature,
        **(lambda s, g: {"split point": s, "gain": g})(
            *find_best_split_pt(df[feature], df.target, drop_in_gini)
        ),
    }
    for feature in iris.feature_names
).sort_values(by="gain", ascending=False)
rank_by_gini

Using the entropy to measure impurity, we have the following alternative gain function:

---

**Definition**

The information gain is defined as 

$$
\begin{align}
\operatorname{Gain}_A(D) &:= \operatorname{Info}(D) - \operatorname{Info}_A(D) && \text{where}\\
\operatorname{Info}_A(D) &:= \sum_{j} \frac{|D_j|}{|D|} \operatorname{Info}(D_j),
\end{align}
$$ (Gain)

---

We will again consider the binary splitting criterion $X\leq s$ in particular, which gives

$$
\begin{align}
\operatorname{Gain}_A(D) = h(\hat{P}_Y) - \left[\hat{P}\{X\leq s\} h(\hat{P}_{Y|X\leq s}) + \hat{P}\{X> s\}h(\hat{P}_{Y|X> s})\right]
\end{align}
$$ (Gain-binary)

**Exercise** Complete the following function to calculate the information gain on the target $Y$ for a binary split $X\leq s$. You may use `dist` and `h` defined previously.

In [None]:
def gain(X, Y, s):
    """Returns the information Gain of Y for the split X <= s.

    Parameters
    ----------
    X: 1D array
        Input feature values for different instances.
    Y: 1D array
        Target values corresponding to X.
    s: Splitting point for X.
    """
    S = X <= s
    q = S.mean()
    # YOUR CODE HERE
    raise NotImplementedError()


print(f"Information gain: {gain(X, Y, 0.8):.4g}")

In [None]:
# tests
rank_by_entropy = pd.DataFrame(
    {
        "feature": feature,
        **(lambda s, g: {"split point": s, "gain": g})(
            *find_best_split_pt(df[feature], df.target, gain)
        ),
    }
    for feature in iris.feature_names
).sort_values(by="gain", ascending=False)
rank_by_entropy

In [None]:
# hidden tests

The C4.5 induction algorithm uses information gain ratio instead of information gain:

---

**Definition**

The information gain ratio is defined as 

$$
\begin{align}
\operatorname{GainRatio}_A(D) &:= \frac{\operatorname{Gain}_A(D)}{\operatorname{SplitInfo}_A(D)}
\end{align}
$$ (GainRatio)

which is normalized by 

$$
\begin{align}
\operatorname{SplitInfo}_A(D) &:= h\left(j\mapsto \frac{|D_j|}{|D|} \right)=\sum_j \frac{|D_j|}{|D|}\log \frac{|D|}{|D_j|}.
\end{align}
$$ (SplitInfo)

---

For binary split $X\leq s$,

$$
\operatorname{SplitInfo}_A(D) := h\left(\hat{P}\{X\leq s\}, \hat{P}\{X> s\}\right)
$$ (SplitInfo-binary)

in terms of the empirical distribution.

**Exercise** Complete the following function to calculate the *information gain ratio* for a binary split $X\leq s$ and target $Y$.

In [None]:
def gain_ratio(X, Y, split_pt):
    S = X <= split_pt
    q = S.mean()
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
# tests
rank_by_gain_ratio = pd.DataFrame(
    {
        "feature": feature,
        **(lambda s, g: {"split point": s, "gain": g})(
            *find_best_split_pt(df[feature], df.target, gain_ratio)
        ),
    }
    for feature in iris.feature_names
).sort_values(by="gain", ascending=False)
rank_by_gain_ratio

In [None]:
# hidden tests

**Exercise** Does the information gain ratio give a different ranking of the features? Why?

Information gain ratio gives the same ranking as the information gain in this case. This is because the split is restricted to be binary and so the normalization by split information has little effect on the ranking.