# Prepare Environment
Thr first section is to import necesssary modules for this Colab notebook.

In [None]:
# Load libraries
import numpy as np
import pandas as pd

# Create a simple dataset
We first start by creating a simple, small fruit dataset, which can be used to train a decision tree.
Here, we will create the dataset using DataFrame from the Pandas library.

In [None]:
df = pd.DataFrame({
    'color': ['green','yellow','red','red','yellow'],
    'diameter': [3,3,1,1,3],
    'label': ['apple','apple','grape','grape','lemon']  # 0: Apple, 1:Grape, 2: Lemon
})
df

As you can see from the output above, we have create the fruit dataset consisting of three columns: `color`, `diameter` and `label`.

The first two columns (i.e., `color` and `diameter`) are the **features** or the characteristics of each fruit, while the last column (i.e., `label`) are the **label** or the answer that we expect the decision tree to know when it accepts the color and diameter values.

# Categorical Columns

It should be emphasize that most of the ML algorithms expect numerical features (e.g., integer and floating-point numbers) as input.

However, there are two columns that are NOT numerical, which are `color` and `label`. We need to convert such columns into numerical ones. 

For categorical features (i.e., `color`), we commonly convert them into what is called **one-hot** format. We **DO NOT** use a number such as 0, 1, 2 as it brings in a natural order for different categories.

```
color=green  --> 0 --> [1, 0, 0]
color=red    --> 1 --> [0, 1, 0]
color=yellow --> 2 --> [0, 0, 1]
```

It should be noted that it **DOES NOT** matter which number you assign for `red`, `green` and `yellow` as long as they are consistent.

The following code shows an example of how to use the `get_dummies` function to convert from the categorical feature into the one-hot format.

In [None]:
pd.get_dummies(df['color'], prefix='color')

We append such one-hot features into the dataframe.

In [None]:
color_code_df = pd.get_dummies(df['color'], prefix='color')
df = pd.concat([df, color_code_df], axis=1)
df

Next, we will convert the `label` column into integer numbers (e.g., 0, 1 and 2). Again, it **DOES NOT** matter which number you assign as long as you are consistent for the task.

Here we will use `LabelEncoder` from scikit-learn to convert from string to class numbers, and then create a new column, named `label_code`, to keep the output.

In [None]:
from sklearn.preprocessing import LabelEncoder

label_enc = LabelEncoder()
df['label_code'] = label_enc.fit_transform(df['label'])
print(label_enc.classes_)

df

Once we have successfully convert categorical columns into numerical ones. We will drop the categorical columns from the dataframe.

In [None]:
data_df = df.drop(columns=['color','label'])
data_df

# Prepare a Training Set

As mentioned in the slide that a training set consists of pairs of data (or features) and labels, we will extract features and labels from the dataframe.

We typically use `X` for features and `y` for labels.

In [None]:
# Prepare the training set
X = data_df.drop(columns=['label_code']).values
y = data_df['label_code'].values
print(X)
print(y)

The following is the code to get the name of each feature column and store in a `feature_names` variable.

In [None]:
feature_names = data_df.drop(columns=['label_code']).columns.values
feature_names

# Train a Decision Tree

In this section, we will create and train a decision tree model using [scikit-learn](https://scikit-learn.org/), which is one of the most popular Python package for machine learning.

The module that we will use is [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html).

In [None]:
from sklearn.tree import DecisionTreeClassifier

# Create Decision Tree classifer object
clf = DecisionTreeClassifier(random_state=42)

To train the model, we simply call `fit` function with the training set that we have prepared: `X` and `y`.

In [None]:
# Train Decision Tree Classifer
clf = clf.fit(X,y)

# Prediction

In this section, we will use the *trained* decision tree to predict the types of fruit based on the `color` and `diameter`.

To make predictions, we call `predict` function with the input features. Let's have a try on the training set.

In [None]:
# Predict the response for test dataset
y_pred = clf.predict(X)
print(y_pred)

It can be seen that the predictions are still the class number. If we want to know the name of each class, we can use the same `LabelEncoder` to inverse the predicted class from numbers back to string.

In [None]:
print(label_enc.inverse_transform(y_pred))

One of common metrics that we can use to evaluate the performance of the model is **accuracy**, which is the closeness of a measured value to a standard or known value.

In [None]:
np.mean(y_pred == y)

# Visualize the Trained Decision Tree

It is also helpful to understand the criteria that the model uses to make predictions. For the decision tree, we can use `export_graphviz` module to visualize the tree.

In [None]:
from sklearn.tree import export_graphviz
from subprocess import call
import matplotlib.pyplot as plt

# Export the decision tree
export_graphviz(
    clf,                             # the trained decision tree here
    feature_names=feature_names,     # the list of feature names here
    class_names=label_enc.classes_,  # the list of labels here
    out_file='tree.dot',
    rounded=True, proportion=False, precision=2, filled=True)

# Convert to png
call(['dot', '-Tpng', 'tree.dot', '-o', 'tree.png', '-Gdpi=600'])

# Display in python
plt.figure(figsize=(10,12))
plt.imshow(plt.imread('tree.png'))
plt.axis('off')
plt.show()

We can also see which are the most importance features for predicting the type of the fruit based on the `color` and the `diameter`.

Here we can use the attribute `feature_importances_` from the trained model `clf`.

In [None]:
for i in range(len(feature_names)):
    print(f'{feature_names[i]}: {clf.feature_importances_[i]}')