# Classification with Decision Trees

This notebook will show how to use a decision tree model to classify banknotes as either authentic or inauthentic. 

## What is a Decision Tree?
A decision tree is a flowchart-like structure where an internal node represents a feature(or attribute), the branch represents a decision rule, and each leaf node represents the outcome. The topmost node in a decision tree is known as the root node. First, it learns to partition based on the attribute value. Then, it partitions the tree in a recursive manner called recursive partitioning. This flowchart-like structure helps you in decision-making. It's visualization like a flowchart diagram that easily mimics human-level thinking. That is why decision trees are easy to understand and interpret.

![Decision Trees Digram](./res/dtree.jpg 'Decision Tree')


In this notebook, we start by importing the necessary library modules and functions as follows:

* The `pyplot` module from the `matplotlib` library for creating the visulizations needed.
* We import the `pandas` library for loading the data from a CSV file.
* From the `sklearn.model_selection`, we import the `train_test_split()` function which we use to split our data into training and testing data.
* From the `sklearn`, we import the `tree` module which we use when plotting the decsion tree digram.
* From the `sklearn.tree`, we import the `DecisionTreeClassifier()` class which we use as a classification model in this notebook.
* Also, from the `sklearn.tree`, we import the `export_text()` function which allow us to build a text report showing the rules of a decision tree.
* From the `sklearn.metrics`, we import the `f1_score()`, `accuracy_score()`, `precision_score()` and `recall_score()` functions to evaluate the our model.

In [None]:
import matplotlib.pyplot as plt # the visualizations  framework
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.model_selection import train_test_split # split the data in train/test
from sklearn import tree   # used for plotting the tree
from sklearn.tree import DecisionTreeClassifier # our classifier
from sklearn.tree import export_text # export the decsion as a text report
from sklearn.metrics import f1_score, accuracy_score, \
                            precision_score, recall_score   # evaluation metrics

## Loading the Data

The data set we use here is stored in the CSV file under `"./data/bank_notes.csv"`. We use the `pandas` library short name (i.e., `pd`) to call the `read_csv` method, which takes the file path as an argument. The method returns a data frame that is assigned to the variable `data`. We use that variable to show the top five rows (using the `head` method) and then print the summary statistics (count, mean, standard deviation, etc.) using the `describe` method. 

In [None]:
# load the dataset
data = pd.read_csv("./data/bank_notes.csv")

print("First five rows:")
print(data.head())  # dsiplay the first 5 rows

print("\n\n")

print("Summary statsitics:")
print(data.describe())  # show the the summary statistics for the dataset

The [BankNote Authentication Dataset](https://archive.ics.uci.edu/ml/datasets/banknote+authentication) is about distinguishing genuine and forged banknotes. Data were extracted from images that were taken from genuine and forged banknote-like specimens. For digitization, an industrial camera usually used for print inspection was used. The final images have 400x 400 pixels. Due to the object lens and distance to the investigated object gray-scale pictures with a resolution of about 660 dpi were gained. A Wavelet Transform tool was used to extract features from these images.

### Attribute Information

* **variance** of Wavelet Transformed image (continuous).
* **skewness** of Wavelet Transformed image (continuous).
* **curtosis** of Wavelet Transformed image (continuous).
* **entropy** of image (continuous).
* **class** The target label (discrete): $0$ for *authentic* and $1$ for *inauthentic*.

The number of examples in the data set (as shown in the summary statistics) is 1372. 

Read [James D. McCaffrey's blog bost](https://bit.ly/3pc6eQt) to learn about the label of this dataset.

# Spliting the Data into Train/Test

Before splitting the data, we get two slices from the data frame (i.e., `data`). The first slice contains the features, which we assign to a variable named `X`. The second slice contains the target (i.e., `price`), which we set to the `y` variable. 

We then use the `train_test_split()` to split our data into training and testing data randomly. The function expects two sequences of data: 

* `X` sequence containing the features.
* `y` sequence containing the target label.

In addition to the sequence, we pass two parameters: 

* `test_size` is the number that defines the size of the test set.
* `random_state`, which is an integer that specifies that state of the random split.  **To make your tests reproducible, it is essential to set this parameter. Otherwise, you will get different splits each time you run your code.**

`train_test_split()` performs the split and returns four sequences in this order:

1. `X_train`: The training part of the first sequence (`X`)
2. `X_test`: The test part of the first sequence (`X`)
3. `y_train`: The training part of the second sequence (`y`)
4. `y_test`: The test part of the second sequence (`y`)

In [None]:
X = data.iloc[:,:-1] # features is all the columns except last one
y = data.iloc[:,-1] # the last column is the label

# split the data into training (70%) and testing (30%)
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.30, 
                                                    random_state=0)

## Creating the Model
Now that data is split, it is time to fit the model. With scikit-learn, we do this in two lines of code. The first line creates a classifier object (i.e., `clf`) using the `DecisionTreeClassifier()` class we imported earlier. Here we pass two parameters to the `DecisionTreeClassifier()` constructor. The first (`max_depth`) specifies the maximum depth of the tree.  The second (`random_state`) is an integer that specifies the state of the random split. The decision tree algorithm randomly permutes the features for each split.  To ensure we get reproducible results across different runs, we must specify a `random_state`. 

The second line calls the `clf` object's `fit()` method by passing two parameters: `X_train` and `y_train`. 

In [None]:
# create a classifer object and fit the model
clf = DecisionTreeClassifier(max_depth = 2, random_state = 0)
clf.fit(X_train, y_train)

## Visualizing the Model

To better understand the learned model, it is good to create a visualization of the decision tree. To produce a tree figure, we use the `plot_tree` function of the `tree` module. We pass to this function the classifier object (i.e., `clf`), the feature names list (`fn`), the class names list (`cn`), the axes to plot to (`axes`) and we set `filled` to `True` to paint nodes. See the [docs](https://scikit-learn.org/stable/modules/generated/sklearn.tree.plot_tree.html) for more information on this function.

In [None]:
# plot the decosion tree

# feature names - required for filling the decision
fn = ['variance','skewness','curtosis','entropy']

# class names - required for filling the decicion tree
cn = ['authentic', 'inauthentic']

# create a figure object for plotting
fig, axes = plt.subplots(nrows=1, ncols=1,  figsize = (15,10))

# plot the tree
tree.plot_tree(clf,
               feature_names=fn, 
               class_names=cn,
               ax=axes,
               filled=True)

plt.show()

In addition to plotting the tree, we can get a textual representation of the decision tree using the `export_text` function. We pass two parameters to this function—the classifier object and feature names. 

See the [docs](https://scikit-learn.org/stable/modules/generated/sklearn.tree.export_text.html?highlight=export_text#sklearn.tree.export_text) for the `export_text` function

In [None]:
# export the decsion tree as a report
r = export_text(clf, feature_names=fn)
print(r)

## Model Evaluation
Last, we evaluate our model using various classification metrics by comparing our model's predictions against the ground truth.  We first call the `predict()` method by passing the `X_test` split of the data. The returned predictions for each of the examples in the `X_test` array is saved in `y_pred` array. We then pass the `y_pred` array along with the ground truth  (i.e., `y_test`) to the scoring functions  (e.g., `accuracy_score()` and  `r2_precision_score()`).  

In [None]:
# now we make some predictions
y_pred = clf.predict(X_test)

# let us evaluate the predictions
acc = accuracy_score(y_test, y_pred)
print(f"Accuracy : {acc:.2f}")

pr = precision_score(y_test, y_pred)
print(f"Precision: {pr:.2f}")

re = recall_score(y_test, y_pred)
print(f"Recall   : {re:.2f}")

f1 = f1_score(y_test, y_pred)
print(f"F1 Score : {f1:.2f}")


As you can see the model is $90\%$ accurate; which means it is able to predict the corret class $90\%$ of te time.