# A computer vision example with decision trees

On this new notebook we are going to study a case related to computer vision, however, we are not going to process images on this notebook. It is assumed that the datasets creators, already did. The example consist on:

## Databank notes

Data were extracted from images that were taken from genuine and forged banknote-like specimens. For digitization, an industrial camera usually used for print inspection was used. The final images have 400x 400 pixels. Due to the object lens and distance to the investigated object gray-scale pictures with a resolution of about 660 dpi were gained. Wavelet Transform tool were used to extract features from images.

<img src="banknote.jpg" style="width: 30%; height: 20%">

1. variance of Wavelet Transformed image (continuous) 
2. skewness of Wavelet Transformed image (continuous) 
3. curtosis of Wavelet Transformed image (continuous) 
4. entropy of image (continuous) 
5. class (integer) 

Courtesy of: Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

## First step
Open the terminal of this environment and execute the following line </p>
pip install graphviz

## Fire up the libraries

In [2]:
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
import graphviz 
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
import math
%matplotlib inline
np.random.seed(1234)

## Import and check the data

In [3]:
notes = pd.read_csv('data_banknote_authentication.txt',header=None)
notes.sample(5)

Unnamed: 0,0,1,2,3,4
266,-0.016103,9.7484,0.15394,-1.6134,0
760,3.2414,0.40971,1.4015,1.1952,0
804,-0.28015,3.0729,-3.3857,-2.9155,1
1136,-0.41645,0.32487,-0.33617,-0.36036,1
1016,-2.0042,-9.3676,9.3333,-0.10303,1


In [4]:
notes.shape

(1372, 5)

## Tidy the data
The fourth column is the target. Now split the target from the features.

In [8]:
y = notes[4]
y.sample(10)

432     0
1061    1
846     1
857     1
337     0
1230    1
903     1
1164    1
928     1
847     1
Name: 4, dtype: int64

In [9]:
notes.drop(4,axis=1,inplace=True)

In [10]:
notes.columns = ['Variance','Skewness','Curtosis','Entropy']
notes.head()

Unnamed: 0,Variance,Skewness,Curtosis,Entropy
0,3.6216,8.6661,-2.8073,-0.44699
1,4.5459,8.1674,-2.4586,-1.4621
2,3.866,-2.6383,1.9242,0.10645
3,3.4566,9.5228,-4.0112,-3.5944
4,0.32924,-4.4552,4.5718,-0.9888


## Split the data.
Take the 20% to the test dataset

In [11]:
X_train, X_test, y_train, y_test = train_test_split(notes, y, test_size=0.2)

# Decision tree
Decision Tree Classifier, repetitively divides the working area(plot) into sub part by identifying lines. (repetitively because there may be two distant regions of same class divided by other as shown in image below).

So when does it terminate?
* Either it has divided into classes that are pure (only containing members of single class )
* Some criteria of classifier attributes are met.

In [15]:
decision_tree_model = DecisionTreeClassifier()
decision_tree_model.fit(X_train,y_train)
decision_tree_model.predict(X_test)
print(decision_tree_model.score(X_test,y_test))

0.9890909090909091


In [18]:
dot_data = tree.export_graphviz(decision_tree_model, out_file="resume.dot",class_names=['forged','genuine'],
feature_names=['Variance','Skewness','Curtosis','Entropy'], filled=True, rounded=True, special_characters=True,
leaves_parallel=False)

<img src="TreeBankNote.png" style="width: 100%; height: 100%">

## Quality of fit
The parameter to measure the quality of a split is called criterion 
### Impurity
Impurity is when we have a traces of one class division into other. This can arise due to following reason
We run out of available features to divide the class upon.
We tolerate some percentage of impurity (we stop further division) for faster performance. (There is always trade off between accuracy and performance).
For example in case we may stop our division when we have x number of fewer number of elements left. This is also known as gini impurity.

### Entropy
Entropy is degree of randomness of elements or in other words it is measure of impurity. Mathematically, it can be calculated with the help of probability of the items as:

$$ H = - \sum p(x) log p(x)$$

p(x) is probability of item x.
It is negative summation of probability times the log of probability of item x.

In [21]:
decision_tree_model = DecisionTreeClassifier(criterion='entropy',random_state=1234)
decision_tree_model.fit(X_train,y_train)
decision_tree_model.predict(X_test)
print(decision_tree_model.score(X_test,y_test))

0.9927272727272727


In [22]:
dot_data = tree.export_graphviz(decision_tree_model, out_file="resume2.dot",class_names=['forged','genuine'],
feature_names=['Variance','Skewness','Curtosis','Entropy'], filled=True, rounded=True, special_characters=True,
leaves_parallel=False)

<img src="TreeBankNote2.png" style="width: 100%; height: 100%">