## Decision Trees

### Another form of supervised learning is decision tree. 

### A decision tree is for classification and regression models. 

### The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.

### To demonstrate the working of a decision tree classifier, we take balance scale dataset (information) [present inside data folder] whose objective is to classify whether the scale has more load on right, left or centered. 

### The attributes are the left weight, the left distance, the right weight, and the right distance.

### Let us load the data into a dataframe and so forth get the training and the testing data. 

### Assuming you have the dataset into an excel file format and stored in the same working directory.

In [27]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

In [28]:
cols = ['Class Name', 'Left-Weight', 'Left-Distance', 'Right-Weight', 'Right-Distance']

In [29]:
df = pd.read_csv("./data/Decision_Tree_DataSet.csv", header = None, names = cols)

In [30]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 625 entries, 0 to 624
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Class Name      625 non-null    object
 1   Left-Weight     625 non-null    int64 
 2   Left-Distance   625 non-null    int64 
 3   Right-Weight    625 non-null    int64 
 4   Right-Distance  625 non-null    int64 
dtypes: int64(4), object(1)
memory usage: 24.5+ KB
None


In [31]:
df.isnull().sum()

Class Name        0
Left-Weight       0
Left-Distance     0
Right-Weight      0
Right-Distance    0
dtype: int64

In [32]:
X = df.values[:, 1:5]   # Other features
Y = df.values[:,0]  # Target

In [33]:
# Splitting data
X_train, X_test, y_train, y_test = train_test_split( X, Y, test_size = 0.3, random_state = 42)

### A decision tree classifier can have two criterion i.e. Gini or Entropy. Let us build both of them to fit the model and achieve some predictions.

#### Gini Impurity

In [34]:
lm_gini = DecisionTreeClassifier(criterion = "gini", random_state = 42,
                              max_depth=3, 
                              min_samples_leaf=5) # min. samples req. at leaf node
lm_gini.fit(X_train, y_train)
y_pred = lm_gini.predict(X_test)

#### Information Gain

In [35]:
lm_ig = DecisionTreeClassifier(criterion = "entropy", random_state = 42,
                              max_depth=3, 
                              min_samples_leaf=5)
lm_ig.fit(X_train, y_train)
y_pred_ig = lm_ig.predict(X_test)

### The predictions in either of the cases have been stored in parameters y_pred and y_pred_ig. 

### Run the following snippet to get a certainty of the prediction.

In [36]:
print(y_pred)
print(y_pred_ig)

['L' 'L' 'L' 'R' 'L' 'R' 'R' 'R' 'L' 'R' 'R' 'R' 'L' 'L' 'L' 'R' 'R' 'L'
 'R' 'R' 'R' 'R' 'R' 'L' 'L' 'L' 'L' 'L' 'L' 'L' 'L' 'R' 'R' 'L' 'R' 'L'
 'L' 'L' 'R' 'L' 'L' 'R' 'R' 'L' 'L' 'R' 'R' 'R' 'R' 'L' 'L' 'L' 'R' 'L'
 'L' 'L' 'R' 'L' 'L' 'R' 'L' 'L' 'R' 'L' 'L' 'L' 'R' 'R' 'R' 'L' 'R' 'R'
 'R' 'R' 'R' 'L' 'L' 'L' 'L' 'R' 'L' 'L' 'R' 'L' 'R' 'L' 'L' 'L' 'R' 'L'
 'L' 'L' 'R' 'L' 'R' 'R' 'R' 'R' 'R' 'R' 'R' 'L' 'R' 'L' 'R' 'R' 'L' 'R'
 'L' 'R' 'R' 'R' 'R' 'L' 'L' 'R' 'L' 'L' 'L' 'R' 'R' 'L' 'R' 'R' 'R' 'R'
 'L' 'R' 'R' 'R' 'L' 'L' 'L' 'L' 'L' 'L' 'R' 'R' 'R' 'L' 'R' 'L' 'R' 'R'
 'R' 'R' 'R' 'L' 'R' 'R' 'R' 'R' 'L' 'L' 'R' 'R' 'R' 'L' 'R' 'L' 'L' 'R'
 'R' 'L' 'R' 'R' 'R' 'L' 'R' 'L' 'L' 'R' 'R' 'R' 'L' 'L' 'L' 'L' 'R' 'L'
 'R' 'R' 'L' 'R' 'L' 'R' 'R' 'L']
['L' 'L' 'R' 'R' 'L' 'R' 'R' 'L' 'L' 'R' 'R' 'R' 'R' 'R' 'L' 'L' 'R' 'L'
 'R' 'R' 'R' 'R' 'R' 'L' 'L' 'R' 'L' 'L' 'L' 'L' 'L' 'R' 'R' 'L' 'R' 'L'
 'L' 'L' 'R' 'L' 'L' 'R' 'R' 'L' 'L' 'L' 'R' 'R' 'L' 'L' 'L' 'L' 'R' 'L'
 'L' 'L' 'R' 'L' 

### We can now check the accuracy of each of the decision tree classifier criterion using the accuracy_score.

In [37]:
print("Accuracy of Gini Impurity model: ", 
      accuracy_score(y_test, y_pred) * 100)
print("Accuracy of Information Gain model: ", 
      accuracy_score(y_test, y_pred_ig) * 100)

Accuracy of Gini Impurity model:  70.2127659574468
Accuracy of Information Gain model:  68.08510638297872
