# Implement Decision Tree without using any standard ML library like scikit-learn or more

### What is Decision Tree Algorithm ? 

Decision Tree algorithm is one of the type of supervised learning algorithms. What makes decision tree algorithm different from other supervised learning algorithm is that it can be used for solving both regression as well as classification problems. The main aim of using a Decision Tree algorithm is to create a training model that can use to predict the class or value of the target variable by learning simple decision rules inferred from training data.

It is a tree-structured classfier, where internal node represents the features of a dataset, meanwhile the branches represents the rules and every single next node represents the outcome.


### Types of Decision Tree

Types of decision trees are based on the type of target variable. It can basically be of two types:
   1. Categorical Variable Decision Tree: Decision Tree which has a categorical target variable.
   2. Continuous Variable Decision Tree: Decision Tree has a continuous target variable.
   
###### For Example:-
A person has a problem to predict whether a customer will renew his subscription for a OTT service (yes or no). Here that person knows that the income of customers is a significant variable but the company does not have income details for all of his customers. Now, as the person know this is an important variable, then that proson can build a decision tree to predict customer income based on occupation and various other important variables. In this case, we are predicting values for the continuous variables so we need a Continuous Variable Decision Tree.


### Important Terminology related to Decision Trees 

1. Root Node: It represents the entire population or sample and this further gets divided into two or more homogeneous sets.
2. Splitting: It is a process of dividing a node into two or more sub-nodes.
3. Decision Node: When a sub-node splits into further sub-nodes, then it is called the decision node.
4. Leaf / Terminal Node: Nodes do not split is called Leaf or Terminal node.
5. Pruning: When we remove sub-nodes of a decision node, this process is called pruning. You can say the opposite process of   splitting.
6. Branch / Sub-Tree: A subsection of the entire tree is called branch or sub-tree.
7. Parent and Child Node: A node, which is divided into sub-nodes is called a parent node of sub-nodes whereas sub-nodes are the child of a parent node.


### Assumptions for creating a Decision Tree.

First we have to concider entire training set as the thre root node. The features are always expected to be categorical. If the data is continuous then they are discretized prior to building the model. In decision tree records are distributed recursively on the basis of attribute values and the order foracing attributes as root or internal node of the tree is done by using some statistical approach.


### Advantages and Disadvantages

#### Advantages
1. A decision tree does not require normalization of data.
2. A decision tree does not require scaling of data.
3. Missing values in the data doesn't affect the process to a considerable extent.
4. A Decision tree model is very easy to explain.

#### Disadvantages
1. Tiny changes in the data can make a large difference in the structure of the decision tree causing instability.
2. Decision Tree requires a very high time of training comparitively.
3. For a Decision tree sometimes calculation can go far more complex compared to other algorithms.
4. Decision tree training is relatively expensive as the complexity and time has taken are more.


### How does Decision Trees work?
The decision of making strategic splits heavily affects a tree’s accuracy. These criterias are different for classification and regression trees. Decision trees use multiple algorithms to decide to split the one which is discussed over here is called ID3. The ID3 algorithm builds decision trees using a top-down greedy approach.

#### Steps involved in ID3 algorithm:
1. Start with the set S as the root node.
2. On each iteration it calculates Entropy(H) and Information gain(IG) of this attribute.
3. It then selects the attribute which has the smallest Entropy or Largest Information gain.
4. The set S is then split by the selected attribute to produce a subset of the data.
5. The algorithm continues to recur itself further.


### Code for Decision Tree from scratch:-

In [1]:
import pandas as pd 
import numpy as np 
import random
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier

In [2]:
iris = load_iris()
X = iris['data']
y = iris['target']

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [4]:
#Entropy
from collections import Counter
def entropy(s):
    counts = np.bincount(s)
    percentages = counts / len(s)
    entropy = 0
    for pct in percentages:
        if pct > 0:
            entropy += pct * np.log2(pct)
    return -entropy
s = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1]
print(f'Entropy: {np.round(entropy(s), 5)}')

Entropy: 0.88129


In [5]:
#Information Gain
def information_gain(parent, left_child, right_child):
    num_left = len(left_child) / len(parent)
    num_right = len(right_child) / len(parent)
    
    gain = entropy(parent) - (num_left * entropy(left_child) + num_right * entropy(right_child))
    return gain

parent = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
left_child = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1]
right_child = [0, 0, 0, 0, 1, 1, 1, 1]

print(f'Information Gain: {np.round(information_gain(parent, left_child, right_child), 5)}')

Information Gain: 0.18094


In [6]:
#Node Class
class Node:
    def __init__(self, feature=None, threshold=None, data_left=None, data_right=None, gain=None, value=None):
        self.feature = feature
        self.threshold = threshold
        self.data_left = data_left
        self.data_right = data_right
        self.gain = gain
        self.value = value
#Building up Decision Tree Algorithm
class DecisionTree:
    def __init__(self, min_samples_split=2, max_depth=5):
        self.min_samples_split = min_samples_split
        self.max_depth = max_depth
        self.root = None
    @staticmethod
    def _entropy(s):
        counts = np.bincount(np.array(s, dtype=np.int64))
        percentages = counts / len(s)
        entropy = 0
        for pct in percentages:
            if pct > 0:
                entropy += pct * np.log2(pct)
        return -entropy
    
    def _information_gain(self, parent, left_child, right_child):
        num_left = len(left_child) / len(parent)
        num_right = len(right_child) / len(parent)
        return self._entropy(parent) - (num_left * self._entropy(left_child) + num_right * self._entropy(right_child))
    
    def _best_split(self, X, y):
        best_split = {}
        best_info_gain = -1
        n_rows, n_cols = X.shape
        for f_idx in range(n_cols):
            X_curr = X[:, f_idx]
            for threshold in np.unique(X_curr):
                df = np.concatenate((X, y.reshape(1, -1).T), axis=1)
                df_left = np.array([row for row in df if row[f_idx] <= threshold])
                df_right = np.array([row for row in df if row[f_idx] > threshold])
                if len(df_left) > 0 and len(df_right) > 0:
                    y = df[:, -1]
                    y_left = df_left[:, -1]
                    y_right = df_right[:, -1]
                    gain = self._information_gain(y, y_left, y_right)
                    if gain > best_info_gain:
                        best_split = {
                            'feature_index': f_idx,
                            'threshold': threshold,
                            'df_left': df_left,
                            'df_right': df_right,
                            'gain': gain
                        }
                        best_info_gain = gain
        return best_split
    
    def _build(self, X, y, depth=0):
        n_rows, n_cols = X.shape
        if n_rows >= self.min_samples_split and depth <= self.max_depth:
            best = self._best_split(X, y)
            if best['gain'] > 0:
                left = self._build(
                    X=best['df_left'][:, :-1], 
                    y=best['df_left'][:, -1], 
                    depth=depth + 1
                )
                right = self._build(
                    X=best['df_right'][:, :-1], 
                    y=best['df_right'][:, -1], 
                    depth=depth + 1
                )
                return Node(
                    feature=best['feature_index'], 
                    threshold=best['threshold'], 
                    data_left=left, 
                    data_right=right, 
                    gain=best['gain']
                )
        return Node(
            value=Counter(y).most_common(1)[0][0]
        )
    
    def fit(self, X, y):
        self.root = self._build(X, y)
        
    def _predict(self, x, tree):
        if tree.value != None:
            return tree.value
        feature_value = x[tree.feature]
        if feature_value <= tree.threshold:
            return self._predict(x=x, tree=tree.data_left)
        if feature_value > tree.threshold:
            return self._predict(x=x, tree=tree.data_right)
        
    def predict(self, X):
        return [self._predict(x, self.root) for x in X]

In [7]:
model = DecisionTree()
model.fit(X_train, y_train)
predict = model.predict(X_test)
accuracy = accuracy_score(y_test, predict) * 100
print("Accuracy score for the built model in the Iris dataset is: ",accuracy,"%")

Accuracy score for the built model in the Iris dataset is:  100.0 %
