# Decision Tree Algorithm

### Learning Agenda of this Notebook


- What are decision trees?
- Basic concepts of decision trees, such as nodes, branches, root, leaves, etc.
- Types of decision trees - classification and regression.
- How decision trees work and their advantages and disadvantages.
- How to build decision trees - algorithms and techniques.
- Entropy and information gain.
- Gini index and Gini impurity.
- Pruning and overfitting.
- Hyperparameters tuning and grid search.
- Ensemble methods with decision trees - Random Forest and Gradient Boosting.
- Decision tree libraries in Python - Scikit-learn and Pandas.
- Decision tree visualization using Graphviz.
- Real-world applications of decision trees.	

**Decision tree** is a type of supervised learning algorithm that is used for both classification and regression tasks. In this algorithm, a tree-like model of decisions is built by recursively splitting the data based on the values of features. The goal of the algorithm is to create a model that predicts the target variable by making a series of decisions based on the values of the input features.

### Idea behind decision tree algorithm? 

The idea behind the decision tree algorithm is to create a model that predicts the value of a target variable by making a series of decisions based on the values of input features. The decision tree algorithm works by recursively partitioning the feature space into smaller regions, such that each region contains data points with similar values of the target variable.

At each node of the tree, a decision is made based on the value of a single feature, and the data is split into two or more subsets based on the value of that feature. This process continues until the data in each leaf node is homogeneous with respect to the target variable, or until a stopping criterion is met (e.g., maximum depth of the tree, minimum number of samples required to split a node).

The decision tree algorithm is useful for both classification and regression problems. In a classification problem, the goal is to predict the class label of a data point, while in a regression problem, the goal is to predict a continuous-valued target variable.

The decision tree algorithm is intuitive, easy to interpret, and can handle both numerical and categorical data. However, it can be prone to overfitting if the tree is too deep or the data is noisy. To address these issues, techniques such as pruning and ensemble methods can be used.

### Step-by-Step workings of the decision tree algorithm
Here are the step-by-step workings of the decision tree algorithm:

**Splitting data:** First, the decision tree algorithm splits the dataset into two or more subsets based on the values of one of the input features.

**Selecting the best feature:** The algorithm selects the feature that best splits the data, based on some criterion such as information gain or Gini impurity.

**Recursive partitioning:** The process of splitting the data and selecting the best feature is repeated recursively for each subset until a stopping criterion is met, such as a maximum depth of the tree or a minimum number of samples required to split a node.

**Building the tree:** The algorithm builds the tree by recursively splitting the data and selecting the best feature, until the stopping criterion is met.

**Predicting the target variable:** Once the tree is built, it can be used to predict the value of the target variable for new data points. This is done by traversing the tree from the root node to a leaf node, based on the values of the input features for the new data point.

**Handling missing values:** The decision tree algorithm can handle missing values by either imputing them or by ignoring them during the split.

**Handling categorical data:** The decision tree algorithm can handle categorical data by either converting them to numerical data or by using specialized splitting criteria such as the chi-square test.

Overall, the decision tree algorithm is a powerful and flexible machine learning algorithm that can handle a variety of data types and can be used for both regression and classification tasks. However, care must be taken to prevent overfitting, and the algorithm may require hyperparameter tuning to achieve optimal performance.

### Terminologies used in Decision tree algorithm

Here are some common terminologies used in the decision tree algorithm:

- Root node: The topmost node of the decision tree, which represents the entire dataset.

- Internal nodes: Nodes in the decision tree that represent a subset of the dataset and contain a split on one of the input features.

- Leaf nodes: Nodes in the decision tree that represent the final outcome or decision of the model.

- Splitting: The process of dividing the dataset into smaller subsets based on the value of a particular input feature.

- Branches: The paths taken through the decision tree from the root node to a leaf node.

- Entropy: A measure of the impurity or randomness in a set of data, used to determine the quality of a split in the decision tree.

- Information gain: The reduction in entropy achieved by a particular split, used to determine the best feature to split on.

- Gini impurity: A measure of the probability of misclassifying a random data point from a subset if it were randomly labeled according to the class distribution in the subset.

- Pruning: The process of removing branches or nodes from the decision tree to prevent overfitting.

- Hyperparameters: Parameters that control the behavior of the decision tree algorithm, such as the maximum depth of the tree or the minimum number of samples required to split a node. Hyperparameter tuning involves selecting the optimal values for these parameters to achieve the best performance of the model.

- Decision Nodes: When a sub-node splits into further sub-nodes, it is called a decision node.

### Entropy

Entropy is a measure of the impurity or randomness in a set of data. In the context of decision trees, entropy is used to determine the quality of a split based on the information gain achieved by that split.

The formula for entropy is:

$E = -\sum_{i=1}^{n} p_i \log_2(p_i)$

where $n$ is the number of classes in the dataset, $p_i$ is the proportion of the data points that belong to class $i$, and $\log_2$ is the base-2 logarithm. The entropy is zero when all the data points belong to the same class, and it is maximum when the data points are evenly distributed across all classes.

In the context of decision trees, entropy is used to measure the impurity of a subset of data before and after a split. The **information gain** achieved by a split is defined as the difference between the entropy of the parent node and the weighted average of the entropies of the child nodes, where the weights are the proportions of the data points in each child node.

The decision tree algorithm aims to maximize the information gain achieved by each split, in order to minimize the entropy of the resulting subsets and increase the homogeneity of the data within each subset.

In practice, entropy is a widely used criterion for building decision trees, but other measures such as Gini impurity or misclassification error can also be used, depending on the specific problem and the nature of the data.

<img src="images/Entropy.png">
<img src="images/Entropy1.png">

### Information Gain

Information gain is a measure of the reduction in entropy achieved by a particular split in a decision tree. It is used to determine the best feature to split on among all possible features.

The formula for information gain is:

$IG(D, F) = E(D) - \sum_{f\in F} \frac{|D_f|}{|D|} E(D_f)$

where $IG(D, F)$ is the information gain of the dataset $D$ with respect to the feature set $F$, $E(D)$ is the entropy of the entire dataset $D$, $D_f$ is the subset of data points in $D$ that have the value $f$ for feature $F$, and $E(D_f)$ is the entropy of the subset $D_f$. The information gain is high when the entropy of the subsets is low, meaning that the split is effective in separating the data into homogenous groups.

In the context of decision trees, the information gain of each feature is calculated and compared, and the feature with the highest information gain is chosen as the splitting criterion. The algorithm recursively applies this process to create a tree until all the data points in each leaf node belong to the same class or a stopping criterion is met.

In practice, information gain is a widely used criterion for building decision trees, but it has some limitations, such as a bias towards features with many possible values and the tendency to favor features with many binary splits. Alternative measures such as gain ratio and Gini index can also be used to overcome these limitations, depending on the specific problem and the nature of the data.

In [1]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create a decision tree classifier object
dt = DecisionTreeClassifier()

# Train the decision tree classifier on the training data
dt.fit(X_train, y_train)

# Predict the class labels for the test data
y_pred = dt.predict(X_test)

# Evaluate the performance of the classifier using accuracy score
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)


Accuracy: 1.0
