## Decision Tree


A decision tree is a supervised machine learning algorithm used for classification and regression tasks, where data is split into branches based on feature values to make predictions. Each internal node represents a test on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label or a continuous value. The tree is built by selecting the best feature to split the data at each step, based on criteria like Gini impurity or entropy for classification, or variance reduction for regression. Despite its simplicity and interpretability, decision trees can be prone to overfitting and sensitive to noisy data, which can be mitigated by techniques like pruning.



In [1]:
from argparse import ArgumentParser
from sys import argv, exit
import numpy as np
import pandas as pd

## Entropy

Entropy in the context of decision trees is a measure of impurity or disorder used to determine how a dataset should be split at each node. It quantifies the uncertainty or randomness in the data. In a decision tree, entropy helps to identify the attribute that will best split the data into distinct classes.
Definition and Formula

Entropy (HH) for a binary classification problem is defined as:
H(S)=−p0log⁡2(p0)−p1log⁡2(p1)H(S)=−p0​log2​(p0​)−p1​log2​(p1​)
where:

    SS is the dataset.
    p0p0​ is the proportion of the first class in the dataset.
    p1p1​ is the proportion of the second class in the dataset.

For a multi-class classification problem, entropy is generalized to:
H(S)=−∑i=1npilog⁡2(pi)H(S)=−∑i=1n​pi​log2​(pi​)
where:

    nn is the number of classes.
    pipi​ is the proportion of instances in class ii.

### Interpretation

    High Entropy: Indicates high disorder and mixed classes. If the dataset is evenly split among different classes, entropy is high, meaning the dataset is impure.
    Low Entropy: Indicates low disorder and mostly one class. If the dataset contains instances of a single class, entropy is zero, meaning the dataset is pure.

### Using Entropy in Decision Trees

Calculate Entropy of the Entire Dataset: Measure the overall impurity before any splits.

Entropy After a Split: For each candidate attribute, calculate the weighted entropy of the resulting subsets.
H(S,A)=∑v∈values(A)∣Sv∣∣S∣H(Sv)H(S,A)=∑v∈values(A)​∣S∣∣Sv​∣​H(Sv​)
where:
SvSv​ is the subset for which attribute AA has value vv.
∣Sv∣∣S∣∣S∣∣Sv​∣​ is the proportion of subset SvSv​ relative to the original dataset SS.
Information Gain: 
Subtract the entropy after the split from the entropy before the split to find the information gain.

#### Information Gain
    
(S,A)=H(S)−H(S,A)Information Gain(S,A)=H(S)−H(S,A)
The attribute with the highest information gain is chosen for the split, as it best reduces the uncertainty.

By selecting splits that maximize information gain (or equivalently, minimize entropy), the decision tree algorithm ensures that each node split results in the most homogenous child nodes possible, leading to a more accurate and efficient classification model.

In [2]:
def entropy(examples, label, possible_labels):
    number_rows = examples.shape[0]
    entropy_value = 0

    for label_value in possible_labels:
        number_label_cases = examples[examples[label] == label_value].shape[0]
        label_entropy = 0
        if number_label_cases > 0:
            label_prob = number_label_cases / number_rows
            label_entropy = -(label_prob * np.log2(label_prob))
        entropy_value += label_entropy
    return round(entropy_value, 4)


def info_gain(attribute, examples, label, possible_labels):
    attr_possible_values = examples[attribute].unique()
    number_rows = examples.shape[0]
    attr_info_gain = 0.0

    for attr_value in attr_possible_values:
        attr_value_examples = examples[examples[attribute] == attr_value]
        attr_value_number_rows = attr_value_examples.shape[0]
        attr_value_entropy = entropy(attr_value_examples, label, possible_labels)
        attr_value_prob = attr_value_number_rows / number_rows
        attr_info_gain += attr_value_prob * attr_value_entropy

    return entropy(examples, label, possible_labels) - attr_info_gain



identifies the most frequently occurring label within a given set of examples. By extracting the labels from the dataset and utilizing the value_counts() method, it computes the occurrences of each unique label. Subsequently, it employs idxmax() to pinpoint the index associated with the maximum count, effectively revealing the most common label. This function is essential in decision tree algorithms for various tasks such as determining class labels for leaf nodes or guiding pruning processes based on label frequencies.

In [None]:
def most_common_label(parent_examples):

    labels = parent_examples.iloc[:, -1]
    return labels.value_counts().idxmax()


this function aims to find the optimal split value for a given attribute based on the information gain criterion, which is crucial for building an accurate decision tree.

In [None]:

def calculate_best_split_value(examples, attribute, label, possible_labels):
    """
    Calcules the best value to split the examples in two subsets, <= and >
    :return: The best split value
    """
    attribute_values = sorted(examples[attribute].unique().tolist())
    best_split_value = None
    best_information_gain = float('-inf')

    middle_values = []
    for i in range(len(attribute_values) - 1):
        middle_values.append((attribute_values[i] + attribute_values[i + 1]) / 2)

    if len(attribute_values) == 1:
        # All instances have the same value for the attribute
        return attribute_values[0]

    for value in middle_values:
        less_equal = examples[examples[attribute] <= value]
        bigger = examples[examples[attribute] > value]

        q1 = len(less_equal) / len(examples)
        q2 = len(bigger) / len(examples)

        entropy1 = entropy(less_equal, label, possible_labels)
        entropy2 = entropy(bigger, label, possible_labels)

        information_gain = entropy(examples, label, possible_labels) - (q1 * entropy1) - (q2 * entropy2)

        if information_gain > best_information_gain:
            best_information_gain = information_gain
            best_split_value = value

    return round(best_split_value, 2)

In [None]:

def generate_branch(attribute, examples, label, possible_labels, parent_examples):
    """
    Generates a branch of the decision tree as a dictionary, as the attribute value
    as the key and a tuple of the label value and a counter of the examples that
    have that attribute value
    :param parent_examples: Parent examples of the examples dataframe
    :return: The resulting branch and the next examples that satisfy the branch condition
    """
    attr_values_dict = examples[attribute].value_counts(sort=False)
    global attribute_possible_values
    possible_val = attribute_possible_values[attribute]
    for value in possible_val:
        if value not in attr_values_dict.keys():
            attr_values_dict[value] = 0
    branch = {}
    next_examples = examples.copy()  # Cria uma cópia dos exemplos

    for attr_value, positives in attr_values_dict.items():
        attr_value_examples = examples[examples[attribute] == attr_value]
        isPure = False

        for label_value in possible_labels:
            label_positives = attr_value_examples[attr_value_examples[label] == label_value].shape[0]

            if label_positives == positives:
                if label_positives == 0 and positives == 0:
                    label_value = most_common_label(parent_examples)
                branch[attr_value] = (label_value, label_positives)
                next_examples = next_examples[next_examples[attribute] != attr_value]
                isPure = True

        if not isPure:
            branch[attr_value] = ('?', -1)

    if branch:
        return branch, next_examples
    else:
        return None, None