In [None]:
'''
Decision Tree:

- Decision trees are a type of supervised learning algorithm used for both classification and regression tasks. 
- They work by splitting the data into subsets based on the value of input features, 
creating a tree-like structure where each node represents a feature and each branch represents a decision rule. 
- The process continues until a stopping criterion is met, 
such as reaching a maximum depth or having a minimum number of samples in a leaf node.
- Decision trees are easy to interpret and visualize, making them a popular choice for many applications. 
- However they can be prone to overfitting, especially when the tree is deep and complex. 
To mitigate this, techniques such as pruning (removing branches that have little importance) or 
setting constraints on the tree's depth can be applied. 


Algorithms to construct decision trees:
a. ID3:
- ID3 (Iterative Dichotomiser 3) is an algorithm used to create a decision tree for classification tasks.
- It uses a top-down, greedy approach to select the feature that provides the highest information gain at each node.
- The algorithm continues to split the data based on the selected feature until all samples in a node belong to the same class or a stopping criterion is met.
- ID3 is known for its simplicity and effectiveness, but it can lead to overfitting if not properly managed.

b. CART:
- CART (Classification and Regression Trees) is a decision tree algorithm that can be used for both classification and regression tasks.
- It constructs binary trees by recursively splitting the data based on the feature that minimizes a cost function
    (e.g., Gini impurity for classification or mean squared error for regression).
- CART can handle both categorical and continuous features and is robust to overfitting through techniques like
pruning.

'''

In [None]:
'''
Decision Tree Classifier: 
- A decision tree classifier is a specific type of decision tree used for classification tasks.
- It predicts the class label of an input sample by traversing the tree from the root to a leaf node,
    where each internal node represents a feature and each branch represents a decision based on that feature.
- The leaf nodes contain the predicted class labels.
Eg: 
If a decision tree is trained on a dataset of animals, it might classify an animal as "Mammal" or "Bird" 
based on features like "Has Fur" or "Can Fly".


'''

In [None]:
'''
Few things to remember:
a. Purity:
- Purity refers to the homogeneity of the samples in a node of a decision tree.
- A node is considered pure if all samples belong to the same class.
- High purity indicates that the decision tree has effectively separated the classes.
- Common measures of purity include Gini impurity and entropy.
    i. Gini impurity:
        - Gini impurity is a measure of how often a randomly chosen element from the set would be incorrectly labeled 
            if it was randomly labeled according to the distribution of labels in the subset.
        - It ranges from 0 (perfectly pure) to 0.5 (maximum impurity for a binary classification).
        - The formula for Gini impurity is:
          Gini = 1 - Σ(p_i^2)
          where p_i is the proportion of samples belonging to class i in the node.
    ii. Entropy:
        - Entropy is a measure of uncertainty or randomness in a set of samples.
        - It is used to quantify the impurity of a node in a decision tree.
        - The formula for entropy is:
          Entropy = -Σ(p_i * log2(p_i))
          where p_i is the proportion of samples belonging to class i in the node.

b. Which feature to split on:
- The feature to split on is chosen based on the one that maximizes the reduction in impurity (or increases purity) of the resulting nodes.
- This is often done using metrics like information gain, Gini impurity, or entropy.
    i. Information gain:
        - Information gain measures the reduction in entropy after a split.
        - It is calculated as the difference between the entropy of the parent node and the weighted average entropy of the child nodes.
        - A higher information gain indicates a better feature for splitting.

c. How to split:
- The splitting process involves dividing the dataset into subsets based on the values of the selected feature.
- For categorical features, the split can be based on the presence or absence of a category.
- For continuous features, the split can be based on a threshold value (e.g., "Is feature X <= threshold?").


'''