## Entropy
In machine learning, particularly in the context of decision trees and other related algorithms, entropy is a measure of impurity or disorder. It is used to evaluate how much "mixed up" the classes in a dataset are, and it helps in deciding the best feature to split the dataset on at each level of the tree.

### Formula:

The entropy $H$ of a dataset $S$ is calculated using the formula:

$ H(S) = - \sum_{i=1}^{c} p_i \log_2(p_i) $

Where:
- $c$ is the number of classes.
- $p_i$ is the proportion of instances of class $i$ in the dataset $S$.

### Explanation:

- If all examples in the dataset belong to a single class (i.e., the dataset is pure), the entropy is 0.
- If the examples are evenly distributed among all classes, the entropy is at its maximum value (which is $ \log_2(c) $ for $c$ classes).

### Working:

In the context of building a decision tree:

1. **Calculate Entropy of the Dataset:**
   - Calculate the entropy of the entire dataset.
  
2. **Calculate Entropy After Split:**
   - For each attribute, calculate the entropy of the dataset after splitting on that attribute.
  
3. **Calculate Information Gain:**
   - Information gain is the difference between the original entropy and the entropy after the split.
   - $ \text{Information Gain} = \text{Entropy before split} - \text{Entropy after split} $
  
4. **Choose the Best Split:**
   - Choose the attribute that results in the highest information gain to make the split.
   - Repeat the process for the sub-datasets after the split.

### Example:

Consider a dataset with two classes, A and B. If the dataset contains 8 instances of class A and 2 of class B, the entropy of the dataset is:

$ H(S) = - \left( \frac{8}{10} \log_2\left(\frac{8}{10}\right) + \frac{2}{10} \log_2\left(\frac{2}{10}\right) \right) $

### Applications:

- **Decision Trees:**
  - Used in algorithms like ID3 (Iterative Dichotomiser 3) to decide the best feature for splitting the dataset.
- **Feature Selection:**
  - Helps in selecting important features by calculating information gain.
  
By minimizing entropy (maximizing information gain), algorithms can make better and more informed decisions when constructing models from data.