#### Understandig Entropy = -Σ(pᵢ × log(pᵢ))

- **The Story Behind the Formula**
    - This formula was created by Claude Shannon in 1948 when he invented information theory. He asked: "How do we mathematically measure uncertainty or information?"

- **Shannon's Requirements**
    - Shannon said any good measure of uncertainty should satisfy these properties:

    - Continuity: Small changes in probabilities → small changes in uncertainty
    - Maximum for uniform distribution: Most uncertain when all outcomes equally likely
    - Additive for independent events: If you have two independent uncertainties, they should add up
    - Monotonicity: More possible outcomes → more uncertainty

    - Amazingly, only one formula satisfies all these requirements:**-Σ(pᵢ × log(pᵢ))**

    - **Intuitive Derivation (Simplified)**
        - Let me build up to the formula step by step:
        
        - **Step 1: Start with "Surprise"**
            - When an event with probability p happens, how "surprised" should you be?  
            - If p = 1 (certain) → No surprise → Surprise = 0
            - If p = 0.5 (coin flip) → Some surprise
            - If p = 0.01 (rare) → Very surprised

        - **Key insight**: Rarer events should give MORE surprise.
            - We need a function S(p) where:
                - S(1) = 0 (no surprise for certain events)
                - S(p) increases as p decreases
                - Small p → Large S(p)
            - **What function does this?** → S(p) = -log(p) or log(1/p)

            - Let's verify:
                - S(1) = -log(1) = 0 ✓
                - S(0.5) = -log(0.5) = 1 ✓
                - S(0.1) = -log(0.1) = 3.32 ✓ (more surprise)
                - S(0.01) = -log(0.01) = 6.64 ✓ (even more!) 

        - **Step 2: Average Surprise = Entropy**
            - Now, if you have multiple possible outcomes (like rolling a die), what's the expected surprise on average?
            - Expected value formula: **E[X] = Σ(probability × value)**
            - So expected surprise (entropy):
                - ``` 
                    Entropy = Σ(pᵢ × Surprise of event i)
                          = Σ(pᵢ × (-log(pᵢ)))
                          = -Σ(pᵢ × log(pᵢ))  
                ```
        - Think of it this way:
            - Entropy = Average amount of surprise you experience
            - Each outcome has:
                - Probability pᵢ (how often it happens)
                - Surprise -log(pᵢ) (how unexpected it is)


        - Multiply them: pᵢ × (-log(pᵢ)) = "weighted surprise contribution"
        - Sum all contributions: That's your average surprise = Entropy!


- #### Why Logarthim
    -    The logarithm is there to measure "surprise" or "information content" in a mathematically meaningful way.

- #### Intuitive Explanation
    - Think about guessing games:
    - Scenario 1: Coin flip (2 outcomes)
        - You need to ask 1 yes/no question to figure out the result
            "Is it heads?"

    - Scenario 2: Rolling a 4-sided die (4 outcomes)
        - You need 2 yes/no questions to figure out the result
        "Is it 1 or 2?" → If yes: "Is it 1?"
        "Is it 3 or 4?" → If No: "Is it 2?" -> NO then your answer is 1. 

    - Scenario 3: Rolling an 8-sided die (8 outcomes)
        - You need 3 yes/no questions     

    - **Notice the pattern?**
        - 2 outcomes → 1 question → log₂(2) = 1
        - 4 outcomes → 2 questions → log₂(4) = 2
        - 8 outcomes → 3 questions → log₂(8) = 3

    - **Logarithm converts "number of possibilities" into "number of questions needed", which is a natural measure of information!**

- **The Math Simplified**
    - Without Log (doesn't work well):
        - If I say "Impurity" = p₁ + p₂ + ... -> This always 1 (useless for any impurity measure)
        - if I say "Impurity" = p₁ × p₂ × ... -> This gets tiny very fast and doesn't capture uncertainty well.

    - With Log (works well):
        - If I say "Impurity" = -Σ(pᵢ × log(pᵢ)) -> This captures uncertainty well and is mathematically meaningful.

 


In [None]:
def entropy_of(data, base=2):
    """
    Accepts:
      - a 1‑D array / list of class labels
      - a pandas Series
    Returns the Shannon entropy (bits by default).
    """
    import numpy as np
    from scipy.stats import entropy

    # Convert to a NumPy array (works for list, np.ndarray, pd.Series)
    arr = np.asarray(data).ravel()

    # Compute frequencies → probabilities
    uniq, counts = np.unique(arr, return_counts=True)
    probs = counts / counts.sum()

    return entropy(probs, base=base)

# Usage
print(entropy_of([1,1,0,0,0,2,2,2,2])) # medium
print(entropy_of([1,1,1,1,1,1])) # low
print(entropy_of([1,2,3,4,5,6,7,8,9]))  # high


1.5304930567574826
0.0
3.1699250014423126


In [6]:
import pandas as pd
from scipy.stats import entropy

def entropy_from_series(series, base=2):
    # value_counts() returns frequencies; normalize=True gives probabilities
    probs = series.value_counts(normalize=True).values
    return entropy(probs, base=base)

# Example
df = pd.DataFrame({
    "target": ["cat", "dog", "cat", "mouse", "dog", "cat"]
})
print("Entropy (bits):", entropy_from_series(df["target"], base=2))

df1 = pd.DataFrame({
    "target": ["cat", "cat", "cat"]
})

df2 = pd.DataFrame({
    "target": ["dog", "mouse", "dog"]
})
print("Entropy (df1):", entropy_from_series(df1["target"], base=2))
print("Entropy (df2):", entropy_from_series(df2["target"], base=2))

parent_entropy = entropy_from_series(df["target"], base=2)
child_entropy = entropy_from_series(df1["target"], base=2) + entropy_from_series(df2["target"], base=2)
print("Information gain:", parent_entropy - child_entropy)



Entropy (bits): 1.459147917027245
Entropy (df1): 0.0
Entropy (df2): 0.9182958340544894
Information gain: 0.5408520829727556


In [None]:
df = pd.DataFrame({
    "target": ["green","green","green","green","green","green","green","green","green","green",
     "blue","blue","blue","blue","blue","blue","blue","blue","blue","blue"]
})