# Lab 4 - Information Theory in Machine Learning

Welcome to this week's lab on Information Theory! This week, we will dive into the fascinating world of Information Theory as applied to Machine Learning. Specifically, we will focus on two key concepts: Entropy and Information Gain. These principles are fundamental in understanding how decision trees make split decisions to organize data effectively.

### Entropy
- Entropy, in the context of information theory, measures the level of uncertainty or disorder within a set of data.
- In machine learning, particularly in decision trees, entropy helps to determine how a dataset should be split. A high entropy means more disorder, indicating that our dataset is varied. Conversely, low entropy suggests more uniformity in the data.

### Information Gain
- Information Gain measures the reduction in entropy after the dataset is split on an attribute.
- It is crucial in building decision trees as it helps to decide the order of attributes the tree will use for splitting the data. The attribute with the highest Information Gain is chosen as the splitting attribute at each node.

## Part 1: Entropy and Information Gain in Decision Trees
Decision Trees use these concepts to create branches. By choosing splits that maximize Information Gain (or equivalently minimize entropy), a decision tree can effectively categorize data, leading to better classification or regression models.

### Step 1: Import Necessary Libraries

In [160]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
import matplotlib.pyplot as plt

### Step 2: Load and Explore the Iris Dataset

In [162]:
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target

### Step 3: Calculate Entropy
To calculate the `entropy` we need to:
- First, extract the target variable `y` from your dataset (like the 'target' column in the Iris dataset).
- Then, call `calculate_entropy(y)` to get the entropy.

This function calculates the entropy of a given target variable `y`. It works by first determining the unique classes in `y`, then computes the probability of each class, and uses this probability to calculate the entropy. This is a crucial step in understanding the disorder or uncertainty in the dataset, a fundamental concept in information theory.

In [163]:
# Extract the target variable y from the iris dataset
y = iris.target

# Print the values of y
print(y)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


In [164]:
def calculate_entropy(y):
    class_labels = np.unique(y)
    entropy = 0
    for label in class_labels:
        probability = len(y[y == label]) / len(y)
        entropy -= probability * np.log2(probability)
    return entropy


In [165]:
# Call the calculate_entropy function on y
entropy = calculate_entropy(y)

# Print the calculated entropy
print(f"Entropy of the target variable(y): {entropy:4f}")

Entropy of the target variable(y): 1.584963


### What is your observation about the calculated Entropy?

In the Iris dataset, the target variable (species) has an entropy score of 1.584953, which indicates a moderate level of uncertainty. 

Entropy values range from 0 (perfectly ordered, no uncertainty) to higher values (increased uncertainty).

In the context of a classification problem, where the target variable represents different species (setosa, versicolor, virginica), this entropy value indicates a moderate level of diversity among the species. 

It suggests that the distribution of instances across the classes is not perfectly uniform (low entropy) but also not extremely skewed or concentrated in a few classes (high entropy). 

In summary, the calculated entropy reflects a moderate level of uncertainty or diversity in the target variable classes of the Iris dataset. Interpretation can be more nuanced based on the specific objectives and characteristics of the dataset

### Step 4: Calculate Information Gain
There are three steps for calculating the Information Gain:
1. Compute Overall Entropy: Use the entropy function from Step 3 on the entire target dataset.
2. Calculate Weighted Entropy for Each Attribute: For each unique value in the attribute, partition the dataset and calculate its entropy. Then calculate the weighted sum of these entropies, where the weights are the proportions of instances in each partition.
3. Compute Information Gain: Subtract the weighted entropy of the split from the original entropy.

The attribute with the highest Information Gain is generally chosen for splitting, as it provides the most significant reduction in uncertainty. This step is critical in constructing an effective decision tree, as it directly influences the structure and depth of the tree.

In [166]:
def calculate_information_gain(df, attribute, target_name):
    total_entropy = calculate_entropy(df[target_name])
    values, counts = np.unique(df[attribute], return_counts=True)
    weighted_entropy = sum((counts[i] / sum(counts)) * calculate_entropy(df.where(df[attribute] == values[i]).dropna()[target_name]) for i in range(len(values)))
    information_gain = total_entropy - weighted_entropy
    return information_gain
    

In [169]:
iris_information_gain = calculate_information_gain(df, 'sepal length (cm)', 'target')
print(f"Information Gain for 'sepal length': {iris_information_gain:.4f}")

iris_information_gain = calculate_information_gain(df, 'petal length (cm)', 'target')
print(f"Information Gain for 'petal length': {iris_information_gain:.4f}")

iris_information_gain = calculate_information_gain(df, 'sepal width (cm)', 'target')
print(f"Information Gain for 'sepal width': {iris_information_gain:.4f}")

iris_information_gain = calculate_information_gain(df, 'petal width (cm)', 'target')
print(f"Information Gain for 'petal width': {iris_information_gain:.4f}")

Information Gain for 'sepal length': 0.8769
Information Gain for 'petal length': 1.4463
Information Gain for 'sepal width': 0.5166
Information Gain for 'petal width': 1.4359


Discuss your findings here.

## Part 2: Apply Entropy and Information Gain on a different dataset

Your task is to choose a new dataset and implement what you learned in `Part 1` on this new dataset.

### Task 1: Implement Entropy and Information Gain

In [172]:
# Your code goes here
import numpy as np
import pandas as pd
from sklearn.datasets import load_diabetes
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
import matplotlib.pyplot as plt

In [181]:
# Load boston data
diabetes = load_diabetes()
df = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
df['target'] = diabetes.target


In [175]:
# Extract the target variable y from the iris dataset
y = diabetes.target

# Print the values of y
print(y)

[151.  75. 141. 206. 135.  97. 138.  63. 110. 310. 101.  69. 179. 185.
 118. 171. 166. 144.  97. 168.  68.  49.  68. 245. 184. 202. 137.  85.
 131. 283. 129.  59. 341.  87.  65. 102. 265. 276. 252.  90. 100.  55.
  61.  92. 259.  53. 190. 142.  75. 142. 155. 225.  59. 104. 182. 128.
  52.  37. 170. 170.  61. 144.  52. 128.  71. 163. 150.  97. 160. 178.
  48. 270. 202. 111.  85.  42. 170. 200. 252. 113. 143.  51.  52. 210.
  65. 141.  55. 134.  42. 111.  98. 164.  48.  96.  90. 162. 150. 279.
  92.  83. 128. 102. 302. 198.  95.  53. 134. 144. 232.  81. 104.  59.
 246. 297. 258. 229. 275. 281. 179. 200. 200. 173. 180.  84. 121. 161.
  99. 109. 115. 268. 274. 158. 107.  83. 103. 272.  85. 280. 336. 281.
 118. 317. 235.  60. 174. 259. 178. 128.  96. 126. 288.  88. 292.  71.
 197. 186.  25.  84.  96. 195.  53. 217. 172. 131. 214.  59.  70. 220.
 268. 152.  47.  74. 295. 101. 151. 127. 237. 225.  81. 151. 107.  64.
 138. 185. 265. 101. 137. 143. 141.  79. 292. 178.  91. 116.  86. 122.
  72. 

In [176]:
def calculate_entropy(y):
    class_labels = np.unique(y)
    entropy = 0
    for label in class_labels:
        probability = len(y[y == label]) / len(y)
        entropy -= probability * np.log2(probability)
    return entropy


In [177]:
# Call the calculate_entropy function on y
entropy = calculate_entropy(y)

# Print the calculated entropy
print(f"Entropy of the target variable(y): {entropy:4f}")

Entropy of the target variable(y): 7.541840


In [178]:
def calculate_information_gain(df, attribute, target_name):
    total_entropy = calculate_entropy(df[target_name])
    values, counts = np.unique(df[attribute], return_counts=True)
    weighted_entropy = sum((counts[i] / sum(counts)) * calculate_entropy(df.where(df[attribute] == values[i]).dropna()[target_name]) for i in range(len(values)))
    information_gain = total_entropy - weighted_entropy
    return information_gain


In [179]:
diabetes_information_gain = calculate_information_gain(df, 'age', 'target')
print(f" Information Gain for 'age': {diabetes_information_gain:.4f}")

breast_information_gain = calculate_information_gain(df, 'sex', 'target')
print(f" Information Gain for 'sex': {diabetes_information_gain:.4f}")

diabetes_information_gain = calculate_information_gain(df, 'bmi', 'target')
print(f" Information Gain for 'bmi': {diabetes_information_gain:.4f}")

diabetes_information_gain = calculate_information_gain(df, 'bp', 'target')
print(f" Information Gain for 'bp': {diabetes_information_gain:.4f}")

diabetes_information_gain = calculate_information_gain(df, 's1', 'target')
print(f" Information Gain for 's1': {diabetes_information_gain:.4f}")

diabetes_information_gain = calculate_information_gain(df, 's2', 'target')
print(f" Information Gain for 's2': {diabetes_information_gain:.4f}")

diabetes_information_gain = calculate_information_gain(df, 's3', 'target')
print(f" Information Gain for 's3': {breast_information_gain:.4f}")


diabetes_information_gain = calculate_information_gain(df, 's4', 'target')
print(f" Information Gain for 's4': {diabetes_information_gain:.4f}")

diabetes_information_gain = calculate_information_gain(df, 's5', 'target')
print(f" Information Gain for 's5': {diabetes_information_gain:.4f}")

diabetes_information_gain = calculate_information_gain(df, 's6', 'target')
print(f" Information Gain for 's6': {diabetes_information_gain:.4f}")

 Information Gain for 'age': 4.3541
 Information Gain for 'sex': 4.3541
 Information Gain for 'bmi': 5.8400
 Information Gain for 'bp': 4.7758
 Information Gain for 's1': 5.5924
 Information Gain for 's2': 6.8312
 Information Gain for 's3': 0.4839
 Information Gain for 's4': 2.4058
 Information Gain for 's5': 5.9613
 Information Gain for 's6': 4.2252


### Task 2: Discuss your findings in detail
Provide a detailed explanation and discussion about your findings.


• The findings from this study show that the information gain of each feature varies depending on how much it reduces the uncertainty of the target variable. The information gain can be used as a criterion for selecting the best features for building a decision tree classifier, as it indicates how well a feature can split the data into homogeneous subsets. A detailed explanation and discussion of the findings are as follows:

• The feature with the highest information gain is blood serum, s2, with a value of 6.8312. This means that the s2 feature provides the most information about whether a patient has diabetes or not, and it can best separate the patients into two groups based on their s2 levels. 

• The feature with the second highest information gain is blood serum s5 with a measurement of 5.9613. This means that the s5 feature also provides a lot of information about whether a patient has diabetes or not, and it can also split the data into two groups based on their s2 values. This is consistent with the biological knowledge that diabetes has a genetic component and that the blood serum is a measure of the likelihood of diabetes based on family history.

• The feature with the third highest information gain is Body Mass Index (BMI), with a value of 0.157. This means that the BMI feature also provides some information about whether a patient has diabetes or not, and it can also split the data into two groups based on their BMI values. This is consistent with the epidemiological knowledge that diabetes is more prevalent among people with a higher BMI index.

• The features with the lowest information gain are blood serums, s3, s3 and S6, with values of 0.4839, 2.4058, and 4.2252, respectively. This means that these features provide very little information about whether a patient has diabetes or not. This may be because these features are not directly related to diabetes, or because they have a lot of variability or missing values in the data.

• These findings suggest that s2, s5, and BMI are the most important features for predicting diabetes, and they should be given higher priority when building a decision tree classifier. The blood serums, s3, s3 and S6 features are the least important features for predicting diabetes, and they should be given lower priority or excluded when building a decision tree classifier.


## Submission
Submit a link to your completed Jupyter Notebook file hosted on your private GitHub repository through the submission link in Blackboard.