

<div align="center">
  <h1> Feature selection using information gain </h1>
</div>

### A glimpse at Entropy  : 
https://youtu.be/YtebGVx-Fxw?si=8WeYydAUV6rlan4I

In 1948, Claude Shannon introduced the conecpt of entropy in his paper "A Mathematical Theory of Communication," . In information theory, entropy is a measure of uncertainty or surprise associated with a random variable.
- Entropy formula :
$$ H(X) = \sum_{i=1}^{n} P(x_i) \cdot \log \left(\frac{1}{P(x_i)}\right) $$

- Conditional entropy formula :
$$ H(Y|X) = - \sum_{i=1}^{m} \sum_{j=1}^{n} P(x_i, y_j) \cdot \log \left(\frac{P(x_i, y_j)}{P(x_i)}\right) $$




In the context of information gain, entropy is a measure of impurity or uncertainty in a set of data. Specifically, it quantifies the disorder or randomness in a collection of examples within a dataset. In decision tree algorithms, entropy is used to determine the effectiveness of splitting a dataset based on different attributes. The goal is to find the attribute that minimizes entropy, thereby maximizing information gain and improving the efficiency of the decision tree in classifying or predicting outcomes.

### A glimpse at mutual information :
https://youtu.be/eJIp_mgVLwE?si=T6gEhOWskObv_a8s

The Mutual Information between two random variables measures non-linear relations between them. Besides, it indicates how much information can be obtained from a random variable by observing another random variable.

It is closely linked to the concept of entropy. This is because it can also be known as the reduction of uncertainty of a random variable if another is known. Therefore, a high mutual information value indicates a large reduction of uncertainty whereas a low value indicates a small reduction. If the mutual information is zero, that means that the two random variables are independent.

### <u> Exercice 01 : </u><br>

Consider the following dataset containing information about the weather and
the number of people who visited a park on a given day:
| Temperature | Cloud cover |Humidity | Weather |
| ----------- | ------------- |-------------|---------- |
| 75 | Sunny | Low | Sunny |
| 80 | Partly Cloudy | High | Sunny |
| 85 | Overcast | High | Rainy |
| 70 | Sunny | Medium | Sunny |
| 65 | Overcast | Medium | Stormy |
| 60 | Partly Cloudy | Low | Sunny |
| 90 | Overcast | High | Rainy |

Your task is to select the most relevant features (weather conditions) to predict
the number of visitors to the park. To do this, you will calculate the information
gain for each feature.

1. Discretize the feature Temperature; so we have three levels of temperature: warm, hot, and very hot.
2. Calculate the Entropy of the target class Weather, using the formula: <br>

$$
H(X) = - \sum_{x=1}^{n} P(x_i) \cdot \log_2(P(x_i))
$$

Where, $P(x_i)$ is the appearance probability of value $x_i$ among the values of the feature $X$, and $n$ is the number of records of $X$.

3. Calculate the information gain of each feature in regard to the target
   feature Weather, using the information gain formula: <br>

   $$
   \text{IG}(A, X) = H(X) - \sum_{i=1}^{n} P(x_i) \cdot H(A | x_i)
   $$

   Where $H(A |x_i)$ is the entropy of the feature $A$ calculated on the portion of data where the target feature has the value $x_i$

4. Order the features according to their IG.


### <U> Solution :

#### Step 1 : Discretize the feature Temperature; so we have three levels of temperature: warm, hot, and very hot.


In [3]:
import numpy as np


def calculate_percentiles(lst):
    percentile_33 = np.percentile(lst, 33.33)
    percentile_66 = np.percentile(lst, 66.66)
    return round(percentile_33), round(percentile_66)


temp_list = [75, 80, 85, 70, 65, 60, 90]
print(calculate_percentiles(temp_list))

(70, 80)


So the temperature colummn can be discretized accordingly :  <br>
<b>Warm: [60,70[ <br>
Hot: [70,80[ <br>
Very Hot: [80-90] <br>
<br>
We obtain this table : <br>

| Temperature | Cloud cover   |Humidity | Weather |
| ----------- | ------------- |-------------|---------- |
| Hot         | Sunny | Low | Sunny |
| Very Hot | Partly Cloudy | High | Sunny |
| Very Hot | Overcast | High | Rainy |
| Hot | Sunny | Medium | Sunny |
| Warm | Overcast | Medium | Stormy |
| Warm | Partly Cloudy | Low | Sunny |
| Very Hot | Overcast | High | Rainy |

### Step 2 : Calculate the Entropy of the target class Weather


$$
H(X) = -( P(\text{Sunny}) \cdot \log_2(P(\text{Sunny})) + P(\text{Rainy}) \cdot \log_2(P(\text{Rainy})) + P(\text{Stormy}) \cdot \log_2(P(\text{Stormy})) )
$$
Where $P(\text{Sunny}) , P(\text{Rainy}), P(\text{Stormy})$ ,are the probabilities of each value that the 'Weather' class takes <br>
Knowing that : <br>
$n= 7$ <br>

$P(\text{Sunny})= \frac{4}{7} $ <br>
$P(\text{Rainy}) = \frac{2}{7}$ <br>
$P(\text{Stormy}) = \frac{1}{7}$ <br>

The entoropy of the 'Weather' class is then given by : <br>
$$
H(\text{Weather}) = - \left( \frac{4}{7} \cdot \log_2\left(\frac{4}{7}\right) + \frac{2}{7} \cdot \log_2\left(\frac{2}{7}\right) + \frac{1}{7} \cdot \log_2\left(\frac{1}{7}\right) \right) 
$$

$$H(\text{Weather})\approx1.3788$$

-0.27798716415 + -0.35793227671 + -0.31978045024

### Step 3 : Calculate the information gain of each feature in regard to the target feature Weather

1.  The <b> 'Cloud Cover' </b> feature : 


\begin{align*}
H(\text{Cloud Cover | Sunny}) &= -\left(\frac{2}{3} \cdot \log_2\left(\frac{2}{3}\right) + \frac{1}{3} \cdot \log_2\left(\frac{1}{3}\right)\right) \
&\approx 0.92
\end{align*}

\begin{align*}
H(\text{Cloud Cover | Partly Cloudy}) &= -\left(\frac{1}{1} \cdot \log_2\left(\frac{1}{1}\right) + 0 \cdot \log_2\left(0\right)\right) \
&= 0
\end{align*}

\begin{align*}
H(\text{Cloud Cover | Overcast}) &= -\left(\frac{1}{2} \cdot \log_2\left(\frac{1}{2}\right) + \frac{1}{2} \cdot \log_2\left(\frac{1}{2}\right)\right) \
&= 1
\end{align*}

\begin{align*}
\text{IG}(\text{Cloud Cover, Weather}) &= -1.10 - \left(\frac{3}{7} \cdot 0.92 + \frac{1}{7} \cdot 0 + \frac{2}{7} \cdot 1\right) \
&\approx -1.10 - \left(0.40 + 0 + 0.29\right) \
&\approx -1.10 - 0.69 \
&\approx -1.79
\end{align*}

2. The <b> 'Humidity'  </b>feature : 


\begin{align*}
H(\text{Humidity | Low}) &= -\left(\frac{2}{2} \cdot \log_2\left(\frac{2}{2}\right) + 0 \cdot \log_2\left(0\right) + 0 \cdot \log_2\left(0\right)\right) \
&= 0
\end{align*}

\begin{align*}
H(\text{Humidity | High}) &= -\left(\frac{2}{3} \cdot \log_2\left(\frac{2}{3}\right) + \frac{1}{3} \cdot \log_2\left(\frac{1}{3}\right) + 0 \cdot \log_2\left(0\right)\right) \
&\approx 0.92
\end{align*}

\begin{align*}
H(\text{Humidity | Medium}) &= -\left(\frac{0}{2} \cdot \log_2\left(0\right) + \frac{1}{3} \cdot \log_2\left(\frac{1}{3}\right) + \frac{2}{3} \cdot \log_2\left(\frac{2}{3}\right)\right) \
&\approx 0.92
\end{align*}

\begin{align*}
\text{IG}(\text{Humidity, Weather}) &= -1.10 - \left(0 \cdot 0 + \frac{2}{7} \cdot 0.92 + \frac{1}{7} \cdot 0.92\right) \
&\approx -1.10 - \left(0 + 0.26 + 0.13\right) \
&\approx -1.49
\end{align*}

3. The <b>'Temperature' </b> feature : 



\begin{align*}
H(\text{Temperature | Warm}) &= -\left(\frac{2}{3} \cdot \log_2\left(\frac{2}{3}\right) + \frac{1}{3} \cdot \log_2\left(\frac{1}{3}\right) + 0 \cdot \log_2\left(0\right)\right) \
&\approx 0.92
\end{align*}

\begin{align*}
H(\text{Temperature | Hot}) &= -\left(\frac{1}{2} \cdot \log_2\left(\frac{1}{2}\right) + \frac{1}{2} \cdot \log_2\left(\frac{1}{2}\right) + 0 \cdot \log_2\left(0\right)\right) \
&= 1
\end{align*}

\begin{align*}
H(\text{Temperature | Very Hot}) &= -\left(\frac{1}{2} \cdot \log_2\left(\frac{1}{2}\right) + 0 \cdot \log_2\left(0\right) + \frac{1}{2} \cdot \log_2\left(\frac{1}{2}\right)\right) \
&\approx 1
\end{align*}

\begin{align*}
\text{IG}(\text{Temperature, Weather}) &= -1.10 - \left(\frac{3}{7} \cdot 0.92 + \frac{1}{7} \cdot 1 + \frac{2}{7} \cdot 1\right) \
&\approx -1.10 - \left(0.40 + 0.14 + 0.29\right) \
&\approx -1.93
\end{align*}


Let's recapitulate the results : <br>
- Information Gain for <b>Cloud Cover </b>: -1.79
- Information Gain for <b>Temperature </b>: -1.93
- Information Gain for <b>Humidity</b>: -1.49

NOTE : The negative sign in Information Gain doesn't affect the ordering; it just indicates a reduction in entropy. The larger (in absolute value) the Information Gain, the more significant the reduction in uncertainty when considering that feature.

### Conclusion and results interpretation : <br>
- The IG values indicate how much uncertainty in predicting the Weather is reduced by considering each feature. Higher IG suggests that a feature is more relevant for predicting the target variable (Weather in this case).

- In this dataset, the feature <b>'Temperature' </b> has the highest Information Gain ( |-1.93|), indicating that it is the most relevant feature to choose to best predict the Weather. Second is Cloud Cover with an Information Gain of |-1.79| . Humidity has the lowest Information Gain (|-1.49|) among the three features.

