<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Filtering-Methods" data-toc-modified-id="Filtering-Methods-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Filtering Methods</a></span><ul class="toc-item"><li><span><a href="#Chi-Squared-Test" data-toc-modified-id="Chi-Squared-Test-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Chi Squared Test</a></span></li><li><span><a href="#Information-Gain,-Mutual-Information" data-toc-modified-id="Information-Gain,-Mutual-Information-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Information Gain, Mutual Information</a></span></li><li><span><a href="#Mutual-Information" data-toc-modified-id="Mutual-Information-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Mutual Information</a></span></li></ul></li></ul></div>

# Filtering Methods

In this example we check how to filter a dataset comprised mainly of categorical features, using Chi Sq. method, and Information Gain method.

The data used to illustrate the example is the toy dataset used in [Wikipedia](https://en.wikipedia.org/wiki/Chi-squared_test) to explain how Chi Squared test works. In that example, suppose there is a city of 1,000,000 residents with four neighborhoods: A, B, C, and D. A random sample of 650 residents of the city is taken and their occupation is recorded as "white collar", "blue collar", or "no collar". The null hypothesis is that each person's neighborhood of residence is independent of the person's occupational classification. The data are tabulated as:

| |A  |B  | C | D |
|-|:-:|:-:|:-:|:-:|
|White collar|90|60|104|95|
|Blue collar |30|50| 51|20|
|No collar   |30|40|45|35|

From this two features, we will measure the value of the Chi Sq. and the information gain, to decide on whether these two variables are **highly correlated** or not.

In [1]:
import numpy as np

from math import log2
from sklearn.feature_selection import mutual_info_classif
from scipy import stats

white_collar = [90, 60, 104, 95]
blue_collar = [30, 50, 51, 20]
no_collar = [30, 40, 45, 35]

occupation = np.array([white_collar, blue_collar, no_collar])

## Chi Squared Test

Let's start by computing the Chi Sq., value. To do so, we call the method `chi2_contingency()`, to later access its internal variables, that will give us the information we need.

In [2]:
chi2, p_value, dof, cont_table = stats.chi2_contingency(occupation)

print('Chi sq. stats: {:.2f}'.format(chi2))
print('Degrees of freedom: {}'.format(dof))
print('P-value: {:.4f}'.format(p_value))
print('Contigency Table:\n', cont_table)

Chi sq. stats: 24.57
Degrees of freedom: 6
P-value: 0.0004
Contigency Table:
 [[ 80.53846154  80.53846154 107.38461538  80.53846154]
 [ 34.84615385  34.84615385  46.46153846  34.84615385]
 [ 34.61538462  34.61538462  46.15384615  34.61538462]]


Remember that the _NULL Hypothesis_ states that the two variables are INDEPENDENT. If the p-value that we obtain by computing the $\chi^2$ statistic is lower than the significance level, then we should **reject** that hypothesis.

In this case $\chi^2 < 0.05$ which means that both, neighbourhood and occupation are **dependent**. 

## Information Gain, Mutual Information

To compute information gain I decided to develop a very simple method that takes as argument the probabilities associated with the internal values of a binomial class, and returns the entropy as per the following formula:

$$ H(class) = - \sum_{i=1}^{2} P(e_i) \cdot log_2 P(e_i) = P(e_0)log_2 P(e_0) + P(e_1)log_2 P(e_1) $$

It will be easy to transform this function into a more general one that will accept multinomial classes.

Data used corresponds to the example used in class where:

|Sex|Pulse|
|:-|-:|
|Female|100|
|Male|25|
|Male|100|
|Male|25|
|Male|50|
|Female|75|
|Male|100|
|Female|75|
|Male|75|
|Male|100|


In [3]:
# calculate the entropy for the split in the dataset
def entropy(class0, class1):
    if class0 == 0.:           # Prevent from crashing at log(0)
        class0_entropy = 0.
    else:
        class0_entropy = class0 * log2(class0)
        
    if class1 == 0.:           # Prevent from crashing at log(0)
        class1_entropy = 0.
    else:
        class1_entropy = class1 * log2(class1)
        
    return -(class0_entropy + class1_entropy)

In [4]:
s_entropy = entropy(3./10., 7./10.)
print('Sex Entropy: {:>16.4f}'.format(s_entropy))

p25_entropy = entropy(2./2., 0./2.)
print('Pulse  25 Entropy: {:>10.4f}'.format(p25_entropy))
p50_entropy = entropy(1./1., 0./1.)
print('Pulse  50 Entropy: {:>10.4f}'.format(p50_entropy))
p75_entropy = entropy(1./3., 2./3.)
print('Pulse  75 Entropy: {:>10.4f}'.format(p75_entropy))
p100_entropy = entropy(3./4., 1./4.)
print('Pulse 100 Entropy: {:>10.4f}'.format(p100_entropy))

gain = s_entropy - (
    (2./10.* p25_entropy) + 
    (1./10.* p50_entropy) + 
    (3./10.* p75_entropy) + 
    (4./10.* p100_entropy))
print('--\nInformation gain.: {:>10.4f}'.format(gain))

Sex Entropy:           0.8813
Pulse  25 Entropy:    -0.0000
Pulse  50 Entropy:    -0.0000
Pulse  75 Entropy:     0.9183
Pulse 100 Entropy:     0.8113
--
Information gain.:     0.2813


As a conclussion, we cannot give a general rule that will decide whether or not to reject feature `Pulse` based on the IG value. What we normally do is to compare the IG value from all the fatures in our dataset and try different threshold values to reject features from.

## Mutual Information

Just for the sake of completitud, I include here how to use mutual information from scikit learn library. Results from MI and IG should be equivalent.

In [5]:
pulse = np.array([[100, 25, 100, 25, 50, 75, 100, 75, 75, 100]])
sex = np.array( [1,0,0,0,0,1,0,1,0,0])

In [6]:
feature_scores = mutual_info_classif(pulse.T, sex, discrete_features=True, n_neighbors=10)
print('MI = {:.4f}'.format(feature_scores[0]))

MI = 0.1950
