# Mutual information

The purpose of this notebook is to find the features with the highest mutual information with both the target `y` and the sensitive features `s9` (here the loan applicant's sex).

In [29]:
# import everything required
import numpy as np
import pandas as pd
import keras as ks
import matplotlib.pyplot as plt
from sklearn import feature_selection
from sklearn.metrics.cluster import normalized_mutual_info_score as mi

# for reproducibility
np.random.seed(123)

# load data
PATH="datasets/german_credit.csv"
raw_data = pd.read_csv(PATH, index_col=False)
df = pd.DataFrame(raw_data)

In [58]:
def mutual_info_stats(df, sensitive_name):

    sensitive = df[sensitive_name]
    df = df.drop(sensitive_name, axis=1)
    
    mutual_informations = np.zeros((df.shape[1], 1))

    for i, col in enumerate(df.columns):
        mutual_informations[i] = mi(df[col], sensitive)
        
    maximum = np.amax(mutual_informations)
    index_of_max = np.argmax(mutual_informations)
    
    print("There is maximum mutual information of", maximum, "between", sensitive_name, "and", df.columns[index_of_max])
    
    return mutual_informations

print(mutual_info_stats(df, 's9'))
print(mutual_info_stats(df, 'y'))

There is maximum mutual information of 0.37008620506805123 between s9 and ns5
[[0.0056476 ]
 [0.03453689]
 [0.01194647]
 [0.02432974]
 [0.37008621]
 [0.00523346]
 [0.03399761]
 [0.01329708]
 [0.00362657]
 [0.01488925]
 [0.01970804]
 [0.07196294]
 [0.00455454]
 [0.04670449]
 [0.01101005]
 [0.00946846]
 [0.06826694]
 [0.00472697]
 [0.00564488]
 [0.00586109]]
There is maximum mutual information of 0.2802530349639538 between y and ns5
[[7.51771477e-02]
 [3.47819316e-02]
 [3.55114453e-02]
 [1.62383444e-02]
 [2.80253035e-01]
 [2.30526511e-02]
 [9.50704284e-03]
 [3.14594511e-03]
 [5.86109088e-03]
 [6.96359572e-03]
 [4.25781454e-04]
 [1.29640471e-02]
 [2.11531997e-02]
 [1.02862592e-02]
 [1.27290198e-02]
 [1.97841899e-03]
 [1.19826892e-03]
 [8.86908694e-06]
 [1.04052779e-03]
 [1.29799384e-02]]


I found that the feature with the highest mutual information with the target `y` is also the feature with the highest mutual information with the sensitive feature `s9`. That feature is `ns5`, i.e. the applicant's credit score. 