## <span style="color:green"> Feature Selection using Mutual Information(MI)

<span style="color:blue">Video Explanation : https://youtu.be/HrnFWU-lNhc

- Mutual information is a measure of dependence or “mutual dependence” between two random variables(x and y). 
- It measures the amount of information obtained about one variable through observing the other variable. In other
  words, it determines how much we can know about one variable by understanding another—it’s a little bit like 
  correlation, but mutual information is more general.
- In machine learning, mutual information measures how much information the presence/absence of a feature contributes
  to making the correct prediction on Y.
- Mutual information (MI)between two random variables is a non-negative value, which measures the dependency between
  the variables. It is equal to zero if and only if two random variables are independent, and higher values mean higher
  dependency.
- The mutual information between two random variables X and Y can be stated formally as follows:

  **I(X ; Y) = H(X) – H(X | Y)**
  
  - Where I(X ; Y) is the mutual information for X and Y, 
  - H(X) is the entropy for X and H(X | Y) is the conditional entropy for X given Y.
  
- Mutual information is a measure of dependence or “mutual dependence” between two random variables. As such, the 
  measure is symmetrical, meaning that I(X ; Y) = I(Y ; X).

## <span style="color:green"> Relation between 'Information Gain' and 'Mutual Information'

Mutual Information and Information Gain are the same thing, although the context or usage of the measure often gives rise to the different names.

For example:

- Effect of Transforms to a Dataset (decision trees): Information Gain.
- Dependence Between Variables (feature selection): Mutual Information.
- Notice the similarity in the way that the mutual information is calculated and the way that information gain is 
  calculated; they are equivalent:

  **I(X ; Y) = H(X) – H(X | Y)**
  
  and
  
  **IG(S, a) = H(S) – H(S | a)**
- Mutual information is sometimes used as a synonym for information gain. Technically, they calculate the same quantity 
  if applied to the same data.

### <span style="color:blue"> Feature Selection for Classification Problem using Mutual Information(MI)

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
import math

In [2]:
# Load The Dataset
#Load the dataset #https://www.kaggle.com/burak3ergun/loan-data-set
df_loan = pd.read_csv("https://raw.githubusercontent.com/atulpatelDS/Data_Files/master/Loan_Dataset/loan_data_set.csv")
df_loan.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [3]:
# Remove all null value
df_loan.dropna(inplace=True)
# drop the unimportant column("Loan_ID")
df_loan.drop(labels=["Loan_ID"],axis=1,inplace=True)
df_loan.reset_index(drop=True,inplace=True)
df_loan.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 480 entries, 0 to 479
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Gender             480 non-null    object 
 1   Married            480 non-null    object 
 2   Dependents         480 non-null    object 
 3   Education          480 non-null    object 
 4   Self_Employed      480 non-null    object 
 5   ApplicantIncome    480 non-null    int64  
 6   CoapplicantIncome  480 non-null    float64
 7   LoanAmount         480 non-null    float64
 8   Loan_Amount_Term   480 non-null    float64
 9   Credit_History     480 non-null    float64
 10  Property_Area      480 non-null    object 
 11  Loan_Status        480 non-null    object 
dtypes: float64(4), int64(1), object(7)
memory usage: 45.1+ KB


In [4]:
# encode the Categorical Variable
from sklearn.preprocessing import LabelEncoder 
class MultiColumnLabelEncoder:
    def __init__(self,columns = None):
        self.columns = columns # array of column names to encode

    def fit(self,X,y=None):
        return self # not relevant here

    def transform(self,X):
        '''
        Transforms columns of X specified in self.columns using
        LabelEncoder(). If no columns specified, transforms all
        columns in X.
        '''
        output = X.copy()
        if self.columns is not None:
            for col in self.columns:
                output[col] = LabelEncoder().fit_transform(output[col])
        else:
            for colname,col in output.iteritems():
                output[colname] = LabelEncoder().fit_transform(col)
        return output

    def fit_transform(self,X,y=None):
        return self.fit(X,y).transform(X)

In [5]:
cat_cols = df_loan.select_dtypes(include=["object"]).columns
df_loan = MultiColumnLabelEncoder(columns = cat_cols).fit_transform(df_loan)

In [6]:
df_loan.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 480 entries, 0 to 479
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Gender             480 non-null    int32  
 1   Married            480 non-null    int32  
 2   Dependents         480 non-null    int32  
 3   Education          480 non-null    int32  
 4   Self_Employed      480 non-null    int32  
 5   ApplicantIncome    480 non-null    int64  
 6   CoapplicantIncome  480 non-null    float64
 7   LoanAmount         480 non-null    float64
 8   Loan_Amount_Term   480 non-null    float64
 9   Credit_History     480 non-null    float64
 10  Property_Area      480 non-null    int32  
 11  Loan_Status        480 non-null    int32  
dtypes: float64(4), int32(7), int64(1)
memory usage: 32.0 KB


In [7]:
from sklearn.feature_selection import SelectKBest,mutual_info_classif
# https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_classif.html#id4

In [8]:
X = df_loan.iloc[:,0:-1]
y = df_loan["Loan_Status"]

In [9]:
mic = SelectKBest(score_func=mutual_info_classif,k=3)

In [10]:
# use fit and get the mutual score score
mic.fit(X,y)
feature_MI_score = pd.Series(mic.scores_,index=X.columns)
feature_MI_score.sort_values(ascending=False)

Credit_History       0.152815
Loan_Amount_Term     0.019970
Property_Area        0.018120
Education            0.013380
Self_Employed        0.005400
CoapplicantIncome    0.002864
Gender               0.000000
Married              0.000000
Dependents           0.000000
ApplicantIncome      0.000000
LoanAmount           0.000000
dtype: float64

In [11]:
# or directly use fit and transform and get the top k features according to Mutual Score
df_loan_afs=mic.fit_transform(X,y)
#feature_MI_score = pd.Series(mic.scores_,index=X.columns)
#feature_MI_score.sort_values(ascending=False)
df_loan_afs.shape


(480, 3)

### <span style="color:blue"> Feature Selection for Regression Problem using Mutual Information(MI) or Information Gain(IG)

In [12]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
import math

In [13]:
# Load The Dataset
# https://data.world/nrippner/ols-regression-challenge
import pandas as pd
df_cancer = pd.read_csv("https://raw.githubusercontent.com/atulpatelDS/Data_Files/master/Cancer/cancer_reg.csv",
                     encoding = "ISO-8859-1")

In [14]:
df_cancer.head()

Unnamed: 0,avgAnnCount,avgDeathsPerYear,TARGET_deathRate,incidenceRate,medIncome,popEst2015,povertyPercent,studyPerCap,binnedInc,MedianAge,...,PctPrivateCoverageAlone,PctEmpPrivCoverage,PctPublicCoverage,PctPublicCoverageAlone,PctWhite,PctBlack,PctAsian,PctOtherRace,PctMarriedHouseholds,BirthRate
0,1397.0,469,164.9,489.8,61898,260131,11.2,499.748204,"(61494.5, 125635]",39.3,...,,41.6,32.9,14.0,81.780529,2.594728,4.821857,1.843479,52.856076,6.118831
1,173.0,70,161.3,411.6,48127,43269,18.6,23.111234,"(48021.6, 51046.4]",33.0,...,53.8,43.6,31.1,15.3,89.228509,0.969102,2.246233,3.741352,45.3725,4.333096
2,102.0,50,174.7,349.7,49348,21026,14.6,47.560164,"(48021.6, 51046.4]",45.0,...,43.5,34.9,42.1,21.1,90.92219,0.739673,0.465898,2.747358,54.444868,3.729488
3,427.0,202,194.8,430.4,44243,75882,17.1,342.637253,"(42724.4, 45201]",42.8,...,40.3,35.0,45.3,25.0,91.744686,0.782626,1.161359,1.362643,51.021514,4.603841
4,57.0,26,144.4,350.1,49955,10321,12.5,0.0,"(48021.6, 51046.4]",48.3,...,43.9,35.1,44.0,22.7,94.104024,0.270192,0.66583,0.492135,54.02746,6.796657


In [15]:
df_Tar = df_cancer["TARGET_deathRate"]
df_cancer.drop(labels=["Geography","binnedInc","TARGET_deathRate"],axis=1,inplace=True)
df_cancer = pd.concat([df_cancer,df_Tar],axis=1)

In [16]:
df_cancer.head()

Unnamed: 0,avgAnnCount,avgDeathsPerYear,incidenceRate,medIncome,popEst2015,povertyPercent,studyPerCap,MedianAge,MedianAgeMale,MedianAgeFemale,...,PctEmpPrivCoverage,PctPublicCoverage,PctPublicCoverageAlone,PctWhite,PctBlack,PctAsian,PctOtherRace,PctMarriedHouseholds,BirthRate,TARGET_deathRate
0,1397.0,469,489.8,61898,260131,11.2,499.748204,39.3,36.9,41.7,...,41.6,32.9,14.0,81.780529,2.594728,4.821857,1.843479,52.856076,6.118831,164.9
1,173.0,70,411.6,48127,43269,18.6,23.111234,33.0,32.2,33.7,...,43.6,31.1,15.3,89.228509,0.969102,2.246233,3.741352,45.3725,4.333096,161.3
2,102.0,50,349.7,49348,21026,14.6,47.560164,45.0,44.0,45.8,...,34.9,42.1,21.1,90.92219,0.739673,0.465898,2.747358,54.444868,3.729488,174.7
3,427.0,202,430.4,44243,75882,17.1,342.637253,42.8,42.2,43.4,...,35.0,45.3,25.0,91.744686,0.782626,1.161359,1.362643,51.021514,4.603841,194.8
4,57.0,26,350.1,49955,10321,12.5,0.0,48.3,47.8,48.9,...,35.1,44.0,22.7,94.104024,0.270192,0.66583,0.492135,54.02746,6.796657,144.4


In [17]:
df_cancer = df_cancer.apply(lambda x: x.fillna(x.mean()),axis = 0)
#df.apply(lambda x: x.fillna(x.mean()),axis=0)

In [18]:
df_cancer.head(3)

Unnamed: 0,avgAnnCount,avgDeathsPerYear,incidenceRate,medIncome,popEst2015,povertyPercent,studyPerCap,MedianAge,MedianAgeMale,MedianAgeFemale,...,PctEmpPrivCoverage,PctPublicCoverage,PctPublicCoverageAlone,PctWhite,PctBlack,PctAsian,PctOtherRace,PctMarriedHouseholds,BirthRate,TARGET_deathRate
0,1397.0,469,489.8,61898,260131,11.2,499.748204,39.3,36.9,41.7,...,41.6,32.9,14.0,81.780529,2.594728,4.821857,1.843479,52.856076,6.118831,164.9
1,173.0,70,411.6,48127,43269,18.6,23.111234,33.0,32.2,33.7,...,43.6,31.1,15.3,89.228509,0.969102,2.246233,3.741352,45.3725,4.333096,161.3
2,102.0,50,349.7,49348,21026,14.6,47.560164,45.0,44.0,45.8,...,34.9,42.1,21.1,90.92219,0.739673,0.465898,2.747358,54.444868,3.729488,174.7


In [19]:
X = df_cancer.iloc[:,0:-1]
y = df_cancer["TARGET_deathRate"]

In [20]:
print(X.shape)
print(y.shape)
print(type(X))
print(type(y))

(3047, 31)
(3047,)
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>


In [21]:
# Apply the Mutual Regression
from sklearn.feature_selection import mutual_info_regression
mir = mutual_info_regression(X,y)
mrs_score = pd.Series(mir,index=X.columns)
mrs_score.sort_values(ascending=False)

PctBachDeg25_Over          0.192336
povertyPercent             0.168725
PctPublicCoverageAlone     0.167971
PctHS25_Over               0.160077
medIncome                  0.157472
incidenceRate              0.153339
PctPrivateCoverage         0.145701
avgAnnCount                0.129016
PctPublicCoverage          0.119232
PctEmployed16_Over         0.117489
avgDeathsPerYear           0.115158
PctUnemployed16_Over       0.101061
popEst2015                 0.094793
PctPrivateCoverageAlone    0.086464
PctBachDeg18_24            0.077057
PctBlack                   0.071954
PctEmpPrivCoverage         0.070925
PctAsian                   0.069798
PctMarriedHouseholds       0.058008
PercentMarried             0.051933
MedianAgeFemale            0.043443
AvgHouseholdSize           0.031267
PctSomeCol18_24            0.030564
PctWhite                   0.028548
PctOtherRace               0.028532
MedianAgeMale              0.024139
studyPerCap                0.021278
MedianAge                  0

In [22]:
from sklearn.feature_selection import mutual_info_regression,SelectKBest
mir = SelectKBest(score_func=mutual_info_regression,k=10)
df_cancer_amr=mir.fit_transform(X,y)

In [23]:
df_cancer_amr.shape

(3047, 10)