# Feature filtering based on Mutual Information for classification

In [None]:
import pandas as pd
import os
import numpy as np
from sklearn.feature_selection import mutual_info_classif

# depending on the OS the path to the data file is different
if os.name == 'nt':
    data_nutr = pd.read_excel(r'..\data\nutrient-file-release2-jan22.xlsx', sheet_name='All solids & liquids per 100g')
elif os.name == 'posix':
    data_nutr = pd.read_excel(r'../data/nutrient-file-release2-jan22.xlsx', sheet_name='All solids & liquids per 100g')

# dataframe with only data columns


# print first 5 results.
data_nutr.head()

Since the dataset is largely continuous, we need to turn features into discrete ones.

## Variable discretisation

There are different methods to discretise continuous variables:

- **Equal-width binning**: divides the scope of possible values into N bins of the same width.
- **Equal-frequency binning**: divides the scope of possible values into N bins, each of them containing approximately the same number of samples.
- **Domain knowledge binning**: divides the scope of possible values into bins according to the domain knowledge.
<!-- There are other methods too listed below.

- ChiMerge: merges the bins using the Chi2 test to evaluate the statistical dependence of the classes and the feature.
- Entropy-based binning: merges the bins using the entropy of the classes and the feature.
- K-means binning: merges the bins using the K-means algorithm.
- Gaussian mixture binning: merges the bins using a Gaussian Mixture Model.
- Quantile binning: merges the bins so that each bin contains the same number of samples.
- Uniform binning: merges the bins so that each bin contains the same width.
- Recursive partitioning: merges the bins using a decision tree.
- Discretisation using decision trees: merges the bins using a decision tree.
- Discretisation using clustering: merges the bins using a clustering algorithm.
- Discretisation using support vector machines: merges the bins using a support vector machine.
- Discretisation using linear models: merges the bins using a linear model.
- Discretisation using nearest neighbours: merges the bins using a nearest neighbours algorithm.
- Discretisation using kernel density estimation: merges the bins using a kernel density estimation.
- Discretisation using fuzzy logic: merges the bins using a fuzzy logic algorithm.
- Discretisation using genetic algorithms: merges the bins using a genetic algorithm.
- Discretisation using simulated annealing: merges the bins using a simulated annealing algorithm.
- Discretisation using a neural network: merges the bins using a neural network.
- Discretisation using a random forest: merges the bins using a random forest.
- Discretisation using a linear discriminant analysis: merges the bins using a linear discriminant analysis.
- Discretisation using a quadratic discriminant analysis: merges the bins using a quadratic discriminant analysis.
- Discretisation using a principal component analysis: merges the bins using a principal component analysis.
- Discretisation using a factor analysis: merges the bins using a factor analysis.
- Discretisation using a canonical correlation analysis: merges the bins using a canonical correlation analysis.
- Discretisation using a partial least squares regression: merges the bins using a partial least squares regression.
- Discretisation using a ridge regression: merges the bins using a ridge regression. -->

WARNING: The choice of bins will influence the results of the mutual information filter.

In [None]:
# variable discretisation using pandas.qcut

# add new column with discretised values
data_nutr['Discretised Energy with dietary fibre, equated \n(kJ)'] = pd.cut(data_nutr['Energy with dietary fibre, equated \n(kJ)'], 20, labels=False)

# print the first few rows of the data for the two columns
data_nutr[['Energy with dietary fibre, equated \n(kJ)', 'Discretised Energy with dietary fibre, equated \n(kJ)']].head(10)

data_nutr.head()
# print the first few rows of the sorted data for the two columns
# data[['Energy with dietary fibre, equated \n(kJ)', 'Discretised Energy with dietary fibre, equated \n(kJ)']].sort_values(by='Energy with dietary fibre, equated \n(kJ)', ascending=False).head(10)

In [None]:
# Discretise all columns of data



ignored_columns = ['Public Food Key', 'Classification', 'Food Name']
label = 'Classification' # label to test
test_col = [] # names of columns
data_nutr = data_nutr.fillna(0) # having values of NaN prevents calculation of MI scores.

for nutrient in data_nutr.columns:
    if nutrient in ignored_columns: 
        continue # disregard first 3 columns ['Public Food Key', 'Classification', 'Food Name']
    else:
        test_col.append(nutrient) # for features below
        data_nutr[nutrient] = pd.cut(data_nutr[nutrient], 20, labels=False) # issues with pd.qcut relating to size of bins, proceeded with pd.cut
        # discretise each column so data is discrete and not continuous

# as follows in Week 9 Workshop - Feature filtering based on Mutual Information for classification
features = data_nutr[test_col]
features = features.fillna(0)
class_label = data_nutr[label]

data_nutr.head()

In [None]:


# Following in Week 9 Workshop - Feature filtering based on Mutual Information for classification
filtered_features = []
THRESHOLD = 0.2 # threshold value not fixed

mi_arr = mutual_info_classif(X=features, y=class_label, discrete_features=True)


for feature, mi in zip(features.columns, mi_arr):
    print(f'MI value for feature "{feature}": {mi:.4f}')

    if (mi >= THRESHOLD):
        filtered_features.append(feature)

print('\nFeature set after filtering with MI:', filtered_features)