# Feature Importance

In this notebook, we measure Feature Importance through four different metrics: 
- Explained Variance of our Bayesian Gaussian Mixture Model
- Permuted Change in the Bayesian Information Criterion (BIC) Score
- Random Forest Feature Importance
- Mutual Information Score for Classification
In the end, we standardise each metric and combine the scores together to form a final Feature Importance score.

In [1]:
# Importing Relevant Libraries
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.mixture import BayesianGaussianMixture
from sklearn.inspection import permutation_importance
from sklearn.feature_selection import mutual_info_classif

In [2]:
# Reading in the scaled and unscaled versions of the 1024 molecule data
df_1024 = pd.read_csv("data/1024_unscaled.csv")
df_1024_scaled = pd.read_csv("data/1024_scaled.csv")

In [4]:
# Fitting the BGMM Model on our dataset
features = ['Sk_all','LSI_all','zeta_all', 'q_all', 'Q6_all', 'd5_all']
gmm = BayesianGaussianMixture(covariance_type="full",
                              n_components=2,
                              tol=1e-3,
                              max_iter=200,
                              mean_precision_prior=None,
                              weight_concentration_prior=None)
gmm.fit(df_1024_scaled[features])
labels = gmm.predict(df_1024_scaled[features])

# Adjust labels so 0 is the larger cluster
zero_count = (labels == 0).sum()
one_count = (labels == 1).sum()
if zero_count < one_count:
    labels = 1 - labels  # This flips 0s to 1s and 1s to 0s
df_1024_scaled['labels'] = labels

X = df_1024_scaled[['Sk_all','LSI_all','zeta_all', 'q_all', 'Q6_all', 'd5_all']]
n_components = 2

# Explained Variance
The explained variance for each feature is the weighted sum of the variances of the feature's values across all the Gaussian components (or clusters). A higher explained variance for a feature means that this feature plays a significant role in differentiating between the components (or clusters) in the model. Conversely, a lower explained variance indicates that the feature does not contribute much to the overall structure captured by the model.

In [5]:
explained_variance = np.zeros(X.shape[1])

for component in range(n_components):
    # Calculate the variance explained by each component
    component_variance = np.var(X - gmm.means_[component], axis=0)
    explained_variance += gmm.weights_[component] * component_variance

# Normalize explained variance to sum to 1 for interpretation
explained_variance /= np.sum(explained_variance)

explained_variance=explained_variance.sort_values(ascending=False)
explained_variance=explained_variance.to_dict()
print(explained_variance)

{'d5_all': 0.29446726850933036, 'Q6_all': 0.230576323274261, 'q_all': 0.20745303735819598, 'zeta_all': 0.11114752210450479, 'Sk_all': 0.08944442699178166, 'LSI_all': 0.06691142176192612}


# Permuted Change in BIC Score
The BIC score is a metric that addresses the trade-off between model fit and complexity, assessing how well the model explains the observed data while penalizing models that are overly complex. We apply the concept of permutation importance to the change in BIC score by randomly permuting each individual feature and assessing the BIC score of the resultant model when compared to the original. A substantial change in the BIC score after permuting a feature suggests that the feature is important, as it significantly influences the model’s performance. Conversely, if the BIC score remains relatively unchanged, it indicates that the feature has a limited role in determining the clustering structure.

In [6]:
bic_scores = {}
def custom_metric(model, data):
    # Calculate BIC manually for each permutation
    log_likelihood = model.score_samples(data)
    n_samples, n_features = data.shape
    k = n_components * (2 * n_features + 1)  # Number of parameters in the model
    bic = -2 * np.sum(log_likelihood) + k * np.log(n_samples)
    return bic

# If X is a NumPy array, you should have a list of feature names as well
feature_names = X.columns.tolist()  # Replace X.columns with the actual column accessor if needed

# Calculate permutation-based feature importances using BIC
perm_importance = permutation_importance(gmm, X, custom_metric, n_repeats=30, random_state=42)

# Rank features based on the difference in BIC values
sorted_features = np.argsort(perm_importance.importances_mean)[::-1]

# Print the ranking of features based on importance along with feature names
print("Feature Ranking based on Permutation Importance with BIC:")
for rank, feature_idx in enumerate(sorted_features):
    feature_name = feature_names[feature_idx]
    bic_scores[feature_name]=perm_importance.importances_mean[feature_idx]
print(bic_scores)

Feature Ranking based on Permutation Importance with BIC:
{'zeta_all': 0.646148517890445, 'd5_all': 0.6357782875765131, 'LSI_all': 0.4502411126967725, 'Sk_all': 0.38044228107657824, 'q_all': 0.19929059669334176, 'Q6_all': 0.0013903515146025983}


# Random Forest Feature Importance
In a Random Forest, which is an ensemble of decision trees, the significance of each feature is gauged by its contribution to reducing the impurity in these nodes. Impurity in a decision tree context refers to the diversity of class labels within a node. A node is 'pure' (with zero impurity) when all its samples belong to the same class, or cluster in this case. To compute feature importance in Random Forests, we assess the decrease in impurity attributable to each feature. A higher RFI score indicates that the feature significantly influences the model’s decision-making process, underlining its importance in predicting the target variable.

In [7]:
# Convert to NumPy array if X is a DataFrame
X_array = X.values if isinstance(X, pd.DataFrame) else X

# Initialize and train the classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_array, labels)

# Extract and display feature importances
importances = rf.feature_importances_
feature_names = X.columns if isinstance(X, pd.DataFrame) else [f"Feature {i}" for i in range(X_array.shape[1])]
feature_importances = pd.Series(importances, index=feature_names)
print(feature_importances.sort_values(ascending=False))

LSI_all     0.452972
Sk_all      0.301161
zeta_all    0.124863
d5_all      0.100934
q_all       0.017076
Q6_all      0.002993
dtype: float64


# Mutual Information Score for Classification
Mutual Information Score is an important metric in classification tasks, particularly valuable for quantifying the relationship between continuous features (the order parameters) and a discrete target variable, such as the state of a water molecule which is binary (represented as clusters 0 and 1). This score measures the amount of information shared between a continuous feature and the binary target, effectively quantifying the reduction in uncertainty about the class label given the feature's value. A higher Mutual Information Score indicates that knowledge of feature x greatly reduces uncertainty about the target y, highlighting its significance in the classification task.  

In [8]:
# Convert to NumPy array if X is a DataFrame
X_array = X.values if isinstance(X, pd.DataFrame) else X
feature_names = X.columns if isinstance(X, pd.DataFrame) else [f"Feature {i}" for i in range(X_array.shape[1])]

# Calculate mutual information
mi_scores = mutual_info_classif(X_array, labels)

# Display mutual information scores
mi_scores_series = pd.Series(mi_scores, index=feature_names)
print(mi_scores_series.sort_values(ascending=False))

LSI_all     0.273289
Sk_all      0.142949
zeta_all    0.128553
d5_all      0.128103
q_all       0.000864
Q6_all      0.000085
dtype: float64


Before combining the different metrics (explained variance, BIC scores, feature importances, and mutual information scores) for each feature, it is essential to standardise these series. Standardisation involves rescaling the distributions of values so that they have a mean of zero and a standard deviation of one. This process is crucial because each metric may have different scales and ranges. By standardising, we ensure that each metric contributes equally to the final score, avoiding bias towards any particular metric due to its scale. The scores are then averaged to obtain a comprehensive view of each feature's relative importance across all metrics.

In [11]:
# Convert dictionaries to pandas DataFrames
df_explained_variance = pd.DataFrame(list(explained_variance.items()), columns=['Feature', 'Value'])
df_bic_scores = pd.DataFrame(list(bic_scores.items()), columns=['Feature', 'Value'])
df_feature_importances = pd.DataFrame(list(feature_importances.items()), columns=['Feature', 'Value'])
df_mi_scores = pd.DataFrame(list(mi_scores_series.items()), columns=['Feature', 'Value'])

# Function to standardize a dataframe
def standardize(df):
    df['Standardized'] = (df['Value'] - df['Value'].mean()) / df['Value'].std()
    return df

# Standardize each DataFrame
df_explained_variance_std = standardize(df_explained_variance)
df_bic_scores_std = standardize(df_bic_scores)
df_feature_importances_std = standardize(df_feature_importances)
df_mi_scores_std = standardize(df_mi_scores)

# Combine the standardized scores by averaging
combined_std = pd.concat([df_explained_variance_std.set_index('Feature'), df_bic_scores_std.set_index('Feature'),
                          df_feature_importances_std.set_index('Feature'), df_mi_scores_std.set_index('Feature')],
                         axis=1, keys=['ExplainedVariance', 'BIC', 'FeatureImportances', 'MIScores'])

# Average the standardized scores
combined_std['Average'] = combined_std.mean(axis=1)

combined_std.sort_values('Average', ascending = False)

Unnamed: 0_level_0,ExplainedVariance,ExplainedVariance,BIC,BIC,FeatureImportances,FeatureImportances,MIScores,MIScores,Average
Unnamed: 0_level_1,Value,Standardized,Value,Standardized,Value,Standardized,Value,Standardized,Unnamed: 9_level_1
Feature,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
LSI_all,0.066911,-1.10057,0.450241,0.256924,0.452972,1.624565,0.273289,1.572805,0.449642
d5_all,0.294467,1.409986,0.635778,0.993777,0.100934,-0.37298,0.128103,0.154329,0.418049
zeta_all,0.111148,-0.612526,0.646149,1.034962,0.124863,-0.237202,0.128553,0.158727,0.169334
Sk_all,0.089444,-0.85197,0.380442,-0.020279,0.301161,0.763153,0.142949,0.299371,0.138034
q_all,0.207453,0.449984,0.199291,-0.739715,0.017076,-0.848813,0.000864,-1.088811,-0.225334
Q6_all,0.230576,0.705096,0.00139,-1.525668,0.002993,-0.928722,8.5e-05,-1.096421,-0.326334
