Automatic Feature Importances & selection

Intro

Feature importances and selection is a critical topic that helps us to understand how data works on the machine learning algorithm, which can also help us select promising features among the large data.

In this repo, I will introduce several techniques including python
code for feature importances and selection:

Data-based feature importances strategies
Model-based feature importance strategies
Automatic comparing strategy for feature importances
Automatic feature selection algorithm
Variance for feature importances
Empirical p-values for feature importances

Link to code: click here.

Datasets

Breast Cancer Wisconsin (Diagnostic) Data Set:

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.

The mean, standard error and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius. All feature values are recoded with four significant digits.

Link to dataset: click here.

Factor	DataType	Detail
`ID number`	Numeric	the image ID
`Diagnosis`	Boolean (Target)	0 = malignant, 1 = benign
`radius`	Numeric	mean of distances from center to points on the perimeter
`texture`	Numeric	standard deviation of gray-scale values
`perimeter`	Numeric	The perimeter of the core tumor
`area`	Numeric	The area of the core tumor
`smoothness`	Numeric	local variation in radius lengths
`compactness`	Numeric	perimeter^2 / area - 1.0
`concavity`	Numeric	severity of concave portions of the contour
`concave points`	Numeric	number of concave portions of the contour
`symmetry`	Numeric	The symmetry of the core tumor
`fractal dimension`	Numeric	coastline approximation - 1

The dataset contains:

32 features, 569 rows
Class distribution: 357 benign, 212 malignant

Data-based Feature Importances Strategies

Data-based importance strategies directly use data to calculate the relationship between features and target. We often use the high relevance and low redundancy of these two indexes to judge if a feature is good or not.

There are several data-based importance strategies, including:

Spearman's rank coefficient
Pearson's rank coefficient
Kendall Tau rank coefficient
Principal component analysis (PCA)
Minimal-redundancy-maximal-relevance (mRMR)

The top_rank function in the code wrapped the Spearman's rank coefficient, Pearson's rank coefficient, Kendall Tau rank coefficient, and Principal component analysis (PCA), which can realize each method on the data and get the top k important features.

Users also can visualize the feature importances using a horizontal bar plot from top to bottom with printting the value of importances value on the plot if use show_values=True.

Spearman & Pearson Rank Coefficient

Spearman's rank correlation evaluates the monotonic relationships between each feature and target. If the Spearman correlation is positive, it indicates that the target tends to increase when X increases. If the Spearman correlation is negative, it indicates that the target tends to decrease when X increases.

Pearson's rank correlation evaluates the linear relationships between each feature and target. The Pearson correlation between two variables is equal to the Spearman correlation.

fea, imp = top_rank(data, 'diagnosis', method='spearman')
plot_feature_importances(imp, fea, "Spearman's rank correlation", show_values=True)

Kendall Tau Rank Coefficient

Kendall’s Tau is a non-parametric measure of relationships based on concordant and discordant pairs between each feature and target in the data, which Kendall Tau correlation coefficient returns a value between 0 to 1.

Kendall’s Tau correlation is usually smaller values than Spearman’s correlation and is insensitive to error.

fea, imp = top_rank(data, 'diagnosis', method='kendall')
plot_feature_importances(imp, fea, "Kendall Tau's rank correlation", show_values=True)

Principal component analysis (PCA)

The principal component analysis is a vital method to rank the feature importances, which operates on just the X explanatory matrix. PCA uses singular value decomposition on data to transform data into a lower-dimensional space, which is characterized by eigenvectors and evaluates each feature by the most variance in the new space.

fea, imp = top_rank(data, 'diagnosis', method='PCA')
plot_feature_importances(imp, fea, 'PCA', show_values=True)

Minimal-redundancy-maximal-relevance (mRMR)

Minimal-redundancy-maximal-relevance is an ideal method to deal with codependencies problems between the features, which measures relevance and low redundancy at the same time.

Relevance: The coefficient between the individual feature and the target
Redundancy: The coefficient between each individual feature.

The ideal feature we need should have high relevance and low redundancy, and we can see how the algorithm works when you set the info=True.

imp, fea = mRMR(data, 'diagnosis', info=True)
plot_feature_importances(imp, fea, 'mRMR', show_values=True)

Model-based Feature Importance Strategies

Model-based importance strategies use the performance of the model to calculate the relationship between features and target. In this method, we need to train a decent model using 80% of data and evaluate on other 20% of data to judge if a feature is good or not.

train, val = train_val_split(data, 0.8)
x_train, y_train = split_target(train, 'diagnosis')
x_val, y_val = split_target(val, 'diagnosis')

There are several model-based importance strategies, including:

Permutation importances
Drop column importances

Users can use individual functions and return the top k importance values and corresponding column name.

Permutation Importances

The permutation importance method measures the features by the difference of validation loss on validation data before and after the permutation, which don't need to retrain model and can apply to any kind of machine learning algorithm.

Procedure:
a. Compute the baseline on validation metric for a model trained on all features
b. Permute evaluating feature in the validation set
c. Compute validation score on new validation data
d. The importance score is the difference between two validation scores.

However, this method may create nonsensical records through permutation.

imp, feas = permutation_importance(x_train, y_train, x_val, y_val)
plot_feature_importances(imp, feas, 'Permutation Importance')

Drop Column Importances

Drop column importances method measures the features by the difference of validation loss on training data before and after the dropping, which need to retrain model and can examine the importance of any feature or combination of features.

Procedure:
a. Compute the baseline on validation metric for a model trained on all features
b. Drop the feature you want to evaluate from the data set
c. Retrain the model
d. Compute validation loss
e. The importance score is the difference between two validation scores

However, this method needs to retrain the model, which will cost a lot of the data is large. And the codependent features often result in 0.

imp, feas = dropcol_importances(x_train, y_train, x_val, y_val)
plot_feature_importances(imp, feas, 'DropColumn Importance')

Automatic Comparing Strategies for Feature Importance

Different methods may emphasize different parts of feature importance and we create a function to compare them all. In compare_Top_k function, users only needs to send the whole data and target, and it will automatically compare the loss on the model using top k features selected from each method above (data-based and model-based). This function also compares to the feature importances from SHAP, which becomes a standard package for feature selection.

Include: Data-based:

Spearman's rank coefficient
Pearson's rank coefficient
Kendall Tau rank coefficient
Principal component analysis (PCA)
Minimal-redundancy-maximal-relevance (mRMR)

Model-based:

Permutation importances
Drop column importances
SHAP

compare_Top_k(data, 'diagnosis', 15)

Procedure:
a. Preprocessing data for model-based and data-based method
b. Calculate the top k features on different methods
c. Calculate different loss on 1 to k features for each method
d. Display all losses on the plot

Automatic Feature Selection

In this section, I will introduce an automatic feature selection approach that only needs to send the whole data and target and it will return the best model and drop-features list for you. You can choose different methods for you to choose feature importance, including:

Spearman's rank coefficient -> spearman
Pearson's rank coefficient -> pearson
Kendall Tau rank coefficient -> kendall
Principal component analysis -> pca
Minimal-redundancy-maximal-relevance -> mrmr
Permutation importances -> permutation
Drop column importances -> dropcol
SHAP -> shap

The basic rule is that we drop the lowest importance until validation loss does not increases, return the previous best model and drop the feature list.

Procedure:
a. Use the method you input to calculate the feature importances
b. Use all features to calculate the baseline validation loss
c. Drop the lowest importance feature
d. Retrain the model and recompute the validation loss
e. If validation loss decreases, we repeat 3, 4
f. Return the best model before the last iteration.

modes = ['spearman', 'pearson', 'kendall', 'pca', 'mrmr', 'permutation', 'dropcol', 'shap']

for mode in modes:
    best_model, feat_drop = auto_featSelection(data, 'diagnosis', mode=mode)
    print(str(mode) + ': ' + str([f for f in feat_drop]))
    print(best_model)

Variance for Feature Importances

Variance is also a necessary part of feature selection, which tells us if the feature importances values are stable or not. In this case, we should include the variance info into the feature importances plot, which makes it easy to compare and select the vital features.

We can calculate the variance of feature importances using the bootstrapping method for the model-based feature importances method and add this variance (show_var) to the plot. My function supports :

Permutation importances -> permutation
Drop column importances -> dropcol
SHAP -> shap

imp, feas = shap_importances(x_train, y_train, x_val, y_val)
var_error = feature_variance(data, 'diagnosis')
plot_feature_importances(imp, feas, 'SHAP Importance', show_var=var_error)

After fitting the model, we can get feature importance from Random Forest. The result shows that sentimental score is the model important feature in the model. Playtime, Developer, and Price are also important to predict Voted up. Then we tested the best model on the validation set, the accuracy is about 0.97.

Empirical p-values for Feature Importances

Different from the variance, p_value can tell us more information about feature important distribution and directly tell us about if this feature is significant or not. We can also adapt the bootstrapping method to calculate p_value for features based on the baseline. Then, we can show the histogram distribution plot for features and compare with the position of p_value by setting the value of significant level.

This function also supports the model-based feature importances method:

Permutation importances -> permutation
Drop column importances -> dropcol
SHAP -> shap

p_values, baseline, imps, feas = feature_pvalue(data, 'diagnosis')
pvalue_hist(p_values, baseline, imps, feas, k=0, alpha=0.05)
pvalue_hist(p_values, baseline, imps, feas, k=16, alpha=0.05)

The first plot shows us the distribution of a significant feature:
radius_mean and corresponding p_value: 0.03
The second plot shows us the distribution of an insignificant feature:
concavity_se and corresponding p_value: 1.0

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
data		data
images		images
LICENSE		LICENSE
README.md		README.md
featimp.py		featimp.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Automatic Feature Importances & selection

Intro

Datasets

Breast Cancer Wisconsin (Diagnostic) Data Set:

Data-based Feature Importances Strategies

Spearman & Pearson Rank Coefficient

Kendall Tau Rank Coefficient

Principal component analysis (PCA)

Minimal-redundancy-maximal-relevance (mRMR)

Model-based Feature Importance Strategies

Permutation Importances

Drop Column Importances

Automatic Comparing Strategies for Feature Importance

Automatic Feature Selection

Variance for Feature Importances

Empirical p-values for Feature Importances

About

Releases

Packages

Languages

License

ajinChen/automatic-feature-selection

Folders and files

Latest commit

History

Repository files navigation

Automatic Feature Importances & selection

Intro

Datasets

Breast Cancer Wisconsin (Diagnostic) Data Set:

Data-based Feature Importances Strategies

Spearman & Pearson Rank Coefficient

Kendall Tau Rank Coefficient

Principal component analysis (PCA)

Minimal-redundancy-maximal-relevance (mRMR)

Model-based Feature Importance Strategies

Permutation Importances

Drop Column Importances

Automatic Comparing Strategies for Feature Importance

Automatic Feature Selection

Variance for Feature Importances

Empirical p-values for Feature Importances

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages