featureselection package for R

Feature Selection with Machine Learning Models

Overview

If you have encountered a dataset with a myriad number of features, it could be very difficult to work with them all. Some features may not even be important or relevant and could even cause optimization bias. One approach to this problem is to select a subset of these features for your model. Feature selection will reduce complexity, reduce the time when training an algorithm, and improve the accuracy of your model (if we select them wisely). However, this is not a trivial task.

The featureselection package for R can help you with this task. It is similar to its companion package featureselection Package for Python.

Features

In this package, four functions are included for feature selection:

forward_selection - Forward Selection for greedy feature selection. This iterative algorithm starts by considering each feature separately to determine the one that results in the model with best accuracy. The process is then repeated iteratively, adding another feature one at a time, again selecting the single feature that gives the best improvement in accuracy. This procedure stops when it is not longer possible to improve the model.
recursive_feature_elimination - Recursive Feature Elimination (RFE) for greedy feature selection. The model initially considers all features with the goal of discovering the worst performing feature which is then removed from the dataset. This process is repeated until the desired number of features are attained.
simulated_annealing - Perform simmulated annealing to select features by randomly choosing a set of features and determining model performance, then slightly modifying the chosen features randomly and testing to see if the modified feature list has improved model performance. If there is improvement, the newer model is kept, if not, a test is performed to determine if the worse model is still kept based on an acceptance probability that decreases as iterations continue and how worse the newer model performs. The process is repeated for a set number of iterations.
variance_thresholding - Select features based on their variances. A threshold, typically a low one, would be set so that any feature with a variance lower than that would be filtered out. Since this algorithm only looks at features without their outputs, it could be used to do feature selection on data related to unsupervised learning.

Existing Ecosystems

Some of the above features already exsist within the R ecosystem but are provided for feature parity with Python edition of this package.

Forward Selection
Recursive Feature Elimination
Variance Threshold (None)
Simulated Annealing (None)

Installation

Make sure you have the devtools package installed. You can install it as follows.

#Install development version from Github
install.packages("devtools")

Then, install the feature selection package.

devtools::install_github("UBC-MDS/feature-selection-r")

Dependencies

Usage

The Friedman dataset is used to generate data for some of the examples. The datasets contain some features that are generated by a whitenoise process and are expected to be eliminated during feature selection.

NOTE: To run the examples below, you will need the tgp package. It can be installed from the R console with:

install.packages("tgp")

Load Dataset

data <- dplyr::select(tgp::friedman.1.data(), -Ytrue)
X <- dplyr::select(data, -Y)
y <- dplyr::select(data, Y)

`forward_selection`

#
# Create a 'scorer' that accepts a dataset
# and returns the Mean Squared Error.
#
custom_scorer <- function(data){
  model <- lm(Y ~ ., data)
  return(mean(model$residuals^2))
}

featureselection::forward_selection(custom_scorer, X, y, 3, 7)
#> [1] 4 1 2 5

`recursive_feature_elimination`

#
# Create a custom 'scorer' that accepts a dataset and returns
# the name of the column with the lowest coefficient weight.
#
custom_scorer <- function(data){
  model <- lm(Y ~ ., data)
  names(which.min(model$coefficients[-1]))[[1]]
}

featureselection::recursive_feature_elimination(custom_scorer, X, y, 4)
#> [1] "X1" "X2" "X4" "X5" "Y"

`simulated_annealing`

#
# Create a 'scorer' that accepts a dataset
# and returns the Mean Squared Error.
#
custom_scorer <- function(data){
  model <- lm(Y ~ ., data)
  return(mean(model$residuals^2))
}

featureselection::simulated_annealing(custom_scorer, X, y)
#> [1]  1  2  3  4  5  7  9 10

`variance_thresholding`

#
# sample data to test variance
#
data <- data.frame(x1=c(1, 2, 3, 4, 5),
                   x2=c(0, 0, 0, 0, 0),
                   x3=c(1, 1, 1, 1, 1))

featureselection::variance_thresholding(data)
#> [1] 1

Documentation

The official documentation is hosted here: https://ubc-mds.github.io/feature-selection-r

Credits

This package was created with the assistance of the following packages: devtools, usethis, pkgdown

Name		Name	Last commit message	Last commit date
Latest commit History 158 Commits
.github		.github
R		R
docs		docs
img		img
man		man
tests		tests
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
CONDUCT.md		CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
CONTRIBUTORS.md		CONTRIBUTORS.md
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
README.Rmd		README.Rmd
README.md		README.md
featureselection.Rproj		featureselection.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

featureselection package for R

Overview

Features

Existing Ecosystems

Installation

Dependencies

Usage

Load Dataset

`forward_selection`

`recursive_feature_elimination`

`simulated_annealing`

`variance_thresholding`

Documentation

Credits

About

Licenses found

Releases 4

Packages

Contributors 4

Languages

License

Licenses found

UBC-MDS/feature-selection-r

Folders and files

Latest commit

History

Repository files navigation

featureselection package for R

Overview

Features

Existing Ecosystems

Installation

Dependencies

Usage

Load Dataset

forward_selection

recursive_feature_elimination

simulated_annealing

variance_thresholding

Documentation

Credits

About

Resources

License

Licenses found

Stars

Watchers

Forks

Releases 4

Packages 0

Contributors 4

Languages

`forward_selection`

`recursive_feature_elimination`

`simulated_annealing`

`variance_thresholding`

Packages