Skip to content

Commit

Permalink
Merge pull request #65 from UBC-MDS/more-docs
Browse files Browse the repository at this point in the history
Add some final touches to vignette+docs
  • Loading branch information
avinashkz committed Mar 18, 2018
2 parents e112eae + 3b27ae1 commit 67db990
Show file tree
Hide file tree
Showing 6 changed files with 146 additions and 125 deletions.
9 changes: 6 additions & 3 deletions R/example_data.R
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
#' Generating test data with mtcars.
#' Generating example data with mtcars.
#'
#' @description generates test data using base R's mtcars dataset
#' @description Generates test data using base R's mtcars dataset.
#' The response variable `y` is horsepower (`hp`), while the remaining variables
#' represent the predictive features `X`.
#'
#' @param seed random seed to use. Defaults to 99.
#' @param seed random seed to use.
#' Defaults to 99.
#'
#' @return X_train, y_train, X_val, y_val (as a list of dataframes)
#'
Expand Down
18 changes: 8 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
[![Coverage status](https://codecov.io/gh/UBC-MDS/punisheR/branch/master/graph/badge.svg)](https://codecov.io/github/UBC-MDS/punisheR?branch=master)


PunisheR is a package for feature and model selection in R. Specifically, this package implements tools for
**punisheR** is a package for feature and model selection in R. Specifically, this package implements tools for
forward and backward model selection (see [here](https://en.wikipedia.org/wiki/Stepwise_regression)).
In order to measure model quality during the selection procedures, we have also implemented
the Akaike and Bayesian Information Criterion (see below), both of which *punish* complex models -- hence this package's
Expand Down Expand Up @@ -72,23 +72,21 @@ X_val <- data[[3]]
y_val <- data[[4]]
```

### Forward Selection using r-squared
### Forward selection

```r

forward(X_train, y_train, X_val, y_val, min_change=0.5,
n_features=NULL, criterion='r-squared', verbose=FALSE)

#> [1] 10

```
When implementing forward selection on the demo data, it returns a list of features for the best model. Here it
can be seen that the function correctly returns only 1 feature.
When implementing forward selection on the demo data, it returns a list of features for the best model. In this example, we use r-squared to determine the "best" model. Here it
can be seen that the function correctly returns only 1 feature.

### Backward Selection using r-squared
### Backward selection

```r

backward(X_train, y_train, X_val, y_val,
n_features=1, min_change=NULL, criterion='r-squared',
verbose=FALSE)
Expand All @@ -100,7 +98,7 @@ backward(X_train, y_train, X_val, y_val,
When implementing backward selection on the demo data, it returns a list of features for the best model.
Here it can be seen that the function correctly returns only 1 feature.

### Criterions
### Scoring a model with AIC, BIC, and r-squared

```r
model <- lm(y_train ~ mpg + cyl + disp, data = X_train)
Expand All @@ -113,7 +111,7 @@ bic(model)

```

When scoring the two the model using AIC and BIC, we can see that the penalty when using `bic` is greater
When scoring the model using AIC and BIC, we can see that the penalty when using `bic` is greater
than the penalty obtained using `aic`.

```r
Expand All @@ -125,7 +123,7 @@ The value returned by the function `r_squared()` will be between 0 and 1.

## Vignette

For a more comprehensive guide of PunisheR, you can read the vignette [here](vignettes/punisheR.md).
For a more comprehensive guide of PunisheR, you can read the vignette [here](vignettes/punisheR.md) or html version [here](https://s3-us-west-2.amazonaws.com/punisherpkg/punisheR.html).



Expand Down
Binary file modified man/figures/logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
20 changes: 20 additions & 0 deletions man/mtcars_data.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

78 changes: 46 additions & 32 deletions vignettes/punisheR.Rmd
Original file line number Diff line number Diff line change
@@ -1,9 +1,7 @@
---
title: "punisheR"
title: "A complete guide to punisheR"
author: "Jill Cates, Tariq Hassan, Avinash Prabhakaran"
date: "`r Sys.Date()`"
output:
github_document : default
rmarkdown::html_vignette : default
vignette: >
%\VignetteIndexEntry{Vignette Title}
Expand All @@ -12,13 +10,17 @@ vignette: >
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```

```{r, include=FALSE}
library(knitr)
library(punisheR)
```

## Introduction

[punisheR](https://github.com/UBC-MDS/punisheR) is a package for feature and model selection in R. Specifically, this package implements tools for forward and backward model selection. In order to measure model quality during the selection procedures, we have also implemented the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC).
Expand All @@ -36,14 +38,10 @@ Sources: https://en.wikipedia.org/wiki/Stepwise_regression

The package contains three metrics that evaluate model performance:

- `aic()`: The [Akaike information criterion](https://en.wikipedia.org/wiki/Akaike_information_criterion) (AIC) adds a penalty term which penalizes more complex models. Its formal definition is:
$$-2\ln(L)+2*k $$
where $k$ is the number of features and $L$ is the maximized value of the likelihood function.
- `aic()`: The [Akaike information criterion](https://en.wikipedia.org/wiki/Akaike_information_criterion) (AIC) adds a penalty term which penalizes more complex models. Its formal definition is: $-2\ln(L)+2*k $ where $k$ is the number of features and $L$ is the maximized value of the likelihood function.


- `bic()`: The [Bayesian information criterion](https://en.wikipedia.org/wiki/Bayesian_information_criterion) adds a penality term which penalizes complex models to a greater extent than AIC. Its formal definition is:
$$-2*\ln(L)+\ln(n)*k$$
where $k$ is the number of features, $n$ is the number of observations, and $L$ is the maximized value of the likelihood function.
- `bic()`: The [Bayesian information criterion](https://en.wikipedia.org/wiki/Bayesian_information_criterion) adds a penality term which penalizes complex models to a greater extent than AIC. Its formal definition is: $-2*\ln(L)+\ln(n)*k$ where $k$ is the number of features, $n$ is the number of observations, and $L$ is the maximized value of the likelihood function.

- `r_squared()`: The [coefficient of determination](https://en.wikipedia.org/wiki/Coefficient_of_determination) is the proportion of the variance in the response variable that can be predicted from the explanatory variable.

Expand All @@ -57,15 +55,14 @@ and [MASS](https://cran.r-project.org/web/packages/MASS/MASS.pdf) packages. The
[`ols_step_backward()`](https://www.rdocumentation.org/packages/olsrr/versions/0.4.0/topics/ols_step_backward) for forward and backward stepwise selection, respectively. Both of these use p-value as a metric for feature selection. The latter, MASS, contains [`StepAIC()`](https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/stepAIC.html), which is complete with three modes: forward, backward or both. Other packages that provide subset selection for regression models are [leaps](https://cran.r-project.org/web/packages/leaps/leaps.pdf) and [bestglm](https://cran.r-project.org/web/packages/bestglm/bestglm.pdf).


## Loading the demo data

```{r}
library(knitr)
library(punisheR)
```
To demonstrate how punisheR's feature selection and criterion functions work, we will use our demo data `mtcars_data()` which arranges `[mtcars](https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/mtcars.html)` into the correct format for our use cases.

`mtcars_data()` returns a list of 4 dataframes in the following order: X_train, y_train, X_val, and y_val. Horsepower (`hp`) is the response variable (`y`), while the remaining variables of `mtcars` are the predictive features (`X`). The data is split into training data, which is used to *train* the model, and validation data which *validates* (scores) it.

```{r}
#Loading the demo mtcars data
# Loading the demo mtcars data
data <- mtcars_data()
X_train <- data[[1]]
y_train <- data[[2]]
Expand All @@ -74,17 +71,26 @@ y_val <- data[[4]]
```


## Forward Selection by specifying the number of features
## Forward Selection

There are two parameters that determine how features are selected in forward selection:

###### Usage example with `aic` as criterion
1. `n_features` specifies the number of features. If you set `n_features` to 3, the forward selection function will select the 3 best features for your model.
2. `min_change` specifies the minimum change in score in order to proceed to the next iteration. The function stops when there are no features left that cause a change larger than the threshold `min_change`.

In order for forward selection to work, only one of `n_features` and `min_change` can be active. The other must be set to NULL.

Let's look at how `n_features` works within forward selection:

###### a) Usage example with `aic` as criterion

```{r}
forward(X_train, y_train, X_val, y_val, min_change=NULL,
n_features=2, criterion='aic', verbose=FALSE)
```
When implementing forward selection on the mtcars dataset with `hp` as the explanatory variable , it returns a list of features that form the best model. In the above example, the desired number of features has been specified as 2 and the criterion being used is `aic`. The function returns a list of 2 features.
When implementing forward selection on the mtcars dataset with `hp` as the response variable , it returns a list of features that form the best model. In the above example, the desired number of features has been specified as 2 and the criterion being used is `aic`. The function returns a list of 2 features.

###### Usage example with `bic` as criterion
###### b) Usage example with `bic` as criterion

```{r}
forward(X_train, y_train, X_val, y_val, min_change=NULL,
Expand All @@ -93,7 +99,7 @@ forward(X_train, y_train, X_val, y_val, min_change=NULL,

In the above example, the desired number of features has been specified as 3 and the criterion being used is `bic`. The function returns a list of 3 features.

###### Usage example with `r-squared` as criterion
###### c) Usage example with `r-squared` as criterion

```{r}
forward(X_train, y_train, X_val, y_val, min_change=NULL,
Expand All @@ -103,64 +109,72 @@ forward(X_train, y_train, X_val, y_val, min_change=NULL,
In the above example, the desired number of features has been specified as 4 and the criterion being used is `r-squared`. The function returns a list of 4 features.


#### Forward Selection by specifying the smallest change in criterion
Forward selection also works by specifying the smallest change in criterion, `min_change`:

```{r}
forward(X_train, y_train, X_val, y_val, min_change=0.5,
n_features=NULL, criterion='r-squared', verbose=FALSE)
```

In the example above, `forward` selction returns a list of 6 features when a minimum change of 0.5 is required in `r-squared` score for an additional feature to be selected.
In the example above, `forward` selction returns a list of 6 features when a minimum change of 0.5 is required in `r-squared`'s score for an additional feature to be selected.

**Note**: When using the criterion as `aic` or `bic`, the value for `min_change` should be carefully selected as `aic` and `bic` tend to have much larger values than `r-squared`.


**Note**: When using the criterion as `aic` or `bic`, the value for `min_change` should be carefully selected as `aic` and `bic` tends to have much larger values.

## Backward Selection

Backward selection works in the same way as forward selection such that you must configure `n_features` or `min_change`, as well as the `criterion` to score the model.

#### Backward Selection by specifying the number of features
###### a) Usage example with `aic` as criterion

```{r}
backward(X_train, y_train, X_val, y_val,
n_features=7, min_change=NULL, criterion='aic',
verbose=FALSE)
```

###### b) Usage example with `bic` as criterion

```{r}
backward(X_train, y_train, X_val, y_val,
n_features=7, min_change=NULL, criterion='bic',
verbose=FALSE)
```

###### c) Usage example with `r-squared` as criterion

```{r}
backward(X_train, y_train, X_val, y_val,
n_features=7, min_change=NULL, criterion='r-squared',
verbose=FALSE)
```

Similarly, for backward selection, the number of features are specified as 7 and the examples using all the three criterion are provided above.
With `n_features` configured to 7, each example above returns the 7 best features based on model score. You can see above that changing the criterion can result in a different output of "best" features.

#### Backward Selection by specifying the smallest change in criterion
In the example below, `backward` selection returns a list of 10 features when the `min_change` in the `r-squared` criterion is specified as 0.5.

```{r}
backward(X_train, y_train, X_val, y_val,
n_features=NULL, min_change=0.5, criterion='r-squared',
verbose=FALSE)
```

In the example above, `backward` selection returns a list of 10 features when the minimum change in the `r-squared` criterion is specified as 0.5.
## AIC, BIC & $R^2$

#### AIC, BIC & $R^2$
punisheR also provides three standalone functions to compute AIC, BIC, and $R^2$. For `aic()` and `bic()` you simply need to pass in the model (e.g., a `lm()` object). You can also pass in the validation data and response variable (`X_val`, `y_val`). By default, `X` and `y` are extracted from the model.

```{r}
model <- lm(y_train ~ mpg + cyl + disp, data = X_train)
aic(model)
```

```{r}
bic(model)
aic(model, X_val, y_val)
bic(model, X_val, y_val)
```

When scoring the two the model using AIC and BIC, we can see that the penalty when using `bic` is greater than the penalty obtained using `aic`.
When scoring the model using AIC and BIC, we can see that the penalty when using `bic` is greater than the penalty obtained using `aic`.

```{r}
r_squared(model, X_val, y_val)
Expand Down
Loading

0 comments on commit 67db990

Please sign in to comment.