Merge pull request #65 from UBC-MDS/more-docs

Add some final touches to vignette+docs
UBC-MDS · Mar 18, 2018 · 67db990 · 67db990
2 parents e112eae + 3b27ae1
commit 67db990
Show file tree

Hide file tree

Showing 6 changed files with 146 additions and 125 deletions.
diff --git a/R/example_data.R b/R/example_data.R
@@ -1,8 +1,11 @@
-#' Generating test data with mtcars.
+#' Generating example data with mtcars.
 #'
-#' @description generates test data using base R's mtcars dataset
+#' @description Generates test data using base R's mtcars dataset.
+#' The response variable `y` is horsepower (`hp`), while the remaining variables
+#' represent the predictive features `X`.
 #'
-#' @param seed random seed to use. Defaults to 99.
+#' @param seed random seed to use.
+#' Defaults to 99.
 #'
 #' @return X_train, y_train, X_val, y_val (as a list of dataframes)
 #'

diff --git a/README.md b/README.md
@@ -4,7 +4,7 @@
 [![Coverage status](https://codecov.io/gh/UBC-MDS/punisheR/branch/master/graph/badge.svg)](https://codecov.io/github/UBC-MDS/punisheR?branch=master)
 
 
-PunisheR is a package for feature and model selection in R. Specifically, this package implements tools for
+**punisheR** is a package for feature and model selection in R. Specifically, this package implements tools for
 forward and backward model selection (see [here](https://en.wikipedia.org/wiki/Stepwise_regression)).
 In order to measure model quality during the selection procedures, we have also implemented
 the Akaike and Bayesian Information Criterion (see below), both of which *punish* complex models -- hence this package's
@@ -72,23 +72,21 @@ X_val <- data[[3]]
 y_val <- data[[4]]
 ```
 
-### Forward Selection using r-squared
+### Forward selection
 
 ```r
-
 forward(X_train, y_train, X_val, y_val, min_change=0.5,
     n_features=NULL, criterion='r-squared', verbose=FALSE)
 
 #> [1] 10
 
 ```
-When implementing forward selection on the demo data, it returns a list of features for the best model. Here it
-can be seen that the function correctly returns only 1 feature.
+When implementing forward selection on the demo data, it returns a list of features for the best model. In this example, we use r-squared to determine the "best" model. Here it
+can be seen that the function correctly returns only 1 feature. 
 
-### Backward Selection using r-squared
+### Backward selection
 
 ```r
-
 backward(X_train, y_train, X_val, y_val,
     n_features=1, min_change=NULL, criterion='r-squared',
     verbose=FALSE)
@@ -100,7 +98,7 @@ backward(X_train, y_train, X_val, y_val,
 When implementing backward selection on the demo data, it returns a list of features for the best model.
 Here it can be seen that the function correctly returns only 1 feature.
 
-### Criterions
+### Scoring a model with AIC, BIC, and r-squared
 
 ```r
 model <- lm(y_train ~ mpg + cyl + disp, data = X_train)
@@ -113,7 +111,7 @@ bic(model)
 
 ```
 
-When scoring the two the model using AIC and BIC, we can see that the penalty when using `bic` is greater
+When scoring the model using AIC and BIC, we can see that the penalty when using `bic` is greater
 than the penalty obtained using `aic`.
 
 ```r
@@ -125,7 +123,7 @@ The value returned by the function `r_squared()` will be between 0 and 1.
 
 ## Vignette
 
-For a more comprehensive guide of PunisheR, you can read the vignette [here](vignettes/punisheR.md).
+For a more comprehensive guide of PunisheR, you can read the vignette [here](vignettes/punisheR.md) or html version [here](https://s3-us-west-2.amazonaws.com/punisherpkg/punisheR.html).
 
 
 

diff --git a/man/figures/logo.png b/man/figures/logo.png
diff --git a/man/mtcars_data.Rd b/man/mtcars_data.Rd
diff --git a/vignettes/punisheR.Rmd b/vignettes/punisheR.Rmd
@@ -1,9 +1,7 @@
 ---
-title: "punisheR"
+title: "A complete guide to punisheR"
 author: "Jill Cates, Tariq Hassan, Avinash Prabhakaran"
-date: "`r Sys.Date()`"
 output: 
-    github_document : default
     rmarkdown::html_vignette : default
 vignette: >
   %\VignetteIndexEntry{Vignette Title}
@@ -12,13 +10,17 @@ vignette: >
 ---
 
 ```{r setup, include = FALSE}
-
 knitr::opts_chunk$set(
   collapse = TRUE,
   comment = "#>"
 )
 ```
 
+```{r, include=FALSE}
+library(knitr)
+library(punisheR)
+```
+
 ## Introduction
 
 [punisheR](https://github.com/UBC-MDS/punisheR) is a package for feature and model selection in R. Specifically, this package implements tools for forward and backward model selection. In order to measure model quality during the selection procedures, we have also implemented the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC).
@@ -36,14 +38,10 @@ Sources: https://en.wikipedia.org/wiki/Stepwise_regression
 
 The package contains three metrics that evaluate model performance: 
 
-- `aic()`: The [Akaike information criterion](https://en.wikipedia.org/wiki/Akaike_information_criterion) (AIC) adds a penalty term which penalizes more complex models. Its formal definition is:
-$$-2\ln(L)+2*k $$
-where $k$ is the number of features and $L$ is the maximized value of the likelihood function.
+- `aic()`: The [Akaike information criterion](https://en.wikipedia.org/wiki/Akaike_information_criterion) (AIC) adds a penalty term which penalizes more complex models. Its formal definition is: $-2\ln(L)+2*k $ where $k$ is the number of features and $L$ is the maximized value of the likelihood function.
 
 
-- `bic()`: The [Bayesian information criterion](https://en.wikipedia.org/wiki/Bayesian_information_criterion) adds a penality term which penalizes complex models to a greater extent than AIC. Its formal definition is:
-$$-2*\ln(L)+\ln(n)*k$$
-where $k$ is the number of features, $n$ is the number of observations, and $L$ is the maximized value of the likelihood function.
+- `bic()`: The [Bayesian information criterion](https://en.wikipedia.org/wiki/Bayesian_information_criterion) adds a penality term which penalizes complex models to a greater extent than AIC. Its formal definition is: $-2*\ln(L)+\ln(n)*k$ where $k$ is the number of features, $n$ is the number of observations, and $L$ is the maximized value of the likelihood function.
 
 - `r_squared()`: The [coefficient of determination](https://en.wikipedia.org/wiki/Coefficient_of_determination) is the proportion of the variance in the response variable that can be predicted from the explanatory variable.
 
@@ -57,15 +55,14 @@ and [MASS](https://cran.r-project.org/web/packages/MASS/MASS.pdf) packages. The
 [`ols_step_backward()`](https://www.rdocumentation.org/packages/olsrr/versions/0.4.0/topics/ols_step_backward) for forward and backward stepwise selection, respectively. Both of these use p-value as a metric for feature selection. The latter, MASS, contains [`StepAIC()`](https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/stepAIC.html), which is complete with three modes: forward, backward or both. Other packages that provide subset selection for regression models are [leaps](https://cran.r-project.org/web/packages/leaps/leaps.pdf) and [bestglm](https://cran.r-project.org/web/packages/bestglm/bestglm.pdf).
 
 
+## Loading the demo data
 
-```{r}
-library(knitr)
-library(punisheR)
-```
+To demonstrate how punisheR's feature selection and criterion functions work, we will use our demo data `mtcars_data()` which arranges `[mtcars](https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/mtcars.html)` into the correct format for our use cases. 
 
+`mtcars_data()` returns a list of 4 dataframes in the following order: X_train, y_train, X_val, and y_val. Horsepower (`hp`) is the response variable (`y`), while the remaining variables of `mtcars` are the predictive features (`X`). The data is split into training data, which is used to *train* the model, and validation data which *validates* (scores) it.  
 
 ```{r}
-#Loading the demo mtcars data
+# Loading the demo mtcars data
 data <- mtcars_data()
 X_train <- data[[1]]
 y_train <- data[[2]]
@@ -74,17 +71,26 @@ y_val <- data[[4]]
 ```
 
 
-## Forward Selection by specifying the number of features
+## Forward Selection
+
+There are two parameters that determine how features are selected in forward selection:
 
-###### Usage example with `aic` as criterion
+1. `n_features` specifies the number of features. If you set `n_features` to 3, the forward selection function will select the 3 best features for your model.  
+2. `min_change` specifies the minimum change in score in order to proceed to the next iteration. The function stops when there are no features left that cause a change larger than the threshold `min_change`.  
+
+In order for forward selection to work, only one of `n_features` and `min_change` can be active. The other must be set to NULL.
+
+Let's look at how `n_features` works within forward selection:
+
+###### a) Usage example with `aic` as criterion
 
 ```{r}
 forward(X_train, y_train, X_val, y_val, min_change=NULL,
                     n_features=2, criterion='aic', verbose=FALSE)
 ```
-When implementing forward selection on the mtcars dataset with `hp` as the explanatory variable , it returns a list of features that form the best model. In the above example, the desired number of features has been specified as 2 and the criterion being used is `aic`. The function returns a list of 2 features.
+When implementing forward selection on the mtcars dataset with `hp` as the response variable , it returns a list of features that form the best model. In the above example, the desired number of features has been specified as 2 and the criterion being used is `aic`. The function returns a list of 2 features.
 
-###### Usage example with `bic` as criterion
+###### b) Usage example with `bic` as criterion
 
 ```{r}
 forward(X_train, y_train, X_val, y_val, min_change=NULL,
@@ -93,7 +99,7 @@ forward(X_train, y_train, X_val, y_val, min_change=NULL,
 
 In the above example, the desired number of features has been specified as 3 and the criterion being used is `bic`. The function returns a list of 3 features.
 
-###### Usage example with `r-squared` as criterion
+###### c) Usage example with `r-squared` as criterion
 
 ```{r}
 forward(X_train, y_train, X_val, y_val, min_change=NULL,
@@ -103,64 +109,72 @@ forward(X_train, y_train, X_val, y_val, min_change=NULL,
 In the above example, the desired number of features has been specified as 4 and the criterion being used is `r-squared`. The function returns a list of 4 features.
 
 
-#### Forward Selection by specifying the smallest change in criterion 
+Forward selection also works by specifying the smallest change in criterion, `min_change`:
 
 ```{r}
 forward(X_train, y_train, X_val, y_val, min_change=0.5,
                     n_features=NULL, criterion='r-squared', verbose=FALSE)
 ```
 
-In the example above, `forward` selction returns a list of 6 features when a minimum change of 0.5 is required in `r-squared` score for an additional feature to be selected.
+In the example above, `forward` selction returns a list of 6 features when a minimum change of 0.5 is required in `r-squared`'s score for an additional feature to be selected.
+
+**Note**: When using the criterion as `aic` or `bic`, the value for `min_change` should be carefully selected as `aic` and `bic` tend to have much larger values than `r-squared`.
+
 
-**Note**: When using the criterion as `aic` or `bic`, the value for `min_change` should be carefully selected as `aic` and `bic` tends to have much larger values.
 
+## Backward Selection
 
+Backward selection works in the same way as forward selection such that you must configure `n_features` or `min_change`, as well as the `criterion` to score the model. 
 
-#### Backward Selection by specifying the number of features
+###### a) Usage example with `aic` as criterion
 
 ```{r}
 backward(X_train, y_train, X_val, y_val,
                      n_features=7, min_change=NULL, criterion='aic',
                      verbose=FALSE)
 ```
 
+###### b) Usage example with `bic` as criterion
+
 ```{r}
 backward(X_train, y_train, X_val, y_val,
                      n_features=7, min_change=NULL, criterion='bic',
                      verbose=FALSE)
 ```
 
+###### c) Usage example with `r-squared` as criterion
+
 ```{r}
 backward(X_train, y_train, X_val, y_val,
                      n_features=7, min_change=NULL, criterion='r-squared',
                      verbose=FALSE)
 ```
 
-Similarly, for backward selection, the number of features are specified as 7 and the examples using all the three criterion are provided above.
+With `n_features` configured to 7, each example above returns the 7 best features based on model score. You can see above that changing the criterion can result in a different output of "best" features.
 
-#### Backward Selection by specifying the smallest change in criterion
+In the example below, `backward` selection returns a list of 10 features when the `min_change` in the `r-squared` criterion is specified as 0.5.
 
 ```{r}
 backward(X_train, y_train, X_val, y_val,
                      n_features=NULL, min_change=0.5, criterion='r-squared',
                      verbose=FALSE)
 ```
 
-In the example above, `backward` selection returns a list of 10 features when the minimum change in the `r-squared` criterion is specified as 0.5.
+## AIC, BIC & $R^2$
 
-#### AIC, BIC & $R^2$
+punisheR also provides three standalone functions to compute AIC, BIC, and $R^2$. For `aic()` and `bic()` you simply need to pass in the model (e.g., a `lm()` object). You can also pass in the validation data and response variable (`X_val`, `y_val`). By default, `X` and `y` are extracted from the model. 
 
 ```{r}
 model <- lm(y_train ~ mpg + cyl + disp, data = X_train)
-
-aic(model)
 ```
 
 ```{r}
-bic(model)
+aic(model, X_val, y_val)
+
+bic(model, X_val, y_val)
 ```
 
-When scoring the two the model using AIC and BIC, we can see that the penalty when using `bic` is greater than the penalty obtained using `aic`.
+When scoring the model using AIC and BIC, we can see that the penalty when using `bic` is greater than the penalty obtained using `aic`.
 
 ```{r}
 r_squared(model, X_val, y_val)