version 0.2.1

cran · Jun 21, 2019 · cdcdbcb · cdcdbcb
1 parent c6d9f25
commit cdcdbcb
Show file tree

Hide file tree

Showing 11 changed files with 76 additions and 73 deletions.
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,24 +1,24 @@
 Package: subgroup.discovery
 Type: Package
 Title: Subgroup Discovery and Bump Hunting
-Version: 0.2.0
+Version: 0.2.1
 Authors@R: c(
-          person(given = "Jurian", family = "Baas", email = "jurian@jurianbaas.nl", role = c("aut", "cre", "cph")),
-          person(given = "Ad", family ="Feelders", email = "A.J.Feelders@uu.nl", role = c("ctb")))
+          person(given = "Jurian", family = "Baas", email = "j.baas@uu.nl", role = c("aut", "cre", "cph")),
+          person(given = "Ad", family ="Feelders", email = "a.j.feelders@uu.nl", role = c("ctb")))
 Description: Developed to assist in discovering interesting subgroups in high-dimensional data. The PRIM implementation is based on the 1998 paper "Bump hunting in high-dimensional data" by Jerome H. Friedman and Nicholas I. Fisher. <doi:10.1023/A:1008894516817> PRIM involves finding a set of "rules" which combined imply unusually large (or small) values of some other target variable. Specifically one tries to find a set of sub regions in which the target variable is substantially larger than overall mean. The objective of bump hunting in general is to find regions in the input (attribute/feature) space with relatively high (low) values for the target variable. The regions are described by simple rules of the type if: condition-1 and ... and condition-n then: estimated target value. Given the data (or a subset of the data), the goal is to produce a box B within which the target mean is as large as possible. There are many problems where finding such regions is of considerable practical interest.  Often these are problems where a decision maker can in a sense choose or select the values of the input variables so as to optimize the value of the target variable. In bump hunting it is customary to follow a so-called covering strategy. This means that the same box construction (rule induction) algorithm is applied sequentially to subsets of the data.
 Depends: R (>= 2.10)
 License: GPL-3
 Encoding: UTF-8
 LazyData: true
-RoxygenNote: 6.0.1
+RoxygenNote: 6.1.1
 URL: https://github.com/Jurian/subgroup.discovery
 BugReports: https://github.com/Jurian/subgroup.discovery/issues
-Date: 2017-07-15
+Date: 2019-06-21
 Suggests: testthat
 NeedsCompilation: no
-Packaged: 2017-08-02 09:04:19 UTC; juria
+Packaged: 2019-06-21 14:28:41 UTC; juria
 Author: Jurian Baas [aut, cre, cph],
   Ad Feelders [ctb]
-Maintainer: Jurian Baas <jurian@jurianbaas.nl>
+Maintainer: Jurian Baas <j.baas@uu.nl>
 Repository: CRAN
-Date/Publication: 2017-08-02 09:09:42 UTC
+Date/Publication: 2019-06-21 16:30:03 UTC
diff --git a/MD5 b/MD5
@@ -1,13 +1,13 @@
-be9e514890bcd9fac88d477dcb57e554 *DESCRIPTION
+5f1418ffd70412f68d038bb91b8a879d *DESCRIPTION
 b787c0195c018b76329c1ffa3e488892 *NAMESPACE
-151874cac62b01f6a93303ef18b4cfa9 *R/data.R
-144b8751d6224f1c61e5eceb71b26ccc *R/prim.R
-13a5c3f56173bffdb63540d03ec4e71b *README.md
+489f9dae3710c838cd7720511d924029 *R/data.R
+79fded2e7ac1f0a81840e58eb61819f2 *R/prim.R
+51f6daa6040fc8f052a516fc69b8b437 *README.md
 758212ab8f8348b6edf5c016cdca769e *data/ames.rda
 4ed2b76fab4f7bbc86c32968f5552817 *data/credit.rda
 b55b409ecc1557da12a68d9cf614ff8a *data/pima.rda
-37d2c3e5d8e6350dd5374780293d9195 *man/ames.Rd
-bb8d890c9476827e8ade9b2d7827c1e1 *man/credit.Rd
+04d26c7cecc8265ed12ac5b06551be03 *man/ames.Rd
+293a396ed7f0ad4c7834dce7c4b9d84c *man/credit.Rd
 c5a81d4f5163d3d5c35852f50d800ff0 *man/pima.Rd
 bef653f6038edc99b604a985aef5cbd7 *man/plot.prim.cover.Rd
 7ff886d509f8cc8fa552c095b6e286fc *man/plot.prim.diversify.Rd
@@ -16,13 +16,13 @@ bef653f6038edc99b604a985aef5cbd7 *man/plot.prim.cover.Rd
 38bb5856d32ed5856308eaa13d7772e2 *man/predict.prim.cover.Rd
 1e12179833917a38e55c003ae569c832 *man/prim.box.optimal.Rd
 11a945c9899cc9bae3eb0a79d0cbf282 *man/prim.candidates.find.Rd
-8c60387654b58bc1da8aa6057b314a4a *man/prim.cover.Rd
-9a7c01f7fd4a63775c4f69e77a7ca16d *man/prim.diversify.Rd
+da4e97c4c048f6c21c55b90750e659a7 *man/prim.cover.Rd
+f3d435bf8e2f17711bf72bdb5685ed26 *man/prim.diversify.Rd
 d1e2471b5bbe7f82dff6bc1d8d7e4219 *man/prim.diversify.compare.Rd
-caba11df12fe8584ca8c2eaefe15144d *man/prim.peel.Rd
+190d21145dc0020abf4a0dd17e0f6c34 *man/prim.peel.Rd
 56c0c729e9d5e24e9fef37666c9427bf *man/prim.rule.condense.Rd
 4c14a90f0b399b6e849ca6d0fbb77927 *man/prim.rule.match.Rd
-764dbe0b8be928a0206560aa55f5d767 *man/prim.rule.operations.Rd
+ab829fffe2e69619c04516d8312c5a37 *man/prim.rule.operations.Rd
 d03472196009ee6ff1ff590eda04f9d2 *man/prim.validate.Rd
 ba90dafaf0933039d0a4caa613f0930c *man/prim.validate.metrics.Rd
 20efa17cf413e73dddb91a770320a124 *man/quasi.convex.hull.Rd

diff --git a/R/data.R b/R/data.R
@@ -13,7 +13,6 @@
 #'   \item{gender}{factor with levels male, female}
 #'   \item{class}{class variable (0 or 1)}
 #' }
-#' @source \url{http://www.cs.uu.nl/docs/vakken/adm/bump.pdf}
 "credit"
 
 #' Pima Indians Diabetes Database.
@@ -165,5 +164,4 @@
 #'   \item{Sale.Condition}{(Nominal): Condition of sale}
 #'   \item{SalePrice}{(Continuous): Sale price}
 #' }
-#' @source \url{https://ww2.amstat.org/publications/jse/v19n3/decock/datadocumentation.txt}
 "ames"
diff --git a/R/prim.R b/R/prim.R
@@ -17,7 +17,7 @@
 #' @param quality.function Function to use for determining set quality, defaults to mean
 #' @param plot Plot intermediate results, defaults to false
 #' @param minimize Should the quality be minimized? Same as setting the quality function to function(x){-quality.function(x)}. Defaults to FALSE
-#' @param optimal.box During validation, choose the box with the highest quality or a simpler box, two standard errors from the optimum
+#' @param optimal.box During validation, choose the box with the highest quality or a simpler box, two standard errors from the optimum. Defaults to best.
 #' @return An S3 object of class prim.cover
 #' @author Jurian Baas
 #' @importFrom stats model.frame model.response complete.cases terms
@@ -60,7 +60,7 @@ prim.cover <- function (
 
     X <- stats::model.frame(formula(stats::terms(formula, data = data, simplify = TRUE)), data)
     y <- stats::model.response(X)
-    X <- X[,-1]
+    X <- X[,-1, drop = FALSE]
 
     if(is.null(y)) stop("Data has no response variable, aborting...")
 
@@ -98,11 +98,11 @@ prim.cover <- function (
   result$min.support <- min.support
   result$train.fraction <- train.fraction
   result$quality.function <- quality.function
+  result$quality.function.name <- base::deparse(base::substitute(quality.function))
   result$global.quality <- quality.function(y)
 
   if(!is.na(max.boxes)) result$max.boxes <- max.boxes
 
-
   covers <- list()
 
   y.sub.quality <- result$global.quality
@@ -120,10 +120,11 @@ prim.cover <- function (
 
     box.nr <- box.nr + 1
 
+    # Take a sample so we can validate the current cover
     train <- sample(1:cur.N, cur.N * train.fraction)
 
     p.peel <- prim.peel(
-      X = X[train,],
+      X = X[train, , drop = FALSE],
       y = y[train],
       N = result$N,
       peeling.quantile = peeling.quantile,
@@ -132,7 +133,7 @@ prim.cover <- function (
       quality.function = quality.function
     )
     p.peel$global.quality <- quality.function(y[train])
-    p.validate <- prim.validate(p.peel, X[-train,], y[-train], optimal.box)
+    p.validate <- prim.validate(p.peel, X[-train, , drop = FALSE], y[-train], optimal.box)
 
     idx <- prim.rule.match(p.validate, X)
 
@@ -141,14 +142,17 @@ prim.cover <- function (
       break
     }
 
+    # The number of observations in the current cover
     p.validate$cover.N <- nrow(X)
+    # The number of observations in the current cover that fall in the box
     p.validate$cover.box.N <- sum(idx)
+    # The overall quality of the current cover
     p.validate$cover.global.quality <- quality.function(y)
+    # The quality of the box inside the current cover
     p.validate$cover.box.quality <- quality.function(y[idx])
 
-    y.sub.quality <- p.validate$cover.box.quality
-
     # This box does not meet the minimal quality
+    y.sub.quality <- p.validate$cover.box.quality
     if(y.sub.quality < result$global.quality )  break
 
     if(plot) {
@@ -158,7 +162,8 @@ prim.cover <- function (
       graphics::par(mfrow = c(1,1))
     }
 
-    X <- X[!idx,]
+    # Remove the observations that fall outside of the current box
+    X <- X[!idx, , drop = FALSE]
     y <- y[!idx]
 
     covers <- c(covers, list(p.validate))
@@ -195,7 +200,7 @@ prim.cover <- function (
 #' @param plot Plot intermediate results, defaults to false. Note that intermediate plotting is unavailable when running in parallel
 #' @param parallel Compute each run in parallel, defaults to TRUE. This will use all but one core. Note that intermediate plotting is unavailable when running in parallel
 #' @param minimize Should the quality be minimized? Same as setting the quality function to function(x){-quality.function(x)}. Defaults to FALSE
-#' @param optimal.box During validation, choose the box with the highest quality or a simpler box, two standard errors from the optimum
+#' @param optimal.box During validation, choose the box with the highest quality or a simpler box, two standard errors from the optimum. Defaults to best.
 #' @return An S3 object of type prim.diversify
 #' @author Jurian Baas
 #' @importFrom stats model.frame model.response complete.cases terms
@@ -280,7 +285,7 @@ prim.diversify <- function (
   result$min.support = min.support
   result$train.fraction = train.fraction
   result$quality.function <- quality.function
-
+  result$quality.function.name <- base::deparse(base::substitute(quality.function))
   result$N <- nrow(X)
   result$global.quality <- quality.function(y)
 
@@ -290,7 +295,7 @@ prim.diversify <- function (
     train <- sample(1:nrow(X), nrow(X) * train.fraction)
 
     p.peel <- prim.peel(
-      X = X[train,],
+      X = X[train, , drop = FALSE],
       y = y[train],
       N = result$N,
       peeling.quantile = peeling.quantile,
@@ -299,7 +304,7 @@ prim.diversify <- function (
       quality.function = quality.function)
     p.peel$global.quality <- result$global.quality
 
-    p.validate <- prim.validate(p.peel, X[-train,], y[-train], optimal.box)
+    p.validate <- prim.validate(p.peel, X[-train, , drop = FALSE], y[-train], optimal.box)
 
     if(plot) {
       graphics::par(mfrow = c(1,2))
@@ -397,7 +402,7 @@ prim.peel <- function(X, y, N, peeling.quantile, min.support, max.peel, quality.
       cf$value <- paste0("'", cf$value, "'")
     }
 
-    X <- X[-cf$idx,]
+    X <- X[-cf$idx, , drop = FALSE]
     y <- y[-cf$idx]
 
     result$box.qualities <- c(result$box.qualities, quality.function(y))
@@ -475,7 +480,7 @@ prim.validate <- function(peel.result, X, y, optimal.box) {
     # Check if this rule has any observations in the test data
     if(sum(idx) > 0) {
 
-      X <- X[!idx,]
+      X <- X[!idx, , drop = FALSE]
       y <- y[!idx]
 
       result$box.qualities <- c(result$box.qualities, quality.function(y))
@@ -691,7 +696,7 @@ prim.box.optimal <- function(prim.validate) {
     # So we pick the optimal in the subset defined by the new best box
     best.box.idx <- which.max(prim.validate$box.qualities[1:cutoff.point])
   }
-
+  if(length(best.box.idx) == 0) stop("Could not find a best box, try adding more features")
   return(best.box.idx)
 }
 
@@ -801,7 +806,7 @@ prim.diversify.compare <- function(X, p.div) {
   if(class(p.div) != "prim.diversify")
     stop("Argument is not of class prim.diversify")
 
-  frontier <- sort(p.div$frontier)
+  frontier <- rev(p.div$frontier)
   nr.of.attempts <- length(frontier)
   idx <- combn(1:nr.of.attempts, 2)
   scores <- apply(idx, 2, function(i) {
@@ -926,8 +931,8 @@ plot.prim.peel <- function(x, ...) {
 
   print.names <- character(length = best.box.idx)
 
-  for(i in 1:(best.box.idx-1)) {
-    if(x$rule.names[i] == x$rule.names[i + 1]) {
+  for(i in 1:(best.box.idx)) {
+    if(i < length(x$rule.names) & x$rule.names[i] == x$rule.names[i + 1]) {
       print.names[i] <- ""
     } else {
       print.names[i] <- paste(x$rule.names[i], x$rule.operators[i], x$rule.values[i])
@@ -985,7 +990,7 @@ plot.prim.validate <- function(x, ...) {
 
   if(best.box.idx > 1)
     for(i in 1:(best.box.idx-1)) {
-      if(x$rule.names[i] == x$rule.names[i + 1]) {
+      if(i < length(x$rule.names) & x$rule.names[i] == x$rule.names[i + 1]) {
         print.names[i] <- ""
       } else {
         print.names[i] <- paste(x$rule.names[i], x$rule.operators[i], x$rule.values[i])
@@ -1184,21 +1189,23 @@ summary.prim.cover <- function(object, ..., round = TRUE, digits = 2) {
   cat("  ========== PRIM COVER RESULT =========", "\n")
   cat("  ======================================", "\n")
   cat("  |\n")
-  cat("  |  Peeling quantile:", object$peeling.quantile, "\n")
-  cat("  |  Min support:", object$min.support, "\n")
-  cat("  |  Train/test split:", object$train.fraction, "\n")
+  cat("  |  Peeling quantile:\t", object$peeling.quantile, "\n")
+  cat("  |  Min support:\t", object$min.support, "\n")
+  cat("  |  Train/test split:\t", object$train.fraction, "\n")
+  cat("  |  Quality function:\t", object$quality.function.name, "\n")
   cat("\n")
 
   for(i in 1:length(object$covers)) {
     x <- object$covers[[i]]
     cat("\n")
     cat("  ======================================", "\n")
     cat("  ============== COVER", i,"===============", "\n")
-    cat("  |  Cover set size: ", x$cover.N, "\n")
-    cat("  |  Cover set quality: ", round(x$cover.global.quality, digits), "\n")
+    cat("  |  Cover size:\t", x$cover.N, "\n")
+    cat("  |  Cover quality:\t", round(x$cover.global.quality, digits), "\n")
     cat("  |\n")
-    cat("  |  Box relative quality: ", round(x$cover.box.quality, digits), "(", round(x$cover.box.quality / x$cover.global.quality, digits), ") \n")
-    cat("  |  Box relative support: ", round(x$cover.box.N / x$cover.N, digits) , " (", x$cover.box.N, ") \n")
+    cat("  |  Box quality:\t", round(x$cover.box.quality, digits), "\n")
+    cat("  |  Box support:\t", round(x$cover.box.N / x$cover.N, digits), "\n")
+    cat("  |  Box size:\t\t", x$cover.box.N, "\n")
     cat("\n")
     cat("  ================ RULES ===============", "\n")
     cat("  | ", paste0(x$superrule, collapse = "\n  |  "))
@@ -1234,32 +1241,34 @@ summary.prim.diversify <- function(object, ..., round = TRUE, digits = 2) {
   cat("  ======== PRIM DIVERSIFY RESULT =======", "\n")
   cat("  ======================================", "\n")
   cat("  |\n")
-  cat("  |  Peeling quantile:", object$peeling.quantile, "\n")
-  cat("  |  Min support:", object$min.support, "\n")
-  cat("  |  Train/test split:", object$train.fraction, "\n")
+  cat("  |  Peeling quantile:\t", object$peeling.quantile, "\n")
+  cat("  |  Min support:\t", object$min.support, "\n")
+  cat("  |  Train/test split:\t", object$train.fraction, "\n")
+  cat("  |  Quality function:\t", object$quality.function.name, "\n")
   cat("  |\n")
-  cat("  |  Set size: ", object$N, "\n")
-  cat("  |  Set quality: ", round(object$global.quality, digits), "\n")
+  cat("  |  Set size:\t\t", object$N, "\n")
+  cat("  |  Set quality:\t", round(object$global.quality, digits), "\n")
   cat("\n")
   cat("  Scores:", "\n  | ")
   cat(paste0(
-    gsub(
+    base::gsub(
       "NA",
       "    ",
       apply(formatC(round(object$scoreMatrix, 2), format = "f", digits = 2), 1, paste, collapse = "\t"))
     , collapse = "\n  |  ")
     )
   cat("\n\n\n")
   cat("  Dominating attempts:","\n")
-  frontier <- sort(object$frontier)
-  for(i in seq_along(frontier)) {
+  frontier <- base::sort(object$frontier)
+  for(i in base::seq_along(frontier)) {
     x <- object$attempts[[frontier[i]]]
     cat("\n")
     cat("  ======================================", "\n")
-    cat("  ============= ATTEMPT", frontier[i],"==============", "\n")
-    cat("  |  Score:", round(object$scores[i], digits), "\n")
-    cat("  |  Box quality: ", round(x$final.box.quality, digits), "(", round(x$final.box.quality / object$global.quality, digits), ") \n")
-    cat("  |  Box support: ", round(x$final.box.N / object$N, digits) , " (", x$final.box.N, ") \n")
+    cat("  ============= ATTEMPT", frontier[i],"=============", "\n")
+    cat("  |  Score:\t\t", round(object$scores[i], digits), "\n")
+    cat("  |  Box quality:\t", round(x$final.box.quality, digits), "\n")
+    cat("  |  Box support:\t", round(x$final.box.N / object$N, digits), "\n")
+    cat("  |  Box size:\t\t", x$final.box.N, "\n")
     cat("\n")
     cat("  ================ RULES ===============", "\n")
     cat("  | ", paste0(x$superrule, collapse = "\n  |  "))

diff --git a/README.md b/README.md
@@ -6,7 +6,7 @@ This package was developed to assist in discovering interesting subgroups in mul
 
 ## Description
 
-The PRIM implementation is based on the 1998 paper "Bump hunting in high-dimensional data" by Jerome H. Friedman and Nicholas I. Fisher. PRIM involves finding a set of "rules" which combined imply unusually large (or small) values of some other target variable. Specifially one tries to find a set of subregions in which the target variable is substantially larger than overall mean. 
+The PRIM implementation is based on the 1998 paper "Bump hunting in high-dimensional data" by Jerome H. Friedman and Nicholas I. Fisher. PRIM involves finding a set of "rules" which combined imply unusually large (or small) values of some other target variable. Specifically one tries to find a set of subregions in which the target variable is substantially larger than overall mean. 
 
 The objective of bump hunting in general is to find regions in the input (attribute/feature) space with relatively high (low) values for the target variable. The regions are described by simple rules of the type if: {condition-1 & ... & condition-n} then: estimated target value. Given the data (or a subset of the data), the goal is to produce a box B within which the target mean is as large as possible. There are many problems where finding such regions is of considerable practical interest.  
 

diff --git a/man/ames.Rd b/man/ames.Rd
diff --git a/man/credit.Rd b/man/credit.Rd
diff --git a/man/prim.cover.Rd b/man/prim.cover.Rd
diff --git a/man/prim.diversify.Rd b/man/prim.diversify.Rd