Skip to content

Commit

Permalink
version 0.2.1
Browse files Browse the repository at this point in the history
  • Loading branch information
Jurian Baas authored and cran-robot committed Jun 21, 2019
1 parent c6d9f25 commit cdcdbcb
Show file tree
Hide file tree
Showing 11 changed files with 76 additions and 73 deletions.
16 changes: 8 additions & 8 deletions DESCRIPTION
@@ -1,24 +1,24 @@
Package: subgroup.discovery
Type: Package
Title: Subgroup Discovery and Bump Hunting
Version: 0.2.0
Version: 0.2.1
Authors@R: c(
person(given = "Jurian", family = "Baas", email = "jurian@jurianbaas.nl", role = c("aut", "cre", "cph")),
person(given = "Ad", family ="Feelders", email = "A.J.Feelders@uu.nl", role = c("ctb")))
person(given = "Jurian", family = "Baas", email = "j.baas@uu.nl", role = c("aut", "cre", "cph")),
person(given = "Ad", family ="Feelders", email = "a.j.feelders@uu.nl", role = c("ctb")))
Description: Developed to assist in discovering interesting subgroups in high-dimensional data. The PRIM implementation is based on the 1998 paper "Bump hunting in high-dimensional data" by Jerome H. Friedman and Nicholas I. Fisher. <doi:10.1023/A:1008894516817> PRIM involves finding a set of "rules" which combined imply unusually large (or small) values of some other target variable. Specifically one tries to find a set of sub regions in which the target variable is substantially larger than overall mean. The objective of bump hunting in general is to find regions in the input (attribute/feature) space with relatively high (low) values for the target variable. The regions are described by simple rules of the type if: condition-1 and ... and condition-n then: estimated target value. Given the data (or a subset of the data), the goal is to produce a box B within which the target mean is as large as possible. There are many problems where finding such regions is of considerable practical interest. Often these are problems where a decision maker can in a sense choose or select the values of the input variables so as to optimize the value of the target variable. In bump hunting it is customary to follow a so-called covering strategy. This means that the same box construction (rule induction) algorithm is applied sequentially to subsets of the data.
Depends: R (>= 2.10)
License: GPL-3
Encoding: UTF-8
LazyData: true
RoxygenNote: 6.0.1
RoxygenNote: 6.1.1
URL: https://github.com/Jurian/subgroup.discovery
BugReports: https://github.com/Jurian/subgroup.discovery/issues
Date: 2017-07-15
Date: 2019-06-21
Suggests: testthat
NeedsCompilation: no
Packaged: 2017-08-02 09:04:19 UTC; juria
Packaged: 2019-06-21 14:28:41 UTC; juria
Author: Jurian Baas [aut, cre, cph],
Ad Feelders [ctb]
Maintainer: Jurian Baas <jurian@jurianbaas.nl>
Maintainer: Jurian Baas <j.baas@uu.nl>
Repository: CRAN
Date/Publication: 2017-08-02 09:09:42 UTC
Date/Publication: 2019-06-21 16:30:03 UTC
20 changes: 10 additions & 10 deletions MD5
@@ -1,13 +1,13 @@
be9e514890bcd9fac88d477dcb57e554 *DESCRIPTION
5f1418ffd70412f68d038bb91b8a879d *DESCRIPTION
b787c0195c018b76329c1ffa3e488892 *NAMESPACE
151874cac62b01f6a93303ef18b4cfa9 *R/data.R
144b8751d6224f1c61e5eceb71b26ccc *R/prim.R
13a5c3f56173bffdb63540d03ec4e71b *README.md
489f9dae3710c838cd7720511d924029 *R/data.R
79fded2e7ac1f0a81840e58eb61819f2 *R/prim.R
51f6daa6040fc8f052a516fc69b8b437 *README.md
758212ab8f8348b6edf5c016cdca769e *data/ames.rda
4ed2b76fab4f7bbc86c32968f5552817 *data/credit.rda
b55b409ecc1557da12a68d9cf614ff8a *data/pima.rda
37d2c3e5d8e6350dd5374780293d9195 *man/ames.Rd
bb8d890c9476827e8ade9b2d7827c1e1 *man/credit.Rd
04d26c7cecc8265ed12ac5b06551be03 *man/ames.Rd
293a396ed7f0ad4c7834dce7c4b9d84c *man/credit.Rd
c5a81d4f5163d3d5c35852f50d800ff0 *man/pima.Rd
bef653f6038edc99b604a985aef5cbd7 *man/plot.prim.cover.Rd
7ff886d509f8cc8fa552c095b6e286fc *man/plot.prim.diversify.Rd
Expand All @@ -16,13 +16,13 @@ bef653f6038edc99b604a985aef5cbd7 *man/plot.prim.cover.Rd
38bb5856d32ed5856308eaa13d7772e2 *man/predict.prim.cover.Rd
1e12179833917a38e55c003ae569c832 *man/prim.box.optimal.Rd
11a945c9899cc9bae3eb0a79d0cbf282 *man/prim.candidates.find.Rd
8c60387654b58bc1da8aa6057b314a4a *man/prim.cover.Rd
9a7c01f7fd4a63775c4f69e77a7ca16d *man/prim.diversify.Rd
da4e97c4c048f6c21c55b90750e659a7 *man/prim.cover.Rd
f3d435bf8e2f17711bf72bdb5685ed26 *man/prim.diversify.Rd
d1e2471b5bbe7f82dff6bc1d8d7e4219 *man/prim.diversify.compare.Rd
caba11df12fe8584ca8c2eaefe15144d *man/prim.peel.Rd
190d21145dc0020abf4a0dd17e0f6c34 *man/prim.peel.Rd
56c0c729e9d5e24e9fef37666c9427bf *man/prim.rule.condense.Rd
4c14a90f0b399b6e849ca6d0fbb77927 *man/prim.rule.match.Rd
764dbe0b8be928a0206560aa55f5d767 *man/prim.rule.operations.Rd
ab829fffe2e69619c04516d8312c5a37 *man/prim.rule.operations.Rd
d03472196009ee6ff1ff590eda04f9d2 *man/prim.validate.Rd
ba90dafaf0933039d0a4caa613f0930c *man/prim.validate.metrics.Rd
20efa17cf413e73dddb91a770320a124 *man/quasi.convex.hull.Rd
Expand Down
2 changes: 0 additions & 2 deletions R/data.R
Expand Up @@ -13,7 +13,6 @@
#' \item{gender}{factor with levels male, female}
#' \item{class}{class variable (0 or 1)}
#' }
#' @source \url{http://www.cs.uu.nl/docs/vakken/adm/bump.pdf}
"credit"

#' Pima Indians Diabetes Database.
Expand Down Expand Up @@ -165,5 +164,4 @@
#' \item{Sale.Condition}{(Nominal): Condition of sale}
#' \item{SalePrice}{(Continuous): Sale price}
#' }
#' @source \url{https://ww2.amstat.org/publications/jse/v19n3/decock/datadocumentation.txt}
"ames"
85 changes: 47 additions & 38 deletions R/prim.R
Expand Up @@ -17,7 +17,7 @@
#' @param quality.function Function to use for determining set quality, defaults to mean
#' @param plot Plot intermediate results, defaults to false
#' @param minimize Should the quality be minimized? Same as setting the quality function to function(x){-quality.function(x)}. Defaults to FALSE
#' @param optimal.box During validation, choose the box with the highest quality or a simpler box, two standard errors from the optimum
#' @param optimal.box During validation, choose the box with the highest quality or a simpler box, two standard errors from the optimum. Defaults to best.
#' @return An S3 object of class prim.cover
#' @author Jurian Baas
#' @importFrom stats model.frame model.response complete.cases terms
Expand Down Expand Up @@ -60,7 +60,7 @@ prim.cover <- function (

X <- stats::model.frame(formula(stats::terms(formula, data = data, simplify = TRUE)), data)
y <- stats::model.response(X)
X <- X[,-1]
X <- X[,-1, drop = FALSE]

if(is.null(y)) stop("Data has no response variable, aborting...")

Expand Down Expand Up @@ -98,11 +98,11 @@ prim.cover <- function (
result$min.support <- min.support
result$train.fraction <- train.fraction
result$quality.function <- quality.function
result$quality.function.name <- base::deparse(base::substitute(quality.function))
result$global.quality <- quality.function(y)

if(!is.na(max.boxes)) result$max.boxes <- max.boxes


covers <- list()

y.sub.quality <- result$global.quality
Expand All @@ -120,10 +120,11 @@ prim.cover <- function (

box.nr <- box.nr + 1

# Take a sample so we can validate the current cover
train <- sample(1:cur.N, cur.N * train.fraction)

p.peel <- prim.peel(
X = X[train,],
X = X[train, , drop = FALSE],
y = y[train],
N = result$N,
peeling.quantile = peeling.quantile,
Expand All @@ -132,7 +133,7 @@ prim.cover <- function (
quality.function = quality.function
)
p.peel$global.quality <- quality.function(y[train])
p.validate <- prim.validate(p.peel, X[-train,], y[-train], optimal.box)
p.validate <- prim.validate(p.peel, X[-train, , drop = FALSE], y[-train], optimal.box)

idx <- prim.rule.match(p.validate, X)

Expand All @@ -141,14 +142,17 @@ prim.cover <- function (
break
}

# The number of observations in the current cover
p.validate$cover.N <- nrow(X)
# The number of observations in the current cover that fall in the box
p.validate$cover.box.N <- sum(idx)
# The overall quality of the current cover
p.validate$cover.global.quality <- quality.function(y)
# The quality of the box inside the current cover
p.validate$cover.box.quality <- quality.function(y[idx])

y.sub.quality <- p.validate$cover.box.quality

# This box does not meet the minimal quality
y.sub.quality <- p.validate$cover.box.quality
if(y.sub.quality < result$global.quality ) break

if(plot) {
Expand All @@ -158,7 +162,8 @@ prim.cover <- function (
graphics::par(mfrow = c(1,1))
}

X <- X[!idx,]
# Remove the observations that fall outside of the current box
X <- X[!idx, , drop = FALSE]
y <- y[!idx]

covers <- c(covers, list(p.validate))
Expand Down Expand Up @@ -195,7 +200,7 @@ prim.cover <- function (
#' @param plot Plot intermediate results, defaults to false. Note that intermediate plotting is unavailable when running in parallel
#' @param parallel Compute each run in parallel, defaults to TRUE. This will use all but one core. Note that intermediate plotting is unavailable when running in parallel
#' @param minimize Should the quality be minimized? Same as setting the quality function to function(x){-quality.function(x)}. Defaults to FALSE
#' @param optimal.box During validation, choose the box with the highest quality or a simpler box, two standard errors from the optimum
#' @param optimal.box During validation, choose the box with the highest quality or a simpler box, two standard errors from the optimum. Defaults to best.
#' @return An S3 object of type prim.diversify
#' @author Jurian Baas
#' @importFrom stats model.frame model.response complete.cases terms
Expand Down Expand Up @@ -280,7 +285,7 @@ prim.diversify <- function (
result$min.support = min.support
result$train.fraction = train.fraction
result$quality.function <- quality.function

result$quality.function.name <- base::deparse(base::substitute(quality.function))
result$N <- nrow(X)
result$global.quality <- quality.function(y)

Expand All @@ -290,7 +295,7 @@ prim.diversify <- function (
train <- sample(1:nrow(X), nrow(X) * train.fraction)

p.peel <- prim.peel(
X = X[train,],
X = X[train, , drop = FALSE],
y = y[train],
N = result$N,
peeling.quantile = peeling.quantile,
Expand All @@ -299,7 +304,7 @@ prim.diversify <- function (
quality.function = quality.function)
p.peel$global.quality <- result$global.quality

p.validate <- prim.validate(p.peel, X[-train,], y[-train], optimal.box)
p.validate <- prim.validate(p.peel, X[-train, , drop = FALSE], y[-train], optimal.box)

if(plot) {
graphics::par(mfrow = c(1,2))
Expand Down Expand Up @@ -397,7 +402,7 @@ prim.peel <- function(X, y, N, peeling.quantile, min.support, max.peel, quality.
cf$value <- paste0("'", cf$value, "'")
}

X <- X[-cf$idx,]
X <- X[-cf$idx, , drop = FALSE]
y <- y[-cf$idx]

result$box.qualities <- c(result$box.qualities, quality.function(y))
Expand Down Expand Up @@ -475,7 +480,7 @@ prim.validate <- function(peel.result, X, y, optimal.box) {
# Check if this rule has any observations in the test data
if(sum(idx) > 0) {

X <- X[!idx,]
X <- X[!idx, , drop = FALSE]
y <- y[!idx]

result$box.qualities <- c(result$box.qualities, quality.function(y))
Expand Down Expand Up @@ -691,7 +696,7 @@ prim.box.optimal <- function(prim.validate) {
# So we pick the optimal in the subset defined by the new best box
best.box.idx <- which.max(prim.validate$box.qualities[1:cutoff.point])
}

if(length(best.box.idx) == 0) stop("Could not find a best box, try adding more features")
return(best.box.idx)
}

Expand Down Expand Up @@ -801,7 +806,7 @@ prim.diversify.compare <- function(X, p.div) {
if(class(p.div) != "prim.diversify")
stop("Argument is not of class prim.diversify")

frontier <- sort(p.div$frontier)
frontier <- rev(p.div$frontier)
nr.of.attempts <- length(frontier)
idx <- combn(1:nr.of.attempts, 2)
scores <- apply(idx, 2, function(i) {
Expand Down Expand Up @@ -926,8 +931,8 @@ plot.prim.peel <- function(x, ...) {

print.names <- character(length = best.box.idx)

for(i in 1:(best.box.idx-1)) {
if(x$rule.names[i] == x$rule.names[i + 1]) {
for(i in 1:(best.box.idx)) {
if(i < length(x$rule.names) & x$rule.names[i] == x$rule.names[i + 1]) {
print.names[i] <- ""
} else {
print.names[i] <- paste(x$rule.names[i], x$rule.operators[i], x$rule.values[i])
Expand Down Expand Up @@ -985,7 +990,7 @@ plot.prim.validate <- function(x, ...) {

if(best.box.idx > 1)
for(i in 1:(best.box.idx-1)) {
if(x$rule.names[i] == x$rule.names[i + 1]) {
if(i < length(x$rule.names) & x$rule.names[i] == x$rule.names[i + 1]) {
print.names[i] <- ""
} else {
print.names[i] <- paste(x$rule.names[i], x$rule.operators[i], x$rule.values[i])
Expand Down Expand Up @@ -1184,21 +1189,23 @@ summary.prim.cover <- function(object, ..., round = TRUE, digits = 2) {
cat(" ========== PRIM COVER RESULT =========", "\n")
cat(" ======================================", "\n")
cat(" |\n")
cat(" | Peeling quantile:", object$peeling.quantile, "\n")
cat(" | Min support:", object$min.support, "\n")
cat(" | Train/test split:", object$train.fraction, "\n")
cat(" | Peeling quantile:\t", object$peeling.quantile, "\n")
cat(" | Min support:\t", object$min.support, "\n")
cat(" | Train/test split:\t", object$train.fraction, "\n")
cat(" | Quality function:\t", object$quality.function.name, "\n")
cat("\n")

for(i in 1:length(object$covers)) {
x <- object$covers[[i]]
cat("\n")
cat(" ======================================", "\n")
cat(" ============== COVER", i,"===============", "\n")
cat(" | Cover set size: ", x$cover.N, "\n")
cat(" | Cover set quality: ", round(x$cover.global.quality, digits), "\n")
cat(" | Cover size:\t", x$cover.N, "\n")
cat(" | Cover quality:\t", round(x$cover.global.quality, digits), "\n")
cat(" |\n")
cat(" | Box relative quality: ", round(x$cover.box.quality, digits), "(", round(x$cover.box.quality / x$cover.global.quality, digits), ") \n")
cat(" | Box relative support: ", round(x$cover.box.N / x$cover.N, digits) , " (", x$cover.box.N, ") \n")
cat(" | Box quality:\t", round(x$cover.box.quality, digits), "\n")
cat(" | Box support:\t", round(x$cover.box.N / x$cover.N, digits), "\n")
cat(" | Box size:\t\t", x$cover.box.N, "\n")
cat("\n")
cat(" ================ RULES ===============", "\n")
cat(" | ", paste0(x$superrule, collapse = "\n | "))
Expand Down Expand Up @@ -1234,32 +1241,34 @@ summary.prim.diversify <- function(object, ..., round = TRUE, digits = 2) {
cat(" ======== PRIM DIVERSIFY RESULT =======", "\n")
cat(" ======================================", "\n")
cat(" |\n")
cat(" | Peeling quantile:", object$peeling.quantile, "\n")
cat(" | Min support:", object$min.support, "\n")
cat(" | Train/test split:", object$train.fraction, "\n")
cat(" | Peeling quantile:\t", object$peeling.quantile, "\n")
cat(" | Min support:\t", object$min.support, "\n")
cat(" | Train/test split:\t", object$train.fraction, "\n")
cat(" | Quality function:\t", object$quality.function.name, "\n")
cat(" |\n")
cat(" | Set size: ", object$N, "\n")
cat(" | Set quality: ", round(object$global.quality, digits), "\n")
cat(" | Set size:\t\t", object$N, "\n")
cat(" | Set quality:\t", round(object$global.quality, digits), "\n")
cat("\n")
cat(" Scores:", "\n | ")
cat(paste0(
gsub(
base::gsub(
"NA",
" ",
apply(formatC(round(object$scoreMatrix, 2), format = "f", digits = 2), 1, paste, collapse = "\t"))
, collapse = "\n | ")
)
cat("\n\n\n")
cat(" Dominating attempts:","\n")
frontier <- sort(object$frontier)
for(i in seq_along(frontier)) {
frontier <- base::sort(object$frontier)
for(i in base::seq_along(frontier)) {
x <- object$attempts[[frontier[i]]]
cat("\n")
cat(" ======================================", "\n")
cat(" ============= ATTEMPT", frontier[i],"==============", "\n")
cat(" | Score:", round(object$scores[i], digits), "\n")
cat(" | Box quality: ", round(x$final.box.quality, digits), "(", round(x$final.box.quality / object$global.quality, digits), ") \n")
cat(" | Box support: ", round(x$final.box.N / object$N, digits) , " (", x$final.box.N, ") \n")
cat(" ============= ATTEMPT", frontier[i],"=============", "\n")
cat(" | Score:\t\t", round(object$scores[i], digits), "\n")
cat(" | Box quality:\t", round(x$final.box.quality, digits), "\n")
cat(" | Box support:\t", round(x$final.box.N / object$N, digits), "\n")
cat(" | Box size:\t\t", x$final.box.N, "\n")
cat("\n")
cat(" ================ RULES ===============", "\n")
cat(" | ", paste0(x$superrule, collapse = "\n | "))
Expand Down
2 changes: 1 addition & 1 deletion README.md
Expand Up @@ -6,7 +6,7 @@ This package was developed to assist in discovering interesting subgroups in mul

## Description

The PRIM implementation is based on the 1998 paper "Bump hunting in high-dimensional data" by Jerome H. Friedman and Nicholas I. Fisher. PRIM involves finding a set of "rules" which combined imply unusually large (or small) values of some other target variable. Specifially one tries to find a set of subregions in which the target variable is substantially larger than overall mean.
The PRIM implementation is based on the 1998 paper "Bump hunting in high-dimensional data" by Jerome H. Friedman and Nicholas I. Fisher. PRIM involves finding a set of "rules" which combined imply unusually large (or small) values of some other target variable. Specifically one tries to find a set of subregions in which the target variable is substantially larger than overall mean.

The objective of bump hunting in general is to find regions in the input (attribute/feature) space with relatively high (low) values for the target variable. The regions are described by simple rules of the type if: {condition-1 & ... & condition-n} then: estimated target value. Given the data (or a subset of the data), the goal is to produce a box B within which the target mean is as large as possible. There are many problems where finding such regions is of considerable practical interest.

Expand Down
3 changes: 0 additions & 3 deletions man/ames.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 0 additions & 3 deletions man/credit.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

10 changes: 5 additions & 5 deletions man/prim.cover.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion man/prim.diversify.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

0 comments on commit cdcdbcb

Please sign in to comment.