Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

speed up buildScoreCache #58

Open
matteodelucchi opened this issue Dec 19, 2023 · 1 comment
Open

speed up buildScoreCache #58

matteodelucchi opened this issue Dec 19, 2023 · 1 comment
Assignees
Labels
good first issue Good for newcomers

Comments

@matteodelucchi
Copy link
Contributor

matteodelucchi commented Dec 19, 2023

Dear Matteo,

Included is a patch from the latest version to help scale building the score cache for the mle option. I'm using a "Sparse Candidate" type algorithm, so the number
of possible parents is normally quite constrained, in the region of 10s. This algorithm also needs to be able to check the scoring on adding
a single node, so I've had to make max.parents per node (I'm not sure why it was forbidden before). I've tested this on features running to
1000s and it seems to work quite well and replicates the previous results.

I also have code that scales the hill climbing algorithm to 1000s of variables, in R, but this is missing some of the functionality of the C
code, so I won't offer it yet.

Any questions, please get in touch!

Many thanks,

Rónán

On 15 Nov 2023, at 16:57, Delucchi Matteo [xxx] wrote:

Dear Ronan,

Thank you for your interest in our abn package and for taking the time to provide feedback.
We currently host the code on our institute’s GitLab server, which can be found at this link: https://git.math.uzh.ch/mdeluc/abn
I greatly appreciate your contribution towards improving the scalability of the package. I would happily review your patch and consider incorporating it for the next release.
Please don’t hesitate to reach out if you encounter any issues or have further questions.
Thank you again for your feedback!

Best regards,
Matteo

From: Ronan
Subject: abn R package

Dear Mr Delucchi,

I've been experimenting with the abn package and found that scaling up to large numbers of nodes was causing issues
with the code setting up the cache structure, specifically in buildScoreCache.mle where banned possibilities are filtered
out, was causing runtime to grow perhaps quadratically. I've implemented a fix that means the code can now scale to
larger examples and I'm wondering is there a way to incorporate this into the mainline of your package? I haven't seen a
github repository, but could send a patch etc.

Many thanks,

Ronan

diff --git a/R/build_score_cache_mle.R b/R/build_score_cache_mle.R
index 6e3e650..5b03b3f 100755
--- a/R/build_score_cache_mle.R
+++ b/R/build_score_cache_mle.R
@@ -377,6 +377,9 @@ buildScoreCache.mle <-
 
     ############################## Function to create the cache
 
+    if ( length(max.parents) == 1 ) {
+        max.parents <- rep(max.parents, nvars)
+    }
 
     if (!is.null(defn.res)) {
         max.parents <- max(apply(defn.res[["node.defn"]], 1, sum))
@@ -392,83 +395,64 @@ buildScoreCache.mle <-
             return(v)
         }
 
-        node.defn <- matrix(data = as.integer(0), nrow = 1L, ncol = nvars)
-        children <- 1
+        ## Generate all possible bit patterns for n variables, with a maximum of m 1s
+        generateBitPatterns = function(n, m) {
+          z <- rep(0,n)
+          do.call(rbind, lapply(0:m, function(i) t(apply(combn(1:n,i), 2, function(k) {z[k]=1;z}))))
+        }
 
-        for (j in 1:nvars) {
-            if (j != 1) {
-                node.defn <- rbind(node.defn, matrix(data = as.integer(0),
-                                                     nrow = 1L, ncol = nvars))
-                children <- cbind(children, j)
-            }
-            # node.defn <- rbind(node.defn,matrix(data = 0,nrow = 1,ncol = n))
+        # Function to generate all possible combinations of parents
+        filteredCombinations = function(x, m, bannedParents, retainedParents) {
+          # These are the parents that cannot change
+          fixedParents = bannedParents | retainedParents | (fun.return(x, length(x) + 2) + 1) %% 2
+          # These are the parents that can change
+          parentPossibleChoices = which(fixedParents == 0)
+          numPossibleChoices = length(parentPossibleChoices)
+          numRetainedParents = sum(retainedParents)
+
+          # Generate all possible combinations of parents, taking account of banned, retained and maximum number of parents
+          parentChoices = generateBitPatterns(numPossibleChoices, min(m-numRetainedParents, numPossibleChoices)) == 1
+          output = t(apply(parentChoices, 1, function(pc) {
+            combinedRow = 1L*(retainedParents | fun.return(parentPossibleChoices[pc], length(x) + 2))
+            combinedRow
+          }))
+          output
+        }
+
+        children <- matrix(nrow=1, ncol=0)
+        node.defn.list = list()
 
+        for (j in 1:nvars) {
           if(is.list(max.parents)){
             stop("ISSUE: `max.parents` as list is not yet implemented further down here. Try with a single numeric value as max.parents instead.")
             if(!is.null(which.nodes)){
               stop("ISSUE: `max.parents` as list in combination with `which.nodes` is not yet implemented further down here. Try with single numeric as max.parents instead.")
             }
-          } else if (is.numeric(max.parents) && length(max.parents)>1){
-            if (length(unique(max.parents)) == 1){
-              max.parents <- unique(max.parents)
-            } else {
-              stop("ISSUE: `max.parents` with node specific values that are not all the same, is not yet implemented further down here.")
-            }
-          }
-
-          if(max.parents == nvars){
-            max.parents <- max.parents-1
-            warning(paste("`max.par` == no. of variables. I set it to (no. of variables - 1)=", max.parents)) #NOTE: This might cause differences to method="bayes"!
           }
 
-            for (i in 1:(max.parents)) {
-                tmp <- t(combn(x = (nvars - 1), m = i, FUN = fun.return, n = nvars, simplify = TRUE))
-                tmp <- t(apply(X = tmp, MARGIN = 1, FUN = function(x) append(x = x, values = 0, after = j - 1)))
-
-                node.defn <- rbind(node.defn, tmp)
-
-                # children position
-                children <- cbind(children, t(rep(j, length(tmp[, 1]))))
-            }
+            # The parents that are banned and retained for node j
+            bannedParents = dag.banned[j, ]
+            retainedParents = dag.retained[j, ]
+            # All possible parents for node j, which is all nodes except j
+            parentChoice = c(seq.int(from=1, length.out=j-1), seq.int(from=j+1, length.out=nvars-j))
+            # How many parents we are keeping for node j
+            numRetainedParents = sum(retainedParents)
+            # The maximum number of parents for node j
+            m = max.parents[j]
+
+            # Generate all possible combinations of parents for node j
+            tmp <- filteredCombinations(x = parentChoice, m=m, bannedParents=bannedParents, retainedParents=retainedParents)
+            # We need a sparse matrix here to deal with large numbers of variables, otherwise memory usage if very high.
+            tmp2 = Matrix(tmp, sparse = TRUE)
+            node.defn.list[[length(node.defn.list) + 1]] <- tmp2
+            children <- cbind(children, t(rep(j, length(tmp2[, 1]))))
         }
 
-        # children <- rowSums(node.defn)
+        node.defn = do.call(rbind, node.defn.list)
         colnames(node.defn) <- colnames(data.df)
-        ## Coerce numeric matrix into integer matrix !!!
-        node.defn <- apply(node.defn, c(1, 2), function(x) {
-            (as.integer(x))
-        })
-
         children <- as.integer(children)
         # node.defn_ <- node.defn
 
-        ## DAG RETAIN/BANNED
-        for (i in 1:nvars) {
-            for (j in 1:nvars) {
-
-                ## DAG RETAIN
-                if (dag.retained[i, j] != 0) {
-                  tmp.indices <- which(children == i & node.defn[, j] == 0)
-
-                  if (length(tmp.indices) != 0) {
-                    node.defn <- node.defn[-tmp.indices, ]
-                    children <- children[-tmp.indices]
-                  }
-                }
-
-                ## DAG BANNED
-                if (dag.banned[i, j] != 0) {
-                  tmp.indices <- which(children == i & node.defn[, j] == 1)
-
-                  if (length(tmp.indices) != 0) {
-                    node.defn <- node.defn[-tmp.indices, ]
-                    children <- children[-tmp.indices]
-                  }
-                }
-
-            }
-        }
-
         mycache <- list(children = as.integer(children), node.defn = (node.defn))
 
         ###------------------------------###
@matteodelucchi
Copy link
Contributor Author

Before this is started, check out issue #55 , which might render this unnecessary.

@matteodelucchi matteodelucchi transferred this issue from another repository Apr 9, 2024
@matteodelucchi matteodelucchi transferred this issue from furrer-lab/shuttle-abn Apr 9, 2024
@matteodelucchi matteodelucchi added the good first issue Good for newcomers label Apr 9, 2024
@matteodelucchi matteodelucchi self-assigned this Jun 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

1 participant