Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: Character string limit #24

Closed
mustaszewski opened this issue May 8, 2018 · 3 comments
Closed

Error: Character string limit #24

mustaszewski opened this issue May 8, 2018 · 3 comments

Comments

@mustaszewski
Copy link

I am trying to tokenise, lemmatise and parts-of-speech tag a large corpus of English texts. There are approximately 160,000 texts in the corpus, totaling approximately 46 million tokens, which means that on average the individual texts are relatively short (approx. 290 tokens). Following the example of the brussels_reviews dataset, the corpus is stored in a data table with individual raw texts in one column (see attached screenshot of corpus sample).
corpus_sample

As usual, I try to parse the corpus by calling

model <- udpipe_load_model(file = model_path)
anno.dt <- as.data.table(udpipe_annotate(model, x = corpus.dt$text,
                                                 doc_id = corpus.dt$doc_id,
                                                 tagger = "default",
                                                 parser = "none"))

This works like a charm for a Polish corpus of approximately 13 million tokens. However, for English I get the following error message:

Error in udp_tokenise_tag_parse(object$model, x, doc_id, tokenizer, tagger, : R character strings are limited to 2^31-1 bytes

I am puzzled by this error, because the longest of all texts in the corpus is only 32,458 characters long, as evidenced by the column textlen from the corpus data table. The sum of all characters across all texts is 274,025,244, which is less than the limit stipulated by the error message. As a further sanity check, I splitted the whole corpus on white spaces by corpus.dt[, .(word = unlist(strsplit(text, "[[:space:]]")), doc_nr = .GRP), by = .(doc_id)] in order to assess the length of the longest token in the corpus, and this method revealed that the longest token in the corpus (obtained by the simple white-space-splitting procedure) is only 114 characters long. Therefore, I have no idea what could have caused the error.

Do you have any idea or suggestion as to what could have gone wrong? As mentioned before, everything worked perfectly fine with a smaller Polish corpus. I will appreciate any comment on this regard, because otherwise your your UDPipe implementation does everything I need without any problems.

@jwijffels
Copy link
Contributor

This is probably because in the tagging, the output is put into 1 string (the result of udpipe_annotate has an argument called conllu which is a length 1 character element containing all text in conllu format as you can see in the code below). I think that string is too large (larger than 2^31-1 bytes)

library(udpipe)
data(brussels_reviews)
comments <- subset(brussels_reviews, language %in% "es")

ud_model <- udpipe_download_model(language = "spanish")
ud_model <- udpipe_load_model(ud_model$file_model)
x <- udpipe_annotate(ud_model, x = comments$feedback[1:3])
cat(x$conllu, sep = "\n")

The solution is just to split the data in chunks and apply the annotation, something like this?

library(data.table)
ud_model <- udpipe_download_model(language = "spanish")
x <- split(comments, seq(1, nrow(comments), by = 1000))
x <- lapply(x,
       FUN=function(x, modelfile){
         ud_model <- udpipe_load_model(modelfile)
         as.data.frame(udpipe_annotate(ud_model, x = x$feedback, doc_id = x$id))
       }, modelfile = ud_model$file_model)
x <- rbindlist(x)

It seems to me that in that case, the final conllu parsed result will not be larger than 2^31-1 bytes)
You can also use mclapply if you have several CPU's which make this parallel which is an advantage. Can you try this.

@mustaszewski
Copy link
Author

mustaszewski commented May 17, 2018

Thank you for replying, your suggested solution works. In addition, using mclapply considerably speeds up the annotation. Since I do not need all columns of the annotated file (e.g. no need for the sentence or the dependency relations), I have written a function that removes this columns, which helps to save memory. The function is passed as the FUN argument of mclapply:

library(udpipe)
library(data.table)
library(parallel)
data(brussels_reviews)
comments <- subset(brussels_reviews, language %in% "fr")
ud_model <- udpipe_download_model(language = "french-partut")
ud_model <- udpipe_load_model(ud_model$file_model)

annotate_splits <- function(x) {
  x <- as.data.table(udpipe_annotate(ud_model, x = x$feedback,
                                     doc_id = x$id, tagger = "default",
                                     parser = "none"))
  # Remove not required columns
  x <- x[, c("sentence", "feats", "head_token_id", "dep_rel", "deps") := NULL]
  return(x)
}

corpus_splitted <- split(comments, seq(1, nrow(comments), by = 100))
annotation <- mclapply(corpus_splitted, FUN = function(x) annotate_splits(x), mc.cores = 8)
annotation <- rbindlist(annotation)

@jwijffels
Copy link
Contributor

jwijffels commented May 17, 2018

Good to know. Feel free to close the issue.
FYI. The newest version of udpipe which landed on cran yesterday has an argument trace allowing you to see how far you are in annotating which might be usefull if you have a lot of text to annotate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants