Error: Character string limit #24

mustaszewski · 2018-05-08T10:00:54Z

I am trying to tokenise, lemmatise and parts-of-speech tag a large corpus of English texts. There are approximately 160,000 texts in the corpus, totaling approximately 46 million tokens, which means that on average the individual texts are relatively short (approx. 290 tokens). Following the example of the brussels_reviews dataset, the corpus is stored in a data table with individual raw texts in one column (see attached screenshot of corpus sample).

As usual, I try to parse the corpus by calling

model <- udpipe_load_model(file = model_path)
anno.dt <- as.data.table(udpipe_annotate(model, x = corpus.dt$text,
                                                 doc_id = corpus.dt$doc_id,
                                                 tagger = "default",
                                                 parser = "none"))

This works like a charm for a Polish corpus of approximately 13 million tokens. However, for English I get the following error message:

Error in udp_tokenise_tag_parse(object$model, x, doc_id, tokenizer, tagger, : R character strings are limited to 2^31-1 bytes

I am puzzled by this error, because the longest of all texts in the corpus is only 32,458 characters long, as evidenced by the column textlen from the corpus data table. The sum of all characters across all texts is 274,025,244, which is less than the limit stipulated by the error message. As a further sanity check, I splitted the whole corpus on white spaces by corpus.dt[, .(word = unlist(strsplit(text, "[[:space:]]")), doc_nr = .GRP), by = .(doc_id)] in order to assess the length of the longest token in the corpus, and this method revealed that the longest token in the corpus (obtained by the simple white-space-splitting procedure) is only 114 characters long. Therefore, I have no idea what could have caused the error.

Do you have any idea or suggestion as to what could have gone wrong? As mentioned before, everything worked perfectly fine with a smaller Polish corpus. I will appreciate any comment on this regard, because otherwise your your UDPipe implementation does everything I need without any problems.

The text was updated successfully, but these errors were encountered:

jwijffels · 2018-05-08T10:21:23Z

This is probably because in the tagging, the output is put into 1 string (the result of udpipe_annotate has an argument called conllu which is a length 1 character element containing all text in conllu format as you can see in the code below). I think that string is too large (larger than 2^31-1 bytes)

library(udpipe)
data(brussels_reviews)
comments <- subset(brussels_reviews, language %in% "es")

ud_model <- udpipe_download_model(language = "spanish")
ud_model <- udpipe_load_model(ud_model$file_model)
x <- udpipe_annotate(ud_model, x = comments$feedback[1:3])
cat(x$conllu, sep = "\n")

The solution is just to split the data in chunks and apply the annotation, something like this?

library(data.table)
ud_model <- udpipe_download_model(language = "spanish")
x <- split(comments, seq(1, nrow(comments), by = 1000))
x <- lapply(x,
       FUN=function(x, modelfile){
         ud_model <- udpipe_load_model(modelfile)
         as.data.frame(udpipe_annotate(ud_model, x = x$feedback, doc_id = x$id))
       }, modelfile = ud_model$file_model)
x <- rbindlist(x)

It seems to me that in that case, the final conllu parsed result will not be larger than 2^31-1 bytes)
You can also use mclapply if you have several CPU's which make this parallel which is an advantage. Can you try this.

mustaszewski · 2018-05-17T09:11:09Z

Thank you for replying, your suggested solution works. In addition, using mclapply considerably speeds up the annotation. Since I do not need all columns of the annotated file (e.g. no need for the sentence or the dependency relations), I have written a function that removes this columns, which helps to save memory. The function is passed as the FUN argument of mclapply:

library(udpipe)
library(data.table)
library(parallel)
data(brussels_reviews)
comments <- subset(brussels_reviews, language %in% "fr")
ud_model <- udpipe_download_model(language = "french-partut")
ud_model <- udpipe_load_model(ud_model$file_model)

annotate_splits <- function(x) {
  x <- as.data.table(udpipe_annotate(ud_model, x = x$feedback,
                                     doc_id = x$id, tagger = "default",
                                     parser = "none"))
  # Remove not required columns
  x <- x[, c("sentence", "feats", "head_token_id", "dep_rel", "deps") := NULL]
  return(x)
}

corpus_splitted <- split(comments, seq(1, nrow(comments), by = 100))
annotation <- mclapply(corpus_splitted, FUN = function(x) annotate_splits(x), mc.cores = 8)
annotation <- rbindlist(annotation)

jwijffels · 2018-05-17T09:13:39Z

Good to know. Feel free to close the issue.
FYI. The newest version of udpipe which landed on cran yesterday has an argument trace allowing you to see how far you are in annotating which might be usefull if you have a lot of text to annotate.

mustaszewski closed this as completed May 17, 2018

jwijffels mentioned this issue Aug 6, 2018

Space optimization #26

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error: Character string limit #24

Error: Character string limit #24

mustaszewski commented May 8, 2018

jwijffels commented May 8, 2018

mustaszewski commented May 17, 2018 •

edited by jwijffels

Loading

jwijffels commented May 17, 2018 •

edited

Loading

Error: Character string limit #24

Error: Character string limit #24

Comments

mustaszewski commented May 8, 2018

jwijffels commented May 8, 2018

mustaszewski commented May 17, 2018 • edited by jwijffels Loading

jwijffels commented May 17, 2018 • edited Loading

mustaszewski commented May 17, 2018 •

edited by jwijffels

Loading

jwijffels commented May 17, 2018 •

edited

Loading