Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in [.data.table when using special characters #99

Closed
etiennebacher opened this issue Oct 17, 2021 · 2 comments
Closed

Error in [.data.table when using special characters #99

etiennebacher opened this issue Oct 17, 2021 · 2 comments

Comments

@etiennebacher
Copy link

etiennebacher commented Oct 17, 2021

Hello,

I may have found a bug that was introduced in version 0.8.6 (last version on CRAN at the time of writing). Using special characters generates the following error:

library(udpipe)
library(tm)

# Text data
textData <- data.frame(
  doc_id = 1,
  text = "tradução"
)

# Download and load model
udModel <- udpipe_download_model(language  = "portuguese-gsd", 
                                 model_dir = getwd())

udModel <- udpipe_load_model('portuguese-gsd-ud-2.5-191206.udpipe')

# Make a corpus 
textCorp <- VCorpus(DataframeSource(textData))
text     <- lapply(textCorp, content)


text <- data.frame(doc_id = 1:nrow(textData), 
                   text   = unlist(text))

udpipe(text, object = udModel)
Error in `[.data.table`(out, , `:=`(term_id, 1L:.N), by = list(doc_id)) : 
  Supplied 2 items to be assigned to group 1 of size 0 in column 'term_id'. The RHS length must either be 1 (single values are ok) or match the LHS length exactly. If you wish to 'recycle' the RHS please use rep() explicitly to make this intent clear to readers of your code.
In addition: Warning message:
In strsplit(x$conllu, "\n", fixed = TRUE) : input string 1 is invalid UTF-8

The error is generated by the letters "çã" in the text (removing them makes the error disappear). Also, I think this error is generated by the following line in the source code:

txt <- strsplit(x$conllu, "\n", fixed = TRUE)[[1]]

Removing fixed = TRUE in the line above removes the error. In case it helps, fixed = TRUE was introduced in c7557b6.

Session info
- Session info ---------------------------------------------------------
 setting  value                       
 version  R version 4.1.0 (2021-05-18)
 os       Windows 10 x64              
 system   x86_64, mingw32             
 ui       RStudio                     
 language (EN)                        
 collate  French_France.1252          
 ctype    French_France.1252          
 tz       Europe/Paris                
 date     2021-10-18                  
  • Packages -------------------------------------------------------------
    package * version date lib source
    cli 3.0.1 2021-07-17 [1] CRAN (R 4.1.1)
    data.table 1.14.2 2021-09-27 [1] standard (@1.14.2)
    lattice 0.20-45 2021-09-22 [1] CRAN (R 4.1.1)
    Matrix 1.3-4 2021-06-01 [1] CRAN (R 4.1.0)
    NLP * 0.2-1 2020-10-14 [1] standard (@0.2-1)
    Rcpp 1.0.7 2021-07-07 [1] standard (@1.0.7)
    rstudioapi 0.13 2020-11-12 [1] standard (@0.13)
    sessioninfo 1.1.1 2018-11-05 [1] standard (@1.1.1)
    slam 0.1-48 2020-12-03 [1] standard (@0.1-48)
    tm * 0.7-8 2020-11-18 [1] standard (@0.7-8)
    udpipe * 0.8.6 2021-06-01 [1] standard (@0.8.6)
    withr 2.4.2 2021-04-18 [1] CRAN (R 4.1.0)
    xml2 1.3.2 2020-04-23 [1] CRAN (R 4.1.0)

[1] C:/Users/etienne/Documents/R/R-4.1.0/library

Best,

@jwijffels
Copy link
Contributor

What happens if you put your text in utf8 encoding as indicated in the help.

@etiennebacher
Copy link
Author

Indeed using text = enc2utf8("tradução") works. Thanks, and sorry for the inconvenience

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants