Error in seq_len(nrow(x)) : argument must be coercible to non-negative integer #10

tinltan · 2021-10-26T03:42:22Z

Hello! When I first tried your ETM package in R using the Belgian parliament data, it worked. However, when I was testing it on my (small) data, after running the ETM() function and optimizer, I encountered this error:

Error in seq_len(nrow(x)) : argument must be coercible to non-negative integer

Here is my code:

library(topicmodels.etm)
library(doc2vec)
library(word2vec)
gcash_data <- read.csv("GCash_200_Reviews_PlayStore_RepeatScroll20_Wait5s_TimeOut60s_AJAx.csv")
names(gcash_data) <- c("UserName", "Date", "Likes", "Review", "Rating")
gcash_r5 <- filter(gcash_data, Rating == "5")
head(gcash_r5)
str(gcash_r5)

x      <- data.frame(doc_id           = gcash_r5$UserName, 
                     text             = gcash_r5$Review, 
                     stringsAsFactors = FALSE)
x$text <- txt_clean_word2vec(x$text)

w2v        <- word2vec(x = x$text, dim = 25, type = "skip-gram", iter = 10, min_count = 5, threads = 2)
embeddings <- as.matrix(w2v)
predict(w2v, newdata = c("app", "convenient"), type = "nearest", top_n = 4)

library(udpipe)
dtm   <- strsplit.data.frame(x, group = "doc_id", term = "text", split = " ")
dtm   <- document_term_frequencies(dtm)
dtm   <- document_term_matrix(dtm)
dtm   <- dtm_remove_tfidf(dtm, prob = 0.50)

vocab        <- intersect(rownames(embeddings), colnames(dtm))
embeddings   <- dtm_conform(embeddings, rows = vocab)
dtm          <- dtm_conform(dtm,     columns = vocab)
dim(dtm)
dim(embeddings)

set.seed(1234)
torch_manual_seed(4321)
model     <- ETM(k = 5, dim = 100, embeddings = embeddings)
optimizer <- optim_adam(params = model$parameters, lr = 0.005, weight_decay = 0.0000012)
loss      <- model$fit(data = dtm, optimizer = optimizer, epoch = 20, batch_size = 5)

As you may see in the code, for the ETM function, I changed args to k=5 topics. For model$fit, I changed args to batch_size =5.

After running the last line above with "model$fit", the following error occurs:
Error in seq_len(nrow(x)) : argument must be coercible to non-negative integer

Is this because I am trying to run a small dataset? How may I solve this?

Thank you in advance! :)

The text was updated successfully, but these errors were encountered:

jwijffels · 2021-10-26T07:31:04Z

code seq_len(nrow(x)) is only used when splitting the data in a train/test set in https://github.com/bnosac/ETM/blob/master/R/ETM.R#L398 namely at https://github.com/bnosac/ETM/blob/master/R/ETM.R#L475
The error indicates your dtm argument has no data.

Did you check on what you were passing on to the function calls?

tinltan · 2021-10-26T09:01:37Z

code seq_len(nrow(x)) is only used when splitting the data in a train/test set in https://github.com/bnosac/ETM/blob/master/R/ETM.R#L398 namely at https://github.com/bnosac/ETM/blob/master/R/ETM.R#L475 The error indicates your dtm argument has no data.

Did you check on what you were passing on to the function calls?

Thank you, I will work on your query and input above.

Earlier, though, I changed my dataset to a little bit more data, which resulted to these dimensions:

dim(dtm)
[1] 190 31
dim(embeddings)
[1] 31 25

After entering this command: loss <- model$fit(data = dtm, optimizer = optimizer, epoch = 20, batch_size = 1000), the prior error did not come out. But I got this new error instead:

Error in Tensor_slice_put(tensor$ptr, environment(), value, mask = .d) :
rhs must be a torch_tensor or scalar value.

I will review the R code as well...

jwijffels · 2021-10-26T09:08:09Z

Check on your input data of dtm and embeddings. Make sure there are no NA values in embeddings due to mismatch between embedding matrix and document term matrix
Think twice before applying this model on merely 190 text records which is just not what this model is built for

tinltan · 2021-10-26T12:45:43Z

Check on your input data of dtm and embeddings. Make sure there are no NA values in embeddings due to mismatch between embedding matrix and document term matrix

Think twice before applying this model on merely 190 text records which is just not what this model is built for

I tried the algorithm on the 20 newsgroups dataset, and it worked smoothly! (Just had a ggrepel warning saying "10 unlabeled data points (too many overlaps). Consider increasing max.overlaps.")

I will look for a larger dataset than the one I'm using. I will also look into the NA values in embeddings for the previous dataset. These may be the sources of the original error I had been encountering.

I will also try the other suggested plots in pythonrepo. Thank you very much for your great help! 👍 👍 👍

jwijffels closed this as completed in cd03711 Nov 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error in seq_len(nrow(x)) : argument must be coercible to non-negative integer #10

Error in seq_len(nrow(x)) : argument must be coercible to non-negative integer #10

tinltan commented Oct 26, 2021 •

edited by jwijffels

jwijffels commented Oct 26, 2021 •

edited

tinltan commented Oct 26, 2021 •

edited

jwijffels commented Oct 26, 2021 •

edited

tinltan commented Oct 26, 2021

Error in seq_len(nrow(x)) : argument must be coercible to non-negative integer #10

Error in seq_len(nrow(x)) : argument must be coercible to non-negative integer #10

Comments

tinltan commented Oct 26, 2021 • edited by jwijffels

jwijffels commented Oct 26, 2021 • edited

tinltan commented Oct 26, 2021 • edited

jwijffels commented Oct 26, 2021 • edited

tinltan commented Oct 26, 2021

tinltan commented Oct 26, 2021 •

edited by jwijffels

jwijffels commented Oct 26, 2021 •

edited

tinltan commented Oct 26, 2021 •

edited

jwijffels commented Oct 26, 2021 •

edited