Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in seq_len(nrow(x)) : argument must be coercible to non-negative integer #10

Closed
tinltan opened this issue Oct 26, 2021 · 4 comments

Comments

@tinltan
Copy link

tinltan commented Oct 26, 2021

Hello! When I first tried your ETM package in R using the Belgian parliament data, it worked. However, when I was testing it on my (small) data, after running the ETM() function and optimizer, I encountered this error:

Error in seq_len(nrow(x)) : argument must be coercible to non-negative integer

Here is my code:

library(topicmodels.etm)
library(doc2vec)
library(word2vec)
gcash_data <- read.csv("GCash_200_Reviews_PlayStore_RepeatScroll20_Wait5s_TimeOut60s_AJAx.csv")
names(gcash_data) <- c("UserName", "Date", "Likes", "Review", "Rating")
gcash_r5 <- filter(gcash_data, Rating == "5")
head(gcash_r5)
str(gcash_r5)

x      <- data.frame(doc_id           = gcash_r5$UserName, 
                     text             = gcash_r5$Review, 
                     stringsAsFactors = FALSE)
x$text <- txt_clean_word2vec(x$text)

w2v        <- word2vec(x = x$text, dim = 25, type = "skip-gram", iter = 10, min_count = 5, threads = 2)
embeddings <- as.matrix(w2v)
predict(w2v, newdata = c("app", "convenient"), type = "nearest", top_n = 4)

library(udpipe)
dtm   <- strsplit.data.frame(x, group = "doc_id", term = "text", split = " ")
dtm   <- document_term_frequencies(dtm)
dtm   <- document_term_matrix(dtm)
dtm   <- dtm_remove_tfidf(dtm, prob = 0.50)

vocab        <- intersect(rownames(embeddings), colnames(dtm))
embeddings   <- dtm_conform(embeddings, rows = vocab)
dtm          <- dtm_conform(dtm,     columns = vocab)
dim(dtm)
dim(embeddings)

set.seed(1234)
torch_manual_seed(4321)
model     <- ETM(k = 5, dim = 100, embeddings = embeddings)
optimizer <- optim_adam(params = model$parameters, lr = 0.005, weight_decay = 0.0000012)
loss      <- model$fit(data = dtm, optimizer = optimizer, epoch = 20, batch_size = 5)

As you may see in the code, for the ETM function, I changed args to k=5 topics. For model$fit, I changed args to batch_size =5.

After running the last line above with "model$fit", the following error occurs:
Error in seq_len(nrow(x)) : argument must be coercible to non-negative integer

Is this because I am trying to run a small dataset? How may I solve this?

Thank you in advance! :)

@jwijffels
Copy link
Contributor

jwijffels commented Oct 26, 2021

code seq_len(nrow(x)) is only used when splitting the data in a train/test set in https://github.com/bnosac/ETM/blob/master/R/ETM.R#L398 namely at https://github.com/bnosac/ETM/blob/master/R/ETM.R#L475
The error indicates your dtm argument has no data.

Did you check on what you were passing on to the function calls?

@tinltan
Copy link
Author

tinltan commented Oct 26, 2021

code seq_len(nrow(x)) is only used when splitting the data in a train/test set in https://github.com/bnosac/ETM/blob/master/R/ETM.R#L398 namely at https://github.com/bnosac/ETM/blob/master/R/ETM.R#L475 The error indicates your dtm argument has no data.

Did you check on what you were passing on to the function calls?

Thank you, I will work on your query and input above.

Earlier, though, I changed my dataset to a little bit more data, which resulted to these dimensions:

dim(dtm)
[1] 190 31
dim(embeddings)
[1] 31 25

After entering this command: loss <- model$fit(data = dtm, optimizer = optimizer, epoch = 20, batch_size = 1000), the prior error did not come out. But I got this new error instead:

Error in Tensor_slice_put(tensor$ptr, environment(), value, mask = .d) :
rhs must be a torch_tensor or scalar value.

I will review the R code as well...

@jwijffels
Copy link
Contributor

jwijffels commented Oct 26, 2021

  • Check on your input data of dtm and embeddings. Make sure there are no NA values in embeddings due to mismatch between embedding matrix and document term matrix
  • Think twice before applying this model on merely 190 text records which is just not what this model is built for

@tinltan
Copy link
Author

tinltan commented Oct 26, 2021

  • Check on your input data of dtm and embeddings. Make sure there are no NA values in embeddings due to mismatch between embedding matrix and document term matrix
  • Think twice before applying this model on merely 190 text records which is just not what this model is built for

I tried the algorithm on the 20 newsgroups dataset, and it worked smoothly! (Just had a ggrepel warning saying "10 unlabeled data points (too many overlaps). Consider increasing max.overlaps.")

I will look for a larger dataset than the one I'm using. I will also look into the NA values in embeddings for the previous dataset. These may be the sources of the original error I had been encountering.

I will also try the other suggested plots in pythonrepo. Thank you very much for your great help! 👍 👍 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants