New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
5-gram crashes R-studio #111
Comments
What do you try to do with following lines? : v_vectorizer <- vocab_vectorizer(vocab_parallel, grow_dtm = FALSE, skip_grams_window = 3L)
vocab_dtm_parallel <- create_dtm(jobs, vectorizer = v_vectorizer) Especially I'm interesting in |
@dselivanov I got that from the documentation and I thought it's necessary. All I want is to have 1-5 gram s running in parallel since it's a bit heavy when it's done in serial. Does it make a difference if I remove those lines? |
If you only need document-term matrix, use: v_vectorizer <- vocab_vectorizer(vocab_parallel)
vocab_dtm_parallel <- create_dtm(jobs, vectorizer = v_vectorizer) Can you point, where is such misleading docs? I will definitely fix them... P.S. Not sure that 4-5 grams can be useful... At least didn't see any application/paper which used such long ngrams. |
@dselivanov It's in the PDF files actually, not on the website. I think your package needs to support it because if you are crashing the studio, it should be a memory leak somewhere, so look into it. About high-order n-grams, it's surprsing that you haven't seen the section higher-order n-grams of the famous Goodman paper: http://research.microsoft.com/en-us/um/redmond/groups/srg/papers/2001-joshuago-tr72.pdf He clearly talks about how adding higher n-gram orders brings more context to ngram models and how it affects the more sophisticated approaches like clustering and caching. |
I don't want to go further in this discussion about order of ngrams - if you think 4-5 grams can be useful - go ahead. But I would say one can benefit from higher order ngrams only when he have sufficient statistics (which means very large corpuses and the first paper clearly points that). Regarding crash. I can't reproduce it. Can you provide minimal reproducible example? (I noticed, you changed gist). Also can you provide output of |
yeah, well you said clearly that you don't know any papers its usefulness:
and now you say you don't wanna go further in the description. It's confusing! Regarding the gist changes: one change was when you said |
@ambodi What is confusing here - your troll-like behaviour. |
@dselivanov I'm sorry that you feel trolled but it's not really a troll. When someone is creating an issue that your library is not working with high order n-grams and then you condescend its use saying you haven't read any paper on this that makes them useful (which I guess a weak argument), I guess that is also trolling. I just tried to show that they are extremely useful and I guess you didn't like my comment because you have pride of what you know and pride is not scientific and after that you try to ignore what you said. If you are a scientist, accept the truth rather than yourself. As with the Since you already created the 20-gram, If you have created the code and it's working just fine, can you tell me about the Object size of |
@ambodi if you will continue going off topic I will block you. Especially, there is no need to get personal. I don't have time/desire to teach you or dispute with you. The only reason I still writing here is that I'm interested in finding a bug (if it really exists). You can get library(text2vec)
library(SnowballC)
library(doParallel)
data <- movie_review$review
stem_tokenizer <- function(x, tokenizer = word_tokenizer) {
x %>%
tokenizer %>%
lapply(wordStem, 'en')
}
N_WORKERS <- 4
registerDoParallel(N_WORKERS, cores=N_WORKERS)
splits <- split_into(data, N_WORKERS)
jobs <- lapply(splits, itoken, tolower, word_tokenizer)
stopwords <- c("i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours") %>%
wordStem('en')
vocab_parallel <- create_vocabulary(jobs, ngram = c(ngram_min = 1L, ngram_max = 20L), stopwords = stopwords)
|
|
And what version of |
@dselivanov 0.3 from CRAN There was a problem in the gist that it was over-riding data by mistake. I reverted that line. 20-grams of my file would occupy way more than 2.7Gb. Can you run it again with the gist I revised now without that line? |
@dselivanov yeah. It works perfectly your movie_review example. |
It will definitely goes to swap memory (I have only 16 gb of ram on my machine). As you can see even on |
Yeah, true! but can't you put a check so that once it has reached the limit of r process, it breaks the computation with an error instead? BTW, if I wanna solve this problem with creating 20 grams of 0.5GB of plain text, what would you suggest as an approach? Running it on AWC or what options do I have within R? |
"reached the limit of r process" can be treated in different ways in different situations... So I leave it for user. My opinion - you are trying to solve your problem in a wrong way. What is the value of 20 ngrams if most of them (99.99...%) observed only once? Mb. it worth to only detect phrases based on PMI. |
This is good to know. @dselivanov I think setting a threshold as you say for ngrams is a nice idea. I decided to do that, however I am not sure how to set them really but nice suggestions, overall. |
I am trying to create 1-5 grams on a 0.5 GB of data but it's crashing RStudio after trying for sometime. I put the script here for reproducibility. I am running 8 cores and run workers on 3 of them. I tried all cores but still that does not work for some reason!
https://gist.github.com/ambodi/d8fc4fbd071c7235fa858d4146ec96c9
The text was updated successfully, but these errors were encountered: