5-gram crashes R-studio #111

amir-rahnama · 2016-07-08T11:01:04Z

I am trying to create 1-5 grams on a 0.5 GB of data but it's crashing RStudio after trying for sometime. I put the script here for reproducibility. I am running 8 cores and run workers on 3 of them. I tried all cores but still that does not work for some reason!

https://gist.github.com/ambodi/d8fc4fbd071c7235fa858d4146ec96c9

dselivanov · 2016-07-08T11:07:35Z

What do you try to do with following lines? :

v_vectorizer <- vocab_vectorizer(vocab_parallel, grow_dtm = FALSE, skip_grams_window = 3L)
vocab_dtm_parallel <- create_dtm(jobs, vectorizer = v_vectorizer)

Especially I'm interesting in skip_grams_window == 3L and ngram = c(1, 5).
Vectorizer with skip_grams_window > 0 needed for creation of term-cooccurence matrix (create_tcm()), not dtm...
Anyway, I will take a look.

amir-rahnama · 2016-07-08T17:58:03Z

@dselivanov I got that from the documentation and I thought it's necessary. All I want is to have 1-5 gram s running in parallel since it's a bit heavy when it's done in serial. Does it make a difference if I remove those lines?

dselivanov · 2016-07-09T11:07:00Z

If you only need document-term matrix, use:

v_vectorizer <- vocab_vectorizer(vocab_parallel)
vocab_dtm_parallel <- create_dtm(jobs, vectorizer = v_vectorizer)

Can you point, where is such misleading docs? I will definitely fix them...

P.S. Not sure that 4-5 grams can be useful... At least didn't see any application/paper which used such long ngrams.

amir-rahnama · 2016-07-09T17:16:41Z

@dselivanov It's in the PDF files actually, not on the website.

I think your package needs to support it because if you are crashing the studio, it should be a memory leak somewhere, so look into it.

About high-order n-grams, it's surprsing that you haven't seen the section higher-order n-grams of the famous Goodman paper: http://research.microsoft.com/en-us/um/redmond/groups/srg/papers/2001-joshuago-tr72.pdf

He clearly talks about how adding higher n-gram orders brings more context to ngram models and how it affects the more sophisticated approaches like clustering and caching.

amir-rahnama · 2016-07-09T17:23:17Z

This is also from a classic NLP textbook, Speech and Language Processing. Daniel Jurafsky & James H. Martin that shows the benefit of high order n-grams:

dselivanov · 2016-07-10T16:06:59Z

I don't want to go further in this discussion about order of ngrams - if you think 4-5 grams can be useful - go ahead. But I would say one can benefit from higher order ngrams only when he have sufficient statistics (which means very large corpuses and the first paper clearly points that).

Regarding crash. I can't reproduce it. Can you provide minimal reproducible example? (I noticed, you changed gist). Also can you provide output of sessionInfo() ?

amir-rahnama · 2016-07-10T17:14:11Z

yeah, well you said clearly that you don't know any papers its usefulness:

P.S. Not sure that 4-5 grams can be useful... At least didn't see any application/paper which used such long ngrams.

and now you say you don't wanna go further in the description. It's confusing!

Regarding the gist changes: one change was when you said skip_grams_window = 3L is not necessary and I removed it. The rest was a mistake and I reverted it back to original.

dselivanov · 2016-07-10T22:06:15Z

@ambodi What is confusing here - your troll-like behaviour.
I'm closing this issue as non constructive - gist works just fine on my computer. If for some reason you need 4-5 or even 20 ngrams, use machine with enough ram.
Feel free to reopen the issue when you will be able to provide reproducible example and sessionInfo().

amir-rahnama · 2016-07-11T08:48:03Z

@dselivanov I'm sorry that you feel trolled but it's not really a troll. When someone is creating an issue that your library is not working with high order n-grams and then you condescend its use saying you haven't read any paper on this that makes them useful (which I guess a weak argument), I guess that is also trolling. I just tried to show that they are extremely useful and I guess you didn't like my comment because you have pride of what you know and pride is not scientific and after that you try to ignore what you said. If you are a scientist, accept the truth rather than yourself.

As with the sessionInfo I can't access it because as I said it crashes RStudio. So when you are asking me to give you a sessionInfo(), I must say that I can't access the session anymore, so how can I get you a sessionInfo result.

Since you already created the 20-gram, If you have created the code and it's working just fine, can you tell me about the Object size of vocab_parallel or save it on an .RData so I can load it?

dselivanov · 2016-07-11T09:20:30Z

@ambodi if you will continue going off topic I will block you. Especially, there is no need to get personal. I don't have time/desire to teach you or dispute with you. The only reason I still writing here is that I'm interested in finding a bug (if it really exists).

You can get sessionInfo() on fresh new session, so I can understand your platform, R version, text2vec version, parallel backend, etc.

library(text2vec)
library(SnowballC)
library(doParallel)
data <- movie_review$review

stem_tokenizer <- function(x, tokenizer = word_tokenizer) {
  x %>% 
    tokenizer %>% 
    lapply(wordStem, 'en')
}

N_WORKERS <- 4
registerDoParallel(N_WORKERS, cores=N_WORKERS)


splits <- split_into(data, N_WORKERS)
jobs <- lapply(splits, itoken, tolower, word_tokenizer)

stopwords <- c("i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours") %>%
  wordStem('en')

vocab_parallel <- create_vocabulary(jobs, ngram = c(ngram_min = 1L, ngram_max = 20L), stopwords = stopwords)

vocab_parallel consists of 20311886 terms and occupies 2.7 gb of ram. Each worker at peak consumes about 2.2 gb of ram.

amir-rahnama · 2016-07-11T09:41:28Z

> sessionInfo()
R version 3.2.4 (2016-03-10)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.5 (El Capitan)

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] stringi_1.1.1   tm_0.6-2        NLP_0.1-9       tau_0.0-18      SnowballC_0.5.1

loaded via a namespace (and not attached):
[1] parallel_3.2.4 tools_3.2.4    slam_0.1-34

dselivanov · 2016-07-11T09:43:27Z

And what version of text2vec do you use? 0.3 from CRAN?
Does my code chunk work for you?

amir-rahnama · 2016-07-11T09:45:21Z

@dselivanov 0.3 from CRAN

There was a problem in the gist that it was over-riding data by mistake. I reverted that line. 20-grams of my file would occupy way more than 2.7Gb. Can you run it again with the gist I revised now without that line?

amir-rahnama · 2016-07-11T09:45:41Z

@dselivanov yeah. It works perfectly your movie_review example.

dselivanov · 2016-07-11T09:58:48Z

It will definitely goes to swap memory (I have only 16 gb of ram on my machine). As you can see even on movie_review dataset 1-20 ngram consumes ~ 2.7 gb only for storage of strings and about 10gb during the work with 4 workers. Try to imagine/estimate how many combinations of 20-grams can be in 0.5 gb txt file.

amir-rahnama · 2016-07-11T11:23:52Z

Yeah, true! but can't you put a check so that once it has reached the limit of r process, it breaks the computation with an error instead?

BTW, if I wanna solve this problem with creating 20 grams of 0.5GB of plain text, what would you suggest as an approach? Running it on AWC or what options do I have within R?

dselivanov · 2016-07-11T11:38:07Z

"reached the limit of r process" can be treated in different ways in different situations... So I leave it for user.
Another limitation of fork-based parallelism - R can collect up to 2gb result from workers - if vocabulary result from each chunk will be more than 2gb, process will silently die. Keep this in mind.

My opinion - you are trying to solve your problem in a wrong way. What is the value of 20 ngrams if most of them (99.99...%) observed only once? Mb. it worth to only detect phrases based on PMI.

amir-rahnama · 2016-07-14T12:04:10Z

Another limitation of fork-based parallelism - R can collect up to 2gb result from workers - if vocabulary result from each chunk will be more than 2gb, process will silently die. Keep this in mind.

This is good to know.

@dselivanov I think setting a threshold as you say for ngrams is a nice idea. I decided to do that, however I am not sure how to set them really but nice suggestions, overall.

dselivanov added the question label Jul 9, 2016

dselivanov closed this as completed Jul 10, 2016

dselivanov self-assigned this Jul 10, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

5-gram crashes R-studio #111

5-gram crashes R-studio #111

amir-rahnama commented Jul 8, 2016

dselivanov commented Jul 8, 2016

amir-rahnama commented Jul 8, 2016 •

edited

dselivanov commented Jul 9, 2016

amir-rahnama commented Jul 9, 2016

amir-rahnama commented Jul 9, 2016

dselivanov commented Jul 10, 2016

amir-rahnama commented Jul 10, 2016 •

edited

dselivanov commented Jul 10, 2016 •

edited

amir-rahnama commented Jul 11, 2016 •

edited

dselivanov commented Jul 11, 2016 •

edited

amir-rahnama commented Jul 11, 2016 •

edited

dselivanov commented Jul 11, 2016

amir-rahnama commented Jul 11, 2016

amir-rahnama commented Jul 11, 2016

dselivanov commented Jul 11, 2016

amir-rahnama commented Jul 11, 2016

dselivanov commented Jul 11, 2016

amir-rahnama commented Jul 14, 2016

5-gram crashes R-studio #111

5-gram crashes R-studio #111

Comments

amir-rahnama commented Jul 8, 2016

dselivanov commented Jul 8, 2016

amir-rahnama commented Jul 8, 2016 • edited

dselivanov commented Jul 9, 2016

amir-rahnama commented Jul 9, 2016

amir-rahnama commented Jul 9, 2016

dselivanov commented Jul 10, 2016

amir-rahnama commented Jul 10, 2016 • edited

dselivanov commented Jul 10, 2016 • edited

amir-rahnama commented Jul 11, 2016 • edited

dselivanov commented Jul 11, 2016 • edited

amir-rahnama commented Jul 11, 2016 • edited

dselivanov commented Jul 11, 2016

amir-rahnama commented Jul 11, 2016

amir-rahnama commented Jul 11, 2016

dselivanov commented Jul 11, 2016

amir-rahnama commented Jul 11, 2016

dselivanov commented Jul 11, 2016

amir-rahnama commented Jul 14, 2016

amir-rahnama commented Jul 8, 2016 •

edited

amir-rahnama commented Jul 10, 2016 •

edited

dselivanov commented Jul 10, 2016 •

edited

amir-rahnama commented Jul 11, 2016 •

edited

dselivanov commented Jul 11, 2016 •

edited

amir-rahnama commented Jul 11, 2016 •

edited