Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

5-gram crashes R-studio #111

Closed
amir-rahnama opened this issue Jul 8, 2016 · 18 comments
Closed

5-gram crashes R-studio #111

amir-rahnama opened this issue Jul 8, 2016 · 18 comments
Assignees
Labels

Comments

@amir-rahnama
Copy link

I am trying to create 1-5 grams on a 0.5 GB of data but it's crashing RStudio after trying for sometime. I put the script here for reproducibility. I am running 8 cores and run workers on 3 of them. I tried all cores but still that does not work for some reason!

https://gist.github.com/ambodi/d8fc4fbd071c7235fa858d4146ec96c9

@dselivanov
Copy link
Owner

What do you try to do with following lines? :

v_vectorizer <- vocab_vectorizer(vocab_parallel, grow_dtm = FALSE, skip_grams_window = 3L)
vocab_dtm_parallel <- create_dtm(jobs, vectorizer = v_vectorizer)

Especially I'm interesting in skip_grams_window == 3L and ngram = c(1, 5).
Vectorizer with skip_grams_window > 0 needed for creation of term-cooccurence matrix (create_tcm()), not dtm...
Anyway, I will take a look.

@amir-rahnama
Copy link
Author

amir-rahnama commented Jul 8, 2016

@dselivanov I got that from the documentation and I thought it's necessary. All I want is to have 1-5 gram s running in parallel since it's a bit heavy when it's done in serial. Does it make a difference if I remove those lines?

@dselivanov
Copy link
Owner

If you only need document-term matrix, use:

v_vectorizer <- vocab_vectorizer(vocab_parallel)
vocab_dtm_parallel <- create_dtm(jobs, vectorizer = v_vectorizer)

Can you point, where is such misleading docs? I will definitely fix them...

P.S. Not sure that 4-5 grams can be useful... At least didn't see any application/paper which used such long ngrams.

@amir-rahnama
Copy link
Author

@dselivanov It's in the PDF files actually, not on the website.

I think your package needs to support it because if you are crashing the studio, it should be a memory leak somewhere, so look into it.

About high-order n-grams, it's surprsing that you haven't seen the section higher-order n-grams of the famous Goodman paper: http://research.microsoft.com/en-us/um/redmond/groups/srg/papers/2001-joshuago-tr72.pdf

He clearly talks about how adding higher n-gram orders brings more context to ngram models and how it affects the more sophisticated approaches like clustering and caching.

@amir-rahnama
Copy link
Author

This is also from a classic NLP textbook, Speech and Language Processing. Daniel Jurafsky & James H. Martin that shows the benefit of high order n-grams:

screen shot 2016-07-09 at 19 21 21

@dselivanov
Copy link
Owner

I don't want to go further in this discussion about order of ngrams - if you think 4-5 grams can be useful - go ahead. But I would say one can benefit from higher order ngrams only when he have sufficient statistics (which means very large corpuses and the first paper clearly points that).

Regarding crash. I can't reproduce it. Can you provide minimal reproducible example? (I noticed, you changed gist). Also can you provide output of sessionInfo() ?

@amir-rahnama
Copy link
Author

amir-rahnama commented Jul 10, 2016

yeah, well you said clearly that you don't know any papers its usefulness:

P.S. Not sure that 4-5 grams can be useful... At least didn't see any application/paper which used such long ngrams.

and now you say you don't wanna go further in the description. It's confusing!

Regarding the gist changes: one change was when you said skip_grams_window = 3L is not necessary and I removed it. The rest was a mistake and I reverted it back to original.

@dselivanov
Copy link
Owner

dselivanov commented Jul 10, 2016

@ambodi What is confusing here - your troll-like behaviour.
I'm closing this issue as non constructive - gist works just fine on my computer. If for some reason you need 4-5 or even 20 ngrams, use machine with enough ram.
Feel free to reopen the issue when you will be able to provide reproducible example and sessionInfo().

@dselivanov dselivanov self-assigned this Jul 10, 2016
@amir-rahnama
Copy link
Author

amir-rahnama commented Jul 11, 2016

@dselivanov I'm sorry that you feel trolled but it's not really a troll. When someone is creating an issue that your library is not working with high order n-grams and then you condescend its use saying you haven't read any paper on this that makes them useful (which I guess a weak argument), I guess that is also trolling. I just tried to show that they are extremely useful and I guess you didn't like my comment because you have pride of what you know and pride is not scientific and after that you try to ignore what you said. If you are a scientist, accept the truth rather than yourself.

As with the sessionInfo I can't access it because as I said it crashes RStudio. So when you are asking me to give you a sessionInfo(), I must say that I can't access the session anymore, so how can I get you a sessionInfo result.

Since you already created the 20-gram, If you have created the code and it's working just fine, can you tell me about the Object size of vocab_parallel or save it on an .RData so I can load it?

@dselivanov
Copy link
Owner

dselivanov commented Jul 11, 2016

@ambodi if you will continue going off topic I will block you. Especially, there is no need to get personal. I don't have time/desire to teach you or dispute with you. The only reason I still writing here is that I'm interested in finding a bug (if it really exists).

You can get sessionInfo() on fresh new session, so I can understand your platform, R version, text2vec version, parallel backend, etc.

library(text2vec)
library(SnowballC)
library(doParallel)
data <- movie_review$review

stem_tokenizer <- function(x, tokenizer = word_tokenizer) {
  x %>% 
    tokenizer %>% 
    lapply(wordStem, 'en')
}

N_WORKERS <- 4
registerDoParallel(N_WORKERS, cores=N_WORKERS)


splits <- split_into(data, N_WORKERS)
jobs <- lapply(splits, itoken, tolower, word_tokenizer)

stopwords <- c("i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours") %>%
  wordStem('en')

vocab_parallel <- create_vocabulary(jobs, ngram = c(ngram_min = 1L, ngram_max = 20L), stopwords = stopwords)

vocab_parallel consists of 20311886 terms and occupies 2.7 gb of ram. Each worker at peak consumes about 2.2 gb of ram.

@amir-rahnama
Copy link
Author

amir-rahnama commented Jul 11, 2016

> sessionInfo()
R version 3.2.4 (2016-03-10)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.5 (El Capitan)

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] stringi_1.1.1   tm_0.6-2        NLP_0.1-9       tau_0.0-18      SnowballC_0.5.1

loaded via a namespace (and not attached):
[1] parallel_3.2.4 tools_3.2.4    slam_0.1-34  

@dselivanov
Copy link
Owner

And what version of text2vec do you use? 0.3 from CRAN?
Does my code chunk work for you?

@amir-rahnama
Copy link
Author

@dselivanov 0.3 from CRAN

There was a problem in the gist that it was over-riding data by mistake. I reverted that line. 20-grams of my file would occupy way more than 2.7Gb. Can you run it again with the gist I revised now without that line?

@amir-rahnama
Copy link
Author

@dselivanov yeah. It works perfectly your movie_review example.

@dselivanov
Copy link
Owner

It will definitely goes to swap memory (I have only 16 gb of ram on my machine). As you can see even on movie_review dataset 1-20 ngram consumes ~ 2.7 gb only for storage of strings and about 10gb during the work with 4 workers. Try to imagine/estimate how many combinations of 20-grams can be in 0.5 gb txt file.

@amir-rahnama
Copy link
Author

Yeah, true! but can't you put a check so that once it has reached the limit of r process, it breaks the computation with an error instead?

BTW, if I wanna solve this problem with creating 20 grams of 0.5GB of plain text, what would you suggest as an approach? Running it on AWC or what options do I have within R?

@dselivanov
Copy link
Owner

"reached the limit of r process" can be treated in different ways in different situations... So I leave it for user.
Another limitation of fork-based parallelism - R can collect up to 2gb result from workers - if vocabulary result from each chunk will be more than 2gb, process will silently die. Keep this in mind.

My opinion - you are trying to solve your problem in a wrong way. What is the value of 20 ngrams if most of them (99.99...%) observed only once? Mb. it worth to only detect phrases based on PMI.

@amir-rahnama
Copy link
Author

Another limitation of fork-based parallelism - R can collect up to 2gb result from workers - if vocabulary result from each chunk will be more than 2gb, process will silently die. Keep this in mind.

This is good to know.

@dselivanov I think setting a threshold as you say for ngrams is a nice idea. I decided to do that, however I am not sure how to set them really but nice suggestions, overall.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants