-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VocabConstructor.buildJointVocabulary`s application of minWordFrequency is taking too much time. #2370
Comments
What's your corpus, and what's your vector size? Right now your issue is
just a JVM gc activity, caused by insufficient RAM.
25 нояб. 2016 г. 9:47 пользователь "Jasneet Sabharwal" <
notifications@github.com> написал:
… The removal of vocab based on minWordFrequency is currently taking too
much time. The vocab size before truncation was 45658973 and it is
currently taking too much time. The process ran for approximately 40
minutes and then the following OOME occurred:
Exception in thread "ContainerBackgroundProcessor[StandardEngine[Tomcat]]" java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.Arrays.copyOf(Arrays.java:3236)
at sun.misc.Resource.getBytes(Resource.java:117)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:462)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at org.springframework.boot.loader.LaunchedURLClassLoader.loadClass(LaunchedURLClassLoader.java:94)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at ch.qos.logback.classic.spi.LoggingEvent.<init>(LoggingEvent.java:119)
at ch.qos.logback.classic.Logger.buildLoggingEventAndAppend(Logger.java:419)
at ch.qos.logback.classic.Logger.filterAndLog_0_Or3Plus(Logger.java:383)
at ch.qos.logback.classic.Logger.log(Logger.java:765)
at org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221)
at org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303)
at java.util.logging.Logger.log(Logger.java:738)
at java.util.logging.Logger.doLog(Logger.java:765)
at java.util.logging.Logger.logp(Logger.java:1042)
at org.apache.juli.logging.DirectJDKLog.log(DirectJDKLog.java:181)
at org.apache.juli.logging.DirectJDKLog.error(DirectJDKLog.java:147)
at org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.run(ContainerBase.java:1352)
at java.lang.Thread.run(Thread.java:745)
The server is an AWS EC2 m4.4xlarge. The Ngram tokenizer was used with
minN as 1 and maxN as 3. Following was the last debug log from
VocabConstructor:
2016-11-25 05:50:14,922 [pool-5-thread-1] DEBUG o.d.m.w.w.VocabConstructor - Vocab size before truncation: [45658973], NumWords: [152451150], sequences parsed: [58753], counter: [152451150]
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#2370>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ALru_319Vc32KG744-4gjJeQTnTlP884ks5rBoSTgaJpZM4K8JvA>
.
|
The vector size is 200. I'm now running it with 15gb Xmx value and monitoring the process. The corpus that was used for this iteration consists of 58753 documents of varying sizes. |
Right, please add gist with your full w2v config
25 нояб. 2016 г. 10:09 пользователь "Jasneet Sabharwal" <
notifications@github.com> написал:
… The vector size is 200. I'm now running it with 15gb Xmx value and
monitoring the process. The corpus that was used for this iteration
consists of 58753 documents of varying sizes.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2370 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ALru_3etE7lY4gwSDQutp3qSmjDPWI7Pks5rBonCgaJpZM4K8JvA>
.
|
Following was the w2v config:
The |
Just to be clear.
45m words in vocab means the following:
- 45m x 24 bytes for string object
- 45m x X average characters per string.
- some more bytes for vocabword structure.
If your average vocab entry is ngram of 3 words, that's 18 unicode symbols,
2 bytes each for utf-8. That's 36 more bytes at very least. But probably
way more.
Then, for training:
45m x 200 x 4 x 2 :)
25 нояб. 2016 г. 10:11 пользователь "raver real" <raver119@gmail.com>
написал:
… Right, please add gist with your full w2v config
25 нояб. 2016 г. 10:09 пользователь "Jasneet Sabharwal" <
***@***.***> написал:
> The vector size is 200. I'm now running it with 15gb Xmx value and
> monitoring the process. The corpus that was used for this iteration
> consists of 58753 documents of varying sizes.
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#2370 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/ALru_3etE7lY4gwSDQutp3qSmjDPWI7Pks5rBonCgaJpZM4K8JvA>
> .
>
|
Agreed, but shouldn't the vector creation happen after the vocab truncation is done? The GC happened while the truncation based on minWordFrequency was in progress. The following log line was never executed: |
Sure, syn0/syn1Neg is created after vocab is built. I'm just saying, that syn0/syn1Neg on their own will be 72GB for floats, or 144GB for doubles. And during vocab creation, you just go OOM for jvm stuff due to amount of strings saved for vocab. |
I'm not saying that everything is ideal in vocab construction, i'm just saying that probably you need higher Xmx for now. But sure, i'll review vocab creation process on huge corpuses. |
Indeed higher Xmx did the trick. When it comes to huge corpuses, iterating over the data to build the vocab can be pretty expensive. One idea could be to have a checkpoint saving of the built data structure as soon as the iteration over the data is complete (can be optional). This way if an OOM occurs due to lack of memory, it would save the time of iterating over the whole dataset. |
There's already available periodical truncation for huge datasets, i've added it year ago, but it assumes manual use unfortunately. |
Issue was fixed long ago. |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
The removal of vocab based on minWordFrequency is currently taking too much time. The vocab size before truncation was
45658973
and it is currently taking too much time. The process ran for approximately 40 minutes and then the following OOME occurred:The server is an AWS EC2 m4.4xlarge. The Ngram tokenizer was used with minN as 1 and maxN as 3. Following was the last debug log from VocabConstructor:
The code was executed with default Xmx value.
The text was updated successfully, but these errors were encountered: