VocabConstructor.buildJointVocabulary`s application of minWordFrequency is taking too much time. #2370

KonceptGeek · 2016-11-25T06:47:30Z

The removal of vocab based on minWordFrequency is currently taking too much time. The vocab size before truncation was 45658973 and it is currently taking too much time. The process ran for approximately 40 minutes and then the following OOME occurred:

Exception in thread "ContainerBackgroundProcessor[StandardEngine[Tomcat]]" java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.util.Arrays.copyOf(Arrays.java:3236)
        at sun.misc.Resource.getBytes(Resource.java:117)
        at java.net.URLClassLoader.defineClass(URLClassLoader.java:462)
        at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at org.springframework.boot.loader.LaunchedURLClassLoader.loadClass(LaunchedURLClassLoader.java:94)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        at ch.qos.logback.classic.spi.LoggingEvent.<init>(LoggingEvent.java:119)
        at ch.qos.logback.classic.Logger.buildLoggingEventAndAppend(Logger.java:419)
        at ch.qos.logback.classic.Logger.filterAndLog_0_Or3Plus(Logger.java:383)
        at ch.qos.logback.classic.Logger.log(Logger.java:765)
        at org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221)
        at org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303)
        at java.util.logging.Logger.log(Logger.java:738)
        at java.util.logging.Logger.doLog(Logger.java:765)
        at java.util.logging.Logger.logp(Logger.java:1042)
        at org.apache.juli.logging.DirectJDKLog.log(DirectJDKLog.java:181)
        at org.apache.juli.logging.DirectJDKLog.error(DirectJDKLog.java:147)
        at org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.run(ContainerBase.java:1352)
        at java.lang.Thread.run(Thread.java:745)

The server is an AWS EC2 m4.4xlarge. The Ngram tokenizer was used with minN as 1 and maxN as 3. Following was the last debug log from VocabConstructor:

2016-11-25 05:50:14,922 [pool-5-thread-1] DEBUG o.d.m.w.w.VocabConstructor - Vocab size before truncation: [45658973],  NumWords: [152451150], sequences parsed: [58753], counter: [152451150]

The code was executed with default Xmx value.

The text was updated successfully, but these errors were encountered:

raver119 · 2016-11-25T07:04:09Z

What's your corpus, and what's your vector size? Right now your issue is just a JVM gc activity, caused by insufficient RAM. 25 нояб. 2016 г. 9:47 пользователь "Jasneet Sabharwal" < notifications@github.com> написал:

…

The removal of vocab based on minWordFrequency is currently taking too much time. The vocab size before truncation was 45658973 and it is currently taking too much time. The process ran for approximately 40 minutes and then the following OOME occurred: Exception in thread "ContainerBackgroundProcessor[StandardEngine[Tomcat]]" java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.Arrays.copyOf(Arrays.java:3236) at sun.misc.Resource.getBytes(Resource.java:117) at java.net.URLClassLoader.defineClass(URLClassLoader.java:462) at java.net.URLClassLoader.access$100(URLClassLoader.java:73) at java.net.URLClassLoader$1.run(URLClassLoader.java:368) at java.net.URLClassLoader$1.run(URLClassLoader.java:362) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:361) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at org.springframework.boot.loader.LaunchedURLClassLoader.loadClass(LaunchedURLClassLoader.java:94) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at ch.qos.logback.classic.spi.LoggingEvent.<init>(LoggingEvent.java:119) at ch.qos.logback.classic.Logger.buildLoggingEventAndAppend(Logger.java:419) at ch.qos.logback.classic.Logger.filterAndLog_0_Or3Plus(Logger.java:383) at ch.qos.logback.classic.Logger.log(Logger.java:765) at org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221) at org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303) at java.util.logging.Logger.log(Logger.java:738) at java.util.logging.Logger.doLog(Logger.java:765) at java.util.logging.Logger.logp(Logger.java:1042) at org.apache.juli.logging.DirectJDKLog.log(DirectJDKLog.java:181) at org.apache.juli.logging.DirectJDKLog.error(DirectJDKLog.java:147) at org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.run(ContainerBase.java:1352) at java.lang.Thread.run(Thread.java:745) The server is an AWS EC2 m4.4xlarge. The Ngram tokenizer was used with minN as 1 and maxN as 3. Following was the last debug log from VocabConstructor: 2016-11-25 05:50:14,922 [pool-5-thread-1] DEBUG o.d.m.w.w.VocabConstructor - Vocab size before truncation: [45658973], NumWords: [152451150], sequences parsed: [58753], counter: [152451150] — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#2370>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ALru_319Vc32KG744-4gjJeQTnTlP884ks5rBoSTgaJpZM4K8JvA> .

KonceptGeek · 2016-11-25T07:09:47Z

The vector size is 200. I'm now running it with 15gb Xmx value and monitoring the process. The corpus that was used for this iteration consists of 58753 documents of varying sizes.

raver119 · 2016-11-25T07:11:39Z

Right, please add gist with your full w2v config 25 нояб. 2016 г. 10:09 пользователь "Jasneet Sabharwal" < notifications@github.com> написал:

…

The vector size is 200. I'm now running it with 15gb Xmx value and monitoring the process. The corpus that was used for this iteration consists of 58753 documents of varying sizes. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2370 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ALru_3etE7lY4gwSDQutp3qSmjDPWI7Pks5rBonCgaJpZM4K8JvA> .

KonceptGeek · 2016-11-25T07:17:49Z

Following was the w2v config:

word2Vec = new Word2Vec.Builder()
                .minWordFrequency(5)
                .layerSize(200)
                .learningRate(0.025f)
                .minLearningRate(0.0001f)
                .windowSize(5)
                .sampling(0.001f)
                .seed(1)
                .workers(trainerThreads)
                .negativeSample(5)
                .iterations(5)
                .stopWords(new ArrayList<String>())
                .elementsLearningAlgorithm(new SkipGram<VocabWord>())
                .useHierarchicSoftmax(false)
                .modelUtils(new TreeModelUtils<>())
                .allowParallelTokenization(true)
                .tokenizerFactory(tokenizerFactory)
                .iterate(word2VecContentIterator)
                .build();

The word2VecContentIterator is an implementation of SentenceIterator and tokenizerFactory is
TokenizerFactory tokenizerFactory = new NGramTokenizerFactory(new DefaultTokenizerFactory(), 1, 3);. trainerThreads is set to 8.

raver119 · 2016-11-25T07:24:00Z

Just to be clear. 45m words in vocab means the following: - 45m x 24 bytes for string object - 45m x X average characters per string. - some more bytes for vocabword structure. If your average vocab entry is ngram of 3 words, that's 18 unicode symbols, 2 bytes each for utf-8. That's 36 more bytes at very least. But probably way more. Then, for training: 45m x 200 x 4 x 2 :) 25 нояб. 2016 г. 10:11 пользователь "raver real" <raver119@gmail.com> написал:

…

Right, please add gist with your full w2v config 25 нояб. 2016 г. 10:09 пользователь "Jasneet Sabharwal" < ***@***.***> написал: > The vector size is 200. I'm now running it with 15gb Xmx value and > monitoring the process. The corpus that was used for this iteration > consists of 58753 documents of varying sizes. > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#2370 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/ALru_3etE7lY4gwSDQutp3qSmjDPWI7Pks5rBonCgaJpZM4K8JvA> > . >

KonceptGeek · 2016-11-25T07:27:18Z

Agreed, but shouldn't the vector creation happen after the vocab truncation is done? The GC happened while the truncation based on minWordFrequency was in progress. The following log line was never executed:
log.debug("Vocab size after truncation: [" + tempHolder.numWords() + "], NumWords: [" + tempHolder.totalWordOccurrences()+ "], sequences parsed: [" + sequences+ "], counter: ["+counter+"]");

raver119 · 2016-11-25T11:00:32Z

Sure, syn0/syn1Neg is created after vocab is built. I'm just saying, that syn0/syn1Neg on their own will be 72GB for floats, or 144GB for doubles.

And during vocab creation, you just go OOM for jvm stuff due to amount of strings saved for vocab.
Formula for string (only string) memory consumption is: (bytes) = 8 * (int) ((((no chars) * 2) + 45) / 8)
So, if your average word is 18 symbols, footprint in memory will be 80 bytes for average string only. Plus additional pointers to strucutres (like huffman tree info placeholders) etc. That can easily go up to your 15gb. Even raw calculation for "45m strings alone 18 symbols average" is 3.6gb memory alone. But if average word length is higher - obviously memory consumption will be higher.

raver119 · 2016-11-25T11:02:56Z

I'm not saying that everything is ideal in vocab construction, i'm just saying that probably you need higher Xmx for now. But sure, i'll review vocab creation process on huge corpuses.

KonceptGeek · 2016-11-25T11:27:44Z

Indeed higher Xmx did the trick. When it comes to huge corpuses, iterating over the data to build the vocab can be pretty expensive. One idea could be to have a checkpoint saving of the built data structure as soon as the iteration over the data is complete (can be optional). This way if an OOM occurs due to lack of memory, it would save the time of iterating over the whole dataset.

raver119 · 2016-11-25T11:30:12Z

There's already available periodical truncation for huge datasets, i've added it year ago, but it assumes manual use unfortunately.

raver119 · 2018-04-26T18:39:27Z

Issue was fixed long ago.

lock · 2018-09-22T14:13:50Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

raver119 added the Enhancement New features and other enhancements label Nov 25, 2016

raver119 self-assigned this Nov 25, 2016

raver119 closed this as completed Apr 26, 2018

lock bot locked and limited conversation to collaborators Sep 22, 2018

eclipsewebmaster unassigned raver119 Jun 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VocabConstructor.buildJointVocabulary`s application of minWordFrequency is taking too much time. #2370

VocabConstructor.buildJointVocabulary`s application of minWordFrequency is taking too much time. #2370

KonceptGeek commented Nov 25, 2016 •

edited

Loading

raver119 commented Nov 25, 2016 via email

KonceptGeek commented Nov 25, 2016

raver119 commented Nov 25, 2016 via email

KonceptGeek commented Nov 25, 2016

raver119 commented Nov 25, 2016 via email

KonceptGeek commented Nov 25, 2016 •

edited

Loading

raver119 commented Nov 25, 2016 •

edited

Loading

raver119 commented Nov 25, 2016

KonceptGeek commented Nov 25, 2016

raver119 commented Nov 25, 2016

raver119 commented Apr 26, 2018

lock bot commented Sep 22, 2018

VocabConstructor.buildJointVocabulary`s application of minWordFrequency is taking too much time. #2370

VocabConstructor.buildJointVocabulary`s application of minWordFrequency is taking too much time. #2370

Comments

KonceptGeek commented Nov 25, 2016 • edited Loading

raver119 commented Nov 25, 2016 via email

KonceptGeek commented Nov 25, 2016

raver119 commented Nov 25, 2016 via email

KonceptGeek commented Nov 25, 2016

raver119 commented Nov 25, 2016 via email

KonceptGeek commented Nov 25, 2016 • edited Loading

raver119 commented Nov 25, 2016 • edited Loading

raver119 commented Nov 25, 2016

KonceptGeek commented Nov 25, 2016

raver119 commented Nov 25, 2016

raver119 commented Apr 26, 2018

lock bot commented Sep 22, 2018

KonceptGeek commented Nov 25, 2016 •

edited

Loading

KonceptGeek commented Nov 25, 2016 •

edited

Loading

raver119 commented Nov 25, 2016 •

edited

Loading