Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VocabConstructor.buildJointVocabulary`s application of minWordFrequency is taking too much time. #2370

Closed
KonceptGeek opened this issue Nov 25, 2016 · 12 comments
Labels
Enhancement New features and other enhancements

Comments

@KonceptGeek
Copy link

KonceptGeek commented Nov 25, 2016

The removal of vocab based on minWordFrequency is currently taking too much time. The vocab size before truncation was 45658973 and it is currently taking too much time. The process ran for approximately 40 minutes and then the following OOME occurred:

Exception in thread "ContainerBackgroundProcessor[StandardEngine[Tomcat]]" java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.util.Arrays.copyOf(Arrays.java:3236)
        at sun.misc.Resource.getBytes(Resource.java:117)
        at java.net.URLClassLoader.defineClass(URLClassLoader.java:462)
        at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
        at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at org.springframework.boot.loader.LaunchedURLClassLoader.loadClass(LaunchedURLClassLoader.java:94)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        at ch.qos.logback.classic.spi.LoggingEvent.<init>(LoggingEvent.java:119)
        at ch.qos.logback.classic.Logger.buildLoggingEventAndAppend(Logger.java:419)
        at ch.qos.logback.classic.Logger.filterAndLog_0_Or3Plus(Logger.java:383)
        at ch.qos.logback.classic.Logger.log(Logger.java:765)
        at org.slf4j.bridge.SLF4JBridgeHandler.callLocationAwareLogger(SLF4JBridgeHandler.java:221)
        at org.slf4j.bridge.SLF4JBridgeHandler.publish(SLF4JBridgeHandler.java:303)
        at java.util.logging.Logger.log(Logger.java:738)
        at java.util.logging.Logger.doLog(Logger.java:765)
        at java.util.logging.Logger.logp(Logger.java:1042)
        at org.apache.juli.logging.DirectJDKLog.log(DirectJDKLog.java:181)
        at org.apache.juli.logging.DirectJDKLog.error(DirectJDKLog.java:147)
        at org.apache.catalina.core.ContainerBase$ContainerBackgroundProcessor.run(ContainerBase.java:1352)
        at java.lang.Thread.run(Thread.java:745)

The server is an AWS EC2 m4.4xlarge. The Ngram tokenizer was used with minN as 1 and maxN as 3. Following was the last debug log from VocabConstructor:

2016-11-25 05:50:14,922 [pool-5-thread-1] DEBUG o.d.m.w.w.VocabConstructor - Vocab size before truncation: [45658973],  NumWords: [152451150], sequences parsed: [58753], counter: [152451150]

The code was executed with default Xmx value.

@raver119
Copy link
Contributor

raver119 commented Nov 25, 2016 via email

@KonceptGeek
Copy link
Author

The vector size is 200. I'm now running it with 15gb Xmx value and monitoring the process. The corpus that was used for this iteration consists of 58753 documents of varying sizes.

@raver119
Copy link
Contributor

raver119 commented Nov 25, 2016 via email

@KonceptGeek
Copy link
Author

Following was the w2v config:

word2Vec = new Word2Vec.Builder()
                .minWordFrequency(5)
                .layerSize(200)
                .learningRate(0.025f)
                .minLearningRate(0.0001f)
                .windowSize(5)
                .sampling(0.001f)
                .seed(1)
                .workers(trainerThreads)
                .negativeSample(5)
                .iterations(5)
                .stopWords(new ArrayList<String>())
                .elementsLearningAlgorithm(new SkipGram<VocabWord>())
                .useHierarchicSoftmax(false)
                .modelUtils(new TreeModelUtils<>())
                .allowParallelTokenization(true)
                .tokenizerFactory(tokenizerFactory)
                .iterate(word2VecContentIterator)
                .build();

The word2VecContentIterator is an implementation of SentenceIterator and tokenizerFactory is
TokenizerFactory tokenizerFactory = new NGramTokenizerFactory(new DefaultTokenizerFactory(), 1, 3);. trainerThreads is set to 8.

@raver119
Copy link
Contributor

raver119 commented Nov 25, 2016 via email

@KonceptGeek
Copy link
Author

KonceptGeek commented Nov 25, 2016

Agreed, but shouldn't the vector creation happen after the vocab truncation is done? The GC happened while the truncation based on minWordFrequency was in progress. The following log line was never executed:
log.debug("Vocab size after truncation: [" + tempHolder.numWords() + "], NumWords: [" + tempHolder.totalWordOccurrences()+ "], sequences parsed: [" + sequences+ "], counter: ["+counter+"]");

@raver119
Copy link
Contributor

raver119 commented Nov 25, 2016

Sure, syn0/syn1Neg is created after vocab is built. I'm just saying, that syn0/syn1Neg on their own will be 72GB for floats, or 144GB for doubles.

And during vocab creation, you just go OOM for jvm stuff due to amount of strings saved for vocab.
Formula for string (only string) memory consumption is: (bytes) = 8 * (int) ((((no chars) * 2) + 45) / 8)
So, if your average word is 18 symbols, footprint in memory will be 80 bytes for average string only. Plus additional pointers to strucutres (like huffman tree info placeholders) etc. That can easily go up to your 15gb. Even raw calculation for "45m strings alone 18 symbols average" is 3.6gb memory alone. But if average word length is higher - obviously memory consumption will be higher.

@raver119 raver119 added the Enhancement New features and other enhancements label Nov 25, 2016
@raver119 raver119 self-assigned this Nov 25, 2016
@raver119
Copy link
Contributor

I'm not saying that everything is ideal in vocab construction, i'm just saying that probably you need higher Xmx for now. But sure, i'll review vocab creation process on huge corpuses.

@KonceptGeek
Copy link
Author

Indeed higher Xmx did the trick. When it comes to huge corpuses, iterating over the data to build the vocab can be pretty expensive. One idea could be to have a checkpoint saving of the built data structure as soon as the iteration over the data is complete (can be optional). This way if an OOM occurs due to lack of memory, it would save the time of iterating over the whole dataset.

@raver119
Copy link
Contributor

There's already available periodical truncation for huge datasets, i've added it year ago, but it assumes manual use unfortunately.

@raver119
Copy link
Contributor

Issue was fixed long ago.

@lock
Copy link

lock bot commented Sep 22, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Sep 22, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Enhancement New features and other enhancements
Projects
None yet
Development

No branches or pull requests

2 participants