-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support explicitly stored ngrams #61
Comments
Before going deeper into NGram vocabs in this repo, I'd like to pull in some more logic from For instance, we should be able to use the What do you think? @danieldk @NianhengWu |
Yes, I think this is a great idea. The |
This one's pretty much done, what's missing now is a release for `finalfusion-rust', then we can replace the dependencies with proper versions. |
Maybe it would be good to do test runs and compute-accuracy for finalfrontier/finalfusion before the changes and then after (obviously, the n-grams model could only be done after). To see if there are no regressions. The last time we did big changes, this shook out an important bug. |
I'm currently training an ngram model on turing (started with I can train a bucket-vocab model in parallel to catch possible regressions, I guess a skipgram model should suffice since we didn't touch the training routine, only the vocab and config. The ngram models I'm training are all structgram models, since we've been using structgram for almost all experiments. |
~48% on compute-accuracy for a skipgram model with
that doesn't seem right. I'm now training on @danieldk did we verify anything after adding support for no-subwords training? |
I checked Unfortunately I didn't change the name before restarting training and since |
So, accuracy is down ~9%, that must be a bug. I guess the next step is finding out whether it's finalfusion or finalfrontier changes causing this. |
Do you remember if we verified models before the |
Of |
My hunch is that it's related to the 0.6 release. We changed quite a lot directly related to the training. If the |
0.5.0 -> 0.6.1 is fairly small though: embeddings norms storage, better default hyperparameters, dirgram model, and that's pretty much it. All the big changes were in 0.5.0, which we checked. |
I misremembered releasing the no-subwords training. I meant to say it's probably related to changing the vocabs and indexing. |
That's likely. One possibility is some incorrect offset in the embedding matrix (where the subword index is not added to the known vocab size). |
What version of |
I verified against So I think that rules out most of |
Shouldn't be about subword indices, adding this to let mut word_n = 0;
for idx in &idx {
if (idx as usize) < self.vocab.len() {
if word_n >= 1 {
panic!();
}
word_n += 1
}
} |
|
Can we see the difference in performance in the loss for a relatively small training corpus? If so, you could |
0.4
0.5
0.6
master
Starting with 0.6 something apparently changed. The loss is now much lower. Although not sure whether that's really indicative. We changed something about the loss calculation at some point. |
Default ctx-size changed. So that explains my differences but not the regression. |
We changed some defaults:
So I guess you have to modify the context size. |
Heh, simultaneous. Maybe you could redo them with the same context size? |
Already done:
and
Loss is the same as on 0.6 and master now. I'll convert one of our public models with current finalfusion-utils and run it against compute-accuracy to see whether serialization is broken... |
$ finalfusion convert /zdv/sfb833-a3/public-data/finalfusion/german-skipgram-mincount-30-ctx-10-dims-300.fifu convert_test.fifu -f finalfusion -t finalfusion
$ finalfusion compute-accuracy ./convert_test.fifu ~/finalfrontier/analogies/de_trans_Google_analogies.txt --threads 10
[...]
Total: 10611/18552 correct, accuracy: 57.20, avg cos: 0.69 Looks fine... |
Cargo installed
$ finalfusion compute-accuracy ../finalfrontier/twe-bucket-vocab-test.fifu ~/finalfrontier/analogies/de_trans_Google_analogies.txt --threads 40
██████████████████████████████ 100% ETA: 00:00:00
capital-common-countries: 282/506 correct, accuracy: 55.73, avg cos: 0.74, skipped: 0
capital-world: 2678/4524 correct, accuracy: 59.20, avg cos: 0.73, skipped: 0
city-in-state: 716/2467 correct, accuracy: 29.02, avg cos: 0.68, skipped: 0
currency: 58/866 correct, accuracy: 6.70, avg cos: 0.60, skipped: 0
family: 275/506 correct, accuracy: 54.35, avg cos: 0.74, skipped: 0
gram2-opposite: 181/812 correct, accuracy: 22.29, avg cos: 0.65, skipped: 0
gram3-comparative: 901/1332 correct, accuracy: 67.64, avg cos: 0.66, skipped: 0
gram4-superlative: 310/1122 correct, accuracy: 27.63, avg cos: 0.56, skipped: 0
gram5-present-participle: 202/1056 correct, accuracy: 19.13, avg cos: 0.64, skipped: 0
gram6-nationality-adjective: 583/1599 correct, accuracy: 36.46, avg cos: 0.70, skipped: 0
gram7-past-tense: 734/1560 correct, accuracy: 47.05, avg cos: 0.66, skipped: 0
gram8-plural: 494/1332 correct, accuracy: 37.09, avg cos: 0.67, skipped: 0
gram9-plural-verbs: 601/870 correct, accuracy: 69.08, avg cos: 0.72, skipped: 0
Total: 8015/18552 correct, accuracy: 43.20, avg cos: 0.68
Skipped: 0/18552 (0%) |
So, does this mean that the bug was already in 0.6.1? |
At some point yes, but we are obscuring losses a bit by averaging over all time. The initial updates will always be much larger and the effect is probably less observable. The long tail we cannot observe as closely due to averaging. This is because we simulated word2vec, but we should really show some moving average. BTW. not saying that this is the culprit, but it's one of the few things I could think of why a single version would produce good and bad results between runs. Also why the older embeddings do so much better. I think I used to train with 20 threads at most. Sometimes even fewer (e.g. on shaw). |
Yes, didn't disable them yet. |
I'll restart master training on hopper with 10 rather than 30 threads. Would be good if we didn't train other embeddings on the same machine. |
Losses for the 30 threads models on turing are super close to each other with 0.00195 and 0.00203. What did you get on hopper? |
0.4.0: 0.00178 But again, let's not over-analyze an average loss, which is also updated with Hogwild ;). (Both were with 30 threads.) |
Sure, just thought it'd be good to verify we're not magnitudes apart. |
Threads could really be our answer, this is from tesniere with 20 threads on 0.5:
And turing on 0.5 with 30 threads:
I'll give master on tesniere and turing with 20 threads a go now. We should add |
It would not be very satisfying, but it at least would be a good explanation. I agree on adding If this is indeed the culprit, we should also investigate switching to HogBatch. From a quick glance over the paper (still need to read it in detail), this seems to be simple enough to implement -- it updates the 'global' matrix every n instances rather than per instance. We would need a performant sparse matrix implementation, but there is a lot of prior work there (and only the rows need to be sparse, within a row we could use dense vectors). |
I haven't really looked at more than the figures in the paper, if I find time, I'll also go through it. There was also an issue that suggested implementing some different optimization techniques: #42 Another small datapoint not related to the bug but turing: With identical hyperparameters turing shows an ETA of +24h compared to tesniere. Are your results from hopper in yet? |
There's also additional support for the threads theory - the skipgram model from a couple of months ago which wasn't broken was also trained on 20 threads. Found that by accident in my bash history. |
Nope, but I suspended the process overnight, because I had to run something else (prepare Dutch treebank), and I didn't want that to influence the result in any way. I have continued to train this morning. But the ETA is completely unreliable now. IIRC it would take a bit more than a day with 10 threads and it's now at 26%. |
I continue to be surprised about this, since they have identical CPUs. Of course, there is the VM, but that should only be a few percent difference. I have reset the CPU affinities of the VMs. I also noticed that the Edit: tesniere also uses |
The difference is still almost 20 hours, current ETAs are 17h on tesniere and 36h on turing. Turing is running an older Ubuntu release at 16.04 while tesniere is on 18.04, although that shouldn't explain such dramatic differences. |
You could The Ubuntu versions should indeed not make a large difference. On both machines, the same compiler is used. Should produce roughly the same machine code. And our hot loops don't do syscalls anyway. |
Results from tesniere on master are in, 56.51%. Looks like everything was fine all along... |
|
Turing won't be done until the early morning. Anyways, I don't expect different results from turing, we've seen on 0.5, 0.6 and master that the number of threads can really do this. Which probably means we should set upper bounds on the default number of threads. Something like |
10 threads on hopper on
|
Yes. And we have to put a warning in the README. Most people will think 'I have 40 cores, let's use all of them'. |
Ok, seems like this issue could almost be closed. Maybe we should also run skipgram - bucket We should probably do all of them with the fixed |
Well, there's still #72 to fix, the finalfusion dependency needs to be updated to a release |
That's true. But we currently don't add EOS markers anyway, right? So this would only affect a hypothetical |
Never mind, we do. |
They appear among the most frequent tokens in the vocab, usually the top index. |
I had a vague recollection that we once had a discussion about not using EOS. |
That was in #60, but we didn't remember this. I think we could patch this out since we include punctuation in training which essentially acts as EOS marker(s). |
Btw, following these changes, enabling training of actual fastText embeddings is only a |
Entails changes to
finalfusion-rust
for serialization/usage.vocab.rs
for all variants is quite unwieldy)finalfusion-rust
, parameterizeSubwordVocab
withIndexer
, separate ngram and word indices in theindex
through enum (might get ugly, considering how many type parameter we already have for the trainer structs)finalfusion-rust
)finalfusion-rust
dependency once it supports NGramVocabsfinalfusion
dependency with release.The text was updated successfully, but these errors were encountered: