-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Release 0.6 #45
Comments
There's still the branch with abstracting the input vocab #13. I haven't looked into that since the last time we discussed that, so probably 2 months? I doubt that we'll figure that one out timely, unless you have a suggestion for that. The only other thing in the pipeline is the post-training filtering approach I thought about the other day. Storing frequencies for the tokens might be nice, too. But that'd entail adding a chunk to the finalfusion format.
I was thinking the same, for my project I need to train some models with norms. |
On Fri, Jun 7, 2019, at 19:45, Sebastian Pütz wrote:
There's still the branch with abstracting the input vocab #29
<#29>. I haven't
looked into that since the last time we discussed that, so probably 2
months? I doubt that we'll figure that one out timely, unless you have
a suggestion for that.
I think we should postpone it for now. It will take some time to settle,
even if we have an elegant solution.
The only other thing in the pipeline is the post-training filtering
approach I thought about the other day.
Ideally this would be a finalfusion-utils utility. At least I think it
would be nice if it was decoupled from training, so that you can decide
to use a different cut-off. But at that point we don't have access to
the counts anymore. Still, I could see a ff-filter-vocab utility, which
you can provide with a list of tokens to retain. Then you could
implement the same functionality by using an external counts list and do
some UNIX-fu. The benefit of this approach would be that people could
also filter the vocab on other criteria.
Storing frequencies for the tokens might be nice, too. But that'd
entail adding a chunk to the finalfusion format.
Yes and yes. I am not sure if we want more chunks ;).
|
Since we have the information, I'm not sure why we should discard it. If someone downloads one of our pretrained models, they might be interest in the distribution of tokens in the training corpus and how often specific tokens showed up. Right now they'd only know that a token showed up at least min_count times. I think having both would be nice. The ability to filter based on a word-list and the ability to filter based on known frequencies. Then it'd be possible to restrict the vocab based on the number of occurrences in the corpus and based on external resources. |
I think I figured out a way that's not too hacky. If we include @NianhengWu's implementation of the directional skipgram, there might be time to polish my code to a good level. I should be able to push out the changes some time tomorrow, it's already compiling, I just want to go through all the places where things changed before letting someone look at it. |
I've added the dirgram model into finalfrontier, testing it now. I recall we briefly discussed whether we are going to change the default context window size from 5 to 10 since in most tests it performs better, but we didn't officially confirm it. So are we changing it? |
Created a PR for this: #48 |
Ok, it seems that most things are done. Maybe we should do a small test run (one epoch of every model). There are no changes that affect the model scores, but just to ensure that all the command-line handling changes are ok. I also didn't test (and forgot) whether the current version of |
doing the test right now:) |
Thanks for testing! |
Also confirmed that finalfusion 0.5 is happy with the changed file format. |
Released in 684334f. |
I think the norms storage change is pretty important. The earlier we push 0.6.0 out, the better, since it reduces the number of embeddings in the wild that do not have norms.
That said, I think it would be nice to have Nicole's directional skipgram implementation in as well, since then we also have a nice user-visible feature.
Is there anything else that we want to add before branching 0.6?
The text was updated successfully, but these errors were encountered: