-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is there a runnable example with the given data? #1
Comments
Hello! I believe the argument I think you're one of the first people to try out this repo so please keep asking questions and I'll find some time to clean up the repo and add more documentation! |
Ok I'm making some updates now. Some changes:
|
Awesome, thanks @ddehueck ! |
@ceteri Ok should be good to go now! You should see embedding like: after around 100 epochs using the world order dataset. |
Also, I believe there may be a way to speed up the loss as I did here: https://github.com/ddehueck/CrossWalk/blob/master/domains/sgns_loss.py If you want to compare to an ML framework without a JIT compiler I have a SGNS implementation in pytorch here: https://github.com/ddehueck/skip-gram-negative-sampling |
Many thanks @ddehueck ! That ran fine, although the embeddings that I'm seeing reported are:
Looking at line
word through that iteration?
|
In the line right above the one you linked to: I think what is happening is you are using different data in So try changing the dataset queries. |
Thank you - Checking through the code, I may be encountering a problem on
This warning gets printed 4x through each epoch:
This is with a simple command line:
And the dataset is hard-coded to |
Hmm, not sure how much help I can be without seeing your setup. How is your data pipeline setup? |
I seem to be facing the same issue. I just cloned the repo and ran python3 train.py, and it seems my gradient is NaN and I get the exact same embeddings that @ceteri is getting. Were either of you able to find a fix for this? |
@arjun-mani Are you running the default sample that is in the codebase or have you added a custom dataset? |
@ddehueck Default sample. Seems like the problem was with the loss - maybe you never fixed the bug on remote? I'm getting -inf values for the loss. |
@arjun-mani Damn ok will take a closer look |
@ddehueck I just fixed the loss locally and am getting embeddings much closer to what you posted. So pretty sure that's the problem :) |
@arjun-mani Awesome! Would you mind creating a PR? |
Not at all, I'll do it today. |
@arjun-mani Appreciate it |
@arjun-mani Any chance you can create that PR? Sorry to bother you. |
Hey @ddehueck - I'm so so sorry for the delay, it's been a crazy few weeks. I've made a lot of changes personally to the codebase and this is a really small change, so hopefully it's helpful if I just attach the modified bce_loss function (in sgns_loss,.py):
(The old line is commented). Hope this helps, lmk if you'd like me to clarify anything. |
Thank you @arjun-mani that works well. |
This code looks really interesting.
However, is there a runnable example with the given data?
When running the current code using the example shown:
The hard-coded location
data/
of data does not appear to be included within this repo, and the named datasets are different from what's indatasets/
My apologies, I may have missed something on set up?
The text was updated successfully, but these errors were encountered: