Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Article's score #1

Closed
PhySci opened this issue Jun 2, 2019 · 9 comments
Closed

Article's score #1

PhySci opened this issue Jun 2, 2019 · 9 comments

Comments

@PhySci
Copy link

PhySci commented Jun 2, 2019

Hi @bcol23
Thank you very much for the inspiring work and publishing your code. I found it very interesting.
I played around with the repo and was discouraged a little bit - I can not achieve score you published in the article. Could you please provide a script to reproduce the published results for RCV1 dataset.
Also I would suggest a few improvement of the project's structure if you don't mind.

@bcol23
Copy link
Owner

bcol23 commented Jun 3, 2019

Thanks for your interest in our work.

A simple script may be not enough to fully cover the experiment procedure. Here are some instructions to reproduce our results:

  1. Firstly you should get the raw text dataset with the label relations and use the pre-processing to get X_train, X_test, y_train, y_test (and use 10% of the training set as the validation set). The sample data can be the example.

  2. Pre-train the word embeddings with Poincaré GloVe, and use the WordNet hypernym relations in NLTK with the Poincaré embeddings to capture the conceptual relations between words. Pre-train the label embeddings using the the label relations via the Poincaré embeddings. There's also a gensim implementation of the Poincaré embeddings, specified in Train and use Poincaré embeddings. We use their implementation to visualize the embeddings.

  3. Make sure you replace the word_embed and label_embed accordingly in the code, then you can train the model with our hyperparameter settings. You can use a larger learning rate (e.g. 3e-3) at the first few epochs and then schedule it down. Note that we use the GRU in both HyperIM and EuclideanIM to achieve our results.

Your suggestion of improvements is very welcomed.

@bcol23 bcol23 closed this as completed Jun 3, 2019
@bcol23 bcol23 reopened this Jun 3, 2019
@bcol23 bcol23 closed this as completed Jun 4, 2019
@ayushbits
Copy link

Hey @bcol23 ,

I'm trying to create label and word embeddings using Poincare embeddings and Poincaré GloVe respectively. I'm facing issues with traning word embeddings using poincare glove model.

I've created the vocab (7MB) and co-occurence file (7.5GB) from the glove's code for RCV1 dataset.

When i train the word embeddings, the process is very slow and utilising single core of CPU only (no GPUs).
./run_glove.sh --train --root .. -coocc_file ../poincare_glove2/GloVe/cooccurrence.bin --vocab_file ../poincare_glove2/GloVe/vocab.txt --epochs 50 --workers 20 --restrict_vocab 200000 --lr 0.01 --poincare 1 --bias --size 100 --dist_func cosh-dist-sq

  1. Could you provide the pre-trained word embeddings for the RCV1 corpus?
  2. Could you tell details about how much time it took to train word embeddings ?
  3. Could you provide parameter settings for training word and label embeddings ?

Thanks

@bcol23
Copy link
Owner

bcol23 commented Mar 19, 2021

Hey @bcol23 ,

I'm trying to create label and word embeddings using Poincare embeddings and Poincaré GloVe respectively. I'm facing issues with traning word embeddings using poincare glove model.

I've created the vocab (7MB) and co-occurence file (7.5GB) from the glove's code for RCV1 dataset.

When i train the word embeddings, the process is very slow and utilising single core of CPU only (no GPUs).
./run_glove.sh --train --root .. -coocc_file ../poincare_glove2/GloVe/cooccurrence.bin --vocab_file ../poincare_glove2/GloVe/vocab.txt --epochs 50 --workers 20 --restrict_vocab 200000 --lr 0.01 --poincare 1 --bias --size 100 --dist_func cosh-dist-sq

  1. Could you provide the pre-trained word embeddings for the RCV1 corpus?
  2. Could you tell details about how much time it took to train word embeddings ?
  3. Could you provide parameter settings for training word and label embeddings ?

Thanks

I didn't keep copies of the pre-trained files and logs. We adopted the word embedding setup detailed in the experiments section of the PoincaréGlove paper. The default setup of gensim worked quite well for the label embeddings.

@ayushbits
Copy link

Thanks @bcol23 for the response.
Label embeddings worked properlly for me as well. I'm facing issue with training word embeddings only. Could you provide some details whether the command mentioned above is proper and how much does it take to train ?
I'm completely stuck on this part.

@bcol23
Copy link
Owner

bcol23 commented Mar 21, 2021

Thanks @bcol23 for the response.
Label embeddings worked properlly for me as well. I'm facing issue with training word embeddings only. Could you provide some details whether the command mentioned above is proper and how much does it take to train ?
I'm completely stuck on this part.

Perhaps it is necessary to check that the workers parameter works properly, since only a single core of CPU is used. And the functionality of the vanille Euclidean GloVe should also be checked.

@ayushbits
Copy link

I changed the workers parameter to a very high value. Still no impact.

  1. Could you confirm that file size of vocab (7MB) and co-occurence file (7.5GB) for RCV1 dataset ?
  2. Could you share the script/command that you ran to train the word embedding model ?

@bcol23
Copy link
Owner

bcol23 commented Mar 25, 2021

I changed the workers parameter to a very high value. Still no impact.

  1. Could you confirm that file size of vocab (7MB) and co-occurence file (7.5GB) for RCV1 dataset ?
  2. Could you share the script/command that you ran to train the word embedding model ?

I do not keep logs of the exact file sizes. As described in the poincare_glove repo, the command similar to

./run_glove.sh --train --root path/to/root --coocc_file path/to/coocc/file --vocab_file path/to/vocab/file --epochs 50 --workers 20 --restrict_vocab 200000 --lr 0.01 --poincare 1 --bias --size 100 --mix --num_embs 50 --dist_func cosh-dist-sq

should work for a Cartesian product of Poincare balls. The initialization trick should also be applied by setting the --restrict_vocab and --init_pretrained parameters respectively. As I remembered, the restricted vocabulary should be 30% of the full vocabulary.

@ayushbits
Copy link

Thanks @bcol23 for the reply.
Could you descibe the initialisation trick in more detail ? We are training the word emb for some corpus, say RCV1, then how would initialization of emb be feasible ?
Secondly, could you describe the computing resources used in the experiments and time take to train ?

These details are not mentioned in the paper, therefore asking here.

Thanks

@bcol23
Copy link
Owner

bcol23 commented Apr 6, 2021

Thanks @bcol23 for the reply.
Could you descibe the initialisation trick in more detail ? We are training the word emb for some corpus, say RCV1, then how would initialization of emb be feasible ?
Secondly, could you describe the computing resources used in the experiments and time take to train ?

These details are not mentioned in the paper, therefore asking here.

Thanks

As described in Section 9 of the Poincaré GloVe paper, the initialization trick is used to improve the embeddings when initialized with pretrained parameters. This can be done by first training the model on the restricted vocabulary, and then using this model as an initialization for the full vocabulary.

The CPU is Intel Xeon E5-2683 and the training process of the embeddings using poincare_glove should be done within an hour.

Repository owner locked and limited conversation to collaborators Apr 6, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants