Article's score #1

PhySci · 2019-06-02T16:43:49Z

Hi @bcol23
Thank you very much for the inspiring work and publishing your code. I found it very interesting.
I played around with the repo and was discouraged a little bit - I can not achieve score you published in the article. Could you please provide a script to reproduce the published results for RCV1 dataset.
Also I would suggest a few improvement of the project's structure if you don't mind.

bcol23 · 2019-06-03T07:14:32Z

Thanks for your interest in our work.

A simple script may be not enough to fully cover the experiment procedure. Here are some instructions to reproduce our results:

Firstly you should get the raw text dataset with the label relations and use the pre-processing to get X_train, X_test, y_train, y_test (and use 10% of the training set as the validation set). The sample data can be the example.
Pre-train the word embeddings with Poincaré GloVe, and use the WordNet hypernym relations in NLTK with the Poincaré embeddings to capture the conceptual relations between words. Pre-train the label embeddings using the the label relations via the Poincaré embeddings. There's also a gensim implementation of the Poincaré embeddings, specified in Train and use Poincaré embeddings. We use their implementation to visualize the embeddings.
Make sure you replace the word_embed and label_embed accordingly in the code, then you can train the model with our hyperparameter settings. You can use a larger learning rate (e.g. 3e-3) at the first few epochs and then schedule it down. Note that we use the GRU in both HyperIM and EuclideanIM to achieve our results.

Your suggestion of improvements is very welcomed.

ayushbits · 2021-03-14T16:55:05Z

Hey @bcol23 ,

I'm trying to create label and word embeddings using Poincare embeddings and Poincaré GloVe respectively. I'm facing issues with traning word embeddings using poincare glove model.

I've created the vocab (7MB) and co-occurence file (7.5GB) from the glove's code for RCV1 dataset.

When i train the word embeddings, the process is very slow and utilising single core of CPU only (no GPUs).
./run_glove.sh --train --root .. -coocc_file ../poincare_glove2/GloVe/cooccurrence.bin --vocab_file ../poincare_glove2/GloVe/vocab.txt --epochs 50 --workers 20 --restrict_vocab 200000 --lr 0.01 --poincare 1 --bias --size 100 --dist_func cosh-dist-sq

Could you provide the pre-trained word embeddings for the RCV1 corpus?
Could you tell details about how much time it took to train word embeddings ?
Could you provide parameter settings for training word and label embeddings ?

Thanks

bcol23 · 2021-03-19T09:12:41Z

Hey @bcol23 ,

I'm trying to create label and word embeddings using Poincare embeddings and Poincaré GloVe respectively. I'm facing issues with traning word embeddings using poincare glove model.

I've created the vocab (7MB) and co-occurence file (7.5GB) from the glove's code for RCV1 dataset.

When i train the word embeddings, the process is very slow and utilising single core of CPU only (no GPUs).
./run_glove.sh --train --root .. -coocc_file ../poincare_glove2/GloVe/cooccurrence.bin --vocab_file ../poincare_glove2/GloVe/vocab.txt --epochs 50 --workers 20 --restrict_vocab 200000 --lr 0.01 --poincare 1 --bias --size 100 --dist_func cosh-dist-sq

Could you provide the pre-trained word embeddings for the RCV1 corpus?

Could you tell details about how much time it took to train word embeddings ?

Could you provide parameter settings for training word and label embeddings ?

Thanks

I didn't keep copies of the pre-trained files and logs. We adopted the word embedding setup detailed in the experiments section of the PoincaréGlove paper. The default setup of gensim worked quite well for the label embeddings.

ayushbits · 2021-03-20T05:50:27Z

Thanks @bcol23 for the response.
Label embeddings worked properlly for me as well. I'm facing issue with training word embeddings only. Could you provide some details whether the command mentioned above is proper and how much does it take to train ?
I'm completely stuck on this part.

bcol23 · 2021-03-21T07:17:45Z

Thanks @bcol23 for the response.
Label embeddings worked properlly for me as well. I'm facing issue with training word embeddings only. Could you provide some details whether the command mentioned above is proper and how much does it take to train ?
I'm completely stuck on this part.

Perhaps it is necessary to check that the workers parameter works properly, since only a single core of CPU is used. And the functionality of the vanille Euclidean GloVe should also be checked.

ayushbits · 2021-03-21T18:39:07Z

I changed the workers parameter to a very high value. Still no impact.

Could you confirm that file size of vocab (7MB) and co-occurence file (7.5GB) for RCV1 dataset ?
Could you share the script/command that you ran to train the word embedding model ?

bcol23 · 2021-03-25T09:30:51Z

I changed the workers parameter to a very high value. Still no impact.

Could you confirm that file size of vocab (7MB) and co-occurence file (7.5GB) for RCV1 dataset ?

Could you share the script/command that you ran to train the word embedding model ?

I do not keep logs of the exact file sizes. As described in the poincare_glove repo, the command similar to

./run_glove.sh --train --root path/to/root --coocc_file path/to/coocc/file --vocab_file path/to/vocab/file --epochs 50 --workers 20 --restrict_vocab 200000 --lr 0.01 --poincare 1 --bias --size 100 --mix --num_embs 50 --dist_func cosh-dist-sq

should work for a Cartesian product of Poincare balls. The initialization trick should also be applied by setting the --restrict_vocab and --init_pretrained parameters respectively. As I remembered, the restricted vocabulary should be 30% of the full vocabulary.

ayushbits · 2021-03-30T07:39:19Z

Thanks @bcol23 for the reply.
Could you descibe the initialisation trick in more detail ? We are training the word emb for some corpus, say RCV1, then how would initialization of emb be feasible ?
Secondly, could you describe the computing resources used in the experiments and time take to train ?

These details are not mentioned in the paper, therefore asking here.

Thanks

bcol23 · 2021-04-06T13:21:57Z

Thanks @bcol23 for the reply.
Could you descibe the initialisation trick in more detail ? We are training the word emb for some corpus, say RCV1, then how would initialization of emb be feasible ?
Secondly, could you describe the computing resources used in the experiments and time take to train ?

These details are not mentioned in the paper, therefore asking here.

Thanks

As described in Section 9 of the Poincaré GloVe paper, the initialization trick is used to improve the embeddings when initialized with pretrained parameters. This can be done by ﬁrst training the model on the restricted vocabulary, and then using this model as an initialization for the full vocabulary.

The CPU is Intel Xeon E5-2683 and the training process of the embeddings using poincare_glove should be done within an hour.

bcol23 closed this as completed Jun 3, 2019

bcol23 reopened this Jun 3, 2019

bcol23 closed this as completed Jun 4, 2019

ayushbits mentioned this issue Apr 1, 2021

Reproduce RCV1 score #5

Closed

Repository owner locked and limited conversation to collaborators Apr 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Article's score #1

Article's score #1

PhySci commented Jun 2, 2019

bcol23 commented Jun 3, 2019 •

edited

Loading

ayushbits commented Mar 14, 2021

bcol23 commented Mar 19, 2021

ayushbits commented Mar 20, 2021

bcol23 commented Mar 21, 2021

ayushbits commented Mar 21, 2021

bcol23 commented Mar 25, 2021

ayushbits commented Mar 30, 2021

bcol23 commented Apr 6, 2021

Article's score #1

Article's score #1

Comments

PhySci commented Jun 2, 2019

bcol23 commented Jun 3, 2019 • edited Loading

ayushbits commented Mar 14, 2021

bcol23 commented Mar 19, 2021

ayushbits commented Mar 20, 2021

bcol23 commented Mar 21, 2021

ayushbits commented Mar 21, 2021

bcol23 commented Mar 25, 2021

ayushbits commented Mar 30, 2021

bcol23 commented Apr 6, 2021

bcol23 commented Jun 3, 2019 •

edited

Loading