Skip to content
This repository has been archived by the owner on Jan 15, 2024. It is now read-only.

Workaround mxnet nd.topk regression #153

Merged
merged 2 commits into from Jun 15, 2018
Merged

Conversation

leezu
Copy link
Contributor

@leezu leezu commented Jun 14, 2018

apache/mxnet#11271

Description

The behavior of nd.topk under the presence of nan values changed, causing wrong results in the pretrained word embeddings notebook due to def norm_vecs_by_row(x) inducing nan values for 0 word vectors. This PR adds a small epsilon to def norm_vecs_by_row(x) so that nan values are avoided.

Checklist

Essentials

  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented

Changes

  • Fix embedding.ipynb for mxnet v1.3

@leezu leezu requested a review from astonzhang June 14, 2018 05:07
@leezu leezu requested a review from szha as a code owner June 14, 2018 05:07
@mli
Copy link
Member

mli commented Jun 14, 2018

Job PR-153/1 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-153/1/index.html

"\n",
"def get_knn(vocab, k, word):\n",
" word_vec = vocab.embedding[word].reshape((-1, 1))\n",
" vocab_vecs = norm_vecs_by_row(vocab.embedding.idx_to_vec)\n",
" dot_prod = nd.dot(vocab_vecs, word_vec)\n",
" indices = nd.topk(dot_prod.reshape((len(vocab), )), k=k+5, ret_typ='indices')\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to keep k+5 to avoid retrieving special tokens.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why special tokens were retrieved is that their word vectors were nan and the topk operator considered nan as the highest ranking. With the changes here the special tokens will not be among the topk elements, consequently instead of k + 5 we need k + 1 to only exclude the input token.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With mxnet 1.3 the topk operator however gets confused by the nan values and returns some 'random' words instead of the topk words if nan is present in the input.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. Do you know why nan is returned for special tokens? Is it because they are initialized as zero vecs so denominator of cosine sim has 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they are initialized as 0 and consequently there was a division by 0 in norm_vecs_by_row

" indices = [int(i.asscalar()) for i in indices]\n",
" # Remove unknown and input tokens.\n",
" return vocab.to_tokens(indices[5:])"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

@astonzhang
Copy link
Member

Besides, please update get_top_k_by_analogy as

def get_top_k_by_analogy(token_embedding, k, word1, word2, word3):
    word_vecs = token_embedding.get_vecs_by_tokens([word1, word2, word3])
    word_diff = (word_vecs[1] - word_vecs[0] + word_vecs[2]).reshape((-1, 1))
    vocab_vecs = norm_vecs_by_row(token_embedding.idx_to_vec)
    dot_prod = nd.dot(vocab_vecs, word_diff)
    indices = nd.topk(dot_prod.reshape((len(token_embedding), )), k=k,
                      ret_typ='indices')
    indices = [int(i.asscalar()) for i in indices]
    return token_embedding.to_tokens(indices)

Otherwise, LGTM.

Copy link
Contributor Author

@leezu leezu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Started review by accident

"\n",
"def get_knn(vocab, k, word):\n",
" word_vec = vocab.embedding[word].reshape((-1, 1))\n",
" vocab_vecs = norm_vecs_by_row(vocab.embedding.idx_to_vec)\n",
" dot_prod = nd.dot(vocab_vecs, word_vec)\n",
" indices = nd.topk(dot_prod.reshape((len(vocab), )), k=k+5, ret_typ='indices')\n",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they are initialized as 0 and consequently there was a division by 0 in norm_vecs_by_row

@leezu
Copy link
Contributor Author

leezu commented Jun 14, 2018

The get_top_k_by_analogy in your last comment doesn't work due to some missing functionality in TokenEmbedding. Please confirm that you are fine with ed0b9e5

@astonzhang
Copy link
Member

LGTM

@mli
Copy link
Member

mli commented Jun 15, 2018

Job PR-153/5 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-153/5/index.html

@szha szha merged commit 2ef479f into dmlc:master Jun 15, 2018
@leezu leezu deleted the workaround11271 branch June 27, 2018 19:00
paperplanet pushed a commit to paperplanet/gluon-nlp that referenced this pull request Jun 9, 2019
* Workaround mxnet nd.topk regression

apache/mxnet#11271

* Simplify get_top_k_by_analogy
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants