Workaround mxnet nd.topk regression #153

leezu · 2018-06-14T05:07:09Z

Description

The behavior of nd.topk under the presence of nan values changed, causing wrong results in the pretrained word embeddings notebook due to def norm_vecs_by_row(x) inducing nan values for 0 word vectors. This PR adds a small epsilon to def norm_vecs_by_row(x) so that nan values are avoided.

Checklist

Essentials

Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage
Code is well-documented

Changes

Fix embedding.ipynb for mxnet v1.3

apache/mxnet#11271

mli · 2018-06-14T05:19:44Z

Job PR-153/1 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-153/1/index.html

astonzhang · 2018-06-14T16:26:29Z

docs/examples/word_embedding/word_embedding.ipynb

    "\n",
    "def get_knn(vocab, k, word):\n",
    "    word_vec = vocab.embedding[word].reshape((-1, 1))\n",
    "    vocab_vecs = norm_vecs_by_row(vocab.embedding.idx_to_vec)\n",
    "    dot_prod = nd.dot(vocab_vecs, word_vec)\n",
-    "    indices = nd.topk(dot_prod.reshape((len(vocab), )), k=k+5, ret_typ='indices')\n",


We need to keep k+5 to avoid retrieving special tokens.

The reason why special tokens were retrieved is that their word vectors were nan and the topk operator considered nan as the highest ranking. With the changes here the special tokens will not be among the topk elements, consequently instead of k + 5 we need k + 1 to only exclude the input token.

With mxnet 1.3 the topk operator however gets confused by the nan values and returns some 'random' words instead of the topk words if nan is present in the input.

ok. Do you know why nan is returned for special tokens? Is it because they are initialized as zero vecs so denominator of cosine sim has 0?

Yes, they are initialized as 0 and consequently there was a division by 0 in norm_vecs_by_row

astonzhang · 2018-06-14T16:26:36Z

docs/examples/word_embedding/word_embedding.ipynb

    "    indices = [int(i.asscalar()) for i in indices]\n",
    "    # Remove unknown and input tokens.\n",
-    "    return vocab.to_tokens(indices[5:])"


same as above

astonzhang · 2018-06-14T21:38:41Z

Besides, please update get_top_k_by_analogy as

def get_top_k_by_analogy(token_embedding, k, word1, word2, word3):
    word_vecs = token_embedding.get_vecs_by_tokens([word1, word2, word3])
    word_diff = (word_vecs[1] - word_vecs[0] + word_vecs[2]).reshape((-1, 1))
    vocab_vecs = norm_vecs_by_row(token_embedding.idx_to_vec)
    dot_prod = nd.dot(vocab_vecs, word_diff)
    indices = nd.topk(dot_prod.reshape((len(token_embedding), )), k=k,
                      ret_typ='indices')
    indices = [int(i.asscalar()) for i in indices]
    return token_embedding.to_tokens(indices)

Otherwise, LGTM.

leezu

Started review by accident

leezu · 2018-06-14T22:43:43Z

docs/examples/word_embedding/word_embedding.ipynb

    "\n",
    "def get_knn(vocab, k, word):\n",
    "    word_vec = vocab.embedding[word].reshape((-1, 1))\n",
    "    vocab_vecs = norm_vecs_by_row(vocab.embedding.idx_to_vec)\n",
    "    dot_prod = nd.dot(vocab_vecs, word_vec)\n",
-    "    indices = nd.topk(dot_prod.reshape((len(vocab), )), k=k+5, ret_typ='indices')\n",


Yes, they are initialized as 0 and consequently there was a division by 0 in norm_vecs_by_row

leezu · 2018-06-14T22:54:17Z

The get_top_k_by_analogy in your last comment doesn't work due to some missing functionality in TokenEmbedding. Please confirm that you are fine with ed0b9e5

astonzhang · 2018-06-15T02:42:16Z

LGTM

mli · 2018-06-15T03:01:56Z

Job PR-153/5 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-153/5/index.html

* Workaround mxnet nd.topk regression apache/mxnet#11271 * Simplify get_top_k_by_analogy

Workaround mxnet nd.topk regression

6c26c56

apache/mxnet#11271

leezu requested a review from astonzhang June 14, 2018 05:07

leezu requested a review from szha as a code owner June 14, 2018 05:07

astonzhang reviewed Jun 14, 2018

View reviewed changes

astonzhang approved these changes Jun 14, 2018

View reviewed changes

leezu commented Jun 14, 2018

View reviewed changes

leezu force-pushed the workaround11271 branch from a7d7088 to ed0b9e5 Compare June 15, 2018 02:28

Simplify get_top_k_by_analogy

4cd074d

leezu force-pushed the workaround11271 branch from ed0b9e5 to 4cd074d Compare June 15, 2018 02:49

szha approved these changes Jun 15, 2018

View reviewed changes

szha merged commit 2ef479f into dmlc:master Jun 15, 2018

leezu deleted the workaround11271 branch June 27, 2018 19:00

paperplanet pushed a commit to paperplanet/gluon-nlp that referenced this pull request Jun 9, 2019

Workaround mxnet nd.topk regression (dmlc#153)

50a8348

* Workaround mxnet nd.topk regression apache/mxnet#11271 * Simplify get_top_k_by_analogy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workaround mxnet nd.topk regression #153

Workaround mxnet nd.topk regression #153

leezu commented Jun 14, 2018

mli commented Jun 14, 2018

astonzhang Jun 14, 2018

leezu Jun 14, 2018

leezu Jun 14, 2018

astonzhang Jun 14, 2018 •

edited

leezu Jun 14, 2018

astonzhang Jun 14, 2018

astonzhang commented Jun 14, 2018

leezu left a comment

leezu Jun 14, 2018

leezu commented Jun 14, 2018 •

edited

astonzhang commented Jun 15, 2018

mli commented Jun 15, 2018

Workaround mxnet nd.topk regression #153

Workaround mxnet nd.topk regression #153

Conversation

leezu commented Jun 14, 2018

Description

Checklist

Essentials

Changes

mli commented Jun 14, 2018

astonzhang Jun 14, 2018

Choose a reason for hiding this comment

leezu Jun 14, 2018

Choose a reason for hiding this comment

leezu Jun 14, 2018

Choose a reason for hiding this comment

astonzhang Jun 14, 2018 • edited

Choose a reason for hiding this comment

leezu Jun 14, 2018

Choose a reason for hiding this comment

astonzhang Jun 14, 2018

Choose a reason for hiding this comment

astonzhang commented Jun 14, 2018

leezu left a comment

Choose a reason for hiding this comment

leezu Jun 14, 2018

Choose a reason for hiding this comment

leezu commented Jun 14, 2018 • edited

astonzhang commented Jun 15, 2018

mli commented Jun 15, 2018

astonzhang Jun 14, 2018 •

edited

leezu commented Jun 14, 2018 •

edited