Out of vocabulary word vectors #25

Henry-E · 2017-06-18T14:16:17Z

I'm interested in implementing something similar in a different framework. And I was wondering how out of vocabulary word vectors are handled. From the code I can see for each example there's a vocab size of 50k + number of unknown words in the article. I'm not super familiar with tensorflow and I was wondering what the out of vocabulary word vectors looked like. Is the vocab size really 50k + maximum number of unknown words in an article. Or are the word vectors for the unknowns set to zero / randomised for every example?

abisee · 2017-06-20T18:58:48Z

Hi @Henry-E,

Have you read the paper? This is a question that would be easier to answer by looking at the paper than the code.

For the pointer-generator model, OOV words in the source text are represented by the word vector for the UNK token (because as they're OOV, there is no word vector to use). In the paper and the code, we refer to e.g. "50k + number of unknown words in the article" as the extended vocabulary because that's the set of words that can be produced by the decoder. This doesn't mean that all the words in the extended vocabulary have word vectors, though. Only the words in the original vocabulary have word vectors.

Hope this answers your question.

Henry-E · 2017-06-21T07:36:07Z

Yep that sounds good. I was just trying to figure out how the probability distribution for the extended vocabulary was obtained in practice. For some reason I thought in order to attend properly the attention mechanism needed to be able to distinguish between unique OOV words.

abisee · 2017-06-21T10:45:13Z

@Henry-E re: distinguishing between unique OOV words, also check out the discussion in the comments started by RobinChen here.

Henry-E · 2017-06-21T11:38:25Z

Ah ok, and the final bit that I also was confused about, for anyone who comes across this later, was how the loss is calculated. I subsequently thought that maybe the copy mechanism was only applied at test time but it appears that the loss is calculated based on the extended vocabulary.

The loss function is as described in equations (6) and (7), but with respect to our modified probability distribution P(w) given in equation (9).

abisee added the question label Jun 20, 2017

Henry-E closed this as completed Jun 21, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out of vocabulary word vectors #25

Out of vocabulary word vectors #25

Henry-E commented Jun 18, 2017 •

edited

Loading

abisee commented Jun 20, 2017 •

edited

Loading

Henry-E commented Jun 21, 2017

abisee commented Jun 21, 2017

Henry-E commented Jun 21, 2017

Out of vocabulary word vectors #25

Out of vocabulary word vectors #25

Comments

Henry-E commented Jun 18, 2017 • edited Loading

abisee commented Jun 20, 2017 • edited Loading

Henry-E commented Jun 21, 2017

abisee commented Jun 21, 2017

Henry-E commented Jun 21, 2017

Henry-E commented Jun 18, 2017 •

edited

Loading

abisee commented Jun 20, 2017 •

edited

Loading