-
Notifications
You must be signed in to change notification settings - Fork 6.8k
allow user to define unknown token symbol #10461
Conversation
python/mxnet/rnn/io.py
Outdated
@@ -58,6 +62,8 @@ def encode_sentences(sentences, vocab=None, invalid_label=-1, invalid_key='\n', | |||
if vocab is None: | |||
vocab = {invalid_key: invalid_label} | |||
new_vocab = True | |||
elif unknown_token: | |||
new_vocab = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, there is situation where users has their own dictionary, say dict = {'a':1, 'b':2, 'c':3}
'abc' are frequent tokens the user care about. All the rest rare tokens are considered as unknown token (say the user define it as 'UNK'
), that return a encoded list [[1,2,3],[2,3,0]]
, a key-value pair 'UNK': 0
is added into the dictionary.
But the previous version will raise error for this case, which by default assuming that user will provdie a thorough vocaburary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should change the assertion to ignore cases where unkown_token is give instead of changing new_vocab to true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for late reply (in exam). Just fix according to your suggestion.
@ShootingSpace thanks for adding a test |
test case added
test case added
test case added
Description
Add new feature for issue #10068. It allows unknown token to be added to vocab if user provides a vocabulary and specifies a symbol(e.g. 'UNK'). Along with new default behaviour as ignoring the unknown token, instead of the present way which throwing an error.
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes
Comments