New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MultiEmbed #18
MultiEmbed #18
Conversation
|
||
|
||
def _get_examples(): | ||
nlp = spacy.blank("en") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nlp
should be passed in so you're using the same vocab for the examples as the model.
from typing import Optional, Callable, List, Dict, Any | ||
from typing import Union, Sequence, Tuple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move the from typing
imports to the first line(s) of the file?
OutT = Ints2d | ||
|
||
|
||
@thinc.registry.layers("remap_ids.v2") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, it feels strange to add to the Thinc registry from within spacy-experimental
. I suppose we need this for config files though. Then maybe it should be spacy_experimental.remap_ids
for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole PR should just wait until the next thinc release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinc 8.1.1 is now available, Ákos :-)
Embedding component that is the deterministic version of
MultiHashEmbed
i.e.: each token gets mapped to an index unless they are not in the vocabulary in which case they get mapped to a learned unknown vector.The mechanism to initialize
MultiEmbed
is a bit strange. TheModel
gets created first with dummyEmbed
layers. Then wheninit
gets calledMultiEmbed
expects themodel.attrs["tables"]
to be already set, which provides the mapping from token attributes to indices. During initialization the dummyEmbed
layers get replaced by ones that adjust their sizes to the number of symbols in thetables
.A helper callback is provided in
set_attr.py
that should be placed in theinitialize.before_init
section in theconfig
. It can be used to set thetables
forMultiEmbed
.Currently the
token_map.py
is a script that has the structure of the usualspacy init
scrips.