Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MultiEmbed #18

Closed
wants to merge 10 commits into from
Closed

MultiEmbed #18

wants to merge 10 commits into from

Conversation

kadarakos
Copy link
Contributor

@kadarakos kadarakos commented Aug 5, 2022

Embedding component that is the deterministic version of MultiHashEmbed i.e.: each token gets mapped to an index unless they are not in the vocabulary in which case they get mapped to a learned unknown vector.

The mechanism to initialize MultiEmbed is a bit strange. The Model gets created first with dummy Embed layers. Then when init gets called MultiEmbed expects the model.attrs["tables"] to be already set, which provides the mapping from token attributes to indices. During initialization the dummy Embed layers get replaced by ones that adjust their sizes to the number of symbols in the tables.

A helper callback is provided in set_attr.py that should be placed in the initialize.before_init section in the config. It can be used to set the tables for MultiEmbed.

Currently the token_map.py is a script that has the structure of the usual spacy init scrips.



def _get_examples():
nlp = spacy.blank("en")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nlp should be passed in so you're using the same vocab for the examples as the model.

Comment on lines +7 to +8
from typing import Optional, Callable, List, Dict, Any
from typing import Union, Sequence, Tuple
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move the from typing imports to the first line(s) of the file?

OutT = Ints2d


@thinc.registry.layers("remap_ids.v2")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, it feels strange to add to the Thinc registry from within spacy-experimental. I suppose we need this for config files though. Then maybe it should be spacy_experimental.remap_ids for now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole PR should just wait until the next thinc release.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinc 8.1.1 is now available, Ákos :-)

@kadarakos kadarakos closed this Dec 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants