MultiEmbed #18

kadarakos · 2022-08-05T17:38:21Z

Embedding component that is the deterministic version of MultiHashEmbed i.e.: each token gets mapped to an index unless they are not in the vocabulary in which case they get mapped to a learned unknown vector.

The mechanism to initialize MultiEmbed is a bit strange. The Model gets created first with dummy Embed layers. Then when init gets called MultiEmbed expects the model.attrs["tables"] to be already set, which provides the mapping from token attributes to indices. During initialization the dummy Embed layers get replaced by ones that adjust their sizes to the number of symbols in the tables.

A helper callback is provided in set_attr.py that should be placed in the initialize.before_init section in the config. It can be used to set the tables for MultiEmbed.

Currently the token_map.py is a script that has the structure of the usual spacy init scrips.

adrianeboyd · 2022-08-08T08:11:17Z

spacy_experimental/multiembed/tests/test_multiembed.py

+
+
+def _get_examples():
+    nlp = spacy.blank("en")


nlp should be passed in so you're using the same vocab for the examples as the model.

svlandeg · 2022-08-18T10:28:08Z

spacy_experimental/multiembed/multiembed.py

+from typing import Optional, Callable, List, Dict, Any
+from typing import Union, Sequence, Tuple


Can you move the from typing imports to the first line(s) of the file?

svlandeg · 2022-08-18T10:30:15Z

spacy_experimental/multiembed/multiembed.py

+OutT = Ints2d
+
+
+@thinc.registry.layers("remap_ids.v2")


Hm, it feels strange to add to the Thinc registry from within spacy-experimental. I suppose we need this for config files though. Then maybe it should be spacy_experimental.remap_ids for now?

This whole PR should just wait until the next thinc release.

Thinc 8.1.1 is now available, Ákos :-)

kadarakos added 9 commits August 4, 2022 18:05

multiembed init

66568f6

make attrs mandatory

92ee658

mutliembed properly serialize

acb051d

add set_attr

05d9a24

can serialize

c907cb4

test errors and overfit

4aad4b7

separater callback from embedder

b613ee0

tests for callback

e2d7872

test fix

6209149

adrianeboyd reviewed Aug 8, 2022

View reviewed changes

make script more like a spacy cli command

306f35a

svlandeg reviewed Aug 18, 2022

View reviewed changes

kadarakos closed this Dec 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MultiEmbed #18

MultiEmbed #18

kadarakos commented Aug 5, 2022 •

edited

adrianeboyd Aug 8, 2022

svlandeg Aug 18, 2022

svlandeg Aug 18, 2022

adrianeboyd Aug 18, 2022

svlandeg Sep 12, 2022

		from typing import Optional, Callable, List, Dict, Any
		from typing import Union, Sequence, Tuple

		OutT = Ints2d


		@thinc.registry.layers("remap_ids.v2")

MultiEmbed #18

MultiEmbed #18

Conversation

kadarakos commented Aug 5, 2022 • edited

adrianeboyd Aug 8, 2022

Choose a reason for hiding this comment

svlandeg Aug 18, 2022

Choose a reason for hiding this comment

svlandeg Aug 18, 2022

Choose a reason for hiding this comment

adrianeboyd Aug 18, 2022

Choose a reason for hiding this comment

svlandeg Sep 12, 2022

Choose a reason for hiding this comment

kadarakos commented Aug 5, 2022 •

edited