Add PrefixSuffix multiseq mappers to prepend/append tokens #12

MaksymDel · 2022-08-02T05:26:28Z

Hi! With the introduction of Smashed, munging datasets of long documents is going to be a lot more fun )

This draft PR is simply to explore the idea below. It does the following:

It adds a new CustomTokensSequencePaddingMapper with corresponding classes for type_ids and attnetion_mask that allow wrapping the sentences with custom ids or strings.
It abstracts away the SequencePaddingMapper to do the general job of adding prefix/suffix tokens depending on the sentence number

In (1), we might want to prepend strings because Text2Text models like T5 expect inputs to have Task prefixes.
And we might want to bound sentences with custom special token_ids (e.g., with tokenizer added special tokens) to indicate the type of sentences in the dataset column.

Because of (2), subclasses now do not have to implement the transform function and only define what prefix/suffix tokens to add.

On the pro side, it reduces code duplication (especially considering new CustomPadding classes) and unifies the classes.
But on the con side, we now have one more level of inheritance...

If some variation of this proposal fits, I can add docs and tests.

@soldni

and add `SequencePaddingMapper`s for arbitrary tokens.

soldni · 2022-08-03T00:32:01Z

Hi Maksym!

Thank you for this pull request. I fully support the reasoning behind adding these mappers, but I would prefer avoiding Mappers argument that are callables, as they can create issues with libraries that use function signatures (I have another PR to remove that from MakeFieldMapper, actually).

In general, I would rather have more specialized mappers that do one thing well than fewer, more generic mappers that are highly configurable. It does make for more legible code, and it is easier to maintain.

MaksymDel · 2022-08-05T01:40:33Z

Hi Luca!

I removed callable arguments and added tests.

I think this is ready for review.

MaksymDel · 2022-10-05T15:15:28Z

Hi @soldni!

Does the current state of this PR look good, or is it better to redirect the efforts elsewhere at the moment?

MaksymDel added 3 commits August 2, 2022 06:19

Abstract away SequencePaddingMapper

773b4a8

and add `SequencePaddingMapper`s for arbitrary tokens.

Allow string prefixes/suffixes

b4ee9b7

doc

8900a10

Undo rebase

33bfe31

MaksymDel force-pushed the custom_tokens_mapper branch from a5472b2 to 33bfe31 Compare August 4, 2022 23:12

MaksymDel added 2 commits August 5, 2022 02:14

Merge branch 'allenai:main' into custom_tokens_mapper

ea6163a

get the file ready

09c99d0

soldni marked this pull request as ready for review August 5, 2022 00:40

soldni marked this pull request as draft August 5, 2022 00:40

Add SuffixPrefix multiseq mappers and tests

63f63e6

MaksymDel marked this pull request as ready for review August 5, 2022 01:36

MaksymDel changed the title ~~Add Mappers for arbitrary tokens and abstract away SequencePaddingMapper;~~ Add PrefixSuffix multiseq mappers to prepend/append tokens Aug 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add PrefixSuffix multiseq mappers to prepend/append tokens #12

Add PrefixSuffix multiseq mappers to prepend/append tokens #12

MaksymDel commented Aug 2, 2022 •

edited

soldni commented Aug 3, 2022

MaksymDel commented Aug 5, 2022 •

edited

MaksymDel commented Oct 5, 2022

Add PrefixSuffix multiseq mappers to prepend/append tokens #12

Are you sure you want to change the base?

Add PrefixSuffix multiseq mappers to prepend/append tokens #12

Conversation

MaksymDel commented Aug 2, 2022 • edited

soldni commented Aug 3, 2022

MaksymDel commented Aug 5, 2022 • edited

MaksymDel commented Oct 5, 2022

MaksymDel commented Aug 2, 2022 •

edited

MaksymDel commented Aug 5, 2022 •

edited