-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add PrefixSuffix multiseq mappers to prepend/append tokens #12
base: main
Are you sure you want to change the base?
Conversation
and add `SequencePaddingMapper`s for arbitrary tokens.
Hi Maksym! Thank you for this pull request. I fully support the reasoning behind adding these mappers, but I would prefer avoiding Mappers argument that are callables, as they can create issues with libraries that use function signatures (I have another PR to remove that from In general, I would rather have more specialized mappers that do one thing well than fewer, more generic mappers that are highly configurable. It does make for more legible code, and it is easier to maintain. |
a5472b2
to
33bfe31
Compare
Mapper
s for arbitrary tokens and abstract away SequencePaddingMapper
;
Hi Luca! I removed callable arguments and added tests. I think this is ready for review. |
Hi @soldni! Does the current state of this PR look good, or is it better to redirect the efforts elsewhere at the moment? |
Hi! With the introduction of Smashed, munging datasets of long documents is going to be a lot more fun )
This draft PR is simply to explore the idea below. It does the following:
CustomTokensSequencePaddingMapper
with corresponding classes fortype_ids
andattnetion_mask
that allow wrapping the sentences with custom ids or strings.SequencePaddingMapper
to do the general job of adding prefix/suffix tokens depending on the sentence numberIn (1), we might want to prepend strings because Text2Text models like T5 expect inputs to have Task prefixes.
And we might want to bound sentences with custom special token_ids (e.g., with tokenizer added special tokens) to indicate the type of sentences in the dataset column.
Because of (2), subclasses now do not have to implement the
transform
function and only define what prefix/suffix tokens to add.On the pro side, it reduces code duplication (especially considering new CustomPadding classes) and unifies the classes.
But on the con side, we now have one more level of inheritance...
If some variation of this proposal fits, I can add docs and tests.
@soldni