Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Where do classes get added as special tokens? #10

Closed
NielsRogge opened this issue Aug 1, 2022 · 6 comments
Closed

Where do classes get added as special tokens? #10

NielsRogge opened this issue Aug 1, 2022 · 6 comments

Comments

@NielsRogge
Copy link

NielsRogge commented Aug 1, 2022

Hi,

I've implemented Donut as a fork of HuggingFace Transformers, and soon I'll add it to the library. The model is implemented as an instance of VisionEncoderDecoderModel, which allows to combine any vision Transformer encoder (like ViT, Swin) with any text Transformer as decoder (like BERT, GPT-2, etc.). As Donut exactly did that, it was straightforward to implement it that way.

Here's a notebook that shows inference with it.

I do have 2 questions though:

  • I prepared a toy dataset of RVL-CDIP, in order to illustrate how to fine-tune the model on document image classification. However, I wonder where the different classes get added to the special tokens of the tokenizer + decoder. The toy dataset can be loaded as follows:
from datasets import load_dataset

dataset = load_dataset("nielsr/rvl_cdip_10_examples_per_class_donut")

when using this dataset when creating an instance of DonutDataset, it seems only "<s_class>", "</s_class>" and "<s_rvlcdip>" are added as special tokens. But looking at this file, it seems that one also defines special tokens for each class. Looking at the code, it seems only keys are added, not values of the dictionaries.

  • I've uploaded all weights to the hub, currently they are all hosted under my own name (nielsr). I wonder whether we can transfer them to the naver-clova-ix organization. Of course, the names are already taken for the PyPi package of this repository, so either we can use branches within the Github repos, to specify a specific revision, either we can give priority to either HuggingFace Transformers/this PyPi package for the names.

Let me know what you think!

Kind regards,

Niels
ML Engineer @ HuggingFace

@gwkrsrch
Copy link
Collaborator

gwkrsrch commented Aug 4, 2022

Hi, @NielsRogge

For (1),
7e45119 will resolve the issue. Thank you for pointing the issue out :)
As you found, such categorical special tokens (from values of the dictionaries) are not automatically added to the vocabulary.
Those should be manually added if needed. This will help improve performance a bit, but it is optional.

For (2),
the authors are pleased with the huggingface update. As you suggested, we would like to transfer them to hf.co/naver-clova-ix/*.
I tagged the original files with the official branch. As far as I understand, this update will enable the model repositories to support both huggingface and donut-python.
donut-python will use revision= "official" instead, to help the huggingface update.

I found 5 donut repositories at hf.co/nielsr/*. (Note that there are 8 donut model repositories in hf.co/naver-clova-ix/*)
If you make PRs to the main branches, I will accept them (or is it better to update the files on my own?).

Kind regards,
Geewook Kim

@NielsRogge
Copy link
Author

Hi,

Thanks for updating that :)

Regarding uploading the checkpoints, I can open up PRs on your repos. I'll open a PR on the Transformers repository today to add the model to the library. Will update you.

@gwkrsrch
Copy link
Collaborator

gwkrsrch commented Aug 6, 2022

Great! I will wait for the updates :)
Please let me know when the PRs to the model repos are ready.

@NielsRogge
Copy link
Author

Hi,

I'll open PRs on the repos today. When these are merged, I can also merge the Donut PR linked above.

Kind regards,

Niels

@NielsRogge
Copy link
Author

NielsRogge commented Aug 12, 2022

Hi @gwkrsrch,

I've opened PRs on all 8 repos. Feel free to review and merge them :)

@gwkrsrch
Copy link
Collaborator

Hi @NielsRogge,

I have just checked that all model repositories got the PR and merged it successfully. And I see the huggingface update is now available. Awesome :)

I will close this issue since the related updates are completed and merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants