Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preprocessing of context to fit max_length #176

Open
Geethi2020 opened this issue May 21, 2021 · 1 comment
Open

Preprocessing of context to fit max_length #176

Geethi2020 opened this issue May 21, 2021 · 1 comment

Comments

@Geethi2020
Copy link

Hi, would you please help me understand how the preprocessing is done for theCovidQA corpus ? Why I ask is because the context in the CovidQA dataset seems to be so much larger than the maximum length set in the code (which is 300+ and BERT max_length is 512 tokens). How is the data processed to fit into the limit ? Couldn't find the code for that in the Git. Please advice. Thank you.

@Timoeller
Copy link
Contributor

The data is in normal SQuAD format so you can use processors like the ones in Huggingface transformers or our FARM framework, see https://github.com/deepset-ai/FARM/blob/master/farm/data_handler/processor.py#L1889
to convert the data.

For dealing with long context you have moving windows, compute possible answers per window and combine the answer afterwords. Hope that helps? If you have detailed questions about the processing, please ask in FARM or huggingface transformers directly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants