forked from facebookresearch/pytext
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
handle long documents in squad qa datasource and models (facebookrese…
…arch#975) Summary: Pull Request resolved: facebookresearch#975 BERT encoder has a maximum sequence length, which means long paragraphs cat get cut off when training / evaluating QA models. In the original Squad dataset, it's very rare to have a paragraph and an answer span that will get truncated, however it can be more common in other datasets. This diff implements chunking for long paragraphs such that: - each chunk is fixed size in terms of character length - a minimum overlap can be specified as a fraction of chunk size In metric reporter, we aggregate by sample id and return the highest scoring span. To share chunking logic between SquadDataSource and SquadTSVDataSource, I merged them into one class, with the added benefit that we can now train on a .json file and test on a tsv file and visa versa. Differential Revision: D17350243 fbshipit-source-id: fb7cfd2b40972168e24dba597bd8d0a812472f3f
- Loading branch information
1 parent
7c69b97
commit e375d1b
Showing
3 changed files
with
206 additions
and
59 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.