handle long documents in squad qa datasource and models #975

borguz · 2019-09-12T19:46:35Z

Summary:
BERT encoder has a maximum sequence length, which means long paragraphs cat get cut off when training / evaluating QA models. In the original Squad dataset, it's very rare to have a paragraph and an answer span that will get truncated, however it can be more common in other datasets.

This diff implements chunking for long paragraphs such that:

each chunk is fixed size in terms of character length
a minimum overlap can be specified as a fraction of chunk size

In metric reporter, we aggregate by sample id and return the highest scoring span.

To share chunking logic between SquadDataSource and SquadTSVDataSource, I merged them into one class, with the added benefit that we can now train on a .json file and test on a tsv file and visa versa.

Differential Revision: D17350243

…arch#975) Summary: Pull Request resolved: facebookresearch#975 BERT encoder has a maximum sequence length, which means long paragraphs cat get cut off when training / evaluating QA models. In the original Squad dataset, it's very rare to have a paragraph and an answer span that will get truncated, however it can be more common in other datasets. This diff implements chunking for long paragraphs such that: - each chunk is fixed size in terms of character length - a minimum overlap can be specified as a fraction of chunk size In metric reporter, we aggregate by sample id and return the highest scoring span. To share chunking logic between SquadDataSource and SquadTSVDataSource, I merged them into one class, with the added benefit that we can now train on a .json file and test on a tsv file and visa versa. Differential Revision: D17350243 fbshipit-source-id: f86c345778739053cf2a543f54a0039f4384f32b

…arch#975) Summary: Pull Request resolved: facebookresearch#975 BERT encoder has a maximum sequence length, which means long paragraphs cat get cut off when training / evaluating QA models. In the original Squad dataset, it's very rare to have a paragraph and an answer span that will get truncated, however it can be more common in other datasets. This diff implements chunking for long paragraphs such that: - each chunk is fixed size in terms of character length - a minimum overlap can be specified as a fraction of chunk size In metric reporter, we aggregate by sample id and return the highest scoring span. To share chunking logic between SquadDataSource and SquadTSVDataSource, I merged them into one class, with the added benefit that we can now train on a .json file and test on a tsv file and visa versa. Differential Revision: D17350243 fbshipit-source-id: fb7cfd2b40972168e24dba597bd8d0a812472f3f

…arch#975) Summary: Pull Request resolved: facebookresearch#975 BERT encoder has a maximum sequence length, which means long paragraphs cat get cut off when training / evaluating QA models. In the original Squad dataset, it's very rare to have a paragraph and an answer span that will get truncated, however it can be more common in other datasets. This diff implements chunking for long paragraphs such that: - each chunk is fixed size in terms of character length - a minimum overlap can be specified as a fraction of chunk size In metric reporter, we aggregate by sample id and return the highest scoring span. To share chunking logic between SquadDataSource and SquadTSVDataSource, I merged them into one class, with the added benefit that we can now train on a .json file and test on a tsv file and visa versa. Reviewed By: hikushalhere Differential Revision: D17350243 fbshipit-source-id: 1bdfade87ce456bee95c23e8ed54affbb5d8a5d1

…arch#975) Summary: Pull Request resolved: facebookresearch#975 BERT encoder has a maximum sequence length, which means long paragraphs cat get cut off when training / evaluating QA models. In the original Squad dataset, it's very rare to have a paragraph and an answer span that will get truncated, however it can be more common in other datasets. This diff implements chunking for long paragraphs such that: - each chunk is fixed size in terms of character length - a minimum overlap can be specified as a fraction of chunk size In metric reporter, we aggregate by sample id and return the highest scoring span. To share chunking logic between SquadDataSource and SquadTSVDataSource, I merged them into one class, with the added benefit that we can now train on a .json file and test on a tsv file and visa versa. Reviewed By: hikushalhere Differential Revision: D17350243 fbshipit-source-id: 1bb95de30d5d2eb5aff392e591afadce0da782be

…arch#975) Summary: Pull Request resolved: facebookresearch#975 BERT encoder has a maximum sequence length, which means long paragraphs cat get cut off when training / evaluating QA models. In the original Squad dataset, it's very rare to have a paragraph and an answer span that will get truncated, however it can be more common in other datasets. This diff implements chunking for long paragraphs such that: - each chunk is fixed size in terms of character length - a minimum overlap can be specified as a fraction of chunk size In metric reporter, we aggregate by sample id and return the highest scoring span. To share chunking logic between SquadDataSource and SquadTSVDataSource, I merged them into one class, with the added benefit that we can now train on a .json file and test on a tsv file and visa versa. Reviewed By: hikushalhere Differential Revision: D17350243 fbshipit-source-id: 432a2088895c195e2b0df696e3ce0e0532fee710

facebook-github-bot · 2019-09-25T20:10:06Z

This pull request has been merged in 7a98332.

facebook-github-bot added the CLA Signed Do not delete this pull request or issue due to inactivity. label Sep 12, 2019

borguz force-pushed the export-D17350243 branch from 9fe5bb9 to 168b4bc Compare September 18, 2019 17:05

borguz force-pushed the export-D17350243 branch from 168b4bc to e375d1b Compare September 20, 2019 17:27

borguz force-pushed the export-D17350243 branch from e375d1b to 5e8488d Compare September 24, 2019 23:36

borguz force-pushed the export-D17350243 branch from 5e8488d to a2ec012 Compare September 24, 2019 23:41

borguz force-pushed the export-D17350243 branch from a2ec012 to ca2a899 Compare September 25, 2019 16:55

facebook-github-bot closed this in 7a98332 Sep 25, 2019

facebook-github-bot added the Merged label Sep 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

handle long documents in squad qa datasource and models #975

handle long documents in squad qa datasource and models #975

borguz commented Sep 12, 2019

facebook-github-bot commented Sep 25, 2019

handle long documents in squad qa datasource and models #975

handle long documents in squad qa datasource and models #975

Conversation

borguz commented Sep 12, 2019

facebook-github-bot commented Sep 25, 2019