This repository has been archived by the owner on Nov 22, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 801
handle long documents in squad qa datasource and models #975
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
9fe5bb9
to
168b4bc
Compare
borguz
added a commit
to borguz/pytext-1
that referenced
this pull request
Sep 18, 2019
…arch#975) Summary: Pull Request resolved: facebookresearch#975 BERT encoder has a maximum sequence length, which means long paragraphs cat get cut off when training / evaluating QA models. In the original Squad dataset, it's very rare to have a paragraph and an answer span that will get truncated, however it can be more common in other datasets. This diff implements chunking for long paragraphs such that: - each chunk is fixed size in terms of character length - a minimum overlap can be specified as a fraction of chunk size In metric reporter, we aggregate by sample id and return the highest scoring span. To share chunking logic between SquadDataSource and SquadTSVDataSource, I merged them into one class, with the added benefit that we can now train on a .json file and test on a tsv file and visa versa. Differential Revision: D17350243 fbshipit-source-id: f86c345778739053cf2a543f54a0039f4384f32b
borguz
added a commit
to borguz/pytext-1
that referenced
this pull request
Sep 20, 2019
…arch#975) Summary: Pull Request resolved: facebookresearch#975 BERT encoder has a maximum sequence length, which means long paragraphs cat get cut off when training / evaluating QA models. In the original Squad dataset, it's very rare to have a paragraph and an answer span that will get truncated, however it can be more common in other datasets. This diff implements chunking for long paragraphs such that: - each chunk is fixed size in terms of character length - a minimum overlap can be specified as a fraction of chunk size In metric reporter, we aggregate by sample id and return the highest scoring span. To share chunking logic between SquadDataSource and SquadTSVDataSource, I merged them into one class, with the added benefit that we can now train on a .json file and test on a tsv file and visa versa. Differential Revision: D17350243 fbshipit-source-id: fb7cfd2b40972168e24dba597bd8d0a812472f3f
168b4bc
to
e375d1b
Compare
borguz
added a commit
to borguz/pytext-1
that referenced
this pull request
Sep 24, 2019
…arch#975) Summary: Pull Request resolved: facebookresearch#975 BERT encoder has a maximum sequence length, which means long paragraphs cat get cut off when training / evaluating QA models. In the original Squad dataset, it's very rare to have a paragraph and an answer span that will get truncated, however it can be more common in other datasets. This diff implements chunking for long paragraphs such that: - each chunk is fixed size in terms of character length - a minimum overlap can be specified as a fraction of chunk size In metric reporter, we aggregate by sample id and return the highest scoring span. To share chunking logic between SquadDataSource and SquadTSVDataSource, I merged them into one class, with the added benefit that we can now train on a .json file and test on a tsv file and visa versa. Reviewed By: hikushalhere Differential Revision: D17350243 fbshipit-source-id: 1bdfade87ce456bee95c23e8ed54affbb5d8a5d1
e375d1b
to
5e8488d
Compare
borguz
added a commit
to borguz/pytext-1
that referenced
this pull request
Sep 24, 2019
…arch#975) Summary: Pull Request resolved: facebookresearch#975 BERT encoder has a maximum sequence length, which means long paragraphs cat get cut off when training / evaluating QA models. In the original Squad dataset, it's very rare to have a paragraph and an answer span that will get truncated, however it can be more common in other datasets. This diff implements chunking for long paragraphs such that: - each chunk is fixed size in terms of character length - a minimum overlap can be specified as a fraction of chunk size In metric reporter, we aggregate by sample id and return the highest scoring span. To share chunking logic between SquadDataSource and SquadTSVDataSource, I merged them into one class, with the added benefit that we can now train on a .json file and test on a tsv file and visa versa. Reviewed By: hikushalhere Differential Revision: D17350243 fbshipit-source-id: 1bb95de30d5d2eb5aff392e591afadce0da782be
5e8488d
to
a2ec012
Compare
…arch#975) Summary: Pull Request resolved: facebookresearch#975 BERT encoder has a maximum sequence length, which means long paragraphs cat get cut off when training / evaluating QA models. In the original Squad dataset, it's very rare to have a paragraph and an answer span that will get truncated, however it can be more common in other datasets. This diff implements chunking for long paragraphs such that: - each chunk is fixed size in terms of character length - a minimum overlap can be specified as a fraction of chunk size In metric reporter, we aggregate by sample id and return the highest scoring span. To share chunking logic between SquadDataSource and SquadTSVDataSource, I merged them into one class, with the added benefit that we can now train on a .json file and test on a tsv file and visa versa. Reviewed By: hikushalhere Differential Revision: D17350243 fbshipit-source-id: 432a2088895c195e2b0df696e3ce0e0532fee710
a2ec012
to
ca2a899
Compare
This pull request has been merged in 7a98332. |
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary:
BERT encoder has a maximum sequence length, which means long paragraphs cat get cut off when training / evaluating QA models. In the original Squad dataset, it's very rare to have a paragraph and an answer span that will get truncated, however it can be more common in other datasets.
This diff implements chunking for long paragraphs such that:
In metric reporter, we aggregate by sample id and return the highest scoring span.
To share chunking logic between SquadDataSource and SquadTSVDataSource, I merged them into one class, with the added benefit that we can now train on a .json file and test on a tsv file and visa versa.
Differential Revision: D17350243