Skip to content
This repository has been archived by the owner on Nov 22, 2022. It is now read-only.

handle long documents in squad qa datasource and models #975

Closed
wants to merge 1 commit into from

Conversation

borguz
Copy link
Contributor

@borguz borguz commented Sep 12, 2019

Summary:
BERT encoder has a maximum sequence length, which means long paragraphs cat get cut off when training / evaluating QA models. In the original Squad dataset, it's very rare to have a paragraph and an answer span that will get truncated, however it can be more common in other datasets.

This diff implements chunking for long paragraphs such that:

  • each chunk is fixed size in terms of character length
  • a minimum overlap can be specified as a fraction of chunk size

In metric reporter, we aggregate by sample id and return the highest scoring span.

To share chunking logic between SquadDataSource and SquadTSVDataSource, I merged them into one class, with the added benefit that we can now train on a .json file and test on a tsv file and visa versa.

Differential Revision: D17350243

@facebook-github-bot facebook-github-bot added the CLA Signed Do not delete this pull request or issue due to inactivity. label Sep 12, 2019
borguz added a commit to borguz/pytext-1 that referenced this pull request Sep 18, 2019
…arch#975)

Summary:
Pull Request resolved: facebookresearch#975

BERT encoder has a maximum sequence length, which means long paragraphs cat get cut off when training / evaluating QA models.  In the original Squad dataset, it's very rare to have a paragraph and an answer span that will get truncated, however it can be more common in other datasets.

This diff implements chunking for long paragraphs such that:
- each chunk is fixed size in terms of character length
- a minimum overlap can be specified as a fraction of chunk size

In metric reporter, we aggregate by sample id and return the highest scoring span.

To share chunking logic between SquadDataSource and SquadTSVDataSource, I merged them into one class, with the added benefit that we can now train on a .json file and test on a tsv file and visa versa.

Differential Revision: D17350243

fbshipit-source-id: f86c345778739053cf2a543f54a0039f4384f32b
borguz added a commit to borguz/pytext-1 that referenced this pull request Sep 20, 2019
…arch#975)

Summary:
Pull Request resolved: facebookresearch#975

BERT encoder has a maximum sequence length, which means long paragraphs cat get cut off when training / evaluating QA models.  In the original Squad dataset, it's very rare to have a paragraph and an answer span that will get truncated, however it can be more common in other datasets.

This diff implements chunking for long paragraphs such that:
- each chunk is fixed size in terms of character length
- a minimum overlap can be specified as a fraction of chunk size

In metric reporter, we aggregate by sample id and return the highest scoring span.

To share chunking logic between SquadDataSource and SquadTSVDataSource, I merged them into one class, with the added benefit that we can now train on a .json file and test on a tsv file and visa versa.

Differential Revision: D17350243

fbshipit-source-id: fb7cfd2b40972168e24dba597bd8d0a812472f3f
borguz added a commit to borguz/pytext-1 that referenced this pull request Sep 24, 2019
…arch#975)

Summary:
Pull Request resolved: facebookresearch#975

BERT encoder has a maximum sequence length, which means long paragraphs cat get cut off when training / evaluating QA models.  In the original Squad dataset, it's very rare to have a paragraph and an answer span that will get truncated, however it can be more common in other datasets.

This diff implements chunking for long paragraphs such that:
- each chunk is fixed size in terms of character length
- a minimum overlap can be specified as a fraction of chunk size

In metric reporter, we aggregate by sample id and return the highest scoring span.

To share chunking logic between SquadDataSource and SquadTSVDataSource, I merged them into one class, with the added benefit that we can now train on a .json file and test on a tsv file and visa versa.

Reviewed By: hikushalhere

Differential Revision: D17350243

fbshipit-source-id: 1bdfade87ce456bee95c23e8ed54affbb5d8a5d1
borguz added a commit to borguz/pytext-1 that referenced this pull request Sep 24, 2019
…arch#975)

Summary:
Pull Request resolved: facebookresearch#975

BERT encoder has a maximum sequence length, which means long paragraphs cat get cut off when training / evaluating QA models.  In the original Squad dataset, it's very rare to have a paragraph and an answer span that will get truncated, however it can be more common in other datasets.

This diff implements chunking for long paragraphs such that:
- each chunk is fixed size in terms of character length
- a minimum overlap can be specified as a fraction of chunk size

In metric reporter, we aggregate by sample id and return the highest scoring span.

To share chunking logic between SquadDataSource and SquadTSVDataSource, I merged them into one class, with the added benefit that we can now train on a .json file and test on a tsv file and visa versa.

Reviewed By: hikushalhere

Differential Revision: D17350243

fbshipit-source-id: 1bb95de30d5d2eb5aff392e591afadce0da782be
…arch#975)

Summary:
Pull Request resolved: facebookresearch#975

BERT encoder has a maximum sequence length, which means long paragraphs cat get cut off when training / evaluating QA models.  In the original Squad dataset, it's very rare to have a paragraph and an answer span that will get truncated, however it can be more common in other datasets.

This diff implements chunking for long paragraphs such that:
- each chunk is fixed size in terms of character length
- a minimum overlap can be specified as a fraction of chunk size

In metric reporter, we aggregate by sample id and return the highest scoring span.

To share chunking logic between SquadDataSource and SquadTSVDataSource, I merged them into one class, with the added benefit that we can now train on a .json file and test on a tsv file and visa versa.

Reviewed By: hikushalhere

Differential Revision: D17350243

fbshipit-source-id: 432a2088895c195e2b0df696e3ce0e0532fee710
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 7a98332.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
CLA Signed Do not delete this pull request or issue due to inactivity. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants