Skip to content
This repository has been archived by the owner on Nov 22, 2022. It is now read-only.

improve cvs support in TSVDataSource #777

Closed

Conversation

Titousensei
Copy link
Contributor

Summary:
TSVDataSource documentation for delimiter param says Change to "," for csv,
but csv files often have quoted fields and TSVDataSource does not support
quoted fields.

We cannot blindly force all fields starting with quotes to be treated as
quoted fields, because it drastically changes the behavior: unclosed quoted
fields will merge with the next row, swallowing \n characters until we find
the closing quote. Some data sets might contain unclosed fields with quotes
and rely on the current behavior.

This diff adds a parameter to the TSVDataSource config that allows users to
specify whether they want quoted fields. The default is False, which is the
current behavior.

Differential Revision: D16232774

@facebook-github-bot facebook-github-bot added the CLA Signed Do not delete this pull request or issue due to inactivity. label Jul 12, 2019
Titousensei added a commit to Titousensei/pytext that referenced this pull request Jul 13, 2019
Summary:
Pull Request resolved: facebookresearch#777

TSVDataSource documentation for delimiter param says Change to "," for csv,
but csv files often have quoted fields and TSVDataSource does not support
quoted fields.

We cannot blindly force all fields starting with quotes to be treated as
quoted fields, because it drastically changes the behavior: unclosed quoted
fields will merge with the next row, swallowing \n characters until we find
the closing quote. Some data sets might contain unclosed fields with quotes
and rely on the current behavior.

This diff adds a parameter to the TSVDataSource config that allows users to
specify whether they want quoted fields. The default is False, which is the
current behavior.

Differential Revision: D16232774

fbshipit-source-id: 3286c6cceb04ec182a155595a38961a20b2c1c04
@Titousensei
Copy link
Contributor Author

Resolves issue #747

Titousensei added a commit to Titousensei/pytext that referenced this pull request Jul 19, 2019
Summary:
Pull Request resolved: facebookresearch#777

TSVDataSource documentation for delimiter param says Change to "," for csv,
but csv files often have quoted fields and TSVDataSource does not support
quoted fields.

We cannot blindly force all fields starting with quotes to be treated as
quoted fields, because it drastically changes the behavior: unclosed quoted
fields will merge with the next row, swallowing \n characters until we find
the closing quote. Some data sets might contain unclosed fields with quotes
and rely on the current behavior.

This diff adds a parameter to the TSVDataSource config that allows users to
specify whether they want quoted fields. The default is False, which is the
current behavior.

Differential Revision: D16232774

fbshipit-source-id: 8a152feaf22f25fbef6892906c55704452624815
Titousensei added a commit to Titousensei/pytext that referenced this pull request Jul 22, 2019
Summary:
Pull Request resolved: facebookresearch#777

TSVDataSource documentation for delimiter param says Change to "," for csv,
but csv files often have quoted fields and TSVDataSource does not support
quoted fields.

We cannot blindly force all fields starting with quotes to be treated as
quoted fields, because it drastically changes the behavior: unclosed quoted
fields will merge with the next row, swallowing \n characters until we find
the closing quote. Some data sets might contain unclosed fields with quotes
and rely on the current behavior.

This diff adds a parameter to the TSVDataSource config that allows users to
specify whether they want quoted fields. The default is False, which is the
current behavior.

Differential Revision: D16232774

fbshipit-source-id: 652f91e1462a010185934083f4dbcf82b99e8428
Titousensei added a commit to Titousensei/pytext that referenced this pull request Jul 23, 2019
Summary:
Pull Request resolved: facebookresearch#777

TSVDataSource documentation for delimiter param says Change to "," for csv,
but csv files often have quoted fields and TSVDataSource does not support
quoted fields.

We cannot blindly force all fields starting with quotes to be treated as
quoted fields, because it drastically changes the behavior: unclosed quoted
fields will merge with the next row, swallowing \n characters until we find
the closing quote. Some data sets might contain unclosed fields with quotes
and rely on the current behavior.

This diff adds a parameter to the TSVDataSource config that allows users to
specify whether they want quoted fields. The default is False, which is the
current behavior.

Differential Revision: D16232774

fbshipit-source-id: 0110293a19f1179cee70b53060123e685a7af988
Summary:
Pull Request resolved: facebookresearch#777

TSVDataSource documentation for delimiter param says Change to "," for csv,
but csv files often have quoted fields and TSVDataSource does not support
quoted fields.

We cannot blindly force all fields starting with quotes to be treated as
quoted fields, because it drastically changes the behavior: unclosed quoted
fields will merge with the next row, swallowing \n characters until we find
the closing quote. Some data sets might contain unclosed fields with quotes
and rely on the current behavior.

This diff adds a parameter to the TSVDataSource config that allows users to
specify whether they want quoted fields. The default is False, which is the
current behavior.

Differential Revision: D16232774

fbshipit-source-id: 5bd625f95b8795d7d5cd07774d41b099b4b3766e
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 7ae0b2a.

@Titousensei Titousensei deleted the export-D16232774 branch July 31, 2019 21:47
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
CLA Signed Do not delete this pull request or issue due to inactivity. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants