Skip to content

Support for CSV Encoding (UTF-16 and Latin-1) #14560

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Nov 5, 2024
Merged

Conversation

pdet
Copy link
Contributor

@pdet pdet commented Oct 25, 2024

This PR adds support for reading various encodings (in addition to UTF-8) in the CSV Reader.

The mechanism should be relatively easy to extend to other encodings, as the more challenging part is handling the end-of-CSV buffers.

I'm aware of at least two discussions that have requested this feature, specifically #9783 and #9436.

From these requests, I believe we are still missing support for Shift-JIS, though implementing the mapping doesn’t seem very complicated.

I think we should centralize encoding requests in one place and tag it as a 'PRs welcome' issue, as implementing new encodings should be relatively easy (aside from the decoding method itself).

@duckdb-draftbot duckdb-draftbot marked this pull request as draft October 25, 2024 17:18
@pdet pdet marked this pull request as ready for review October 25, 2024 17:18
Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! LGTM - two questions:

  • Could we move the decoding logic to a separate class - EncodingFunction - that is stored in an EncodingFunctionSet in the DBConfig? Similar to the CompressionFunction. We can then do the actual decoding (and encoding?) using a few callbacks. This would then allow extensions to add new encodings. We would also need to not use an enum, but use strings as encoding names to make this work.
  • Could we add auto-detection for the different encodings as well? (this can also be done in a future PR if this is a lot of work)

@duckdb-draftbot duckdb-draftbot marked this pull request as draft October 30, 2024 14:00
@pdet pdet marked this pull request as ready for review October 30, 2024 14:02
@duckdb-draftbot duckdb-draftbot marked this pull request as draft October 31, 2024 10:52
@pdet pdet marked this pull request as ready for review October 31, 2024 11:14
@duckdb-draftbot duckdb-draftbot marked this pull request as draft November 1, 2024 10:05
@pdet pdet marked this pull request as ready for review November 1, 2024 10:08
Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes! Looks great, some more comments (mostly nits):

@duckdb-draftbot duckdb-draftbot marked this pull request as draft November 4, 2024 13:31
@pdet pdet marked this pull request as ready for review November 4, 2024 13:34
@Mytherin Mytherin changed the base branch from feature to main November 5, 2024 10:23
@Mytherin Mytherin merged commit 1317872 into duckdb:main Nov 5, 2024
42 checks passed
@Mytherin
Copy link
Collaborator

Mytherin commented Nov 5, 2024

Thanks!

@pdet pdet deleted the csv_encoding branch November 27, 2024 12:31
github-actions bot pushed a commit to duckdb/duckdb-r that referenced this pull request Dec 20, 2024
Support for CSV Encoding (UTF-16 and Latin-1) (duckdb/duckdb#14560)
github-actions bot added a commit to duckdb/duckdb-r that referenced this pull request Dec 20, 2024
Support for CSV Encoding (UTF-16 and Latin-1) (duckdb/duckdb#14560)

Co-authored-by: krlmlr <krlmlr@users.noreply.github.com>
@szarnyasg szarnyasg added the Needs Documentation Use for issues or PRs that require changes in the documentation label Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Documentation Use for issues or PRs that require changes in the documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants