Skip to content

CSV Reader - 4 byte delimiters #14670

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 30 commits into from
Nov 29, 2024
Merged

CSV Reader - 4 byte delimiters #14670

merged 30 commits into from
Nov 29, 2024

Conversation

pdet
Copy link
Contributor

@pdet pdet commented Nov 1, 2024

This PR introduces support for delimiters of up to 4 bytes in the CSV Reader. I chose 4 bytes primarily to allow 🦆 to be used as a delimiter :-)

The parsing is implemented by adding intermediate states in the state machine to handle these specific extra bytes.

The PR includes several tests to verify that repeated character patterns (e.g., ABAC) are correctly parsed, and that multi-byte delimiters are properly handled in cases where values span multiple buffers.

I'm happy to add more tests if anyone has additional ideas for interesting edge cases to stress-test.

@duckdb-draftbot duckdb-draftbot marked this pull request as draft November 4, 2024 13:54
@pdet pdet marked this pull request as ready for review November 4, 2024 13:57
@duckdb-draftbot duckdb-draftbot marked this pull request as draft November 5, 2024 11:03
@pdet pdet marked this pull request as ready for review November 5, 2024 11:09
Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Looks great - some comments below

@duckdb-draftbot duckdb-draftbot marked this pull request as draft November 6, 2024 15:57
@Mytherin Mytherin changed the base branch from feature to main November 6, 2024 16:25
@Mytherin Mytherin marked this pull request as ready for review November 7, 2024 14:39
@duckdb-draftbot duckdb-draftbot marked this pull request as draft November 7, 2024 15:16
@Mytherin Mytherin marked this pull request as ready for review November 7, 2024 16:35
@duckdb-draftbot duckdb-draftbot marked this pull request as draft November 25, 2024 14:17
@pdet pdet marked this pull request as ready for review November 25, 2024 16:33
@duckdb-draftbot duckdb-draftbot marked this pull request as draft November 26, 2024 10:09
@pdet pdet marked this pull request as ready for review November 26, 2024 10:10
@duckdb-draftbot duckdb-draftbot marked this pull request as draft November 27, 2024 10:03
@pdet pdet marked this pull request as ready for review November 27, 2024 12:00
@duckdb-draftbot duckdb-draftbot marked this pull request as draft November 27, 2024 15:27
@pdet pdet marked this pull request as ready for review November 27, 2024 15:27
@duckdb-draftbot duckdb-draftbot marked this pull request as draft November 28, 2024 09:33
@pdet pdet marked this pull request as ready for review November 28, 2024 09:35
@Mytherin Mytherin merged commit ef43a0d into duckdb:main Nov 29, 2024
40 of 41 checks passed
@Mytherin
Copy link
Collaborator

Thanks!

krlmlr added a commit to duckdb/duckdb-r that referenced this pull request Dec 27, 2024
CSV Reader - 4 byte delimiters (duckdb/duckdb#14670)
Force download doesn't require to do a head request (duckdb/duckdb#14979)
github-actions bot pushed a commit to duckdb/duckdb-r that referenced this pull request Dec 27, 2024
CSV Reader - 4 byte delimiters (duckdb/duckdb#14670)
Force download doesn't require to do a head request (duckdb/duckdb#14979)
github-actions bot added a commit to duckdb/duckdb-r that referenced this pull request Dec 27, 2024
CSV Reader - 4 byte delimiters (duckdb/duckdb#14670)
Force download doesn't require to do a head request (duckdb/duckdb#14979)

Co-authored-by: krlmlr <krlmlr@users.noreply.github.com>
@pdet pdet deleted the multi_byte_delimiter branch January 12, 2025 16:29
@szarnyasg szarnyasg added the Needs Documentation Use for issues or PRs that require changes in the documentation label Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Documentation Use for issues or PRs that require changes in the documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants