Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow JSON reader to sample across multiple files #8891

Merged
merged 4 commits into from
Sep 18, 2023

Conversation

lnkuiper
Copy link
Contributor

Following the discussion in #8865, I've changed schema sampling in the JSON reader.

The JSON reader follows the functionality of our other readers (CSV/Parquet) when it comes to reading multiple files, e.g., with a GLOB. sample_size rows of the first file are sampled to determine the schema, and the rest of the files are assumed to have the same schema as the first one. However, if union_by_name=true, then we sample sample_size rows from each input file to determine the schema.

More so than with CSV/Parquet, having many small JSON files is very common. If this is the case, our default sampling strategy is quite poor. If the first file does not contain all the fields, then we will throw an error at some point during parsing. We can set union_by_name=true, but this has performance implications and requires users to look at the documentation to figure out how to combine schemas.

In general, we'd like to be able to parse most inputs simply by doing SELECT * FROM '*.json';. Therefore, I've changed our schema sampling to always sample the first sample_size rows when union_by_name=false (default), even if this requires sampling from multiple files. For example, if we have sample_size=6, and we're querying two JSON files both with 5 rows each, we will sample the first 5 rows of the first file, and the first row of the second file to determine the schema. The default sample_size is 20480.

I hope this will make querying JSON more pleasant for our users.

@github-actions github-actions bot marked this pull request as draft September 12, 2023 13:30
@lnkuiper lnkuiper marked this pull request as ready for review September 12, 2023 13:31
@lnkuiper
Copy link
Contributor Author

I also found that reading piped JSON in a recursive CTE didn't work, that's fixed too in this PR.

@github-actions github-actions bot marked this pull request as draft September 18, 2023 07:08
@lnkuiper lnkuiper marked this pull request as ready for review September 18, 2023 07:08
@Mytherin Mytherin merged commit 6e4f454 into duckdb:main Sep 18, 2023
50 checks passed
@Mytherin
Copy link
Collaborator

Thanks! LGTM

@lnkuiper lnkuiper deleted the json_multi_file_reader branch November 24, 2023 13:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants