Allow JSON reader to sample across multiple files #8891

lnkuiper · 2023-09-12T07:38:56Z

Following the discussion in #8865, I've changed schema sampling in the JSON reader.

The JSON reader follows the functionality of our other readers (CSV/Parquet) when it comes to reading multiple files, e.g., with a GLOB. sample_size rows of the first file are sampled to determine the schema, and the rest of the files are assumed to have the same schema as the first one. However, if union_by_name=true, then we sample sample_size rows from each input file to determine the schema.

More so than with CSV/Parquet, having many small JSON files is very common. If this is the case, our default sampling strategy is quite poor. If the first file does not contain all the fields, then we will throw an error at some point during parsing. We can set union_by_name=true, but this has performance implications and requires users to look at the documentation to figure out how to combine schemas.

In general, we'd like to be able to parse most inputs simply by doing SELECT * FROM '*.json';. Therefore, I've changed our schema sampling to always sample the first sample_size rows when union_by_name=false (default), even if this requires sampling from multiple files. For example, if we have sample_size=6, and we're querying two JSON files both with 5 rows each, we will sample the first 5 rows of the first file, and the first row of the second file to determine the schema. The default sample_size is 20480.

I hope this will make querying JSON more pleasant for our users.

lnkuiper · 2023-09-12T13:31:23Z

I also found that reading piped JSON in a recursive CTE didn't work, that's fixed too in this PR.

Mytherin · 2023-09-18T11:13:25Z

Thanks! LGTM

lnkuiper added 2 commits September 12, 2023 09:25

sample across multiple files

10e4204

add comment

9215923

lnkuiper mentioned this pull request Sep 12, 2023

querying empty and non-empty unnested lists over multiple json files fails but succeeds when files concatenated together (or reordered) #8865

Closed

1 task

reading piped json in a recursive CTE works now

ac1b78b

github-actions bot marked this pull request as draft September 12, 2023 13:30

lnkuiper marked this pull request as ready for review September 12, 2023 13:31

Merge branch 'main' into json_multi_file_reader

21c064d

github-actions bot marked this pull request as draft September 18, 2023 07:08

lnkuiper marked this pull request as ready for review September 18, 2023 07:08

Mytherin merged commit 6e4f454 into duckdb:main Sep 18, 2023
50 checks passed

lnkuiper deleted the json_multi_file_reader branch November 24, 2023 13:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow JSON reader to sample across multiple files #8891

Allow JSON reader to sample across multiple files #8891

lnkuiper commented Sep 12, 2023

lnkuiper commented Sep 12, 2023

Mytherin commented Sep 18, 2023

Allow JSON reader to sample across multiple files #8891

Allow JSON reader to sample across multiple files #8891

Conversation

lnkuiper commented Sep 12, 2023

lnkuiper commented Sep 12, 2023

Mytherin commented Sep 18, 2023