Allow JSON reader to sample across multiple files #8891
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Following the discussion in #8865, I've changed schema sampling in the JSON reader.
The JSON reader follows the functionality of our other readers (CSV/Parquet) when it comes to reading multiple files, e.g., with a GLOB.
sample_size
rows of the first file are sampled to determine the schema, and the rest of the files are assumed to have the same schema as the first one. However, ifunion_by_name=true
, then we samplesample_size
rows from each input file to determine the schema.More so than with CSV/Parquet, having many small JSON files is very common. If this is the case, our default sampling strategy is quite poor. If the first file does not contain all the fields, then we will throw an error at some point during parsing. We can set
union_by_name=true
, but this has performance implications and requires users to look at the documentation to figure out how to combine schemas.In general, we'd like to be able to parse most inputs simply by doing
SELECT * FROM '*.json';
. Therefore, I've changed our schema sampling to always sample the firstsample_size
rows whenunion_by_name=false
(default), even if this requires sampling from multiple files. For example, if we havesample_size=6
, and we're querying two JSON files both with 5 rows each, we will sample the first 5 rows of the first file, and the first row of the second file to determine the schema. The defaultsample_size
is 20480.I hope this will make querying JSON more pleasant for our users.