-
Notifications
You must be signed in to change notification settings - Fork 172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allows nested types in BigQuery via autodetect_schema #1591
Conversation
✅ Deploy Preview for dlt-hub-docs ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
e744b4b
to
b5df237
Compare
table = conn.execute(f"SELECT * FROM read_json_auto('{json_file_path}')").fetch_arrow_table() | ||
``` | ||
|
||
Note that **duckdb** and **pyarrow** methods will generate [nested types](#loading-nested-types) for nested data, which are only partially by `dlt`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is a word missing here, probably "supported"
@rudolfix the PR looks good, but I have a question. Are you sure that this schema auto detection really creates the schema the exact same way in bigquery as our explicit code would (with the exception of the complex fields of course)? At least this is what I understand it says in the docs. I think I would at least add a test that checks all data types or something like that. |
Or maybe it would be possible to only have the complex fields be autodetected and run the other migrations the way we always do, but I am actually not sure this makes sense. |
@sh-rp the schema is identical in case of parquet (because parquet schema is source of truth both for bigquery and dlt) - with the exception of JSON documents are inferred differently and our schema does not match 100% regarding the types and column order. What is unfortunate - I can't create some columns with dlt and some (complex) with BigQuery because it complains that the schema is different. I do not think the problem is too great. The biggest difference so far is anyway this is a trick for power user until we do #1592 is done |
@rudolfix can we do this with Redshift SUPER type? Loading from jsonl. |
Description
This PR allows users to offload schema inference and evolution (type inference from data, create and alter tables) to BigQuery.
As a consequence nested data types may be created (if present in arrow tables/parquet or inferred from json).
Besides being useful in itself it is a PoC of how nested types support may look in
dlt
.max_table_nesting
still works - user has a choice how fardlt
should decompose json into tables and when to switch to nested types