Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allows nested types in BigQuery via autodetect_schema #1591

Merged
merged 6 commits into from
Jul 16, 2024

Conversation

rudolfix
Copy link
Collaborator

@rudolfix rudolfix commented Jul 14, 2024

Description

This PR allows users to offload schema inference and evolution (type inference from data, create and alter tables) to BigQuery.
As a consequence nested data types may be created (if present in arrow tables/parquet or inferred from json).
Besides being useful in itself it is a PoC of how nested types support may look in dlt.

  1. in case of arrow tables / parquet files containing nested fields, such fields are created and migrated/evolved in the destination
  2. in case or json documents, nested fields are inferred and stored as nested types. max_table_nesting still works - user has a choice how far dlt should decompose json into tables and when to switch to nested types

Copy link

netlify bot commented Jul 14, 2024

Deploy Preview for dlt-hub-docs ready!

Name Link
🔨 Latest commit d7e0138
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/66962c2e74f6cd0008e98c2a
😎 Deploy Preview https://deploy-preview-1591--dlt-hub-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@rudolfix rudolfix force-pushed the feat/allows-autodetect-schema-bigquery branch from e744b4b to b5df237 Compare July 15, 2024 08:58
table = conn.execute(f"SELECT * FROM read_json_auto('{json_file_path}')").fetch_arrow_table()
```

Note that **duckdb** and **pyarrow** methods will generate [nested types](#loading-nested-types) for nested data, which are only partially by `dlt`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a word missing here, probably "supported"

@sh-rp
Copy link
Collaborator

sh-rp commented Jul 15, 2024

@rudolfix the PR looks good, but I have a question. Are you sure that this schema auto detection really creates the schema the exact same way in bigquery as our explicit code would (with the exception of the complex fields of course)? At least this is what I understand it says in the docs. I think I would at least add a test that checks all data types or something like that.

@sh-rp
Copy link
Collaborator

sh-rp commented Jul 15, 2024

Or maybe it would be possible to only have the complex fields be autodetected and run the other migrations the way we always do, but I am actually not sure this makes sense.

@rudolfix
Copy link
Collaborator Author

@sh-rp the schema is identical in case of parquet (because parquet schema is source of truth both for bigquery and dlt) - with the exception of complex fields which here are structs and in our case JSON

JSON documents are inferred differently and our schema does not match 100% regarding the types and column order. What is unfortunate - I can't create some columns with dlt and some (complex) with BigQuery because it complains that the schema is different. I do not think the problem is too great. The biggest difference so far is _dlt_load_id that is inferred as float

anyway this is a trick for power user until we do #1592 is done

@rudolfix rudolfix merged commit 401c62d into devel Jul 16, 2024
1 check passed
@rudolfix rudolfix deleted the feat/allows-autodetect-schema-bigquery branch July 16, 2024 08:15
@Deninc
Copy link

Deninc commented Sep 8, 2024

@rudolfix can we do this with Redshift SUPER type? Loading from jsonl.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants