allows nested types in BigQuery via autodetect_schema #1591

rudolfix · 2024-07-14T19:44:58Z

Description

This PR allows users to offload schema inference and evolution (type inference from data, create and alter tables) to BigQuery.
As a consequence nested data types may be created (if present in arrow tables/parquet or inferred from json).
Besides being useful in itself it is a PoC of how nested types support may look in dlt.

in case of arrow tables / parquet files containing nested fields, such fields are created and migrated/evolved in the destination
in case or json documents, nested fields are inferred and stored as nested types. max_table_nesting still works - user has a choice how far dlt should decompose json into tables and when to switch to nested types

…_adapter

netlify · 2024-07-14T19:45:16Z

✅ Deploy Preview for dlt-hub-docs ready!

Name	Link
🔨 Latest commit	`d7e0138`
🔍 Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/66962c2e74f6cd0008e98c2a
😎 Deploy Preview	https://deploy-preview-1591--dlt-hub-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

sh-rp · 2024-07-15T13:32:29Z

docs/website/docs/dlt-ecosystem/verified-sources/arrow-pandas.md

+table = conn.execute(f"SELECT * FROM read_json_auto('{json_file_path}')").fetch_arrow_table()
+```
+
+Note that **duckdb** and **pyarrow** methods will generate [nested types](#loading-nested-types) for nested data, which are only partially by `dlt`.


there is a word missing here, probably "supported"

sh-rp · 2024-07-15T13:45:10Z

@rudolfix the PR looks good, but I have a question. Are you sure that this schema auto detection really creates the schema the exact same way in bigquery as our explicit code would (with the exception of the complex fields of course)? At least this is what I understand it says in the docs. I think I would at least add a test that checks all data types or something like that.

sh-rp · 2024-07-15T13:47:18Z

Or maybe it would be possible to only have the complex fields be autodetected and run the other migrations the way we always do, but I am actually not sure this makes sense.

rudolfix · 2024-07-15T19:51:31Z

@sh-rp the schema is identical in case of parquet (because parquet schema is source of truth both for bigquery and dlt) - with the exception of complex fields which here are structs and in our case JSON

JSON documents are inferred differently and our schema does not match 100% regarding the types and column order. What is unfortunate - I can't create some columns with dlt and some (complex) with BigQuery because it complains that the schema is different. I do not think the problem is too great. The biggest difference so far is _dlt_load_id that is inferred as float

anyway this is a trick for power user until we do #1592 is done

Deninc · 2024-09-08T15:11:25Z

@rudolfix can we do this with Redshift SUPER type? Loading from jsonl.

rudolfix added 4 commits July 14, 2024 21:38

allows bigquery to manage schema inference, table creation and migration

5c24ef0

allows to specify resources with bigquery managed schemas in bigquery…

28ac270

…_adapter

adds tests

bbc6e10

updates bigquery and arrow docs

65247c4

rudolfix mentioned this pull request Jul 14, 2024

add "Normalize" as a boolean argument to pipeline.run #1556

Closed

fixes test deps

b5df237

rudolfix force-pushed the feat/allows-autodetect-schema-bigquery branch from e744b4b to b5df237 Compare July 15, 2024 08:58

sh-rp reviewed Jul 15, 2024

View reviewed changes

fixes typo

d7e0138

rudolfix merged commit 401c62d into devel Jul 16, 2024
1 check passed

rudolfix deleted the feat/allows-autodetect-schema-bigquery branch July 16, 2024 08:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allows nested types in BigQuery via autodetect_schema #1591

allows nested types in BigQuery via autodetect_schema #1591

rudolfix commented Jul 14, 2024 •

edited

Loading

netlify bot commented Jul 14, 2024 •

edited

Loading

sh-rp Jul 15, 2024

sh-rp commented Jul 15, 2024

sh-rp commented Jul 15, 2024

rudolfix commented Jul 15, 2024

Deninc commented Sep 8, 2024

allows nested types in BigQuery via autodetect_schema #1591

allows nested types in BigQuery via autodetect_schema #1591

Conversation

rudolfix commented Jul 14, 2024 • edited Loading

Description

netlify bot commented Jul 14, 2024 • edited Loading

✅ Deploy Preview for dlt-hub-docs ready!

sh-rp Jul 15, 2024

Choose a reason for hiding this comment

sh-rp commented Jul 15, 2024

sh-rp commented Jul 15, 2024

rudolfix commented Jul 15, 2024

Deninc commented Sep 8, 2024

rudolfix commented Jul 14, 2024 •

edited

Loading

netlify bot commented Jul 14, 2024 •

edited

Loading