Skip to content

SQLGlot powered Schema discovery for ReadableRelations ("basic lineage") #2502

@sh-rp

Description

@sh-rp

TLDR

For any class that implements SupportsReadableRelation we want to be able to compute the column schema if requested. Right now we already have a column_schema property that kind works in some cases when we just do a subselection of columns or a limit or something like that, but it does not work for more complex sql queries / ibis expressions. This is needed to be able to create and update tables for transformations before the transformations are executed.

Since we have the full input schema for each dataset, we should be able to compute the resulting columns_schema by running some form of lineage on the sql query and looking up the columns in the input schema. A simple version of this exists in the dlt+ repo, maybe this is the way, maybe some completely different way is better.

We need this regardless of the fact wether we replace our IbisRelations with Tables from an IbisBackend.

Expected behavior

  • For every SupportsReadableRelation we want a compute_columns_schema method, for now without any arguments, but there might be some later to control how this works exactly. If called, the current query should be analyzed and the resulting columns should be returned as TTableSchemaColumns. The result of this computation may be cached.
  • For now, we only keep the data_type and the nullable as well as the precision, scaleand timezone hints of each column, we still have to figure out how all the other hints behave if used in transformations, some, like primary_key may not make sense anymore. We can make one exception: In the case where the expression in based on only one table and all or a subselection of columns is selected, we could return the full hints for all columns.
  • It should be possible to join tables from two separate datasets and have this work with forwarding all the relevant hints for all columns.
  • If a new column that is not derived from an origin column appears (such as Count(*) or a static value), sqlglot will probably give us a type there too and we can map these in a standard way to our dlt data_types. The same is true for casts. This is one of the big parts to still figure out.
  • If for some reason the data_type of a column can not be determined, this method should fail for now. We should be able to figure them all out though.
  • @rudolfix has the idea to be able to attach our computed schema to arrow tables and dataframes in some userspace they provide. This means that for all methods that return or iterate these two table formats, we need to add the computed schema, probably serialized as a string. We will then consume this if present in the extraction stage and merge it into the schema. This is to be investigated :)
  • You probably do not need to act on this yet, but keep in mind that at a later stage we will also forward other hints, such as pii or similar, and we may implement some kind of lineage which will return which column came from which origin column. The stuff in dlt+ has this functionality sketched out already too.

Open questions

  • Should we rather return a TTableSchema instead of TTableSchemaColumns so we may also forward table level hints in some cases?

Additional Reading

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions