SQLGlot powered Schema discovery for ReadableRelations ("basic lineage")

**TLDR**

For any class that implements `SupportsReadableRelation` we want to be able to compute the column schema if requested. Right now we already have a `column_schema` property that kind works in some cases when we just do a subselection of columns or a limit or something like that, but it does not work for more complex sql queries / ibis expressions. This is needed to be able to create and update tables for transformations before the transformations are executed.

Since we have the full input schema for each dataset, we should be able to compute the resulting columns_schema by running some form of lineage on the sql query and looking up the columns in the input schema. A simple version of this exists in the dlt+ repo, maybe this is the way, maybe some completely different way is better.

We need this regardless of the fact wether we replace our `IbisRelations` with `Tables` from an `IbisBackend`.

**Expected behavior**
* For every `SupportsReadableRelation` we want a `compute_columns_schema` method, for now without any arguments, but there might be some later to control how this works exactly. If called, the current query should be analyzed and the resulting columns should be returned as `TTableSchemaColumns`. The result of this computation may be cached. 
* For now, we only keep the `data_type` and the `nullable` as well as the `precision`, `scale`and `timezone` hints of each column, we still have to figure out how all the other hints behave if used in transformations, some, like `primary_key` may not make sense anymore. We can make one exception: In the case where the expression in based on only one table and all or a subselection of columns is selected, we could return the full hints for all columns.
* It should be possible to join tables from two separate datasets and have this work with forwarding all the relevant hints for all columns.
* If a new column that is not derived from an origin column appears (such as Count(*) or a static value), sqlglot will probably give us a type there too and we can map these in a standard way to our dlt `data_types`. The same is true for casts. This is one of the big parts to still figure out.
* If for some reason the data_type of a column can not be determined, this method should fail for now. We should be able to figure them all out though.
* @rudolfix has the idea to be able to attach our computed schema to arrow tables and dataframes in some userspace they provide. This means that for all methods that return or iterate these two table formats, we need to add the computed schema, probably serialized as a string. We will then consume this if present in the extraction stage and merge it into the schema. This is to be investigated :)
* You probably do not need to act on this yet, but keep in mind that at a later stage we will also forward other hints, such as pii or similar, and we may implement some kind of lineage which will return which column came from which origin column. The stuff in dlt+ has this functionality sketched out already too.

**Open questions**
* Should we rather return a `TTableSchema` instead of `TTableSchemaColumns` so we may also forward table level hints in some cases?

**Additional Reading**
* SQLGlot ast / lineage primer: https://github.com/tobymao/sqlglot/blob/main/posts/ast_primer.md
* Columns discovery implementation in dlt+: https://github.com/dlt-hub/dlt-plus/pull/217/files#diff-b1f6ef92887216758374b380dc3aa203edec8a88cdc596f7edc1cb0b1e3a9c22


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SQLGlot powered Schema discovery for ReadableRelations ("basic lineage") #2502

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SQLGlot powered Schema discovery for ReadableRelations ("basic lineage") #2502

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions