Skip to content

AWS Athena Information Schema Error with Large Table Lists in dlt Pipeline #2529

@amirdataops

Description

@amirdataops

dlt version

1.8.0

Describe the problem

We encountered an issue in our dlt pipeline when querying AWS Athena's INFORMATION_SCHEMA.COLUMNS for multiple tables. The error we faced was:

Error listing table columns for catalog awsdatacatalog: Detected Iceberg type table without metadata location. Please make sure an Iceberg-enabled compute engine such as Athena or EMR Spark is used to create the table, or the table is created by using the Iceberg open source AWS library iceberg-aws.

This error happens due to the query executed against INFORMATION_SCHEMA:

sql:
SELECT table_name, column_name, data_type, is_nullable
    FROM INFORMATION_SCHEMA.COLUMNS
WHERE table_schema = '{my_table_schema_name}' AND table_name IN ({127 elements}) 
ORDER BY table_name, ordinal_position

The query works without issues when the number of elements in the IN clause is less than or equal to 100. However, with larger lists (e.g., 127 elements), the query takes around 20 minutes to complete, leading to pipeline failure.

To mitigate this, we adjusted the INFO_TABLES_QUERY_THRESHOLD parameter by forking the dlt library and setting the value from 1000 to 40, thus avoiding the fallback mechanism defined in the library's code:

python
if (
    len(name_lookup) > self.INFO_TABLES_QUERY_THRESHOLD
    or len(",".join(folded_table_names)) > self.capabilities.max_query_length / 2
):
    logger.info(
        "Fallback to query all columns from INFORMATION_SCHEMA due to limited query length"
        " or table threshold"
    )
    folded_table_names = []

This workaround resolved the error.

Expected behavior

The query against INFORMATION_SCHEMA.COLUMNS should operate efficiently regardless of the number of table names passed in the IN clause. The pipeline should successfully retrieve metadata without fallback or excessive execution delays.

Steps to reproduce

  1. Use AWS Athena as the SQL backend for the dlt pipeline.
  2. Attempt a query similar to the following:
sql
SELECT table_name, column_name, data_type, is_nullable
FROM INFORMATION_SCHEMA.COLUMNS
WHERE table_schema = '{my_table_schema_name}' AND table_name IN ({127 elements}) 
ORDER BY table_name, ordinal_position

  1. Observe the error or extremely delayed query execution if the number of elements in the IN clause exceeds 100.
  2. Note that reducing the INFO_TABLES_QUERY_THRESHOLD parameter mitigates the issue.

Operating system

Linux

Runtime environment

Kubernetes

Python version

3.9

dlt data source

No response

dlt destination

No response

Other deployment details

No response

Additional information

No response

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions