-
Notifications
You must be signed in to change notification settings - Fork 418
Description
dlt version
1.8.0
Describe the problem
We encountered an issue in our dlt pipeline when querying AWS Athena's INFORMATION_SCHEMA.COLUMNS for multiple tables. The error we faced was:
Error listing table columns for catalog awsdatacatalog: Detected Iceberg type table without metadata location. Please make sure an Iceberg-enabled compute engine such as Athena or EMR Spark is used to create the table, or the table is created by using the Iceberg open source AWS library iceberg-aws.
This error happens due to the query executed against INFORMATION_SCHEMA:
sql:
SELECT table_name, column_name, data_type, is_nullable
FROM INFORMATION_SCHEMA.COLUMNS
WHERE table_schema = '{my_table_schema_name}' AND table_name IN ({127 elements})
ORDER BY table_name, ordinal_position
The query works without issues when the number of elements in the IN clause is less than or equal to 100. However, with larger lists (e.g., 127 elements), the query takes around 20 minutes to complete, leading to pipeline failure.
To mitigate this, we adjusted the INFO_TABLES_QUERY_THRESHOLD parameter by forking the dlt library and setting the value from 1000 to 40, thus avoiding the fallback mechanism defined in the library's code:
python
if (
len(name_lookup) > self.INFO_TABLES_QUERY_THRESHOLD
or len(",".join(folded_table_names)) > self.capabilities.max_query_length / 2
):
logger.info(
"Fallback to query all columns from INFORMATION_SCHEMA due to limited query length"
" or table threshold"
)
folded_table_names = []
This workaround resolved the error.
Expected behavior
The query against INFORMATION_SCHEMA.COLUMNS should operate efficiently regardless of the number of table names passed in the IN clause. The pipeline should successfully retrieve metadata without fallback or excessive execution delays.
Steps to reproduce
- Use AWS Athena as the SQL backend for the dlt pipeline.
- Attempt a query similar to the following:
sql
SELECT table_name, column_name, data_type, is_nullable
FROM INFORMATION_SCHEMA.COLUMNS
WHERE table_schema = '{my_table_schema_name}' AND table_name IN ({127 elements})
ORDER BY table_name, ordinal_position
- Observe the error or extremely delayed query execution if the number of elements in the IN clause exceeds 100.
- Note that reducing the INFO_TABLES_QUERY_THRESHOLD parameter mitigates the issue.
Operating system
Linux
Runtime environment
Kubernetes
Python version
3.9
dlt data source
No response
dlt destination
No response
Other deployment details
No response
Additional information
No response
Metadata
Metadata
Assignees
Labels
Type
Projects
Status