Skip to content

Cannot schedule a DAG on a DatasetAlias when using a clean Airflow docker image (for CI) #44775

@Dev-iL

Description

@Dev-iL

Apache Airflow version

2.10.4

If "Other Airflow 2 version" selected, which one?

Observed also on 2.10.2

What happened?

One of the tests in our CI is creating a DagBag. After adding a DAG that's scheduled on a DatasetAlias, the bag cannot be created. Confusingly, this issue does not occur in other settings/environments (e.g. dag.test, local pytest, non-CI Airflow run in Docker).

Relevant stack trace
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) no such table: dataset_alias
[SQL: SELECT dataset_alias.id, dataset_alias.name 
FROM dataset_alias 
WHERE dataset_alias.name = ?
 LIMIT ? OFFSET ?]
[parameters: ('bar', 1, 0)]
(Background on this error at: https://sqlalche.me/e/14/e3q8)
[2024-12-08T16:08:22.227+0000] {variable.py:357} ERROR - Unable to retrieve variable from secrets backend (MetastoreBackend). Checking subsequent secrets backend.
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1910, in _execute_context
    self.dialect.do_execute(
  File "/home/airflow/.local/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 736, in do_execute
    cursor.execute(statement, parameters)
sqlite3.OperationalError: no such table: variable
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.9/site-packages/airflow/models/variable.py", line 353, in get_variable_from_secrets
    var_val = secrets_backend.get_variable(key=key)
  File "/home/airflow/.local/lib/python3.9/site-packages/airflow/utils/session.py", line 97, in wrapper
    return func(*args, session=session, **kwargs)
  File "/home/airflow/.local/lib/python3.9/site-packages/airflow/secrets/metastore.py", line 66, in get_variable
    return MetastoreBackend._fetch_variable(key=key, session=session)
  File "/home/airflow/.local/lib/python3.9/site-packages/airflow/api_internal/internal_api_call.py", line 139, in wrapper
    return func(*args, **kwargs)
  File "/home/airflow/.local/lib/python3.9/site-packages/airflow/utils/session.py", line 94, in wrapper
    return func(*args, **kwargs)
  File "/home/airflow/.local/lib/python3.9/site-packages/airflow/secrets/metastore.py", line 84, in _fetch_variable
    var_value = session.scalar(select(Variable).where(Variable.key == key).limit(1))
  File "/home/airflow/.local/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 1747, in scalar
    return self.execute(
  File "/home/airflow/.local/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 1717, in execute
    result = conn._execute_20(statement, params or {}, execution_options)
  File "/home/airflow/.local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1710, in _execute_20
    return meth(self, args_10style, kwargs_10style, execution_options)
  File "/home/airflow/.local/lib/python3.9/site-packages/sqlalchemy/sql/elements.py", line 334, in _execute_on_connection
    return connection._execute_clauseelement(
  File "/home/airflow/.local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1577, in _execute_clauseelement
    ret = self._execute_context(
  File "/home/airflow/.local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1953, in _execute_context
    self._handle_dbapi_exception(
  File "/home/airflow/.local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 2134, in _handle_dbapi_exception
    util.raise_(
  File "/home/airflow/.local/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 211, in raise_
    raise exception
  File "/home/airflow/.local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1910, in _execute_context
    self.dialect.do_execute(
  File "/home/airflow/.local/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 736, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) no such table: variable
[SQL: SELECT variable.val, variable.id, variable."key", variable.description, variable.is_encrypted 
FROM variable 
WHERE variable."key" = ?
 LIMIT ? OFFSET ?]
[parameters: ('latest_processed_instrument_file_path', 1, 0)]
(Background on this error at: https://sqlalche.me/e/14/e3q8)

What you think should happen instead?

DagBag creation should not fail on a one-shot parse pass of the dags folder regardless of order.

How to reproduce

  1. Obtain a recent docker image, such as: apache/airflow:2.10.4-python3.11
  2. Spin up a container and open a docker shell.
  3. Add the following DAG to the dags folder:
from pendulum import datetime

from airflow import DAG
from airflow.datasets import DatasetAlias

with DAG(
    dag_id="foo",
    start_date=datetime(2000, 1, 1),
    schedule=[
        DatasetAlias("bar"),
    ],
    catchup=False,
):
    pass
  1. Open a python shell, and run:
from airflow.models import DagBag
DagBag(include_examples=False)

Operating System

Rocky Linux 9.3

Versions of Apache Airflow Providers

(Irrelevant)

Deployment

Other Docker-based deployment

Deployment details

Dockerfile-ci:

FROM apache/airflow:2.10.2-python3.9

# Install system packages
USER root
RUN apt-get update \
    && apt-get install -y --no-install-recommends build-essential vim strace iproute2 git \
       pkg-config libxml2-dev libxmlsec1-dev libxmlsec1-openssl \
    && apt-get autoremove -yqq --purge \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

USER airflow

ci script:

  script:
    - git config --global --add safe.directory $PWD
    - pip install uv pre-commit-uv --upgrade
    - uv pip install -e .[dev] --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.10.2/constraints-3.9.txt"
    - uv tool install pre-commit --force --with pre-commit-uv --force-reinstall
    - export PIP_USER=false && pre-commit install --install-hooks
    - pre-commit run --all-files --show-diff-on-failure

Anything else?

Presumably, this issue isn't present in a "running" Airflow instance where there is at least one DAG that outputs a DatasetAlias, which causes the necessary tables to be created, and then the 2nd parse of the alias-scheduled DAG succeeds.

I believe this happens because the docker image doesn't come with an initialized sqlite database, an issue that surfaces when the DagBag is built.

As a workaround, I tried adding an airflow db migrate step to the image creation, and while this did help with the manual test explained above, the CI kept failing. Adding airflow db migrate to the CI script itself similarly did not help.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions