Amazon: Add Athena Spark operator and sensor support by Andisha2004 · Pull Request #66576 · apache/airflow

Andisha2004 · 2026-05-07T23:05:03Z

What this PR does

This PR adds Athena Spark calculation support to the Amazon provider.

Specifically, it:

extends AthenaHook with Athena Spark calculation helpers
adds AthenaSparkOperator for submitting and waiting on Athena Spark calculations
adds AthenaSparkSensor for monitoring an existing calculation execution
adds unit tests for the new hook/operator/sensor behavior
adds provider documentation for Athena Spark usage

Why this is needed

The Amazon provider already supports Athena query execution, but it does not currently expose first-class support for Athena Spark calculation APIs. This PR fills that gap by adding provider-native support for Athena Spark job submission and monitoring.

Implementation notes

AthenaHook wraps Athena Spark APIs including:
- start_calculation_execution
- get_calculation_execution
- stop_calculation_execution
AthenaSparkOperator submits a calculation, polls until a terminal state, and returns execution metadata
AthenaSparkSensor polls an existing calculation execution ID until completion or failure
the implementation is intentionally scoped to provider-layer functionality only

Tests

I added or updated unit tests for:

providers/amazon/tests/unit/amazon/aws/hooks/test_athena.py
providers/amazon/tests/unit/amazon/aws/operators/test_athena_spark.py
providers/amazon/tests/unit/amazon/aws/sensors/test_athena_spark.py

Backward compatibility

This PR adds new functionality and does not change the existing Athena query operator or sensor behavior.

Was generative AI tooling used to co-author this PR?

Yes (please specify the tool below)

Generated-by: OpenAI Codex following the guidelines

boring-cyborg · 2026-05-07T23:05:12Z

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide
Here are some useful points:

Pay attention to the quality of your code (ruff, mypy and type annotations). Our prek-hooks will help you with that.
In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example Dag that shows how users should use it.
Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
Be sure to read the Airflow Coding style.
Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
Apache Airflow is a community-driven project and together we are making it better 🚀.
In case of doubts contact the developers at:
Mailing List: dev@airflow.apache.org
Slack: https://s.apache.org/airflow-slack

Andisha2004 · 2026-05-07T23:09:16Z

I ran the relevant provider unit tests and Python lint/format checks locally. I used --no-verify for the local commit because the blacken-docs hook environment on this machine was resolving against Python 3.9 even after reinstalling prek, which appears to be a local tooling issue rather than a code issue in this PR.

SameerMesiah97

I only managed to get through half of the diff but there were quite a few issues worth addressing. I have left some comments but I would urge you to review the entire diff to ensure that the code is clean and that everything is in line with project standards.

Edit: I have just gone through the rest of the diff. The tests are better than the actual implementation but they need some refinement too. They will naturally have to be adjusted if you implement my feedback in initial part of the diff.

SameerMesiah97 · 2026-05-08T08:57:02Z

@@ -0,0 +1 @@
+Add Athena Spark operator and sensor support to the Amazon provider.


Provider changes do not require newsfragments.

SameerMesiah97 · 2026-05-08T09:11:16Z

+        AthenaSparkSensor(
+            task_id="wait_for_spark_calculation",
+            calculation_execution_id="calc-exec-123",
+        )


So there are a few issues with this file:

You are inlining DAG code using these new operators/sensors. Standard practice in the provider docs is usually to add tasks using the new operators/sensors to the example DAG(s) and reference them via exampleinclude blocks instead.

Where is the Prerequisite Tasks section?

The structure of the Operators/Sensors sections looks inconsistent with the existing Amazon provider docs. For example, if you look at athena_sql.rst, there is a top-level Operators section followed by use-case-oriented subsections such as Execute a SQL query, along with explanatory text describing when/how the operators should be used. I think the same structure should be followed here for both operators and sensors.

The page is also missing some contextual guidance for users. For example:

whether an Athena Spark session must already exist

which AWS connection/authentication is expected

when to use AthenaSparkOperator vs AthenaSparkSensor

SameerMesiah97 · 2026-05-08T09:13:19Z

        self.log.info("Stopping Query with executionId - %s", query_execution_id)
        return self.get_conn().stop_query_execution(QueryExecutionId=query_execution_id)
+
+    # --- Athena Spark (Calculations) API ---


Remove this.

SameerMesiah97 · 2026-05-08T09:19:53Z

+            calculation_execution_id=calculation_execution_id, use_cache=use_cache
+        )
+        try:
+            return response["CalculationExecution"]["Status"].get("StateChangeReason")


Why is it ["CalculationExecution"]["Status"]["StateChangeReason"] here but response["Status"]["State"] above?

SameerMesiah97 · 2026-05-08T09:22:14Z

+        "COMPLETED",
+        "FAILED",
+        "CANCELED",
+    )


Better to do this to avoid drift:

SPARK_TERMINAL_STATES = SPARK_SUCCESS_STATES + SPARK_FAILURE_STATES

Also, I would move these constants above the hook constructor near the existing hook constants.

SameerMesiah97 · 2026-05-08T09:24:28Z

+        :param description: Optional description of the calculation.
+        :param calculation_configuration: Contains configuration information for the calculation.
+        :param client_request_token: Optional idempotency token.
+        :return: CalculationExecutionId


Shouldn't this be:

:return: str

SameerMesiah97 · 2026-05-08T09:34:04Z

+            execution_info.get("OutputLocation")
+            or (execution_info.get("CalculationExecution") or {}).get("OutputLocation")
+            or (execution_info.get("ResultConfiguration") or {}).get("OutputLocation")
+        )


It looks like you are compensating for inconsistent response structures coming from the hook. If you need to normalize, why not push it down to the hook itself?

SameerMesiah97 · 2026-05-08T09:35:22Z

+            if state in AthenaHook.SPARK_TERMINAL_STATES:
+                return state
+
+        raise AirflowException(


We need to avoid introducing AirflowException in the provider layer. I would suggest using RuntimeError. Please check the rest of the diff and implement the same feedback.

SameerMesiah97 · 2026-05-08T09:39:20Z

+        """Poll calculation status until a terminal state or timeout."""
+        for attempt in range(1, self.max_polling_attempts + 1):
+            if attempt > 1:
+                time.sleep(self.poll_interval)


Instead of checking for the iteration count, maybe it would be cleaner to move time.sleep to the end of the loop?

SameerMesiah97 · 2026-05-08T18:39:19Z

+
+    def execute(self, context: Context) -> dict[str, Any]:
+        """Submit the Spark calculation, poll until terminal state, then return metadata."""
+        del context


Why is del context needed here?

SameerMesiah97 · 2026-05-08T18:45:38Z

+        )
+
+        if initial_state and initial_state in AthenaHook.SPARK_TERMINAL_STATES:
+            return self._handle_terminal_state(calculation_execution_id, initial_state)


Why is this needed when you have lines 128-129 to poll for and then handle the terminal state? I would remove it unless you there is a strong justification that I cannot see.

SameerMesiah97 · 2026-05-08T18:51:22Z

+                f"Reason: {reason or 'No reason provided.'}"
+            )
+
+        if state != "COMPLETED":


Why are you using a string here for state checks when you have a constant called SPARK_SUCCESS_STATES ? It would be better to do:

if state not in AthenaHook.SPARK_SUCCESS_STATES:

SameerMesiah97 · 2026-05-08T18:52:45Z

+                self.log.warning(
+                    "Failed to stop calculation %s: %s",
+                    self._calculation_execution_id,
+                    e,


This would better as it preserves debugging information:

except Exception: self.log.warning( "Failed to stop calculation %s", self._calculation_execution_id, exc_info=True, )

SameerMesiah97 · 2026-05-08T18:54:08Z

+        if state in hook.SPARK_FAILURE_STATES:
+            raise AirflowException(f"Calculation {self.calculation_execution_id} failed with state: {state}")
+
+        return state == "COMPLETED"


Use the constant AthenaHook.SPARK_SUCCESS_STATES.

SameerMesiah97 · 2026-05-08T18:54:32Z

+
+    def poke(self, context: Context) -> bool:
+        """Check the current status of the Spark calculation."""
+        del context


No need for this.

SameerMesiah97 · 2026-05-08T18:56:32Z

+- Mock the boto3 client via AthenaHook.get_conn() so no real AWS calls are made.
+- Cover success, failure (exceptions), and bad/edge-case input for each hook method.
+- Use botocore.exceptions.ClientError for API failure scenarios.
+"""


Why is this docstring here?

SameerMesiah97 · 2026-05-08T19:02:28Z

+            calculation_execution_id=MOCK_DATA["calculation_execution_id"]
+        )
+
+        assert state == "RUNNING"


What about when the API returns a None or empty state? Have you covered that?

SameerMesiah97 · 2026-05-08T19:08:32Z

+        mock_conn.return_value.start_calculation_execution.assert_called_with(**expected_call_params)
+        assert result == MOCK_DATA["calculation_execution_id"]
+
+    @mock.patch.object(AthenaHook, "get_conn")


Since there is ambiguity regarding the shape of the response object received for get_calculation_info, shouldn't there also be a test validating that? Also, since there is an option to use caching, I would suggest that you add a test or set of tests covering that behaviour.

SameerMesiah97 · 2026-05-08T19:14:42Z

+    def test_poke_running_returns_false(self, mock_check: mock.Mock, sensor: AthenaSparkSensor):
+        result = sensor.poke({})
+        assert result is False
+        mock_check.assert_called_once_with(CALC_ID)


These tests should be fine once you move response normalization down to the hook. There is one small gap that would still remain: there does not appear to be coverage for the Unexpected terminal state branch in _handle_terminal_state().

o-nikolas

Great review @SameerMesiah97 I did not have much to add on top. @Andisha2004 please try to clean this one up.

o-nikolas · 2026-05-12T22:20:01Z

+    :param max_polling_attempts: Maximum number of polling attempts before timing out.
+        To limit total task time, use execution_timeout on the task as well.


The majority of the Amazon provider uses max_attempts, please use that throughout this PR.

Co-authored-by: Cursor <cursoragent@cursor.com>

potiuk · 2026-05-18T09:58:51Z

@Andisha2004 — There are 20 unresolved review thread(s) on this PR from @SameerMesiah97, @o-nikolas, and you have engaged with each one (post-review commits and/or in-thread replies). Could you confirm whether you believe the feedback is fully addressed and the PR is ready for maintainer review confirmation?

If yes, reply here (a short "yes / ready" is fine) and an Apache Airflow maintainer will pick the PR up from the review queue on the next sweep.

If you are still working on a thread, please reply with what is outstanding so the threads stay unresolved on purpose.

Note: This comment was drafted by an AI-assisted triage tool and may contain mistakes. Once you have addressed the points above, an Apache Airflow maintainer — a real person — will take the next look at your PR. We use this two-stage triage process so that our maintainers' limited time is spent where it matters most: the conversation with you.

Amazon: Add Athena Spark operator and sensor support

1838370

Andisha2004 requested a review from o-nikolas as a code owner May 7, 2026 23:05

boring-cyborg Bot added area:providers kind:documentation provider:amazon AWS/Amazon - related issues labels May 7, 2026

Rename Athena Spark newsfragment for PR number

c38a3d6

SameerMesiah97 reviewed May 8, 2026

View reviewed changes

o-nikolas reviewed May 12, 2026

View reviewed changes

Remove newsfragment; provider changes do not require one.

c34fb59

Co-authored-by: Cursor <cursoragent@cursor.com>

		@@ -0,0 +1 @@
		Add Athena Spark operator and sensor support to the Amazon provider.

		:param max_polling_attempts: Maximum number of polling attempts before timing out.
		To limit total task time, use execution_timeout on the task as well.

Conversation

Andisha2004 commented May 7, 2026

What this PR does

Why this is needed

Implementation notes

Tests

Backward compatibility

Was generative AI tooling used to co-author this PR?

Uh oh!

boring-cyborg Bot commented May 7, 2026

Uh oh!

Andisha2004 commented May 7, 2026

Uh oh!

SameerMesiah97 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SameerMesiah97 May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SameerMesiah97 May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

o-nikolas left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

potiuk commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

SameerMesiah97 left a comment •

edited

Loading

SameerMesiah97 May 8, 2026 •

edited

Loading

SameerMesiah97 May 8, 2026 •

edited

Loading