Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GlueJobOperator with local script location fails on consecutive runs #38959

Closed
2 tasks done
moritzsanne opened this issue Apr 12, 2024 · 1 comment · Fixed by #38960
Closed
2 tasks done

GlueJobOperator with local script location fails on consecutive runs #38959

moritzsanne opened this issue Apr 12, 2024 · 1 comment · Fixed by #38960
Labels
area:providers kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet

Comments

@moritzsanne
Copy link
Contributor

Apache Airflow Provider(s)

amazon

Versions of Apache Airflow Providers

apache-airflow-providers-amazon 8.19.0

Apache Airflow version

2.8.3

Operating System

Amazon Linux 2; Kernel Version: 5.10.209-198.812.amzn2.x86_64

Deployment

Official Apache Airflow Helm Chart

Deployment details

We deploy airflow on EKS using the official Helm chart.

What happened

We are deploying a Glue Job using the GlueJobOperator with the following configuration:

GlueJobOperator(
        job_name="weather_data_prepared_local_file",
        script_location=str(Path(__file__).resolve().parent / "scripts/weather_data_prepared.py"),
        s3_bucket="aws-glue-temporary-bucket",
        task_id="WeatherGlueJob",
        iam_role_name="eks.data.airflow.glue.executor",
        create_job_kwargs={"GlueVersion": "4.0", "NumberOfWorkers": 2, "WorkerType": "G.1X"},
        update_config=True,
        aws_conn_id='datalake',
        dag=dag
    )

This works fine for the first run of our DAG and the script file gets uploaded to artifacts/glue-scripts/weather_data_prepared.py
However, when we trigger the DAG for a second run, it fails because the file already exists.

 [2024-04-12T08:13:33.615+0000] {glue.py:173} INFO - Initializing AWS Glue Job: weather_data_prepared 
 [2024-04-12T08:13:33.659+0000] {base.py:83} INFO - Using connection ID 'datalake' for task execution
 [2024-04-12T08:13:34.265+0000] {taskinstance.py:2731} ERROR - Task failed with exception Traceback (most recent call last): 
 File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 44 result = _execute_callable(context=context, **execute_callable_kwargs) 
 File "/home/airflow/.local/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 41 return execute_callable(context=context, **execute_callable_kwargs) 
 File "/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/amazon/aws/operators/g glue_job_run = self.glue_job_hook.initialize_job(self.script_args, self.run_job_kwargs) 
 File "/usr/local/lib/python3.8/functools.py", line 967, in __get__ val = self.func(instance) 
 File "/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/amazon/aws/operators/g s3_hook.load_file( 
 File "/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/amazon/aws/hooks/s3.py return func(*bound_args.args, **bound_args.kwargs) 
 File "/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/amazon/aws/hooks/s3.py return func(*bound_args.args, **bound_args.kwargs) 
 File "/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/amazon/aws/hooks/s3.py raise ValueError(f"The key {key} already exists.") 
 ValueError: The key artifacts/glue-scripts/weather_data_prepared.py already exists.

What you think should happen instead

We are of the opinion, that the file on S3 should be overwritten for subsequent DAG executions.
So that consecutive runs of GlueJobOperators using local script locations do not fail.
This enables us to subject our script files to version control and CI/CD pipelines.

How to reproduce

1. Create a DAG with the GlueJobOperator, referencing a local file and a S3 Bucket.
2. Run the DAG twice.

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@moritzsanne moritzsanne added area:providers kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Apr 12, 2024
Copy link

boring-cyborg bot commented Apr 12, 2024

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet
Projects
None yet
1 participant