Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated logic to allow AWS Batch Hook get_job_description retries to be more effective #37552

Closed
wants to merge 1 commit into from

Conversation

evgenyslab
Copy link
Contributor


Update logic in Amazon/aws providers for batch_client hook's parse_job_description function. This function is ONLY called within the get_job_description function that allows for retries; however, there is only handling of botocore.exceptions.ClientError error in this function.

This means, that if AWS simply fails to provide the job description - if there is a delay in AWS to register job information, this function will immediate fail instead of trying. This may happen under certain conditions, especially when using the batch operators for submitting jobs which immediately use returned job_id to query for the job information - but again, if AWS fails to register the job description in time - this will cause an unnecessary failure in Airflow side of things.

The proposed change allows parse_job_description to return None instead of raising if zero jobs are returned by AWS. Similarly, if more than one matching job is returned, the function will provide the description of the first job.

@boring-cyborg boring-cyborg bot added area:providers provider:amazon-aws AWS/Amazon - related issues labels Feb 19, 2024
@evgenyslab evgenyslab changed the title updated logic to allow retries to be effective Updated logic to allow AWS Batch Hook get_job_description retries to be more effective Feb 19, 2024
Copy link
Collaborator

@dirrao dirrao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add the test case for the change? test cases are failing. Can you take look at them?

Copy link
Contributor

@Taragolis Taragolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I aggree with @dirrao the test cases should be fixed, and new tests should be added.

There are 3 different places in Amazon Provider which are rely on current behaviour, and unknown number of usage into th user code.

job = self.hook.get_job_description(self.job_id)

job_desc = self.hook.get_job_description(self.job_id)

job_description = self.hook.get_job_description(self.job_id)

At least this places should be covered by the new behaiviour. We do not want to to break something when we tried to improve/fix something other

Comment on lines +403 to +404
else:
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seem like this continue redundant

Copy link

github-actions bot commented Apr 6, 2024

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers provider:amazon-aws AWS/Amazon - related issues stale Stale PRs per the .github/workflows/stale.yml policy file
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants