Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix code checking job names in sagemaker #29245

Merged
merged 8 commits into from
Feb 27, 2023
Merged

Conversation

vandonr-amz
Copy link
Contributor

@vandonr-amz vandonr-amz commented Jan 30, 2023

I was looking at this code because we got a throttling error in a system test.
The error was coming from the list_ operation on the job names.
This boto call has a cap at 100 results per call, and the function wanted all the jobs, so we were doing repeated calls with the NextToken set.
Looking online, it seems boto starts to throw throttling errors around 70/80 chained calls like this with the default config (i.e. around 7-8000 jobs).
The problem is that sagemaker jobs cannot be "cleaned", they stay there forever, so listing all of them is an increasingly longer operation, doomed to fail.

Looking further, it turned out that we didn't really needed to list all the jobs. It was used for 2 things: checking if the job name already existed (which can be done in a single describe call), and get a number to append to the job name if needed.
To make this a lot cheaper and not O(nb jobs), I propose that we use a random number instead of a monotonically increasing sequence to rename jobs in case of collision.

@vandonr-amz
Copy link
Contributor Author

vandonr-amz commented Jan 30, 2023

need to fix transform tests, I forgot that when pushing :|

EDIT: done

@@ -679,11 +697,11 @@ def __init__(
self.check_interval = check_interval
self.max_ingestion_time = max_ingestion_time
self.check_if_job_exists = check_if_job_exists
if action_if_job_exists in ("increment", "fail"):
if action_if_job_exists in {"random", "increment", "fail"}:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand well, increment value is deprecated but if you set it, the behavior is the same as random? I am wondering if we should not keep the current behavior if increment is set

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes your understanding is correct.
We can be extra cautious and keep the existing behavior, as it's a change, but it's not a "breaking" change.
I think the "increment" behavior is a bit dangerous because it's just problems waiting to happen if you use it, so I'd rather remove it entirely.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine with that :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These will cause the jobs to have different names from now onward, ya? I think @ferruzzi modified this code somewhat recently and we tried to keep the naming the same for backwards compatibility (folks may have other logic that depends on the names being in the format they have been historically).
How to handle these low level breaking changes has been a hot topic lately in Airflow, I still don't think we've landed on a spot everyone agrees with. Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well there is the "every change is a breaking change" philosophy of course, and maybe I can imagine a workflow where changing this would be breaking, but it'd be pretty far fetched.
I can always keep the existing behavior, but I'd put big deprecation warnings around it, because let's say you run 10 jobs a day, you'd be looking at throttling exceptions after 2-3 years, which is not such a long time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, the "shape" of the naming is still the same, if a regex had been designed for the old code, the new code would pass it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was my recent poke at this code from back in November: #27634

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that since we're adding in a deprecation warning that should be sufficient as users will have proper logs as to why they are getting unexpected naming.

@ferruzzi
Copy link
Contributor

I don't think I agree that using a random suffix is an improvement.

The sequence was already imperfect, because we counted all jobs, not just the ones with the same base name, so it'd go like:

We're only counting jobs whose name matches here

@vandonr-amz
Copy link
Contributor Author

We're only counting jobs whose name matches here

Oh yes my bad, I read the code too fast.

Having a random number rather than beautiful sequence is not better in itself of course, but I think the price to pay for this little nice thing is too high.

Instead of an O(1) operation, we end up doing potentially hundreds of network calls, with added wait time to account for throttling errors, every time we create a job. All this just so that we can eventually have a nice number in a list that is ordered by creation date by default anyway.
We also run the risk of collisions if two jobs with the same name are created at the same time.

@ferruzzi
Copy link
Contributor

I guess I see what you are saying. With a reasonably large range on the random suffix the collision rate will be low. My initial thought was how many times our proposed name might collide with a random suffix, but that's not really a big deal. I think I still like the incremental suffix, but I can see your point... I won't stand in the way if others think the tradeoff is worth it. 👍

Also, you've been doing a lot of little efficiency tweaks; it's definitely nice to see them. 👍

@vandonr-amz
Copy link
Contributor Author

sorry for the force-push, there was a clandestine commit that made its way into this PR :/

if fail_if_exists:
raise AirflowException(f"A SageMaker job with name {job_name} already exists.")
else:
job_name = f"{proposed_name}-{random.randint(0, 999999999):09}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we're using a randomized name here? Perhaps something more programmatic could be more reproducible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It used to not be random, but that required to enumerate all jobs created since the beginning of times.
Do you have a suggestion for a not random name ?
I also wanted to keep the - format.

The best alternative I can see is to use the current date, but we still need to resolve intra-day (or intra-hour or whatever) conflicts.
Listing all jobs for the day and adding a sequence number after that would be OK because it'd keep the number of jobs to list under control, but this behavior would also add a lot of complexity to this code, for limited benefits I think.

Also, what benefits do you see in having a predictable job name ? The job name is returned by the operator, so following steps should use that returned value if they need the job name.
If having a specific job name is important, the user can always setup the behavior to "fail" and come up with a programmatic naming scheme that will fit their usage, which will be easier than coming up with something that could fit all possible uses here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dimberman are you satisfied by that explanation/reasoning? This one has been open a while and I'd like to get it merged soon 👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding reproduction, would it be a good idea to use a time-based token, say time.time(), here? It would still be somewhat “random” but more predictable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can probably replace this with a millisecond timestamp, I guess that'd be enough to prevent collisions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking in again @dimberman, your request for changes is blocking this PR from being merged. Can you have another look and see if your concerns are sufficiently addressed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@o-nikolas I like @uranusjr 's suggestion of using a millisecond timestamp. It's reproduceable and easy to avoid collisions.

@@ -679,11 +697,11 @@ def __init__(
self.check_interval = check_interval
self.max_ingestion_time = max_ingestion_time
self.check_if_job_exists = check_if_job_exists
if action_if_job_exists in ("increment", "fail"):
if action_if_job_exists in {"random", "increment", "fail"}:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that since we're adding in a deprecation warning that should be sufficient as users will have proper logs as to why they are getting unexpected naming.

@vandonr-amz
Copy link
Contributor Author

hello @dimberman we just had a new system test failure because of this issue (getting throttled on too many listing requests). Any chance you can re-assess your review ? It'd be nice to get this merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers provider:amazon-aws AWS/Amazon - related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants