fix code checking job names in sagemaker #29245

vandonr-amz · 2023-01-30T17:58:56Z

I was looking at this code because we got a throttling error in a system test.
The error was coming from the list_ operation on the job names.
This boto call has a cap at 100 results per call, and the function wanted all the jobs, so we were doing repeated calls with the NextToken set.
Looking online, it seems boto starts to throw throttling errors around 70/80 chained calls like this with the default config (i.e. around 7-8000 jobs).
The problem is that sagemaker jobs cannot be "cleaned", they stay there forever, so listing all of them is an increasingly longer operation, doomed to fail.

Looking further, it turned out that we didn't really needed to list all the jobs. It was used for 2 things: checking if the job name already existed (which can be done in a single describe call), and get a number to append to the job name if needed.
To make this a lot cheaper and not O(nb jobs), I propose that we use a random number instead of a monotonically increasing sequence to rename jobs in case of collision.

vandonr-amz · 2023-01-30T18:16:25Z

need to fix transform tests, I forgot that when pushing :|

EDIT: done

vincbeck · 2023-01-30T18:20:53Z

airflow/providers/amazon/aws/operators/sagemaker.py

@@ -679,11 +697,11 @@ def __init__(
        self.check_interval = check_interval
        self.max_ingestion_time = max_ingestion_time
        self.check_if_job_exists = check_if_job_exists
-        if action_if_job_exists in ("increment", "fail"):
+        if action_if_job_exists in {"random", "increment", "fail"}:


If I understand well, increment value is deprecated but if you set it, the behavior is the same as random? I am wondering if we should not keep the current behavior if increment is set

yes your understanding is correct.
We can be extra cautious and keep the existing behavior, as it's a change, but it's not a "breaking" change.
I think the "increment" behavior is a bit dangerous because it's just problems waiting to happen if you use it, so I'd rather remove it entirely.

Fine with that :)

These will cause the jobs to have different names from now onward, ya? I think @ferruzzi modified this code somewhat recently and we tried to keep the naming the same for backwards compatibility (folks may have other logic that depends on the names being in the format they have been historically).
How to handle these low level breaking changes has been a hot topic lately in Airflow, I still don't think we've landed on a spot everyone agrees with. Thoughts?

well there is the "every change is a breaking change" philosophy of course, and maybe I can imagine a workflow where changing this would be breaking, but it'd be pretty far fetched.
I can always keep the existing behavior, but I'd put big deprecation warnings around it, because let's say you run 10 jobs a day, you'd be looking at throttling exceptions after 2-3 years, which is not such a long time.

also, the "shape" of the naming is still the same, if a regex had been designed for the old code, the new code would pass it.

This was my recent poke at this code from back in November: #27634

I think that since we're adding in a deprecation warning that should be sufficient as users will have proper logs as to why they are getting unexpected naming.

airflow/providers/amazon/aws/operators/sagemaker.py

ferruzzi · 2023-01-31T19:24:44Z

I don't think I agree that using a random suffix is an improvement.

The sequence was already imperfect, because we counted all jobs, not just the ones with the same base name, so it'd go like:

We're only counting jobs whose name matches here

vandonr-amz · 2023-01-31T19:34:24Z

We're only counting jobs whose name matches here

Oh yes my bad, I read the code too fast.

Having a random number rather than beautiful sequence is not better in itself of course, but I think the price to pay for this little nice thing is too high.

Instead of an O(1) operation, we end up doing potentially hundreds of network calls, with added wait time to account for throttling errors, every time we create a job. All this just so that we can eventually have a nice number in a list that is ordered by creation date by default anyway.
We also run the risk of collisions if two jobs with the same name are created at the same time.

ferruzzi · 2023-01-31T19:53:36Z

I guess I see what you are saying. With a reasonably large range on the random suffix the collision rate will be low. My initial thought was how many times our proposed name might collide with a random suffix, but that's not really a big deal. I think I still like the incremental suffix, but I can see your point... I won't stand in the way if others think the tradeoff is worth it. 👍

Also, you've been doing a lot of little efficiency tweaks; it's definitely nice to see them. 👍

vandonr-amz · 2023-02-02T23:21:31Z

sorry for the force-push, there was a clandestine commit that made its way into this PR :/

dimberman · 2023-02-13T04:49:06Z

airflow/providers/amazon/aws/operators/sagemaker.py

+            if fail_if_exists:
+                raise AirflowException(f"A SageMaker job with name {job_name} already exists.")
+            else:
+                job_name = f"{proposed_name}-{random.randint(0, 999999999):09}"


Is there a reason we're using a randomized name here? Perhaps something more programmatic could be more reproducible?

It used to not be random, but that required to enumerate all jobs created since the beginning of times.
Do you have a suggestion for a not random name ?
I also wanted to keep the - format.

The best alternative I can see is to use the current date, but we still need to resolve intra-day (or intra-hour or whatever) conflicts.
Listing all jobs for the day and adding a sequence number after that would be OK because it'd keep the number of jobs to list under control, but this behavior would also add a lot of complexity to this code, for limited benefits I think.

Also, what benefits do you see in having a predictable job name ? The job name is returned by the operator, so following steps should use that returned value if they need the job name.
If having a specific job name is important, the user can always setup the behavior to "fail" and come up with a programmatic naming scheme that will fit their usage, which will be easier than coming up with something that could fit all possible uses here.

@dimberman are you satisfied by that explanation/reasoning? This one has been open a while and I'd like to get it merged soon 👍

Regarding reproduction, would it be a good idea to use a time-based token, say time.time(), here? It would still be somewhat “random” but more predictable.

I can probably replace this with a millisecond timestamp, I guess that'd be enough to prevent collisions

Checking in again @dimberman, your request for changes is blocking this PR from being merged. Can you have another look and see if your concerns are sufficiently addressed?

@o-nikolas I like @uranusjr 's suggestion of using a millisecond timestamp. It's reproduceable and easy to avoid collisions.

dimberman · 2023-02-13T04:53:15Z

airflow/providers/amazon/aws/operators/sagemaker.py

@@ -679,11 +697,11 @@ def __init__(
        self.check_interval = check_interval
        self.max_ingestion_time = max_ingestion_time
        self.check_if_job_exists = check_if_job_exists
-        if action_if_job_exists in ("increment", "fail"):
+        if action_if_job_exists in {"random", "increment", "fail"}:


I think that since we're adding in a deprecation warning that should be sufficient as users will have proper logs as to why they are getting unexpected naming.

vandonr-amz · 2023-02-21T20:18:48Z

hello @dimberman we just had a new system test failure because of this issue (getting throttled on too many listing requests). Any chance you can re-assess your review ? It'd be nice to get this merged.

fix code checking job names in sagemaker

89d1ee0

vandonr-amz requested review from eladkal and o-nikolas as code owners January 30, 2023 17:58

boring-cyborg bot added area:providers provider:amazon-aws AWS/Amazon - related issues labels Jan 30, 2023

vincbeck reviewed Jan 30, 2023

View reviewed changes

fix transform code/tests

6bab0b1

vincbeck approved these changes Jan 30, 2023

View reviewed changes

eladkal reviewed Jan 31, 2023

View reviewed changes

airflow/providers/amazon/aws/operators/sagemaker.py Show resolved Hide resolved

vandonr-amz added 2 commits February 2, 2023 15:20

add deprecation warnings + some checks

4951d52

Merge remote-tracking branch 'origin/main' into vandonr/tests

d1d6f22

vandonr-amz force-pushed the vandonr/tests branch from ca9a7db to d1d6f22 Compare February 2, 2023 23:20

dimberman requested changes Feb 13, 2023

View reviewed changes

Merge branch 'main' into vandonr/tests

d3bb83e

potiuk requested a review from dimberman February 20, 2023 09:27

vandonr-amz added 3 commits February 22, 2023 13:46

replace random with timestamp for job name deduplication

4168002

Merge remote-tracking branch 'origin/main' into vandonr/tests

bf4e368

rename random mode to timestamp mode

c70cfc6

dimberman approved these changes Feb 27, 2023

View reviewed changes

dimberman merged commit df4abcb into apache:main Feb 27, 2023

eladkal mentioned this pull request Mar 3, 2023

Status of testing Providers that were prepared on March 03, 2023 #29901

Closed

40 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix code checking job names in sagemaker #29245

fix code checking job names in sagemaker #29245

vandonr-amz commented Jan 30, 2023 •

edited

vandonr-amz commented Jan 30, 2023 •

edited

vincbeck Jan 30, 2023

vandonr-amz Jan 30, 2023

vincbeck Jan 30, 2023

o-nikolas Jan 31, 2023

vandonr-amz Jan 31, 2023

vandonr-amz Jan 31, 2023

ferruzzi Jan 31, 2023

dimberman Feb 13, 2023

ferruzzi commented Jan 31, 2023

vandonr-amz commented Jan 31, 2023

ferruzzi commented Jan 31, 2023

vandonr-amz commented Feb 2, 2023

dimberman Feb 13, 2023

vandonr-amz Feb 13, 2023

o-nikolas Feb 14, 2023

uranusjr Feb 22, 2023

vandonr-amz Feb 22, 2023

o-nikolas Feb 27, 2023

dimberman Feb 27, 2023

dimberman Feb 13, 2023

vandonr-amz commented Feb 21, 2023

fix code checking job names in sagemaker #29245

fix code checking job names in sagemaker #29245

Conversation

vandonr-amz commented Jan 30, 2023 • edited

vandonr-amz commented Jan 30, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ferruzzi commented Jan 31, 2023

vandonr-amz commented Jan 31, 2023

ferruzzi commented Jan 31, 2023

vandonr-amz commented Feb 2, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vandonr-amz commented Feb 21, 2023

vandonr-amz commented Jan 30, 2023 •

edited

vandonr-amz commented Jan 30, 2023 •

edited