MLCOMPUTE-1203 | Configure Spark driver pod memory and cores based on Spark args #3892

CaptainSame · 2024-06-10T17:36:43Z

No description provided.

… Spark args

nemacysts · 2024-06-10T18:06:46Z

@CaptainSame is this for the new soaconfigs config style? if so, we probably need to figure out how to reconcile this with the usual way of specifying resources

CaptainSame · 2024-06-10T18:26:05Z

@CaptainSame is this for the new soaconfigs config style? if so, we probably need to figure out how to reconcile this with the usual way of specifying resources

Yes there was a case where cpus and mem were not specified in the soa config but spark.driver.memory and spark.driver.cores were specified and the Spark driver container was started with default resources which were quite low.

Do you think it makes sense to override cpus and mem values with spark.driver.. values if both are specified?
OR
We take the maximum of both sets of values if both are specified?
OR
Ask the users to keep both the values equal and we just read the cpus and mem in soaconfigs like the status quo?

chi-yelp · 2024-06-11T14:40:31Z

In paasta spark-run, people can use spark configs (spark.driver.cores and spark.driver.memory) to specify driver resources needed. I think it's still important to support doing it in this way. Also I'm a little worried if we allow to use paasta config's cpus and mem for spark jobs, people might get confused for whether they are for Spark driver or executors

… and mem configs for Spark drivers on k8s

…E-1203_fix_spark_driver_resources

nemacysts · 2024-06-13T19:14:56Z

paasta_tools/spark_tools.py

+            try:
+                memory_bytes = float(mem)
+            except ValueError:
+                print(f"Unable to parse memory value {mem}")
+    memory_unit = memory_bytes / MEM_MULTIPLIER[unit]


it's probably worth doing some validation in paasta validate / the schema to ensure that this is correctly specified: right now, if there's an issue with the value provided, this will return 0

(also, we probably want to verify that the provided unit is valid too - otherwise the division line will throw an exception when indexing into MEM_MULTIPLIER)

(oh, i see - we hardcode the unit when we call this)

Changed it to return a default 2 GB memory

nemacysts · 2024-06-13T19:18:25Z

paasta_tools/tron_tools.py

+    def get_cpus(self) -> float:
+        # set Spark driver pod CPU if it is specified by Spark arguments
+        cpus = 0.0
+        if (
+            self.action_spark_config
+            and "spark.driver.cores" in self.action_spark_config
+        ):
+            cpus = float(self.action_spark_config["spark.driver.cores"])
+        # use the soa config otherwise
+        return cpus or super().get_cpus()


Suggested change

def get_cpus(self) -> float:

# set Spark driver pod CPU if it is specified by Spark arguments

cpus = 0.0

if (

self.action_spark_config

and "spark.driver.cores" in self.action_spark_config

):

cpus = float(self.action_spark_config["spark.driver.cores"])

# use the soa config otherwise

return cpus or super().get_cpus()

def get_cpus(self) -> float:

if (

self.get_executor() == "spark"

and "spark.driver.cores" in self.action_spark_config

):

return self.action_spark_config["spark.driver.cores"]

# NOTE: we fallback to this if there's no spark.driver.cores config to

# use the paasta default

return super().get_cpus()

that said, maybe we want a different default CPU for spark drivers? the paasta default for both cpu and memory might be too small for drivers :)

(also, imo we should add spark.driver.cores to the spark_args schema and enforce that it's a float there so that we don't need to remember to cast it in the python code :))

I would rather handle enforce the default for spark.driver.cores and spark.driver.memory in service_configuration_lib than using another default value here.
Even if we enforce the float in schema, it passes through operations in service_configuration_lib and everything is returned as a string

we need this to return the resources that are actually used tho (either user-specified or the defaults that are used) - this can call service_configuration_lib code if necessary :)

re: s_c_l: i think it's also worth slowly refactoring that code - it's sorta painful to work with when everything in that library is stringly-typed :)

nemacysts · 2024-06-13T19:21:34Z

paasta_tools/tron_tools.py

+    def get_mem(self) -> float:
+        # set Spark driver pod memory if it is specified by Spark arguments
+        mem_mb = 0.0
+        if (
+            self.action_spark_config
+            and "spark.driver.memory" in self.action_spark_config
+        ):
+            # need to set mem in MB based on tron schema
+            mem_mb = spark_tools.get_spark_memory_in_unit(
+                self.action_spark_config["spark.driver.memory"], "m"
+            )
+        # use the soa config otherwise
+        return mem_mb or super().get_mem()


Suggested change

def get_mem(self) -> float:

# set Spark driver pod memory if it is specified by Spark arguments

mem_mb = 0.0

if (

self.action_spark_config

and "spark.driver.memory" in self.action_spark_config

):

# need to set mem in MB based on tron schema

mem_mb = spark_tools.get_spark_memory_in_unit(

self.action_spark_config["spark.driver.memory"], "m"

)

# use the soa config otherwise

return mem_mb or super().get_mem()

def get_mem(self) -> float:

if (

self.get_executor() == "spark"

and "spark.driver.memory" in self.action_spark_config

):

return spark_tools.get_spark_memory_in_unit(

self.action_spark_config["spark.driver.memory"], "m"

)

# NOTE: we fallback to this if there's no spark.driver.memory config to

# use the paasta default

return super().get_mem()

(we probably also want to do the same thing i mentioned above re: making the schema ensure that spark.driver.memory is a number :))

and i think we might need to make sure this returns an int? i don't think we allow for fractional mem values

spark.driver.memory doesn't have to be a number. We allow values allowed by Spark, which are JVM memory strings. I saw that the signature of the original get_mem method returns a float so I kept the same here
Should I change it to return int at both the places??

re: spark.driver.memory: i don't think that would work with the code as-is since we're hardcoding that the unit is "m"

that said, we can still add some schema/paasta validate validation so that we can ensure that we're not getting junk data from users

re: changing the signature: we can probably leave this for another time - i see that flink is using floats for things like mem: 1.5Gi

oh, re: unit="m" - i re-read the code with fresh eyes today and realized i'd been misreading it, the code is converting the code to mb :p

that said, all the pods i'm spot-checking only have integer PAASTA_RESOURCE_MEM values and the flink pods that are setting fractional mem values don't add that env var - so casting to an int here is probably a good idea to keep the data looking the same (mypy shouldn't complain about returning an int from a function typed to return a float from my testing)

paasta_tools/spark_tools.py

paasta_tools/tron_tools.py

chi-yelp · 2024-06-14T14:20:15Z

paasta_tools/tron_tools.py

+                error_msgs.append(
+                    f"{self.get_job_name()}.{self.get_action_name()} is a Spark job. `mem` config is not allowed. "
+                    f"Please specify the driver memory using `spark.driver.memory`."
+                )


We can can add checks in the yelpsoa_configs' pre-commit hook (after this), so users can notice the problem earlier before the next run.

it's probably worth adding this to paasta validate now - it'd be essentially dropping this same code in there :)

oh actually, this is already being called by paasta validate :)

nemacysts · 2024-06-14T16:14:13Z

paasta_tools/spark_tools.py

+    if mem:
+        if mem[-1] in MEM_MULTIPLIER:
+            memory_bytes = float(mem[:-1]) * MEM_MULTIPLIER[mem[-1]]


if we don't add validation for the user-provided value here now, then this needs to be wrapped in a try-except as well (since otherwise this will crash if a user puts in "lolk" or 1B as a spark.driver.memory value)

…E-1203_fix_spark_driver_resources

MLCOMPUTE-1203 | Configure Spark driver pod memory and cores based on…

f727e9e

… Spark args

CaptainSame requested review from nemacysts and chi-yelp June 10, 2024 17:36

Sameer Sharma added 5 commits June 13, 2024 15:20

MLCOMPUTE-1203 | add validation to prevent users from specifying cpus…

488b3b3

… and mem configs for Spark drivers on k8s

Merge branch 'master' of https://github.com/Yelp/paasta into MLCOMPUT…

b7d0833

…E-1203_fix_spark_driver_resources

MLCOMPUTE-1203 | overload instance config methods, add tests

fac144d

MLCOMPUTE-1203 | fix tests

9614c92

MLCOMPUTE-1203 | fix tests

bbd6c59

nemacysts reviewed Jun 13, 2024

View reviewed changes

MLCOMPUTE-1203 | minor fixes

9782558

chi-yelp reviewed Jun 14, 2024

View reviewed changes

chi-yelp approved these changes Jun 14, 2024

View reviewed changes

nemacysts reviewed Jun 14, 2024

View reviewed changes

Sameer Sharma added 3 commits June 14, 2024 17:49

MLCOMPUTE-1203 | move building spark config to constructor

1b45874

MLCOMPUTE-1203 | fix tests and formatting

659b96f

Merge branch 'master' of https://github.com/Yelp/paasta into MLCOMPUT…

9d16aa5

…E-1203_fix_spark_driver_resources

CaptainSame merged commit 5784aef into master Jun 17, 2024
10 checks passed

CaptainSame mentioned this pull request Jun 18, 2024

Revert "MLCOMPUTE-1203 | Configure Spark driver pod memory and cores based on Spark args" #3900

Merged

CaptainSame mentioned this pull request Jun 28, 2024

Attempt #2 MLCOMPUTE-1203 | Configure Spark driver pod memory and cores based on Spark args #3903

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MLCOMPUTE-1203 | Configure Spark driver pod memory and cores based on Spark args #3892

MLCOMPUTE-1203 | Configure Spark driver pod memory and cores based on Spark args #3892

CaptainSame commented Jun 10, 2024

nemacysts commented Jun 10, 2024 •

edited

Loading

CaptainSame commented Jun 10, 2024

chi-yelp commented Jun 11, 2024

nemacysts Jun 13, 2024

nemacysts Jun 13, 2024

CaptainSame Jun 13, 2024

nemacysts Jun 13, 2024

nemacysts Jun 13, 2024

nemacysts Jun 13, 2024

CaptainSame Jun 13, 2024

nemacysts Jun 14, 2024

nemacysts Jun 13, 2024

nemacysts Jun 13, 2024

nemacysts Jun 13, 2024

CaptainSame Jun 13, 2024

nemacysts Jun 14, 2024

nemacysts Jun 14, 2024

nemacysts Jun 14, 2024

chi-yelp Jun 14, 2024 •

edited

Loading

nemacysts Jun 14, 2024

nemacysts Jun 14, 2024

nemacysts Jun 14, 2024

MLCOMPUTE-1203 | Configure Spark driver pod memory and cores based on Spark args #3892

MLCOMPUTE-1203 | Configure Spark driver pod memory and cores based on Spark args #3892

Conversation

CaptainSame commented Jun 10, 2024

nemacysts commented Jun 10, 2024 • edited Loading

CaptainSame commented Jun 10, 2024

chi-yelp commented Jun 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chi-yelp Jun 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nemacysts commented Jun 10, 2024 •

edited

Loading

chi-yelp Jun 14, 2024 •

edited

Loading