Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Oct 10, 2025

What changes were proposed in this pull request?

This PR aims to support multiple files in SparkApplication's pyFiles field.

Why are the changes needed?

Currently, pyFiles is mapped to the main resource directly because it assumes a single Python file.

primaryResource = new PythonMainAppResource(applicationSpec.getPyFiles());

However, it's supposed to be a comma-separated string. If users provide multiple files, t causes a failure like the following.

python3: can't open file '/opt/spark/examples/src/main/python/pi.py,local:///opt/spark/examples/src/main/python/sort.py': [Errno 2] │

This PR proposes a mitigation to handle the first file of pyFiles as the primary resource and the rest of files as the real pyFiles. Note that the previous logic works without any change and new logic is going to be applied only when mainClass is org.apache.spark.deploy.PythonRunner specified additionally.

BEFORE

spec:
  pyFiles: "local:///opt/spark/examples/src/main/python/pi.py"

AFTER

spec:
  mainClass: "org.apache.spark.deploy.PythonRunner"
  pyFiles: "local:///opt/spark/examples/src/main/python/pi.py,local:///opt/spark/examples/src/main/python/lib.py"

Does this PR introduce any user-facing change?

No behavior change because new logic works only when mainClass is org.apache.spark.deploy.PythonRunner.

How was this patch tested?

Pass the CIs with newly added test case.

Was this patch authored or co-authored using generative AI tooling?

No.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-53869] Support multiple files in pyFiles [SPARK-53869] Support multiple files in pyFiles field Oct 10, 2025
@dongjoon-hyun
Copy link
Member Author

cc @jiangzho , @peter-toth , @viirya

if (StringUtils.isNotEmpty(applicationSpec.getJars())) {
primaryResource = new JavaMainAppResource(Option.apply(applicationSpec.getJars()));
effectiveSparkConf.setIfMissing("spark.jars", applicationSpec.getJars());
} else if ("org.apache.spark.deploy.PythonRunner".equals(applicationSpec.getMainClass())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why in the branch below, it doesn't check if the main class is PythonRunner before creating PythonMainAppResource?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, previously, it doesn't care of mainClass because it checks the following only.

} else if (StringUtils.isNotEmpty(applicationSpec.getPyFiles())) {

@dongjoon-hyun
Copy link
Member Author

Thank you, @viirya . Let me merge this and improve PySpark examples.

@dongjoon-hyun dongjoon-hyun deleted the SPARK-53869 branch October 10, 2025 18:53
Copy link
Contributor

@peter-toth peter-toth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Late LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants