Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-21945][YARN][PYTHON] Make --py-files work with PySpark shell in Yarn client mode #21267

Closed
wants to merge 3 commits into from
Closed
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
20 changes: 17 additions & 3 deletions python/pyspark/context.py
Original file line number Diff line number Diff line change
Expand Up @@ -211,9 +211,23 @@ def _do_init(self, master, appName, sparkHome, pyFiles, environment, batchSize,
for path in self._conf.get("spark.submit.pyFiles", "").split(","):
if path != "":
(dirname, filename) = os.path.split(path)
if filename[-4:].lower() in self.PACKAGE_EXTENSIONS:
self._python_includes.append(filename)
sys.path.insert(1, os.path.join(SparkFiles.getRootDirectory(), filename))
try:
filepath = os.path.join(SparkFiles.getRootDirectory(), filename)
if not os.path.exists(filepath):
# In case of YARN with shell mode, 'spark.submit.pyFiles' files are
# not added via SparkContext.addFile. Here we check if the file exists,
# try to copy and then add it to the path. See SPARK-21945.
shutil.copyfile(path, filepath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this copy necessary? Couldn't you just add path to sys.path (instead of adding filepath) and that would solve the problem?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's the initial approach I tried. thing is, .py file in the configuration. it needs its parent directory (not .py file itself) and it would add other .py files too if there are in the directort.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For file types in PACKAGE_EXTENSIONS, do we need to copy?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so but that's already being done in other cluster / client modes. The copies are made via addFile in other modes but it's not being copied in this case specifically. I think we should better consistently copy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are 'spark.submit.pyFiles' files only missing on driver side? I mean, if they are not added by SparkContext.addFile, shouldn't they also be missing on executors?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, that's only missing on driver side in this mode specifically. Yarn doesn't add it since spark.files is not set if I understood correctly. They are specially handled in case of submit but shell case seems missing.

I described a bit in the PR description too.

In case of Yarn client and cluster with submit, these are manually being handled. In particular #6360 added most of the logics. In this case, the Python path looks manually set via, for example, deploy.PythonRunner. We don't use spark.files here.

if filename[-4:].lower() in self.PACKAGE_EXTENSIONS:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I missing anything? Looks like PACKAGE_EXTENSIONS = ('.zip', '.egg', '.jar'). So .py seems not in that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the root is added into the path above. .py file needs its parent directory ..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see.

self._python_includes.append(filename)
sys.path.insert(1, filepath)
except Exception as e:
from pyspark import util
warnings.warn(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Log was also tested manually:

.../python/pyspark/context.py:230: RuntimeWarning: Python file [/home/spark/tmp.py] specified in 'spark.submit.pyFiles' failed to be added in the Python path, excluding this in the Python path.
  : ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, this should now be safer in any case since we now don't put non-existent files and print out warnings.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise, I checked the warning manually:

.../pyspark/context.py:229: RuntimeWarning: Failed to add file [/home/spark/tmp.py] speficied in 'spark.submit.pyFiles' to Python path:

...
  /usr/lib64/python27.zip
  /usr/lib64/python2.7
... 

"Python file [%s] specified in 'spark.submit.pyFiles' failed "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplify this message?

"Failed to add file [%s] speficied in 'spark.submit.pyFiles' to Python path:\n %s"

"to be added in the Python path, excluding this in the Python path."
"\n :%s" % (path, util._exception_message(e)),
RuntimeWarning)

# Create a temporary directory inside spark.local.dir:
local_dir = self._jvm.org.apache.spark.util.Utils.getLocalDir(self._jsc.sc().conf())
Expand Down