-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-21945][YARN][PYTHON] Make --py-files work with PySpark shell in Yarn client mode #21267
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -211,9 +211,22 @@ def _do_init(self, master, appName, sparkHome, pyFiles, environment, batchSize, | |
for path in self._conf.get("spark.submit.pyFiles", "").split(","): | ||
if path != "": | ||
(dirname, filename) = os.path.split(path) | ||
if filename[-4:].lower() in self.PACKAGE_EXTENSIONS: | ||
self._python_includes.append(filename) | ||
sys.path.insert(1, os.path.join(SparkFiles.getRootDirectory(), filename)) | ||
try: | ||
filepath = os.path.join(SparkFiles.getRootDirectory(), filename) | ||
if not os.path.exists(filepath): | ||
# In case of YARN with shell mode, 'spark.submit.pyFiles' files are | ||
# not added via SparkContext.addFile. Here we check if the file exists, | ||
# try to copy and then add it to the path. See SPARK-21945. | ||
shutil.copyfile(path, filepath) | ||
if filename[-4:].lower() in self.PACKAGE_EXTENSIONS: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Am I missing anything? Looks like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the root is added into the path above. .py file needs its parent directory .. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh, I see. |
||
self._python_includes.append(filename) | ||
sys.path.insert(1, filepath) | ||
except Exception: | ||
from pyspark import util | ||
warnings.warn( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Log was also tested manually:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. BTW, this should now be safer in any case since we now don't put non-existent files and print out warnings. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Likewise, I checked the warning manually:
|
||
"Failed to add file [%s] speficied in 'spark.submit.pyFiles' to " | ||
"Python path:\n %s" % (path, "\n ".join(sys.path)), | ||
RuntimeWarning) | ||
|
||
# Create a temporary directory inside spark.local.dir: | ||
local_dir = self._jvm.org.apache.spark.util.Utils.getLocalDir(self._jsc.sc().conf()) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this copy necessary? Couldn't you just add
path
tosys.path
(instead of addingfilepath
) and that would solve the problem?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's the initial approach I tried. thing is, .py file in the configuration. it needs its parent directory (not .py file itself) and it would add other .py files too if there are in the directort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For file types in
PACKAGE_EXTENSIONS
, do we need to copy?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so but that's already being done in other cluster / client modes. The copies are made via addFile in other modes but it's not being copied in this case specifically. I think we should better consistently copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are 'spark.submit.pyFiles' files only missing on driver side? I mean, if they are not added by
SparkContext.addFile
, shouldn't they also be missing on executors?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, that's only missing on driver side in this mode specifically. Yarn doesn't add it since
spark.files
is not set if I understood correctly. They are specially handled in case of submit but shell case seems missing.I described a bit in the PR description too.