[SPARK-26658][PySpark] : Call pickle.dump with protocol version 3 for Python 3… #23577
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
… to fix the serialization issue with large objects
When a pyspark job using python 3 tries to serialize large objects, it throws a pickle error in case of trying to serialize global variable object and overflow error in case of broadcast. Refer the corresponding JIRA https://issues.apache.org/jira/browse/SPARK-26658 for more details.
What changes were proposed in this pull request?
Fixed the issue by updating the pickle dump method in code to use protocol version 3 for python 3.
How was this patch tested?
Running manual tests before and after the fix.
Steps To Reproduce:
To reproduce the above issue, I am using the word2vec model trained on the Google News dataset downloaded from https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing
Use python 3.x with module gensim installed(or ship the module zip file using --py-files).
Launch pyspark with the following command:
bin/pyspark --master yarn --py-files additionalPythonModules.zip --conf spark.driver.memory=16g --conf spark.executor.memory=16g --conf spark.driver.memoryOverhead=16g --conf spark.executor.memoryOverhead=16g --conf spark.executor.pyspark.memory=16g
Run the following commands. For the sake of reproducing the issue, I have simply pasted certain parts of the code here:
`SparkSession available as 'spark'.
`SparkSession available as 'spark'.