-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-19872][PYTHON] Use the correct deserializer for RDD construction for coalesce/repartition #17282
Conversation
cc @davies, could you see if it makes sense? |
Test build #74462 has finished for PR 17282 at commit
|
The root cause is The correct fixing looks like:
|
@viirya you are right. I overlooked. Thanks for correcting this. |
30688db
to
925cd2e
Compare
Test build #74592 has finished for PR 17282 at commit
|
LGTM |
@viirya, thank you for taking a look. |
lgtm, merging into master, and 2.1 branch. |
…ion for coalesce/repartition ## What changes were proposed in this pull request? This PR proposes to use the correct deserializer, `BatchedSerializer` for RDD construction for coalesce/repartition when the shuffle is enabled. Currently, it is passing `UTF8Deserializer` as is not `BatchedSerializer` from the copied one. with the file, `text.txt` below: ``` a b d e f g h i j k l ``` - Before ```python >>> sc.textFile('text.txt').repartition(1).collect() ``` ``` UTF8Deserializer(True) Traceback (most recent call last): File "<stdin>", line 1, in <module> File ".../spark/python/pyspark/rdd.py", line 811, in collect return list(_load_from_socket(port, self._jrdd_deserializer)) File ".../spark/python/pyspark/serializers.py", line 549, in load_stream yield self.loads(stream) File ".../spark/python/pyspark/serializers.py", line 544, in loads return s.decode("utf-8") if self.use_unicode else s File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte ``` - After ```python >>> sc.textFile('text.txt').repartition(1).collect() ``` ``` [u'a', u'b', u'', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l', u''] ``` ## How was this patch tested? Unit test in `python/pyspark/tests.py`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #17282 from HyukjinKwon/SPARK-19872. (cherry picked from commit 7387126) Signed-off-by: Davies Liu <davies.liu@gmail.com>
My code is : ## sc.binaryFiles('hdfs://localhost:9000/user/majdouline/Training').repartition(90).collect() and i got this error : I had change rdd.py and serializers (version 2.1.0 to 2.0.2), but i got the same error |
What changes were proposed in this pull request?
This PR proposes to use the correct deserializer,
BatchedSerializer
for RDD construction for coalesce/repartition when the shuffle is enabled. Currently, it is passingUTF8Deserializer
as is notBatchedSerializer
from the copied one.with the file,
text.txt
below:How was this patch tested?
Unit test in
python/pyspark/tests.py
.