[SPARK-19872][PYTHON] Use the correct deserializer for RDD construction for coalesce/repartition #17282

HyukjinKwon · 2017-03-13T19:13:12Z

What changes were proposed in this pull request?

This PR proposes to use the correct deserializer, BatchedSerializer for RDD construction for coalesce/repartition when the shuffle is enabled. Currently, it is passing UTF8Deserializer as is not BatchedSerializer from the copied one.

with the file, text.txt below:

a
b

d
e
f
g
h
i
j
k
l

Before

>>> sc.textFile('text.txt').repartition(1).collect()

UTF8Deserializer(True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../spark/python/pyspark/rdd.py", line 811, in collect
    return list(_load_from_socket(port, self._jrdd_deserializer))
  File ".../spark/python/pyspark/serializers.py", line 549, in load_stream
    yield self.loads(stream)
  File ".../spark/python/pyspark/serializers.py", line 544, in loads
    return s.decode("utf-8") if self.use_unicode else s
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

After

>>> sc.textFile('text.txt').repartition(1).collect()

[u'a', u'b', u'', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l', u'']

How was this patch tested?

Unit test in python/pyspark/tests.py.

HyukjinKwon · 2017-03-13T19:13:26Z

cc @davies, could you see if it makes sense?

SparkQA · 2017-03-13T19:46:07Z

Test build #74462 has finished for PR 17282 at commit 30688db.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-03-15T09:08:33Z

The root cause is coalesce uses wrong jrdd_deserializer when constructing new RDD.

The correct fixing looks like:

diff --git a/python/pyspark/rdd.py b/python/pyspark/rdd.py
index a5e6e2b..291c1ca 100644
--- a/python/pyspark/rdd.py
+++ b/python/pyspark/rdd.py
@@ -2072,10 +2072,12 @@ class RDD(object):
             batchSize = min(10, self.ctx._batchSize or 1024)
             ser = BatchedSerializer(PickleSerializer(), batchSize)
             selfCopy = self._reserialize(ser)
+            jrdd_deserializer = selfCopy._jrdd_deserializer
             jrdd = selfCopy._jrdd.coalesce(numPartitions, shuffle)
         else:
+            jrdd_deserializer = self._jrdd_deserializer
             jrdd = self._jrdd.coalesce(numPartitions, shuffle)
-        return RDD(jrdd, self.ctx, self._jrdd_deserializer)
+        return RDD(jrdd, self.ctx, jrdd_deserializer)

HyukjinKwon · 2017-03-15T09:20:49Z

@viirya you are right. I overlooked. Thanks for correcting this.

SparkQA · 2017-03-15T09:54:42Z

Test build #74592 has finished for PR 17282 at commit 925cd2e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-03-15T11:57:29Z

LGTM

HyukjinKwon · 2017-03-15T12:04:30Z

@viirya, thank you for taking a look.

davies · 2017-03-15T17:11:11Z

lgtm, merging into master, and 2.1 branch.

…ion for coalesce/repartition ## What changes were proposed in this pull request? This PR proposes to use the correct deserializer, `BatchedSerializer` for RDD construction for coalesce/repartition when the shuffle is enabled. Currently, it is passing `UTF8Deserializer` as is not `BatchedSerializer` from the copied one. with the file, `text.txt` below: ``` a b d e f g h i j k l ``` - Before ```python >>> sc.textFile('text.txt').repartition(1).collect() ``` ``` UTF8Deserializer(True) Traceback (most recent call last): File "<stdin>", line 1, in <module> File ".../spark/python/pyspark/rdd.py", line 811, in collect return list(_load_from_socket(port, self._jrdd_deserializer)) File ".../spark/python/pyspark/serializers.py", line 549, in load_stream yield self.loads(stream) File ".../spark/python/pyspark/serializers.py", line 544, in loads return s.decode("utf-8") if self.use_unicode else s File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte ``` - After ```python >>> sc.textFile('text.txt').repartition(1).collect() ``` ``` [u'a', u'b', u'', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l', u''] ``` ## How was this patch tested? Unit test in `python/pyspark/tests.py`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #17282 from HyukjinKwon/SPARK-19872. (cherry picked from commit 7387126) Signed-off-by: Davies Liu <davies.liu@gmail.com>

HyukjinKwon · 2017-03-15T22:44:01Z

Thank you both @davies and @viirya

Majdouline-Meddad · 2018-02-09T13:08:55Z

My code is : ## sc.binaryFiles('hdfs://localhost:9000/user/majdouline/Training').repartition(90).collect()

and i got this error :
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/local/spark/python/pyspark/worker.py", line 174, in main
process()
File "/usr/local/spark/python/pyspark/worker.py", line 169, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/usr/local/spark/python/pyspark/serializers.py", line 268, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "/usr/local/spark/python/pyspark/serializers.py", line 328, in _load_stream_without_unbatching
for (key_batch, val_batch) in zip(key_batch_stream, val_batch_stream):
File "/usr/local/spark/python/pyspark/serializers.py", line 529, in load_stream
yield self.loads(stream)
File "/usr/local/spark/python/pyspark/serializers.py", line 524, in loads
return s.decode("utf-8") if self.use_unicode else s
File "/grid/hadoop/yarn/local/usercache/rsrpsinr/appcache/application_1506405147397_0015/container_1506405147397_0015_01_000005/Python2SparkDl/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

I had change rdd.py and serializers (version 2.1.0 to 2.0.2), but i got the same error
Can you help me please to fixe that .

HyukjinKwon changed the title ~~[SPARK-19872][PYTHON] Only reseralize BatchedSerializers when repartitioning~~ [SPARK-19872][PYTHON] Only reseralize with BatchedSerializer when repartitioning for skewed partition handling Mar 13, 2017

Use the correct deserializer for RDD construction

925cd2e

HyukjinKwon force-pushed the SPARK-19872 branch from 30688db to 925cd2e Compare March 15, 2017 09:23

HyukjinKwon changed the title ~~[SPARK-19872][PYTHON] Only reseralize with BatchedSerializer when repartitioning for skewed partition handling~~ [SPARK-19872][PYTHON] Use the correct deserializer for RDD construction for coalesce/repartition Mar 15, 2017

asfgit closed this in 7387126 Mar 15, 2017

HyukjinKwon deleted the SPARK-19872 branch January 2, 2018 03:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-19872][PYTHON] Use the correct deserializer for RDD construction for coalesce/repartition #17282

[SPARK-19872][PYTHON] Use the correct deserializer for RDD construction for coalesce/repartition #17282

HyukjinKwon commented Mar 13, 2017 •

edited

Loading

HyukjinKwon commented Mar 13, 2017

SparkQA commented Mar 13, 2017

viirya commented Mar 15, 2017 •

edited

Loading

HyukjinKwon commented Mar 15, 2017

SparkQA commented Mar 15, 2017

viirya commented Mar 15, 2017

HyukjinKwon commented Mar 15, 2017

davies commented Mar 15, 2017 •

edited

Loading

HyukjinKwon commented Mar 15, 2017

Majdouline-Meddad commented Feb 9, 2018 •

edited

Loading

[SPARK-19872][PYTHON] Use the correct deserializer for RDD construction for coalesce/repartition #17282

[SPARK-19872][PYTHON] Use the correct deserializer for RDD construction for coalesce/repartition #17282

Conversation

HyukjinKwon commented Mar 13, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

HyukjinKwon commented Mar 13, 2017

SparkQA commented Mar 13, 2017

viirya commented Mar 15, 2017 • edited Loading

HyukjinKwon commented Mar 15, 2017

SparkQA commented Mar 15, 2017

viirya commented Mar 15, 2017

HyukjinKwon commented Mar 15, 2017

davies commented Mar 15, 2017 • edited Loading

HyukjinKwon commented Mar 15, 2017

Majdouline-Meddad commented Feb 9, 2018 • edited Loading

HyukjinKwon commented Mar 13, 2017 •

edited

Loading

viirya commented Mar 15, 2017 •

edited

Loading

davies commented Mar 15, 2017 •

edited

Loading

Majdouline-Meddad commented Feb 9, 2018 •

edited

Loading