[SPARK-20947][PYTHON] Fix encoding/decoding error in pipe action #18277

chaoslawful · 2017-06-12T10:20:32Z

What changes were proposed in this pull request?

Pipe action convert objects into strings using a way that was affected by the default encoding setting of Python environment.

This patch fixed the problem. The detailed description is added here:

https://issues.apache.org/jira/browse/SPARK-20947

How was this patch tested?

Run the following statement in pyspark-shell, and it will NOT raise exception if this patch is applied:

sc.parallelize([u'\u6d4b\u8bd5']).pipe('cat').collect()

HyukjinKwon · 2017-06-13T06:17:45Z

python/pyspark/rdd.py

@@ -751,7 +751,7 @@ def func(iterator):

            def pipe_objs(out):
                for obj in iterator:
-                    s = str(obj).rstrip('\n') + '\n'
+                    s = unicode(obj).rstrip('\n') + '\n'


I think we need a small test to validate this.

…k into fix_pipe_encoding_error

viirya · 2017-06-14T15:17:33Z

When you try to do this on a rdd of array of unicode string. The result of Python2 looks a bit weird.

Using Python version 2.7.12 (default, Jul  1 2016 15:12:24)
SparkSession available as 'spark'.
>>> data = [u'\u6d4b\u8bd5', '1']
>>> rdd = sc.parallelize(data)
>>> result = rdd.pipe('cat').collect()
>>> result   
[u'\u6d4b\u8bd5', u'1']
>>> data = [[u'\u6d4b\u8bd5', '1'], ['1', '2']]                                                                    
>>> rdd = sc.parallelize(data)                                                                                    
>>> rdd.collect()
[[u'\u6d4b\u8bd5', '1'], ['1', '2']] 
>>> result = rdd.pipe('cat').collect()                                                                             
>>> result
[u"[u'\\u6d4b\\u8bd5', '1']", u"['1', '2']"]     # looks weird and different to Python3.
>>> 

Using Python version 3.5.2 (default, Nov 17 2016 17:05:23)
SparkSession available as 'spark'.
>>> data = [u'\u6d4b\u8bd5', '1']
>>> rdd = sc.parallelize(data)
>>> result = rdd.pipe('cat').collect()
>>> result
['\u6d4b\u8bd5', '1']
>>> data = [[u'\u6d4b\u8bd5', '1'], ['1', '2']]
>>> rdd = sc.parallelize(data)
>>> rdd.collect()
[['\u6d4b\u8bd5', '1'], ['1', '2']]
>>> result = rdd.pipe('cat').collect()
>>> result
["['\u6d4b\u8bd5', '1']", "['1', '2']"]
>>>

chaoslawful · 2017-06-15T11:17:55Z

Well, the difference comes from repr()'s divergent default behaviors between Python2 and Python3. And the previous code does no better than the patched one but causing troubles while processing unicode strings.

On the other hand, pipe() action involved implicit serialization from any type to bytes by its definition, so IMHO the application itself should take care of consistent serialization/deserialization of data before/after pipe() action, IF it wants to always get the same behavior in different environments.

sasameti · 2017-10-15T18:24:09Z

how do I apply the patch?

holdenk · 2017-11-18T14:58:48Z

What do you think @HyukjinKwon ? I think this is probably a reasonable fix, but we might break some peoples code who have been depending on the bug.

HyukjinKwon · 2017-11-19T06:23:12Z

it seems okay without a close look. Let me take the close look if I can take the look first soon.

HyukjinKwon · 2018-01-01T04:10:40Z

ok to test

SparkQA · 2018-01-01T04:45:47Z

Test build #85574 has finished for PR 18277 at commit 8c88595.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2018-01-19T03:57:28Z

Jenkins OK to test.

holdenk

Tentatively LGTM provided no other objections. I think this is OK but we should create a JIRA for us to document this in the release notes as a possible breaking change (it improves correctness but some folks may have written code that depends on the old behaviour). What do you think @HyukjinKwon & @viirya?

viirya · 2018-01-19T06:03:16Z

retest this please.

viirya · 2018-01-19T06:14:29Z

This change looks reasonable to me for now. But I'm also concerned about the behavior change. A note into release notes should be good or maybe we need a note at migration guide in RDD Programming Guide.

SparkQA · 2018-01-19T06:40:42Z

Test build #86375 has finished for PR 18277 at commit 8c88595.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-01-19T09:21:38Z

Wanted to make a clarification on what we will change here to myself because it's quite confusing to me.

In Python 3, it's declared above basestring = unicode = str. So, it won't change anything. I think this is not our concern.

In Python 2,

Before:

str(obj).encode("utf8")

When obj is unicode:

str(obj): encoded to bytes by system default (ascii)
.encode("utf-8"): decoded to unicodes by system default (ascii) and then encoded to bytes by UTF8.

When obj is str:

str(obj): bytes as are
.encode("utf-8"): decoded to unicodes by system default (ascii) and then encoded to bytes by UTF8

When obj is other types:

str(obj): call __str__()
.encode("utf-8"): decoded to unicodes by system default (ascii) and then encoded to bytes by UTF8

After:

unicode(obj).encode("utf8")

When obj is unicode:

unicode(obj): unicodes as are
.encode("utf-8"): encoded to bytes by UTF8

When obj is str

unicode(obj): decoded to unicode by system default (ascii)
.encode("utf-8"): encoded to bytes by UTF8

When obj is other types

unicode(obj): call __unicode__(). It falls back to __str__() if __unicode__() is not defined.
.encode("utf-8"): encoded to bytes by UTF8

(As a note for sure, UTF8 is ascii compatible)

HyukjinKwon · 2018-01-19T09:23:37Z

So, after this change, we will get rid of system default roundtrip in When obj: unicode and When obj: other types.

In case of When obj: other types, we might have a behaviour change if __unicode__() is defined differently with __str__() but I believe it's quite rare.

So, LGTM but I want a double check from you @holdenk and @viirya if I missed anything.

HyukjinKwon · 2018-01-19T10:52:09Z

cc @ueshin too. I think we were in several PRs related with encoding / decoding stuff.

HyukjinKwon · 2018-01-20T13:31:51Z

python/pyspark/rdd.py

@@ -751,7 +751,7 @@ def func(iterator):

            def pipe_objs(out):
                for obj in iterator:
-                    s = str(obj).rstrip('\n') + '\n'
+                    s = unicode(obj).rstrip('\n') + '\n'


@chaoslawful, if you are active, we could change \n to u\n to reduce the confusion and don't rely on the implicit conversion between str and unicode.

HyukjinKwon · 2018-01-20T13:32:24Z

Let me merge this one in few days if there's no more comments.

HyukjinKwon · 2018-01-22T00:59:24Z

Let me merge this one only into master considering the concerns - #18277 (review) and #18277 (comment). Adding a note / backporting to branch-2.3 could be fine. I don't feel strongly about it.

HyukjinKwon · 2018-01-22T01:00:47Z

retest this please

SparkQA · 2018-01-22T01:35:19Z

Test build #86450 has finished for PR 18277 at commit 8c88595.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-01-22T01:43:50Z

Merged to master.

[SPARK-20947][PYTHON] Fix encoding/decoding error in pipe action

1b13e34

HyukjinKwon reviewed Jun 13, 2017

View reviewed changes

王晓哲 added 2 commits June 14, 2017 20:36

[SPARK-20947][PYTHON] Fix encoding/decoding error in pipe action

47355e6

Merge branch 'fix_pipe_encoding_error' of github.com:chaoslawful/spar…

8c88595

…k into fix_pipe_encoding_error

holdenk approved these changes Jan 19, 2018

View reviewed changes

HyukjinKwon reviewed Jan 20, 2018

View reviewed changes

asfgit closed this in 602c6d8 Jan 22, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-20947][PYTHON] Fix encoding/decoding error in pipe action #18277

[SPARK-20947][PYTHON] Fix encoding/decoding error in pipe action #18277

chaoslawful commented Jun 12, 2017

HyukjinKwon Jun 13, 2017

viirya commented Jun 14, 2017

chaoslawful commented Jun 15, 2017

sasameti commented Oct 15, 2017

holdenk commented Nov 18, 2017

HyukjinKwon commented Nov 19, 2017

HyukjinKwon commented Jan 1, 2018

SparkQA commented Jan 1, 2018

holdenk commented Jan 19, 2018

holdenk left a comment

viirya commented Jan 19, 2018

viirya commented Jan 19, 2018

SparkQA commented Jan 19, 2018

HyukjinKwon commented Jan 19, 2018 •

edited

HyukjinKwon commented Jan 19, 2018

HyukjinKwon commented Jan 19, 2018

HyukjinKwon Jan 20, 2018 •

edited

HyukjinKwon commented Jan 20, 2018

HyukjinKwon commented Jan 22, 2018 •

edited

HyukjinKwon commented Jan 22, 2018

SparkQA commented Jan 22, 2018

HyukjinKwon commented Jan 22, 2018

[SPARK-20947][PYTHON] Fix encoding/decoding error in pipe action #18277

[SPARK-20947][PYTHON] Fix encoding/decoding error in pipe action #18277

Conversation

chaoslawful commented Jun 12, 2017

What changes were proposed in this pull request?

How was this patch tested?

HyukjinKwon Jun 13, 2017

Choose a reason for hiding this comment

viirya commented Jun 14, 2017

chaoslawful commented Jun 15, 2017

sasameti commented Oct 15, 2017

holdenk commented Nov 18, 2017

HyukjinKwon commented Nov 19, 2017

HyukjinKwon commented Jan 1, 2018

SparkQA commented Jan 1, 2018

holdenk commented Jan 19, 2018

holdenk left a comment

Choose a reason for hiding this comment

viirya commented Jan 19, 2018

viirya commented Jan 19, 2018

SparkQA commented Jan 19, 2018

HyukjinKwon commented Jan 19, 2018 • edited

Before:

After:

HyukjinKwon commented Jan 19, 2018

HyukjinKwon commented Jan 19, 2018

HyukjinKwon Jan 20, 2018 • edited

Choose a reason for hiding this comment

HyukjinKwon commented Jan 20, 2018

HyukjinKwon commented Jan 22, 2018 • edited

HyukjinKwon commented Jan 22, 2018

SparkQA commented Jan 22, 2018

HyukjinKwon commented Jan 22, 2018

HyukjinKwon commented Jan 19, 2018 •

edited

HyukjinKwon Jan 20, 2018 •

edited

HyukjinKwon commented Jan 22, 2018 •

edited