Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-20947][PYTHON] Fix encoding/decoding error in pipe action #18277

Closed

Conversation

chaoslawful
Copy link

What changes were proposed in this pull request?

Pipe action convert objects into strings using a way that was affected by the default encoding setting of Python environment.

This patch fixed the problem. The detailed description is added here:

https://issues.apache.org/jira/browse/SPARK-20947

How was this patch tested?

Run the following statement in pyspark-shell, and it will NOT raise exception if this patch is applied:

sc.parallelize([u'\u6d4b\u8bd5']).pipe('cat').collect()

@@ -751,7 +751,7 @@ def func(iterator):

def pipe_objs(out):
for obj in iterator:
s = str(obj).rstrip('\n') + '\n'
s = unicode(obj).rstrip('\n') + '\n'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a small test to validate this.

@viirya
Copy link
Member

viirya commented Jun 14, 2017

When you try to do this on a rdd of array of unicode string. The result of Python2 looks a bit weird.

Using Python version 2.7.12 (default, Jul  1 2016 15:12:24)
SparkSession available as 'spark'.
>>> data = [u'\u6d4b\u8bd5', '1']
>>> rdd = sc.parallelize(data)
>>> result = rdd.pipe('cat').collect()
>>> result   
[u'\u6d4b\u8bd5', u'1']
>>> data = [[u'\u6d4b\u8bd5', '1'], ['1', '2']]                                                                    
>>> rdd = sc.parallelize(data)                                                                                    
>>> rdd.collect()
[[u'\u6d4b\u8bd5', '1'], ['1', '2']] 
>>> result = rdd.pipe('cat').collect()                                                                             
>>> result
[u"[u'\\u6d4b\\u8bd5', '1']", u"['1', '2']"]     # looks weird and different to Python3.
>>> 

Using Python version 3.5.2 (default, Nov 17 2016 17:05:23)
SparkSession available as 'spark'.
>>> data = [u'\u6d4b\u8bd5', '1']
>>> rdd = sc.parallelize(data)
>>> result = rdd.pipe('cat').collect()
>>> result
['\u6d4b\u8bd5', '1']
>>> data = [[u'\u6d4b\u8bd5', '1'], ['1', '2']]
>>> rdd = sc.parallelize(data)
>>> rdd.collect()
[['\u6d4b\u8bd5', '1'], ['1', '2']]
>>> result = rdd.pipe('cat').collect()
>>> result
["['\u6d4b\u8bd5', '1']", "['1', '2']"]
>>>

@chaoslawful
Copy link
Author

Well, the difference comes from repr()'s divergent default behaviors between Python2 and Python3. And the previous code does no better than the patched one but causing troubles while processing unicode strings.

On the other hand, pipe() action involved implicit serialization from any type to bytes by its definition, so IMHO the application itself should take care of consistent serialization/deserialization of data before/after pipe() action, IF it wants to always get the same behavior in different environments.

@sasameti
Copy link

how do I apply the patch?

@holdenk
Copy link
Contributor

holdenk commented Nov 18, 2017

What do you think @HyukjinKwon ? I think this is probably a reasonable fix, but we might break some peoples code who have been depending on the bug.

@HyukjinKwon
Copy link
Member

it seems okay without a close look. Let me take the close look if I can take the look first soon.

@HyukjinKwon
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Jan 1, 2018

Test build #85574 has finished for PR 18277 at commit 8c88595.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor

holdenk commented Jan 19, 2018

Jenkins OK to test.

Copy link
Contributor

@holdenk holdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tentatively LGTM provided no other objections. I think this is OK but we should create a JIRA for us to document this in the release notes as a possible breaking change (it improves correctness but some folks may have written code that depends on the old behaviour). What do you think @HyukjinKwon & @viirya?

@viirya
Copy link
Member

viirya commented Jan 19, 2018

retest this please.

@viirya
Copy link
Member

viirya commented Jan 19, 2018

This change looks reasonable to me for now. But I'm also concerned about the behavior change. A note into release notes should be good or maybe we need a note at migration guide in RDD Programming Guide.

@SparkQA
Copy link

SparkQA commented Jan 19, 2018

Test build #86375 has finished for PR 18277 at commit 8c88595.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jan 19, 2018

Wanted to make a clarification on what we will change here to myself because it's quite confusing to me.

In Python 3, it's declared above basestring = unicode = str. So, it won't change anything. I think this is not our concern.

In Python 2,

Before:

str(obj).encode("utf8")

When obj is unicode:

  1. str(obj): encoded to bytes by system default (ascii)

  2. .encode("utf-8"): decoded to unicodes by system default (ascii) and then encoded to bytes by UTF8.

When obj is str:

  1. str(obj): bytes as are

  2. .encode("utf-8"): decoded to unicodes by system default (ascii) and then encoded to bytes by UTF8

When obj is other types:

  1. str(obj): call __str__()

  2. .encode("utf-8"): decoded to unicodes by system default (ascii) and then encoded to bytes by UTF8

After:

unicode(obj).encode("utf8")

When obj is unicode:

  1. unicode(obj): unicodes as are

  2. .encode("utf-8"): encoded to bytes by UTF8

When obj is str

  1. unicode(obj): decoded to unicode by system default (ascii)

  2. .encode("utf-8"): encoded to bytes by UTF8

When obj is other types

  1. unicode(obj): call __unicode__(). It falls back to __str__() if __unicode__() is not defined.

  2. .encode("utf-8"): encoded to bytes by UTF8

(As a note for sure, UTF8 is ascii compatible)

@HyukjinKwon
Copy link
Member

So, after this change, we will get rid of system default roundtrip in When obj: unicode and When obj: other types.

In case of When obj: other types, we might have a behaviour change if __unicode__() is defined differently with __str__() but I believe it's quite rare.

So, LGTM but I want a double check from you @holdenk and @viirya if I missed anything.

@HyukjinKwon
Copy link
Member

cc @ueshin too. I think we were in several PRs related with encoding / decoding stuff.

@@ -751,7 +751,7 @@ def func(iterator):

def pipe_objs(out):
for obj in iterator:
s = str(obj).rstrip('\n') + '\n'
s = unicode(obj).rstrip('\n') + '\n'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chaoslawful, if you are active, we could change \n to u\n to reduce the confusion and don't rely on the implicit conversion between str and unicode.

@HyukjinKwon
Copy link
Member

Let me merge this one in few days if there's no more comments.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jan 22, 2018

Let me merge this one only into master considering the concerns - #18277 (review) and #18277 (comment). Adding a note / backporting to branch-2.3 could be fine. I don't feel strongly about it.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Jan 22, 2018

Test build #86450 has finished for PR 18277 at commit 8c88595.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master.

@asfgit asfgit closed this in 602c6d8 Jan 22, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants