[SPARK-3047] [PySpark] add an option to use str in textFileRDD #1951

davies · 2014-08-14T20:46:40Z

str is much efficient than unicode (both CPU and memory), it'e better to use str in textFileRDD. In order to keep compatibility, use unicode by default. (Maybe change it in the future).

use_unicode=True:

daviesliu@dm:~/work/spark$ time python wc.py
(u'./universe/spark/sql/core/target/java/org/apache/spark/sql/execution/ExplainCommand$.java', 7776)

real 2m8.298s
user 0m0.185s
sys 0m0.064s

use_unicode=False

daviesliu@dm:~/work/spark$ time python wc.py
('./universe/spark/sql/core/target/java/org/apache/spark/sql/execution/ExplainCommand$.java', 7776)

real 1m26.402s
user 0m0.182s
sys 0m0.062s

We can see that it got 32% improvement!

str is much efficient than unicode

SparkQA · 2014-08-14T20:50:23Z

QA tests have started for PR 1951. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18564/consoleFull

SparkQA · 2014-08-14T21:46:42Z

QA results for PR 1951:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18564/consoleFull

JoshRosen · 2014-08-19T21:23:55Z

I think there's one more use of UTF8Deserializer, in worker.py, that might need to be updated to reflect the new default.

JoshRosen · 2014-08-19T21:26:05Z

python/pyspark/serializers.py

    def loads(self, stream):
        length = read_int(stream)
-        return stream.read(length).decode('utf8')


I don't know how we'll we've stuck to this convention in the existing code, but my original intention was that loads(loaded a single record and load_stream loaded a stream of records. If you wanted, we could conditionally define loads based on whether we've set use_unicode, which would allow the serializer to be used to deserialize an individual element or a stream.

JoshRosen · 2014-08-19T21:27:31Z

This is a nice performance optimization. Should we document this somewhere? My concern is that users will never find out about it.

davies · 2014-08-19T22:22:34Z

@JoshRosen the use cases in worker.py, they are safe to changed from unicode to str, so I did not change them.

SparkQA · 2014-08-19T22:25:28Z

QA tests have started for PR 1951 at commit 85246e5.

This patch merges cleanly.

SparkQA · 2014-08-19T23:13:07Z

QA tests have finished for PR 1951 at commit 85246e5.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2014-08-20T01:04:21Z

QA tests have started for PR 1951 at commit a286f2f.

This patch merges cleanly.

SparkQA · 2014-08-20T01:58:24Z

QA tests have finished for PR 1951 at commit a286f2f.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2014-09-06T23:15:56Z

python/pyspark/context.py

        """
        Read a text file from HDFS, a local file system (available on all
        nodes), or any Hadoop-supported file system URI, and return it as an
        RDD of Strings.

+        If use_unicode is False, the strings will be kept as `str` (encoding
+        as `utf-8`), which is faster and smaller than unicode. (Added in
+        Spark 1.1)


Since this didn't make it into 1.1, maybe we should change this to 1.2 (or just drop the version information completely).

JoshRosen · 2014-09-06T23:18:13Z

Aside from the minor comment about version numbers, this looks good to me. I can see how this could lead to large performance wins for certain jobs (especially when parsing, say, numeric data that's stored in a CSV format).

SparkQA · 2014-09-07T00:44:55Z

QA tests have started for PR 1951 at commit 8352d57.

This patch merges cleanly.

SparkQA · 2014-09-07T01:49:07Z

QA tests have finished for PR 1951 at commit 8352d57.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2014-09-07T06:16:22Z

Jenkins, retest this please.

SparkQA · 2014-09-07T06:45:28Z

QA tests have started for PR 1951 at commit 8352d57.

This patch merges cleanly.

SparkQA · 2014-09-07T07:54:22Z

QA tests have finished for PR 1951 at commit 8352d57.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2014-09-10T22:21:29Z

QA tests have started for PR 1951 at commit 8352d57.

This patch merges cleanly.

SparkQA · 2014-09-10T22:21:42Z

QA tests have started for PR 1951 at commit 8352d57.

This patch merges cleanly.

SparkQA · 2014-09-10T23:27:31Z

QA tests have finished for PR 1951 at commit 8352d57.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2014-09-10T23:32:37Z

QA tests have finished for PR 1951 at commit 8352d57.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2014-09-11T18:47:10Z

This looks good to me, so I'm going to merge it into master. Thanks!

add an option to use str in textFile()

a0295e1

str is much efficient than unicode

JoshRosen reviewed Aug 19, 2014
View reviewed changes

add docs for use_unicode

85246e5

rollback loads()

a286f2f

JoshRosen reviewed Sep 6, 2014
View reviewed changes

update version number

8352d57

asfgit closed this in 1ef656e Sep 11, 2014

davies deleted the unicode branch September 15, 2014 22:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-3047] [PySpark] add an option to use str in textFileRDD #1951

[SPARK-3047] [PySpark] add an option to use str in textFileRDD #1951

davies commented Aug 14, 2014

SparkQA commented Aug 14, 2014

SparkQA commented Aug 14, 2014

JoshRosen commented Aug 19, 2014

JoshRosen Aug 19, 2014

JoshRosen commented Aug 19, 2014

davies commented Aug 19, 2014

SparkQA commented Aug 19, 2014

SparkQA commented Aug 19, 2014

SparkQA commented Aug 20, 2014

SparkQA commented Aug 20, 2014

JoshRosen Sep 6, 2014

JoshRosen commented Sep 6, 2014

SparkQA commented Sep 7, 2014

SparkQA commented Sep 7, 2014

JoshRosen commented Sep 7, 2014

SparkQA commented Sep 7, 2014

SparkQA commented Sep 7, 2014

SparkQA commented Sep 10, 2014

SparkQA commented Sep 10, 2014

SparkQA commented Sep 10, 2014

SparkQA commented Sep 10, 2014

JoshRosen commented Sep 11, 2014

[SPARK-3047] [PySpark] add an option to use str in textFileRDD #1951

[SPARK-3047] [PySpark] add an option to use str in textFileRDD #1951

Conversation

davies commented Aug 14, 2014

SparkQA commented Aug 14, 2014

SparkQA commented Aug 14, 2014

JoshRosen commented Aug 19, 2014

JoshRosen Aug 19, 2014

Choose a reason for hiding this comment

JoshRosen commented Aug 19, 2014

davies commented Aug 19, 2014

SparkQA commented Aug 19, 2014

SparkQA commented Aug 19, 2014

SparkQA commented Aug 20, 2014

SparkQA commented Aug 20, 2014

JoshRosen Sep 6, 2014

Choose a reason for hiding this comment

JoshRosen commented Sep 6, 2014

SparkQA commented Sep 7, 2014

SparkQA commented Sep 7, 2014

JoshRosen commented Sep 7, 2014

SparkQA commented Sep 7, 2014

SparkQA commented Sep 7, 2014

SparkQA commented Sep 10, 2014

SparkQA commented Sep 10, 2014

SparkQA commented Sep 10, 2014

SparkQA commented Sep 10, 2014

JoshRosen commented Sep 11, 2014