Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-3047] [PySpark] add an option to use str in textFileRDD #1951

Closed
wants to merge 4 commits into from

Conversation

davies
Copy link
Contributor

@davies davies commented Aug 14, 2014

str is much efficient than unicode (both CPU and memory), it'e better to use str in textFileRDD. In order to keep compatibility, use unicode by default. (Maybe change it in the future).

use_unicode=True:

daviesliu@dm:~/work/spark$ time python wc.py
(u'./universe/spark/sql/core/target/java/org/apache/spark/sql/execution/ExplainCommand$.java', 7776)

real 2m8.298s
user 0m0.185s
sys 0m0.064s

use_unicode=False

daviesliu@dm:~/work/spark$ time python wc.py
('./universe/spark/sql/core/target/java/org/apache/spark/sql/execution/ExplainCommand$.java', 7776)

real 1m26.402s
user 0m0.182s
sys 0m0.062s

We can see that it got 32% improvement!

str is much efficient than unicode
@SparkQA
Copy link

SparkQA commented Aug 14, 2014

QA tests have started for PR 1951. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18564/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 14, 2014

QA results for PR 1951:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18564/consoleFull

@JoshRosen
Copy link
Contributor

I think there's one more use of UTF8Deserializer, in worker.py, that might need to be updated to reflect the new default.

def loads(self, stream):
length = read_int(stream)
return stream.read(length).decode('utf8')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how we'll we've stuck to this convention in the existing code, but my original intention was that loads(loaded a single record and load_stream loaded a stream of records. If you wanted, we could conditionally define loads based on whether we've set use_unicode, which would allow the serializer to be used to deserialize an individual element or a stream.

@JoshRosen
Copy link
Contributor

This is a nice performance optimization. Should we document this somewhere? My concern is that users will never find out about it.

@davies
Copy link
Contributor Author

davies commented Aug 19, 2014

@JoshRosen the use cases in worker.py, they are safe to changed from unicode to str, so I did not change them.

@SparkQA
Copy link

SparkQA commented Aug 19, 2014

QA tests have started for PR 1951 at commit 85246e5.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 19, 2014

QA tests have finished for PR 1951 at commit 85246e5.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 20, 2014

QA tests have started for PR 1951 at commit a286f2f.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 20, 2014

QA tests have finished for PR 1951 at commit a286f2f.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

"""
Read a text file from HDFS, a local file system (available on all
nodes), or any Hadoop-supported file system URI, and return it as an
RDD of Strings.

If use_unicode is False, the strings will be kept as `str` (encoding
as `utf-8`), which is faster and smaller than unicode. (Added in
Spark 1.1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this didn't make it into 1.1, maybe we should change this to 1.2 (or just drop the version information completely).

@JoshRosen
Copy link
Contributor

Aside from the minor comment about version numbers, this looks good to me. I can see how this could lead to large performance wins for certain jobs (especially when parsing, say, numeric data that's stored in a CSV format).

@SparkQA
Copy link

SparkQA commented Sep 7, 2014

QA tests have started for PR 1951 at commit 8352d57.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 7, 2014

QA tests have finished for PR 1951 at commit 8352d57.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Sep 7, 2014

QA tests have started for PR 1951 at commit 8352d57.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 7, 2014

QA tests have finished for PR 1951 at commit 8352d57.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 10, 2014

QA tests have started for PR 1951 at commit 8352d57.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 10, 2014

QA tests have started for PR 1951 at commit 8352d57.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 10, 2014

QA tests have finished for PR 1951 at commit 8352d57.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 10, 2014

QA tests have finished for PR 1951 at commit 8352d57.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor

This looks good to me, so I'm going to merge it into master. Thanks!

@asfgit asfgit closed this in 1ef656e Sep 11, 2014
@davies davies deleted the unicode branch September 15, 2014 22:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants