[MINOR][DOCS] Mention lack of RDD order preservation after deserialization by wetneb · Pull Request #28465 · apache/spark

wetneb · 2020-05-06T16:42:18Z

What changes were proposed in this pull request?

This changes the docs to make it clearer that order preservation is not guaranteed when saving a RDD to disk and reading it back (SPARK-5300).

I added two sentences about this in the RDD Programming Guide.

The issue was discussed on the dev mailing list:
http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-order-guarantees-td10142.html

Why are the changes needed?

Because RDDs are order-aware collections, it is natural to expect that if I use saveAsTextFile and then load the resulting file with sparkContext.textFile, I obtain a RDD in the same order.

This is unfortunately not the case at the moment and there is no agreed upon way to fix this in Spark itself (see PR #4204 which attempted to fix this). Users should be aware of this.

Does this PR introduce any user-facing change?

Yes, two new sentences in the documentation.

How was this patch tested?

By checking that the documentation looks good.

HyukjinKwon · 2020-05-07T00:58:53Z

docs/rdd-programming-guide.md

 * If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system.

-* All of Spark's file-based input methods, including `textFile`, support running on directories, compressed files, and wildcards as well. For example, you can use `textFile("/my/directory")`, `textFile("/my/directory/*.txt")`, and `textFile("/my/directory/*.gz")`.
+* All of Spark's file-based input methods, including `textFile`, support running on directories, compressed files, and wildcards as well. For example, you can use `textFile("/my/directory")`, `textFile("/my/directory/*.txt")`, and `textFile("/my/directory/*.gz")`. When multiple files are read, the order of elements in the resulting RDD is not guaranteed, as files can be read in any order. Within a partition, element order is respected.


Well, I think this isn't only the case for reading. The natural order can only be preserved in some certain contexts. You can still keep the natural order by setting a very high value to spark.sql.files.openCostInBytes and spark.sql.files.maxPartitionBytes.

Spark does not guarantee its natural order in general. Actually, I think we should have a separate section or page to publicly document this.

In case of RDD case you mentioned, #4204, I think Hadoop file system uses a lexicographical order when it lists up files. So, sure, it will keep the order in most cases but they are not fully guaranteed. So, the internal listing order is inherited from Hadoop's handling.

This isn't specific to textFile either. SQL case is different as I described above. It might be best to have a separate page to document.

At the moment the order is not preserved when reading from a local file system, but it is preserved via HDFS. It is simple to fix the issue for the local file system (as #4204 demonstrates) but unfortunately that hasn't been merged.

Are you sure spark.sql.files.openCostInBytes and spark.sql.files.maxPartitionBytes have any influence on this? I don't see how they would influence the order in which files are read (especially because this happens entirely at the RDD level?)

I can try making a separate page about this, detailing which operation of the API preserves the order.

Another approach would be to mention in each method of RDD and SparkContext (and Dataset, SparkSession) whether they preserve the order or not. I would be interested in preservation of partitioning too, it could be documented in the same way.

Perhaps there could even be annotations on methods which preserve these aspects (which would potentially let users implement automated checks for calls to methods which do not preserve these things?).

The problem with writing up a separate page/section about this in the docs is that it is likely to go out of sync with the API.

Typically local file system is not used in production so it might not be a big deal at this moment.

Are you sure spark.sql.files.openCostInBytes and spark.sql.files.maxPartitionBytes have any influence on this?

This affects SQL case - SQL APIs such as spark.read.csv() does not also guarantee the natural order and it can be indeterministic in the middle of the operation such as shuffle.
So, the cause is different but the result is similar - indeterministic order.

This is why I am thinking we should rather have a separate page to comprehensively elaborate this. It might not have to list up every API because it's more specific to how Spark works rather than how each API works.

I think the sentence is OK, though it's really partition order that isn't guaranteed, because there is no inherent ordering across files. It kind of implies that there is some natural ordering, and lexicographic is probably what many FSes will use, but, that wasn't a contract from the FS either.

Thanks @srowen. Should I rephrase my sentence to just say that partition order is not guaranteed? I don't think I am in a position to write up an entire section about where in Spark one should expect order to be preserved, except by listing out all operations in the API, which is probably not very useful…

Something simple like "When multiple files are read, the order of the partitions depends on the order the files are returned from the filesystem. It may or may not, for example, follow the lexicographic ordering of the files by path. Within a partition, elements are ordered according to their order in an underlying file" ?

Thanks a lot, I added your suggestion, with a minor tweak: "the underlying file" at the end: as I understand it, the partition determines the file at this stage, right?

One file could result in several partitions, but, either construction is fine.

HyukjinKwon · 2020-05-11T15:02:37Z

ok to test

srowen

I'm OK with that.

SparkQA · 2020-05-11T15:17:16Z

Test build #122511 has finished for PR 28465 at commit da7f285.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-05-12T01:25:03Z

I am okay too.

srowen · 2020-05-12T13:27:50Z

Merged to master

[SPARK-5300] Mention loss of RDD order in RDD Programming Guide

74db2ef

probot-autolabeler bot added the DOCS label May 6, 2020

HyukjinKwon reviewed May 7, 2020

View reviewed changes

wetneb added 2 commits May 11, 2020 16:53

"the order of partitions is not guaranteed"

a2e37c8

Add @srowen's suggestion

da7f285

srowen approved these changes May 11, 2020

View reviewed changes

srowen changed the title ~~[SPARK-5300][CORE][DOCS] Mention lack of RDD order preservation after deserialization~~ [MINOR][DOCS] Mention lack of RDD order preservation after deserialization May 12, 2020

srowen closed this in 59d9099 May 12, 2020

wetneb deleted the SPARK-5300-docs branch July 4, 2020 15:20

Conversation

wetneb commented May 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon May 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented May 11, 2020

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 11, 2020

Uh oh!

HyukjinKwon commented May 12, 2020

Uh oh!

srowen commented May 12, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wetneb commented May 6, 2020 •

edited

Loading

HyukjinKwon May 7, 2020 •

edited

Loading