Skip to content

[MINOR][DOCS] Mention lack of RDD order preservation after deserialization#28465

Closed
wetneb wants to merge 3 commits intoapache:masterfrom
wetneb:SPARK-5300-docs
Closed

[MINOR][DOCS] Mention lack of RDD order preservation after deserialization#28465
wetneb wants to merge 3 commits intoapache:masterfrom
wetneb:SPARK-5300-docs

Conversation

@wetneb
Copy link
Contributor

@wetneb wetneb commented May 6, 2020

What changes were proposed in this pull request?

This changes the docs to make it clearer that order preservation is not guaranteed when saving a RDD to disk and reading it back (SPARK-5300).

I added two sentences about this in the RDD Programming Guide.

The issue was discussed on the dev mailing list:
http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-order-guarantees-td10142.html

Why are the changes needed?

Because RDDs are order-aware collections, it is natural to expect that if I use saveAsTextFile and then load the resulting file with sparkContext.textFile, I obtain a RDD in the same order.

This is unfortunately not the case at the moment and there is no agreed upon way to fix this in Spark itself (see PR #4204 which attempted to fix this). Users should be aware of this.

Does this PR introduce any user-facing change?

Yes, two new sentences in the documentation.

How was this patch tested?

By checking that the documentation looks good.

* If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system.

* All of Spark's file-based input methods, including `textFile`, support running on directories, compressed files, and wildcards as well. For example, you can use `textFile("/my/directory")`, `textFile("/my/directory/*.txt")`, and `textFile("/my/directory/*.gz")`.
* All of Spark's file-based input methods, including `textFile`, support running on directories, compressed files, and wildcards as well. For example, you can use `textFile("/my/directory")`, `textFile("/my/directory/*.txt")`, and `textFile("/my/directory/*.gz")`. When multiple files are read, the order of elements in the resulting RDD is not guaranteed, as files can be read in any order. Within a partition, element order is respected.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I think this isn't only the case for reading. The natural order can only be preserved in some certain contexts. You can still keep the natural order by setting a very high value to spark.sql.files.openCostInBytes and spark.sql.files.maxPartitionBytes.

Spark does not guarantee its natural order in general. Actually, I think we should have a separate section or page to publicly document this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of RDD case you mentioned, #4204, I think Hadoop file system uses a lexicographical order when it lists up files. So, sure, it will keep the order in most cases but they are not fully guaranteed. So, the internal listing order is inherited from Hadoop's handling.

This isn't specific to textFile either. SQL case is different as I described above. It might be best to have a separate page to document.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment the order is not preserved when reading from a local file system, but it is preserved via HDFS. It is simple to fix the issue for the local file system (as #4204 demonstrates) but unfortunately that hasn't been merged.

Are you sure spark.sql.files.openCostInBytes and spark.sql.files.maxPartitionBytes have any influence on this? I don't see how they would influence the order in which files are read (especially because this happens entirely at the RDD level?)

I can try making a separate page about this, detailing which operation of the API preserves the order.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another approach would be to mention in each method of RDD and SparkContext (and Dataset, SparkSession) whether they preserve the order or not. I would be interested in preservation of partitioning too, it could be documented in the same way.

Perhaps there could even be annotations on methods which preserve these aspects (which would potentially let users implement automated checks for calls to methods which do not preserve these things?).

The problem with writing up a separate page/section about this in the docs is that it is likely to go out of sync with the API.

Copy link
Member

@HyukjinKwon HyukjinKwon May 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically local file system is not used in production so it might not be a big deal at this moment.

Are you sure spark.sql.files.openCostInBytes and spark.sql.files.maxPartitionBytes have any influence on this?

This affects SQL case - SQL APIs such as spark.read.csv() does not also guarantee the natural order and it can be indeterministic in the middle of the operation such as shuffle.
So, the cause is different but the result is similar - indeterministic order.

This is why I am thinking we should rather have a separate page to comprehensively elaborate this. It might not have to list up every API because it's more specific to how Spark works rather than how each API works.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the sentence is OK, though it's really partition order that isn't guaranteed, because there is no inherent ordering across files. It kind of implies that there is some natural ordering, and lexicographic is probably what many FSes will use, but, that wasn't a contract from the FS either.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @srowen. Should I rephrase my sentence to just say that partition order is not guaranteed? I don't think I am in a position to write up an entire section about where in Spark one should expect order to be preserved, except by listing out all operations in the API, which is probably not very useful…

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something simple like "When multiple files are read, the order of the partitions depends on the order the files are returned from the filesystem. It may or may not, for example, follow the lexicographic ordering of the files by path. Within a partition, elements are ordered according to their order in an underlying file" ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot, I added your suggestion, with a minor tweak: "the underlying file" at the end: as I understand it, the partition determines the file at this stage, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One file could result in several partitions, but, either construction is fine.

@HyukjinKwon
Copy link
Member

ok to test

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with that.

@SparkQA
Copy link

SparkQA commented May 11, 2020

Test build #122511 has finished for PR 28465 at commit da7f285.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

I am okay too.

@srowen srowen changed the title [SPARK-5300][CORE][DOCS] Mention lack of RDD order preservation after deserialization [MINOR][DOCS] Mention lack of RDD order preservation after deserialization May 12, 2020
@srowen
Copy link
Member

srowen commented May 12, 2020

Merged to master

@srowen srowen closed this in 59d9099 May 12, 2020
@wetneb wetneb deleted the SPARK-5300-docs branch July 4, 2020 15:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants