[MINOR][DOCS] Mention lack of RDD order preservation after deserialization#28465
[MINOR][DOCS] Mention lack of RDD order preservation after deserialization#28465wetneb wants to merge 3 commits intoapache:masterfrom
Conversation
docs/rdd-programming-guide.md
Outdated
| * If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system. | ||
|
|
||
| * All of Spark's file-based input methods, including `textFile`, support running on directories, compressed files, and wildcards as well. For example, you can use `textFile("/my/directory")`, `textFile("/my/directory/*.txt")`, and `textFile("/my/directory/*.gz")`. | ||
| * All of Spark's file-based input methods, including `textFile`, support running on directories, compressed files, and wildcards as well. For example, you can use `textFile("/my/directory")`, `textFile("/my/directory/*.txt")`, and `textFile("/my/directory/*.gz")`. When multiple files are read, the order of elements in the resulting RDD is not guaranteed, as files can be read in any order. Within a partition, element order is respected. |
There was a problem hiding this comment.
Well, I think this isn't only the case for reading. The natural order can only be preserved in some certain contexts. You can still keep the natural order by setting a very high value to spark.sql.files.openCostInBytes and spark.sql.files.maxPartitionBytes.
Spark does not guarantee its natural order in general. Actually, I think we should have a separate section or page to publicly document this.
There was a problem hiding this comment.
In case of RDD case you mentioned, #4204, I think Hadoop file system uses a lexicographical order when it lists up files. So, sure, it will keep the order in most cases but they are not fully guaranteed. So, the internal listing order is inherited from Hadoop's handling.
This isn't specific to textFile either. SQL case is different as I described above. It might be best to have a separate page to document.
There was a problem hiding this comment.
At the moment the order is not preserved when reading from a local file system, but it is preserved via HDFS. It is simple to fix the issue for the local file system (as #4204 demonstrates) but unfortunately that hasn't been merged.
Are you sure spark.sql.files.openCostInBytes and spark.sql.files.maxPartitionBytes have any influence on this? I don't see how they would influence the order in which files are read (especially because this happens entirely at the RDD level?)
I can try making a separate page about this, detailing which operation of the API preserves the order.
There was a problem hiding this comment.
Another approach would be to mention in each method of RDD and SparkContext (and Dataset, SparkSession) whether they preserve the order or not. I would be interested in preservation of partitioning too, it could be documented in the same way.
Perhaps there could even be annotations on methods which preserve these aspects (which would potentially let users implement automated checks for calls to methods which do not preserve these things?).
The problem with writing up a separate page/section about this in the docs is that it is likely to go out of sync with the API.
There was a problem hiding this comment.
Typically local file system is not used in production so it might not be a big deal at this moment.
Are you sure spark.sql.files.openCostInBytes and spark.sql.files.maxPartitionBytes have any influence on this?
This affects SQL case - SQL APIs such as spark.read.csv() does not also guarantee the natural order and it can be indeterministic in the middle of the operation such as shuffle.
So, the cause is different but the result is similar - indeterministic order.
This is why I am thinking we should rather have a separate page to comprehensively elaborate this. It might not have to list up every API because it's more specific to how Spark works rather than how each API works.
There was a problem hiding this comment.
I think the sentence is OK, though it's really partition order that isn't guaranteed, because there is no inherent ordering across files. It kind of implies that there is some natural ordering, and lexicographic is probably what many FSes will use, but, that wasn't a contract from the FS either.
There was a problem hiding this comment.
Thanks @srowen. Should I rephrase my sentence to just say that partition order is not guaranteed? I don't think I am in a position to write up an entire section about where in Spark one should expect order to be preserved, except by listing out all operations in the API, which is probably not very useful…
There was a problem hiding this comment.
Something simple like "When multiple files are read, the order of the partitions depends on the order the files are returned from the filesystem. It may or may not, for example, follow the lexicographic ordering of the files by path. Within a partition, elements are ordered according to their order in an underlying file" ?
There was a problem hiding this comment.
Thanks a lot, I added your suggestion, with a minor tweak: "the underlying file" at the end: as I understand it, the partition determines the file at this stage, right?
There was a problem hiding this comment.
One file could result in several partitions, but, either construction is fine.
|
ok to test |
|
Test build #122511 has finished for PR 28465 at commit
|
|
I am okay too. |
|
Merged to master |
What changes were proposed in this pull request?
This changes the docs to make it clearer that order preservation is not guaranteed when saving a RDD to disk and reading it back (SPARK-5300).
I added two sentences about this in the RDD Programming Guide.
The issue was discussed on the dev mailing list:
http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-order-guarantees-td10142.html
Why are the changes needed?
Because RDDs are order-aware collections, it is natural to expect that if I use
saveAsTextFileand then load the resulting file withsparkContext.textFile, I obtain a RDD in the same order.This is unfortunately not the case at the moment and there is no agreed upon way to fix this in Spark itself (see PR #4204 which attempted to fix this). Users should be aware of this.
Does this PR introduce any user-facing change?
Yes, two new sentences in the documentation.
How was this patch tested?
By checking that the documentation looks good.