New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-18934][SQL] Writing to dynamic partitions does not preserve sort order if spills occur #16347
[SPARK-18934][SQL] Writing to dynamic partitions does not preserve sort order if spills occur #16347
Conversation
Can one of the admins verify this patch? |
Thanks for submitting the ticket. In general I don't think the sortWithinPartitions property can carry over to writing out data, because one partition actually corresponds to more than one file. Can your use case be satisfied by adding an explicit sortBy?
|
Thanks for the comment. I was trying to implement the following Hive QL in Spark SQL/API: set hive.exec.dynamic.partition.mode=nonstrict;
set hive.mapred.mode = nonstrict;
insert overwrite table target_table
partition (day)
select * from source_table
distribute by day sort by id; In Hive, However, if I run the same query or its equivalent Spark code –
|
@junegunn I ran into the same issue, using partitionBy; missed it completely during my testing. Do you have any workarounds for stock Spark 2.x? I'm at a loss, and currently I can't apply the patch to the cluster, and bucketing seems -mostly- unavailable in S3, as I could do bucketing and move the output files. |
@chpritchard-expedia The patch here fixes the problem. I don't think it's possible to workaround the issue by using Spark API in some different ways, because we can't completely avoid memory spills at the writers. Hive doesn't have the problem, so maybe you can consider running the same statement on Hive if this is not something Spark wants to address. Anyway, for anyone who's interested, I could confirm that for a sorted ORC table built with this patch, point/range lookups on the sorted column can be several times faster. Also the final size of the table turned out to be significantly smaller in this case (60% of improperly sorted table) due to the temporal locality in our data. |
Maybe we should make DataFrameWriter.sortBy work here. |
@rxin - sortBy is somewhat tied in with bucketing, which is also a little difficult to work with. First, bucketing often relies on a column being present, whereas in Hive (and with repartition), I may use a formula, to split the data into appropriate buckets that are evenly distributed. Overall is not well supported throughout the ecosystem. Even with all of that, Spark doesn't particularly support semantics to say that a data set is already sorted. In Hive, I've had to do a lot of PARTITION BY(datefield, bucket) CLUSTERED BY (key) INTO 1 SORTED BY (key) INTO 1 BUCKETS. That gets us stable totally sorted files, for GUIDs. In the case of Spark, this issue of partitionBy destroying sorting is a painful bug. I'm now using some very large data sets, searching for keys, and instead of returning data in a few seconds (thanks to predicate pushdown with Parquet), it has to scan the entire files. |
What I was suggesting was to allow sort by without bucketing. |
@rxin - Oh, yes that'd be fantastic, partitionBy.sortBy is just about all I need to survive in this crazy world. In the meantime, I think there ought to be a big warning label on partitionBy in the Spark docs. |
…rt order if spills occur
bfeccd8
to
3e2bb08
Compare
Rebased to current master. The patch is simpler thanks to the refactoring made in SPARK-18243. Anyway, I can understand your rationale for wanting to have explicit API on the writer side, but then make sure that the sort specification from |
Is there anyone on the Spark team taking this up? This bug is painful; it's saddened a hundred TB of data I stacked up, and I'm really trying to avoid more manual work. "INSERT OVERWRITE TABLE ... DISTRIBUTE BY ... SORT BY" is how I live my life these days. |
is this still a problem after #16898 ? |
cd9c7f3
to
3e2bb08
Compare
@cloud-fan Unfortunately, yes. sc.parallelize(1 to 10000000).toDS.withColumn("part", 'value.mod(2))
.repartition(1, 'part).sortWithinPartitions("value")
.write.mode("overwrite").format("orc").partitionBy("part")
.saveAsTable("test_sort_within") For the above case, However, we now have a workaround; prepend partition columns to |
@junegunn I think it's not a problem, Currently the |
@cloud-fan It's not a problem in the context of DataFrame API. But when it comes to Spark SQL, it makes Spark SQL incompatible to equivalent HiveQL in a subtle way. And at least we may need to revisit the documentation that gives wrong impression that |
is this still needed after #16898 ? |
See my answer above. |
@junegunn Can you check the query plan of hive for In Spark SQL, the query plan looks like
The sort happens before the insertion, this explains why Spark doesn't preserve the sort order. If in hive the sort is included in insertion, we should follow that. In the meanwhile, the behavior of DataFrameWriter looks reasonable(need to fix the document of |
gentle ping @junegunn on ^. |
Hive makes sure that the output file is properly sorted by the column specified in
The later stage simply moves the files to the corresponding directories. Since the patch no longer merges and I think I have made my point, I'm closing this. |
What changes were proposed in this pull request?
Make dynamic partition writer perform stable sort by the partition key, so that the sort order within the partition specified via
sortWithinPartitions
orSORT BY
is preserved even when spill occurs.How was this patch tested?
Manually tested with the following code snippet and orcdump.
It was not straightforward to come up with a unit test as the problem is only reproducible if spill occurs due to memory constraint. I'd appreciate any suggestions or pointers.