[SPARK-30185][SQL] Implement Dataset.tail API #26809

HyukjinKwon · 2019-12-09T10:28:28Z

What changes were proposed in this pull request?

This PR proposes a tail API.

Namely, as below:

scala> spark.range(10).head(5)
res1: Array[Long] = Array(0, 1, 2, 3, 4)
scala> spark.range(10).tail(5)
res2: Array[Long] = Array(5, 6, 7, 8, 9)

Implementation details will be similar with head but it will be reversed:

Run the job against the last partition and collect rows. If this is enough, return as is.
If this is not enough, calculate the number of partitions to select more based upon
spark.sql.limit.scaleUpFactor
Run more jobs against more partitions (in a reversed order compared to head) as many as the number calculated from 2.
Go to 2.

Note that, we don't guarantee the natural order in DataFrame in general - there are cases when it's deterministic and when it's not. We probably should write down this as a caveat separately.

Why are the changes needed?

Many other systems support the way to take data from the end, for instance, pandas[1] and
Python[2][3]. Scala collections APIs also have head and tail

On the other hand, in Spark, we only provide a way to take data from the start
(e.g., DataFrame.head).

This has been requested multiple times here and there in Spark user mailing list[4], StackOverFlow[5][6], JIRA[7] and other third party projects such as
Koalas[8]. In addition, this missing API seems explicitly mentioned in comparison to another system[9] time to time.

It seems we're missing non-trivial use case in Spark and this motivated me to propose this API.

[1] https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html?highlight=tail#pandas.DataFrame.tail
[2] https://stackoverflow.com/questions/10532473/head-and-tail-in-one-line
[3] https://stackoverflow.com/questions/646644/how-to-get-last-items-of-a-list-in-python
[4] http://apache-spark-user-list.1001560.n3.nabble.com/RDD-tail-td4217.html
[5] https://stackoverflow.com/questions/39544796/how-to-select-last-row-and-also-how-to-access-pyspark-dataframe-by-index
[6] https://stackoverflow.com/questions/45406762/how-to-get-the-last-row-from-dataframe
[7] https://issues.apache.org/jira/browse/SPARK-26433
[8] databricks/koalas#343
[9] https://medium.com/@chris_bour/6-differences-between-pandas-and-spark-dataframes-1380cec394d2

Does this PR introduce any user-facing change?

No, (new API)

How was this patch tested?

Unit tests were added and manually tested.

HyukjinKwon · 2019-12-09T10:30:17Z

sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala

+  override lazy val metrics = readMetrics ++ writeMetrics
+  protected override def doExecute(): RDD[InternalRow] = {
+    val locallyLimited = child.execute().mapPartitionsInternal { iter =>
+      val slidingIter = iter.sliding(limit)


This sliding Scala API - I manually tested after writing some logics by myself manually (e.g., having a finite queue and loop once via while). There wasn't notable performance diff so I just decided to use sliding as it does what I want.

SparkQA · 2019-12-09T16:20:17Z

Test build #115032 has finished for PR 26809 at commit 9436dfb.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class Tail(limitExpr: Expression, child: LogicalPlan) extends OrderPreservingUnaryNode
case class CollectTailExec(limit: Int, child: SparkPlan) extends LimitExec

srowen · 2019-12-09T17:17:03Z

I can see the use case for skipping a header, but, this doesn't help if you still want an RDD/DataFrame with the result, because you collect an Array. It also only really works if there is an ordering defined.

How much is this different from sorting in reverse and head()? in comparison this looks like it has to traverse the whole data set?

HeartSaVioR · 2019-12-09T21:16:52Z

I felt the same with @srowen; once the shuffle is involved, without ordering there should be no outstanding difference with head() as we don't guarantee ordering anyway, and with ordering the semantic would be same as sort with reverse order + head().

It would be great if we can clarify the benefits compared to the same semantic, otherwise it might be just going to be a syntax sugar, though I'd be even OK for it given there're so many requests in the description of PR.

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala

HyukjinKwon · 2019-12-10T00:30:18Z

How much is this different from sorting in reverse and head()? in comparison this looks like it has to traverse the whole data set?

At least it can drop the records at executor sides and it won't require a sort. so, we can do this almost at map-only op.

once the shuffle is involved, without ordering there should be no outstanding difference with head() as we don't guarantee ordering anyway, and with ordering the semantic would be same as sort with reverse order + head().

Yes, I think this is a good point. It can be just a different way for the same thing with ordering. Without ordering, it's designed to follow its natural order, which is not guaranteed in many cases in Spark.

One clear use case might be when it reads from external datasource. If I am not wrong, when we use Hadoop RDD (which most of external datasources use), it respects its natural order. So, spark.read.format("xml").load().tail(5) case will work.
Another case is local collection. If I am not wrong, the natural order is preserved.
I am sure there are such more cases which I should identify.

FWIW, Spark used to (unofficially) respect its natural order but it's broken after we started to consolidate small partitions into a big partition IIRC. This can be configured to keep its natural order for our file based sources too, if I am not wrong, of course, I don't think it's official support though.

HyukjinKwon · 2019-12-10T04:50:25Z

Oh, BTW, we also have last and first in functions that I think are friends with head and tail.

cloud-fan · 2019-12-10T13:04:35Z

I think it makes sense to support top-level tail which doesn't need shuffle, like limit (see CollectLimitExec.executeCollect). But I agree with @HeartSaVioR that once this is in the middle and you need a shuffle, then tail and limit has no difference.

sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala

SparkQA · 2019-12-12T14:36:53Z

Test build #115230 has finished for PR 26809 at commit 9b58d4c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-12-13T02:45:30Z

Just for clarification, take(..) code path isn't affected at all.

sql/core/src/main/scala/org/apache/spark/sql/execution/command/commands.scala

sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala

SparkQA · 2019-12-24T05:56:54Z

Test build #115676 has finished for PR 26809 at commit 6d40611.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-12-24T06:08:32Z

retest this please

SparkQA · 2019-12-24T06:10:10Z

Test build #115678 has finished for PR 26809 at commit 8ad431d.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-12-24T08:05:02Z

Test build #115696 has finished for PR 26809 at commit 8ad431d.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2019-12-24T08:21:54Z

retest this, please

SparkQA · 2019-12-24T13:15:12Z

Test build #115712 has finished for PR 26809 at commit 8ad431d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-12-26T07:14:06Z

Test build #115783 has finished for PR 26809 at commit 40d0740.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala

SparkQA · 2019-12-27T20:07:29Z

Test build #115868 has finished for PR 26809 at commit 4999808.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class Tail(limitExpr: Expression, child: LogicalPlan) extends OrderPreservingUnaryNode
case class CollectTailExec(limit: Int, child: SparkPlan) extends LimitExec

HyukjinKwon · 2019-12-30T03:51:31Z

retest this please

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala

SparkQA · 2019-12-30T07:44:42Z

Test build #115930 has finished for PR 26809 at commit 4999808.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class Tail(limitExpr: Expression, child: LogicalPlan) extends OrderPreservingUnaryNode
case class CollectTailExec(limit: Int, child: SparkPlan) extends LimitExec

SparkQA · 2019-12-30T08:05:02Z

Test build #115932 has finished for PR 26809 at commit 738d3e1.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-12-30T08:13:52Z

retest this please

SparkQA · 2019-12-30T12:21:47Z

Test build #115943 has finished for PR 26809 at commit 738d3e1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-12-30T16:06:46Z

Merged to master.

### What changes were proposed in this pull request? #26809 added `Dataset.tail` API. It should be good to have it in PySpark API as well. ### Why are the changes needed? To support consistent APIs. ### Does this PR introduce any user-facing change? No. It adds a new API. ### How was this patch tested? Manually tested and doctest was added. Closes #27251 from HyukjinKwon/SPARK-30539. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

HyukjinKwon commented Dec 9, 2019

View reviewed changes

HyukjinKwon requested review from viirya, cloud-fan and gengliangwang and removed request for viirya and cloud-fan December 9, 2019 10:38

This comment has been minimized.

Sign in to view

jangorecki mentioned this pull request Dec 9, 2019

preview spark results using tail method h2oai/db-benchmark#128

Closed

HyukjinKwon force-pushed the wip-tail branch from 04430df to 9cf63e2 Compare December 9, 2019 11:45

This comment has been minimized.

Sign in to view

HyukjinKwon force-pushed the wip-tail branch from 35e9c43 to 9436dfb Compare December 9, 2019 12:13

dongjoon-hyun added the SQL label Dec 9, 2019

viirya reviewed Dec 9, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala Show resolved Hide resolved

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala Outdated Show resolved Hide resolved

viirya reviewed Dec 9, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala Outdated Show resolved Hide resolved

viirya reviewed Dec 10, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala Show resolved Hide resolved

HyukjinKwon mentioned this pull request Dec 10, 2019

Implements tail() for DataFrame & Series databricks/koalas#1055

Closed

cloud-fan reviewed Dec 10, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala Show resolved Hide resolved

cloud-fan reviewed Dec 10, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala Outdated Show resolved Hide resolved

HyukjinKwon force-pushed the wip-tail branch from 9436dfb to 9b58d4c Compare December 12, 2019 08:26

cloud-fan reviewed Dec 23, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/command/commands.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Dec 23, 2019

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala Show resolved Hide resolved

HyukjinKwon force-pushed the wip-tail branch from 9b58d4c to 6d40611 Compare December 24, 2019 02:12

cloud-fan reviewed Dec 27, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala Outdated Show resolved Hide resolved

Implement Dataset.tail API

4999808

HyukjinKwon force-pushed the wip-tail branch from 40d0740 to 4999808 Compare December 27, 2019 16:18

cloud-fan reviewed Dec 30, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala Outdated Show resolved Hide resolved

Address comments

738d3e1

cloud-fan approved these changes Dec 30, 2019

View reviewed changes

HyukjinKwon closed this in 7079e87 Dec 30, 2019

HyukjinKwon mentioned this pull request Jan 17, 2020

[SPARK-30539][PYTHON][SQL] Add DataFrame.tail in PySpark #27251

Closed

HyukjinKwon deleted the wip-tail branch March 3, 2020 01:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-30185][SQL] Implement Dataset.tail API #26809

[SPARK-30185][SQL] Implement Dataset.tail API #26809

HyukjinKwon commented Dec 9, 2019 •

edited

Loading

HyukjinKwon Dec 9, 2019 •

edited

Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

SparkQA commented Dec 9, 2019

srowen commented Dec 9, 2019

HeartSaVioR commented Dec 9, 2019

HyukjinKwon commented Dec 10, 2019 •

edited

Loading

HyukjinKwon commented Dec 10, 2019

cloud-fan commented Dec 10, 2019

SparkQA commented Dec 12, 2019

HyukjinKwon commented Dec 13, 2019

SparkQA commented Dec 24, 2019

HyukjinKwon commented Dec 24, 2019

SparkQA commented Dec 24, 2019

SparkQA commented Dec 24, 2019

HeartSaVioR commented Dec 24, 2019

SparkQA commented Dec 24, 2019

SparkQA commented Dec 26, 2019

SparkQA commented Dec 27, 2019

HyukjinKwon commented Dec 30, 2019

SparkQA commented Dec 30, 2019

SparkQA commented Dec 30, 2019

HyukjinKwon commented Dec 30, 2019

SparkQA commented Dec 30, 2019

HyukjinKwon commented Dec 30, 2019

[SPARK-30185][SQL] Implement Dataset.tail API #26809

[SPARK-30185][SQL] Implement Dataset.tail API #26809

Conversation

HyukjinKwon commented Dec 9, 2019 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

HyukjinKwon Dec 9, 2019 • edited Loading

Choose a reason for hiding this comment

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

SparkQA commented Dec 9, 2019

srowen commented Dec 9, 2019

HeartSaVioR commented Dec 9, 2019

HyukjinKwon commented Dec 10, 2019 • edited Loading

HyukjinKwon commented Dec 10, 2019

cloud-fan commented Dec 10, 2019

SparkQA commented Dec 12, 2019

HyukjinKwon commented Dec 13, 2019

SparkQA commented Dec 24, 2019

HyukjinKwon commented Dec 24, 2019

SparkQA commented Dec 24, 2019

SparkQA commented Dec 24, 2019

HeartSaVioR commented Dec 24, 2019

SparkQA commented Dec 24, 2019

SparkQA commented Dec 26, 2019

SparkQA commented Dec 27, 2019

HyukjinKwon commented Dec 30, 2019

SparkQA commented Dec 30, 2019

SparkQA commented Dec 30, 2019

HyukjinKwon commented Dec 30, 2019

SparkQA commented Dec 30, 2019

HyukjinKwon commented Dec 30, 2019

HyukjinKwon commented Dec 9, 2019 •

edited

Loading

HyukjinKwon Dec 9, 2019 •

edited

Loading

HyukjinKwon commented Dec 10, 2019 •

edited

Loading