-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-30185][SQL] Implement Dataset.tail API #26809
Conversation
override lazy val metrics = readMetrics ++ writeMetrics | ||
protected override def doExecute(): RDD[InternalRow] = { | ||
val locallyLimited = child.execute().mapPartitionsInternal { iter => | ||
val slidingIter = iter.sliding(limit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sliding Scala API - I manually tested after writing some logics by myself manually (e.g., having a finite queue and loop once via while
). There wasn't notable performance diff so I just decided to use sliding
as it does what I want.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Test build #115032 has finished for PR 26809 at commit
|
I can see the use case for skipping a header, but, this doesn't help if you still want an RDD/DataFrame with the result, because you collect an Array. It also only really works if there is an ordering defined. How much is this different from sorting in reverse and head()? in comparison this looks like it has to traverse the whole data set? |
I felt the same with @srowen; once the shuffle is involved, without ordering there should be no outstanding difference with head() as we don't guarantee ordering anyway, and with ordering the semantic would be same as sort with reverse order + head(). It would be great if we can clarify the benefits compared to the same semantic, otherwise it might be just going to be a syntax sugar, though I'd be even OK for it given there're so many requests in the description of PR. |
...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/limit.scala
Outdated
Show resolved
Hide resolved
At least it can drop the records at executor sides and it won't require a sort. so, we can do this almost at map-only op.
Yes, I think this is a good point. It can be just a different way for the same thing with ordering. Without ordering, it's designed to follow its natural order, which is not guaranteed in many cases in Spark. One clear use case might be when it reads from external datasource. If I am not wrong, when we use Hadoop RDD (which most of external datasources use), it respects its natural order. So, FWIW, Spark used to (unofficially) respect its natural order but it's broken after we started to consolidate small partitions into a big partition IIRC. This can be configured to keep its natural order for our file based sources too, if I am not wrong, of course, I don't think it's official support though. |
Oh, BTW, we also have |
I think it makes sense to support top-level tail which doesn't need shuffle, like limit (see |
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala
Outdated
Show resolved
Hide resolved
9436dfb
to
9b58d4c
Compare
Test build #115230 has finished for PR 26809 at commit
|
Just for clarification, |
sql/core/src/main/scala/org/apache/spark/sql/execution/command/commands.scala
Outdated
Show resolved
Hide resolved
9b58d4c
to
6d40611
Compare
Test build #115676 has finished for PR 26809 at commit
|
retest this please |
Test build #115678 has finished for PR 26809 at commit
|
Test build #115696 has finished for PR 26809 at commit
|
retest this, please |
Test build #115712 has finished for PR 26809 at commit
|
Test build #115783 has finished for PR 26809 at commit
|
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala
Outdated
Show resolved
Hide resolved
40d0740
to
4999808
Compare
Test build #115868 has finished for PR 26809 at commit
|
retest this please |
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala
Outdated
Show resolved
Hide resolved
Test build #115930 has finished for PR 26809 at commit
|
Test build #115932 has finished for PR 26809 at commit
|
retest this please |
Test build #115943 has finished for PR 26809 at commit
|
Merged to master. |
### What changes were proposed in this pull request? #26809 added `Dataset.tail` API. It should be good to have it in PySpark API as well. ### Why are the changes needed? To support consistent APIs. ### Does this PR introduce any user-facing change? No. It adds a new API. ### How was this patch tested? Manually tested and doctest was added. Closes #27251 from HyukjinKwon/SPARK-30539. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
What changes were proposed in this pull request?
This PR proposes a
tail
API.Namely, as below:
Implementation details will be similar with
head
but it will be reversed:spark.sql.limit.scaleUpFactor
Note that, we don't guarantee the natural order in DataFrame in general - there are cases when it's deterministic and when it's not. We probably should write down this as a caveat separately.
Why are the changes needed?
Many other systems support the way to take data from the end, for instance, pandas[1] and
Python[2][3]. Scala collections APIs also have head and tail
On the other hand, in Spark, we only provide a way to take data from the start
(e.g., DataFrame.head).
This has been requested multiple times here and there in Spark user mailing list[4], StackOverFlow[5][6], JIRA[7] and other third party projects such as
Koalas[8]. In addition, this missing API seems explicitly mentioned in comparison to another system[9] time to time.
It seems we're missing non-trivial use case in Spark and this motivated me to propose this API.
[1] https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html?highlight=tail#pandas.DataFrame.tail
[2] https://stackoverflow.com/questions/10532473/head-and-tail-in-one-line
[3] https://stackoverflow.com/questions/646644/how-to-get-last-items-of-a-list-in-python
[4] http://apache-spark-user-list.1001560.n3.nabble.com/RDD-tail-td4217.html
[5] https://stackoverflow.com/questions/39544796/how-to-select-last-row-and-also-how-to-access-pyspark-dataframe-by-index
[6] https://stackoverflow.com/questions/45406762/how-to-get-the-last-row-from-dataframe
[7] https://issues.apache.org/jira/browse/SPARK-26433
[8] databricks/koalas#343
[9] https://medium.com/@chris_bour/6-differences-between-pandas-and-spark-dataframes-1380cec394d2
Does this PR introduce any user-facing change?
No, (new API)
How was this patch tested?
Unit tests were added and manually tested.