Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-31903][SQL][PYSPARK][2.4] Fix toPandas with Arrow enabled to show metrics in Query UI. #28740

Conversation

ueshin
Copy link
Member

@ueshin ueshin commented Jun 5, 2020

What changes were proposed in this pull request?

This is a backport of #28730.

In Dataset.collectAsArrowToPython, since the code block for serveToStream is run in the separate thread, withAction finishes as soon as it starts the thread. As a result, it doesn't collect the metrics of the actual action and Query UI shows the plan graph without metrics.

We should call serveToStream first, then withAction in it.

The affected function is:

  • DataFrame.toPandas() in PySpark

Why are the changes needed?

When calling toPandas, usually Query UI shows each plan node's metric:

>>> df = spark.createDataFrame([(1, 10, 'abc'), (2, 20, 'def')], schema=['x', 'y', 'z'])
>>> df.toPandas()
   x   y    z
0  1  10  abc
1  2  20  def

Screen Shot 2020-06-05 at 10 58 30 AM

but if Arrow execution is enabled, it shows only plan nodes and the duration is not correct:

>>> spark.conf.set('spark.sql.execution.arrow.enabled', True)
>>> df.toPandas()
   x   y    z
0  1  10  abc
1  2  20  def

Screen Shot 2020-06-05 at 10 58 42 AM

Does this PR introduce any user-facing change?

Yes, the Query UI will show the plan with the correct metrics.

How was this patch tested?

I checked it manually in my local.

Screen Shot 2020-06-05 at 11 29 48 AM

Copy link
Member

@gengliangwang gengliangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the back porting!

@SparkQA
Copy link

SparkQA commented Jun 5, 2020

Test build #123579 has finished for PR 28740 at commit 753a3b7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to branch-2.4.

HyukjinKwon pushed a commit that referenced this pull request Jun 6, 2020
…how metrics in Query UI

### What changes were proposed in this pull request?

This is a backport of #28730.

In `Dataset.collectAsArrowToPython`, since the code block for `serveToStream` is run in the separate thread, `withAction` finishes as soon as it starts the thread. As a result, it doesn't collect the metrics of the actual action and Query UI shows the plan graph without metrics.

We should call `serveToStream` first, then `withAction` in it.

### Why are the changes needed?

When calling toPandas, usually Query UI shows each plan node's metric:

```py
>>> df = spark.createDataFrame([(1, 10, 'abc'), (2, 20, 'def')], schema=['x', 'y', 'z'])
>>> df.toPandas()
   x   y    z
0  1  10  abc
1  2  20  def
```

![Screen Shot 2020-06-05 at 10 58 30 AM](https://user-images.githubusercontent.com/506656/83914110-6f3b3080-a725-11ea-88c0-de83a833b05c.png)

but if Arrow execution is enabled, it shows only plan nodes and the duration is not correct:

```py
>>> spark.conf.set('spark.sql.execution.arrow.enabled', True)
>>> df.toPandas()
   x   y    z
0  1  10  abc
1  2  20  def
```

![Screen Shot 2020-06-05 at 10 58 42 AM](https://user-images.githubusercontent.com/506656/83914127-782c0200-a725-11ea-84e4-74d861d5c20a.png)

### Does this PR introduce _any_ user-facing change?

Yes, the Query UI will show the plan with the correct metrics.

### How was this patch tested?

I checked it manually in my local.

![Screen Shot 2020-06-05 at 11 29 48 AM](https://user-images.githubusercontent.com/506656/83914142-7e21e300-a725-11ea-8925-edc22df16388.png)

Closes #28740 from ueshin/issues/SPARK-31903/2.4/to_pandas_with_arrow_query_ui.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
@HyukjinKwon HyukjinKwon closed this Jun 6, 2020
@HyukjinKwon
Copy link
Member

cc @holdenk FYI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants