Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-2017] [SPARK-2016] Web UI responsiveness with large numbers of tasks/partitions #1682

Closed
wants to merge 25 commits into from
Closed

Conversation

carlosfuertes
Copy link

I address here issues SPARK-2017 and SPARK-2016 by serving the data for the tables under Spark UI web interface as JSON for later rendering, and using javascript in the browser to build the tables from an ajax request on that JSON.

Main addition is exposing paths with the JSON data as:

/stages/stage/tasks/json/?id=nnn&attempt=mmm
/storage/json
/storage/rdd/workers/json?id=nnn
/storage/rdd/blocks/json?id=nnn

I also add a new env variable "spark.ui.jsRenderingEnabled" (true by default) which controls whether to use js to do rendering or not for backward compatibility for people who cannot use js.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@carlosfuertes
Copy link
Author

Identified the bottleneck in the rendering of the page containing the data tables. It was related to the css styling used (bootstrap with some nth-child(odd)). Added a custom style instead and now all rendering is much faster. See Jira SPARK-2017 for more details

@carlosfuertes
Copy link
Author

I added a configuration property "spark.ui.jsRenderingEnabled" that controls whether the rendering of the tables happens using Javascript or not. It is enable by default. This ensures that people that cannot or do not want to run javascript to do the rendering, they can use the web ui as before.

@JoshRosen
Copy link
Contributor

It seems like having two different table rendering techniques, server-side HTML and client-side Javascript, could become a maintenance / complexity burden. Do you think it's important for the UI to work without Javascript? It could be important for tools that scrape the web UI, but it would be better if those tools consumed JSON data instead.

Personally, I'd be a fan of delivering a stable JSON API and rewriting the web UI as a Javascript application that consumes data from those endpoints, but I'm open to other opinions.

/cc @pwendell @mateiz @rxin

@JoshRosen
Copy link
Contributor

Also, do you mind editing the title of this PR so that it's tracked correctly by our review tools?

Something like [SPARK-2017] [SPARK-2016] Web UI responsiveness with large numbers of tasks/partitions would be great.

@SparkQA
Copy link

SparkQA commented Sep 5, 2014

Can one of the admins verify this patch?

@ash211
Copy link
Contributor

ash211 commented Sep 6, 2014

FWIW I run Spark in AWS and my ops team requires that all web interfaces exposed through the proxy into the enclave have both SSL and user auth. Spark doesn't support those, so for now I'm making heavy use of the webui via the links command line browser.

All that to say, yes there are users accessing the interface via no-Javascript browsers. I can understand though if you make the web interface use JS and expose a REST API since that's a more generally attractive setup for checking status on a cluster.

@JoshRosen
Copy link
Contributor

@ash211 This weekend, I'm actually working on writing a design document for web UI improvements in Spark 1.2. SSL encryption, authentication, and ACLs are all features that I'm planning to put on the roadmap.

Do you have SSH to your EC2 machines? One option is to use a SSH proxy to view the full web UI in your browser. Once you've set up the proxy, you can use a browser plugin like FoxyProxy to seamlessly proxy requests for the UI.

@carlosfuertes carlosfuertes changed the title Spark 2017 [SPARK-2017] [SPARK-2016] Web UI responsiveness with large numbers of tasks/partitions Sep 7, 2014
@carlosfuertes
Copy link
Author

Hi,

I have updated the title of the pull request and make sure it is mergable
after latest master updates. Last I have dropped the usage of a custom
table css since as I explained in
https://issues.apache.org/jira/browse/SPARK-2017 it is not reallly the
bottleneck and it may simplify things at first (that's a later tweak).

The reason I added the env variable ""spark.ui.jsRenderingEnabled" and
retain html server rendering in this PR, is to ensure folks that rely on
not having to use javascript can still operate. They would just to launch
spark with "spark.ui.jsRenderingEnabled" to false.

Later, anybody that relies on not using javascript, should access the info
using the JSON interface. But till people are using the JSON interface, we
should still have the current minimal html form.

Right now I see this Pull Request as a working proof of concept of what the
JSON interface and javascript can look like. There are still some points to
discuss and agree among everybody:

  1. What is the JSON format that we want?
    Current JSON is very verbose, in the sense that is very inefficient
    since it sends everything as key value for every line: It repeats
    field unnecessarily

[ {
"Index" : 0,
"ID" : 0,
"Attempt" : {
"value" : "0",
"sorttable_customkey" : "0"
},
"Status" : "SUCCESS",
"Locality Level" : "PROCESS_LOCAL",
"Executor" : "localhost",
"Launch Time" : "2014/09/07 15:25:24",
"Duration" : {
"value" : "0.8 s",
"sorttable_customkey" : "780"
},
"GC Time" : "",
"Errors" : ""
}, ...

We could use a much compact format where the first line are the names
of the fields and every new line is just an array of the values of the
fields (no repetition of keys).

Also we need to include tests for the JSON interface.

  1. If we want to deal and render really big tables, I think we should
    include pagination and update the web UI with it.

In the JSON interface, we should include a parameter that tells you
how many rows to return. Something like ex.
"/storage/rdd/blocks/json/?id=0&nrows=1000" and if you want to get
everything say it explicitly, ex.
"/storage/rdd/blocks/json/?id=0&nrows=all"

  1. The current method to sort the output in the browser using the
    "sorttable" js package does not work for large tables. It is too slow.

When we request the data, the server should do the sorting.

That is, the JSON api should receive a parameter telling the server
which column should be used to do the sorting: something like ex.
"/storage/rdd/blocks/json/?id=0&sortby=0"

Let me know what you think about the points above.

On Sat, Sep 6, 2014 at 3:44 PM, Josh Rosen notifications@github.com wrote:

@ash211 https://github.com/ash211 This weekend, I'm actually working on
writing a design document for web UI improvements in Spark 1.2. SSL
encryption, authentication, and ACLs are all features that I'm planning to
put on the roadmap.

Do you have SSH to your EC2 machines? One option is to use a SSH proxy to
view the full web UI in your browser. Once you've set up the proxy, you can
use a browser plugin like FoxyProxy http://getfoxyproxy.org/ to
seamlessly proxy requests for the UI.


Reply to this email directly or view it on GitHub
#1682 (comment).

@carlosfuertes
Copy link
Author

@JoshRosen You mentioned above that you are working on a design document for web UI improvements. Are you going to post it anywhere? I would like to take all the above and help make it happen.

@JoshRosen
Copy link
Contributor

@carlosfuertes I'm working on a draft this weekend; I'll make a post to the developers mailing list sometime next week once I've gotten the basics fleshed out.

Is the duplication of the keys a significant problem in practice? If we configure the API's HTTP server to use gzip compression, I think the API responses still shouldn't be too big. If we do choose to remove this redundancy, would it make more sense to return the data in CSV or TSV format?

You've raised some good questions about sorting, pagination, etc. I think we should address these in a consistent way throughout the API. Are there some existing JSON APIs that you think made good decisions for these features?

@carlosfuertes
Copy link
Author

@JoshRosen I can imagine that avoiding duplication of keys can save easily 50% or more. If we are aiming at big data sizes that can matter a lot. In the JIRA post I explain how I was seeing already 15Mb data sizes with just 50,000 jobs. Gzip can make sense but if you start with something that is already 50% smaller that's even better.

In any case that is very simple to test and benchmark (with this PR for example is very simple) and an optimization at the end of the day.

To be very concrete, something like

[ {"key1": row1_value1, "key2": row1_value2},
{"key1": row2_value1, "key2": row2_value2},
{"key1": row3_value1, "key2": row3_value2} ]

versus using

{ meta: { keys: [ "key1", "key2" ] },
data: [ ["row1_value1", "row1_value1"],
["row2_value1", "row2_value1"],
["row3_value1", "row3_value1"] ] }

I would stick to using JSON and not csv or tsv, since JSON interacts extremely well with javascript (it was design to), and pretty much anything, and you have the flexibility to add other meta information and parse it.

I have seen some HTTP APIs that incorporate pagination and so forth (ex. steam web page) but I do not have a particular one in mind freely available in the wild... I'm thinking that talking about JSON API is a bit confusing and it would be better to refer to the HTTP API (which returns JSON for some calls). I think it may make sense in the doc that you are working on to delineate the full (RESTful) HTTP API for the WebUI rather than just the JSON part so that the global picture and design is clear.

@JoshRosen
Copy link
Contributor

I've opened SPARK-3644 as a forum for discussing the design of a REST API; sorry for the delay (got busy with other work / bug fixing).

sunchao pushed a commit to sunchao/spark that referenced this pull request Jun 2, 2023
…aster` (apache#1682)

### What changes were proposed in this pull request?

This PR aims to support a driver-only K8s Spark job.

### Why are the changes needed?

Some workloads like SQL DDL operations can be executed by driver only.

### Does this PR introduce _any_ user-facing change?

No. This is a new feature.

### How was this patch tested?

Pass the CIs with the newly add test case.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants