[SPARK-34659][UI] Forbid using keyword "proxy" or "history" in reverse proxy URL #36176

gengliangwang · 2022-04-13T07:48:41Z

What changes were proposed in this pull request?

When the reverse proxy URL contains "proxy" or "history", the application ID in UI is wrongly parsed.
For example, if we set spark.ui.reverseProxyURL as "/test/proxy/prefix" or "/test/history/prefix", the application ID is parsed as "prefix" and the related API calls will fail in stages/executors pages:

.../api/v1/applications/prefix/allexecutors

instead of

.../api/v1/applications/app-20220413142241-0000/allexecutors

There are more contexts in #31774
We can fix this entirely like #36174, but it is risky and complicated to do that.

Why are the changes needed?

Avoid users setting keywords in reverse proxy URL and getting wrong UI results.

Does this PR introduce any user-facing change?

No

How was this patch tested?

A new unit test.
Also doc preview:

gengliangwang · 2022-04-13T07:50:20Z

cc @ornew @dongjoon-hyun @PerilousApricot @jasonli-db @cloud-fan @MaxGekk

gengliangwang · 2022-04-13T07:53:19Z

core/src/main/scala/org/apache/spark/SparkContext.scala

-      val proxyUrl = _conf.get(UI_REVERSE_PROXY_URL.key, "").stripSuffix("/") +
-        "/proxy/" + _applicationId
-      System.setProperty("spark.ui.proxyBase", proxyUrl)
+      val proxyUrl = _conf.get(UI_REVERSE_PROXY_URL).getOrElse("").stripSuffix("/")


We need to get the ConfigEntry to trigger the checkValue method.

dongjoon-hyun

+1, LGTM. Thank you, @gengliangwang and @cloud-fan .

dongjoon-hyun · 2022-04-13T17:50:21Z

The commit passed all tests here in @gengliangwang 's repo.

https://github.com/gengliangwang/spark/actions/runs/2159769125

dongjoon-hyun · 2022-04-13T17:53:25Z

Merged to master for Apache Spark 3.4.0.

PerilousApricot · 2022-04-13T17:55:02Z

@dongjoon-hyun @cloud-fan Wait a second! This is exactly the wrong solution, and it means that it spark can't be proxied via Jupyter. Why was the original solution unceremoneously thrown out after 4 months of waiting. We could have at least discussed it further before deciding to do that.

PerilousApricot · 2022-04-13T17:58:56Z

@dongjoon-hyun It's extremely disappointing to be desperately waiting for a feature, get no response for months, have to plead with people to even take a look at it then have someone swoop in at the last moment and undoing all the work without even a discussion about it. Since this have been moved from 3.3 (the original ask) to 3.4, surely there's enough time to at least speak with people before doing this

dongjoon-hyun · 2022-04-13T18:37:47Z

Yes, we can add it back properly as you said. This is for Apache Spark 3.4 . BTW, are you sure the following? Are you using JupiterLab or the class Notebook?

it means that it spark can't be proxied via Jupyter.

PerilousApricot · 2022-04-13T19:38:28Z

Or, alternately, if you have specific things to change in #31774, can you know what they are so they can be fixed and merged? I'm more than willing to do it if that's what it takes to get it over the line, but I need more guidance than "tricky" to be able to do that.-- It's dark in this basement.

gengliangwang · 2022-04-14T01:39:40Z

@PerilousApricot Note that #31774 doesn't solve all the problems. If you read #36174, you will find that we need to handle the proxy/history keyword in the method createTemplateURI and createRESTEndPointForExecutorsPage?
This solution is not "wrong". It just prevents the confusing behavior if there is "proxy" or "history" keyword in the reverse proxy URL.
I would appreciate it if someone can point out what the problem is in JupiterLab. All I can get from #31774 doesn't help me understand it. IIUC JupiterLab URL has to be ".../proxy/ui_port"? If yes, can the proxy URL of JupiterLab be configurated? Is it possible we adjust JupiterLab to make it work with Spark instead?

dongjoon-hyun · 2022-04-14T01:47:26Z

I also agree with @gengliangwang . From my side, I also asked the same question (like @gengliangwang ) to JupiterLab guy, but it seems that the limitation doesn't came from their code. I mean JupyterLab doesn't enforce proxy or history keyword from their code side.

JupiterLab URL has to be ".../proxy/ui_port"?

PerilousApricot · 2022-04-14T11:23:03Z

I think the first confusion is that it's *JupyterHub* please see #31774 (comment) . I was asked to make a reproducer that didn't involve jupyter, which I did. Do I understand correctly that you now would like me to make a second reproducer to show how JupyterHub works? -- It's dark in this basement.

PerilousApricot · 2022-04-14T11:27:00Z

IIUC JupiterLab URL has to be ".../proxy/ui_port"? If yes, can the proxy URL of JupiterLab be configurated? Is it possible we adjust JupiterLab to make it work with Spark instead?

No, it's not configurable, if it was, we wouldn't be here :) This whole saga is two people pointing at each other saying the other one should fix it, and the users are stuck in the middle. If you will point me to your reservations then I'll fix the PR, but please at least give me a chance.

--

It's dark in this basement.

PerilousApricot · 2022-04-14T11:34:39Z

@gengliangwang <https://github.com/gengliangwang> I see #36174, but there was like half an hour between when you wrote that and when the next commit turning off the functionality completely happened, so I didn't quite have a chance to update the PR in the interim (I was also asleep). I'll fix the things you mention there for the correct fix for you to review -- It's dark in this basement.

gengliangwang · 2022-04-14T12:02:32Z

@PerilousApricot My major interest is in Spark SQL. I only work on UI in my spare time, after finishing my internal works and open source Spark SQL works. Besides Spark UI is working fine in the standalone mode.

On the dev list thread of the Spark 3.3 release, SPARK-34659 is mentioned. So I created this PR to prevent confusing behaviors. Code changes like #36174 are risky(or we can say it doesn't resolve all the issues), considering we are going to release 3.3.

Free feel to take over #36174, and please make sure you verified it working with the reverse proxy on a standalone cluster containing master/worker.

And you can ping @jasonli-db for a review who is focusing on Spark UI in Databricks.

PerilousApricot · 2022-04-14T13:15:21Z

@gengliangwang I'm not sure I understand -- are you saying that someone else should've reviewed the PR in the first place? The intent (and what I was trying to do since December) was to get this into the 3.3 release. I don't use databricks, so is @jasonli-db still the person with the correct knowledge, or should I speak to someone else instead? Looking at their GH profile, I don't see any activity on the Spark repo. I honestly don't know at all how Spark's development model works, but I'm obviously doing something wrong here. Please let me know what I can do to satisfy your concerns, since it appears that at least for now you've taken ownership of this.

cloud-fan · 2022-04-14T13:33:38Z

With my Apache hat on, I'd say there is no component owner in this community. A PR can be merged as long as there is one committer who supports it and no objections from others. In your case, it seems that there are not many interests from committers because the fix is so complicated.

As a Spark committer, I do feel the same with @gengliangwang and @dongjoon-hyun . The fact is that "proxy" and "history" are already reserved keywords in Spark UI. This PR just fails earlier and provides a better error message. We can still put more effort to make them not reserved keywords, but I'm not very confident about it and I'd like to see other solutions like making JupiterLab using a different word other than "proxy" in the Spark UI.

PerilousApricot · 2022-04-14T13:37:15Z

@cloud-fan with your two hats, would you at least concede that it would make sense to at least see the proposed solution before it's called "too complicated"? It seems a bit like putting the cart before the horse here to reject something that hasn't been written.

cloud-fan · 2022-04-14T14:02:41Z

I did take a look at #36174 and I see code like var ind = words.indexOf("proxy");. It seems to me that some Spark code assumes "proxy" and "history" are reserved keywords and I'm not sure how many code blocks like this are out there. I don't think it's hard to fix it, but I think it's hard to find all these places that need to fix. In addition, reverse proxy and history server can bring more complexity as @gengliangwang mentioned.

To clarify, merging this PR does not mean rejecting the fixing PR. As I said before, this PR actually just improves the error message. You are welcome to continue the fix and find a committer to review it. Unfortunately, I'm not a Spark UI expert and I'm not confident to merge the fix. If you can explain the entire Spark UI stack and how your fix works, I believe it can make it easier for other committers to review the PR and give them more confidence to merge it.

PerilousApricot · 2022-04-14T14:24:49Z

Of course, the PR needs to be updated anyway for the cases that @gengliangwang found, I was speaking to not just your comment, but the others as well. To phrase it differently: "I'm going to update the PR, lets judge the complexity based on that"

You're critique about indexOf is well taken, and if that's the concern from @gengliangwang and @dongjoon-hyun, then I'm glad factorize the fix differently. I've repeatedly tried to emphasize that I'm willing and able to change the code to fit your criteria but, importantly, it's helpful to know what that criteria is. Now that I have your note about indexOf and @gengliangwang's comment in
#36174, I know what you're looking for and can change things accordingly.

With respect to..

To clarify, merging this PR does not mean rejecting the fixing PR

... then I misunderstood the situation, but to me, what I saw was a new PR disabling the functionality without commenting on the previous PR or giving an opportunity to fix the issues beforehand seemed like the door was being slammed on the previous PR.

I'll update #31774 with your comments and get back to you. Thanks!

PerilousApricot · 2022-10-11T07:01:18Z

@gengliangwang <https://github.com/gengliangwang> I'm not sure I understand -- are you saying that someone else should've reviewed the PR in the first place? The intent (and what I was trying to do since December) was to get this into the 3.3 release. I don't use databricks, so is @jasonli-db <https://github.com/jasonli-db> still the person with the correct knowledge, or should I speak to someone else instead? I honestly don't know at all how Spark's development model works, but I'm obviously doing something wrong here. Please let me know what I can do to satisfy your concerns, since it appears that at least for now you've taken ownership of this. -- It's dark in this basement.

PerilousApricot · 2022-10-11T08:25:35Z

Yes, it's an issue for years now, see jupyterhub/jupyter-server-proxy#57 .. it influences being able to implement functionality like more-functional Jupyter extensions that can query/display the status of jobs inline, and especially since Dask has a very nice JupyterLab extension, it would be good to get this blocker out of the way so that the equivalent can be developed for Spark. Can we at least have the initial #31774 PR for 3.3 so *something* works while we wait to get a real solution after all the wait? Then we can code the total fix against the 3.4 branch.

…

Message ID: ***@***.***>

forbid key words

542a4f0

gengliangwang commented Apr 13, 2022

View reviewed changes

gengliangwang mentioned this pull request Apr 13, 2022

[SPARK-34659][UI] Fix wrong application ID when reverse proxy URL contains "proxy" or "history" #36174

Closed

github-actions bot added CORE DOCS WEB UI labels Apr 13, 2022

cloud-fan approved these changes Apr 13, 2022

View reviewed changes

dongjoon-hyun approved these changes Apr 13, 2022

View reviewed changes

dongjoon-hyun closed this in 43c6333 Apr 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-34659][UI] Forbid using keyword "proxy" or "history" in reverse proxy URL #36176

[SPARK-34659][UI] Forbid using keyword "proxy" or "history" in reverse proxy URL #36176

gengliangwang commented Apr 13, 2022 •

edited

gengliangwang commented Apr 13, 2022

gengliangwang Apr 13, 2022

dongjoon-hyun left a comment

dongjoon-hyun commented Apr 13, 2022

dongjoon-hyun commented Apr 13, 2022

PerilousApricot commented Apr 13, 2022

PerilousApricot commented Apr 13, 2022

dongjoon-hyun commented Apr 13, 2022 •

edited

PerilousApricot commented Apr 13, 2022 via email

gengliangwang commented Apr 14, 2022

dongjoon-hyun commented Apr 14, 2022

PerilousApricot commented Apr 14, 2022 via email •

edited

PerilousApricot commented Apr 14, 2022 via email

PerilousApricot commented Apr 14, 2022 via email

gengliangwang commented Apr 14, 2022

PerilousApricot commented Apr 14, 2022 •

edited

cloud-fan commented Apr 14, 2022

PerilousApricot commented Apr 14, 2022

cloud-fan commented Apr 14, 2022

PerilousApricot commented Apr 14, 2022

PerilousApricot commented Oct 11, 2022 via email

PerilousApricot commented Oct 11, 2022 via email

[SPARK-34659][UI] Forbid using keyword "proxy" or "history" in reverse proxy URL #36176

[SPARK-34659][UI] Forbid using keyword "proxy" or "history" in reverse proxy URL #36176

Conversation

gengliangwang commented Apr 13, 2022 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

gengliangwang commented Apr 13, 2022

gengliangwang Apr 13, 2022

Choose a reason for hiding this comment

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented Apr 13, 2022

dongjoon-hyun commented Apr 13, 2022

PerilousApricot commented Apr 13, 2022

PerilousApricot commented Apr 13, 2022

dongjoon-hyun commented Apr 13, 2022 • edited

PerilousApricot commented Apr 13, 2022 via email

gengliangwang commented Apr 14, 2022

dongjoon-hyun commented Apr 14, 2022

PerilousApricot commented Apr 14, 2022 via email • edited

PerilousApricot commented Apr 14, 2022 via email

PerilousApricot commented Apr 14, 2022 via email

gengliangwang commented Apr 14, 2022

PerilousApricot commented Apr 14, 2022 • edited

cloud-fan commented Apr 14, 2022

PerilousApricot commented Apr 14, 2022

cloud-fan commented Apr 14, 2022

PerilousApricot commented Apr 14, 2022

PerilousApricot commented Oct 11, 2022 via email

PerilousApricot commented Oct 11, 2022 via email

gengliangwang commented Apr 13, 2022 •

edited

dongjoon-hyun commented Apr 13, 2022 •

edited

PerilousApricot commented Apr 14, 2022 via email •

edited

PerilousApricot commented Apr 14, 2022 •

edited