Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Router bug with full request URL #3088

Closed
mscdex opened this issue Sep 24, 2016 · 21 comments
Closed

Router bug with full request URL #3088

mscdex opened this issue Sep 24, 2016 · 21 comments

Comments

@mscdex
Copy link

mscdex commented Sep 24, 2016

It appears that Express ignores the other properties of the parsed request URL when matching defined routes against an incoming request.

This is problematic because "full" request URLs for example will match any defined route handler that contains the path part of the "full" URL.

Example:

GET http://google.com HTTP/1.1
Connection: close

will cause Express to match a handler like this:

app.get('/', function(req, res) {
  res.end('Hello world!');
});
@dougwilson
Copy link
Contributor

Yes, this is as it was designed to behave.

@dougwilson
Copy link
Contributor

The routing is only path name routing. If you are looking to better handle those types of URLs, you'd want to implement a top-level router that looks at the full URL and passes it to the appropriate router for the path part.

I hope this answers your... statement?

@dougwilson
Copy link
Contributor

Oh, haha,I just noticed the issue title says "bug" :) I wouldn't classify this as a bug, more of a misunderstanding of how the routing works. We actually have tests that ensure we handle those full URLs like we are doing. There is a lot of discussion in the connect repo on the history, as that code originated from connect.

The routing syntax is basically just method name + path name. Any other type of routing you need to layer on top of express as needed, for example the suggestion above if you are looking for that type of routing.

Anyone can even make a new router implementation and plug it into express (just a function that takes req, res, next) which can do different things :)

I'm not sure if this is well documented on our site, dispute it existing for multiple major versions of express. If not, we can open an issue to get improved docs around this :) !

@mscdex
Copy link
Author

mscdex commented Sep 24, 2016

It seems like something this unexpected (IMHO) should at least be noted in documentation somewhere. I could not find any such notes.

This could be a real issue since there are bots (and/or malicious users looking for open proxies) that send full URLs and those requests inadvertently are responded to with a status code of 200 for Express servers with matching paths.

@dougwilson
Copy link
Contributor

should at least be noted in documentation somewhere. I could not find any such notes.

I agree. I'm not sure if it is, but if not, please feel free to open an issue there or even a pull request :) ! We try hard on the docs and it's hard to remember everything, so we honestly missing things, I apologize.

This could be a real issue since there are bots (and/or malicious users looking for open proxies) that send full URLs and those requests inadvertently are responded to with a status code of 200 for Express servers with matching paths.

If you are suggesting a security issue, please follow our security disclosure guidelines to discuss; this is not the appropriate forum, I'm sure you understand.

@mscdex
Copy link
Author

mscdex commented Sep 25, 2016

If you are suggesting a security issue, please follow our security disclosure guidelines to discuss; this is not the appropriate forum, I'm sure you understand.

Not suggesting a security issue per se, mostly just the unnecessary transfer of data. Any Express applications should already be protecting access to resources as usual.

@dougwilson
Copy link
Contributor

@mscdex, gotcha, sorry for the misunderstanding on my part. So I did just jot down some thoughts on the documentation end (because at the very least, Express 4.x and previous versions will have this behavior that needs to be documented) at the issue expressjs/expressjs.com#716 for tracking.

I can see the unnecessary transfer of data part, but I mean, since the router itself is only a method + pathname router, without completely changing how the path syntax is, I'm not sure how it can take into account the scheme + host parts. I suppose we could just never route the request at all if those parts are present, but then it would be very difficult for people to even use the router at all to route those types of requests if that was their use-case. Express itself is "unopinionated" in this case, just routing based on the information Express was given: the method + pathname and simply ignoring all other parts (the scheme, host, query, etc.) and leaving that up to further user-land code currently.

I was taking a look to determine if there is any traceable conversation from when this was added, but only ended up on the commit itself that added it: senchalabs/connect@7560a5a

As far as I can tell, the purpose is as I have been living with so far: that the router literally only routes on what you have provided it (method + path) and ignores all the rest (scheme, host, query, etc.), leaving it up to the user-land to handle.

Let me know if I'm totally off-base here or missing something :) I know this issue started off reporting a bug, which turned out to be "not a bug; functions as designed", but if you want to move to a conversation of questioning the design itself, that's defiantly OK, and I can re-open the issue (and probably change the title and initial post to make sense for the discussion starter).

@mscdex
Copy link
Author

mscdex commented Sep 25, 2016

Maybe I'm in the minority here since the issue hadn't been raised previously, but IMHO I would expect a route like app.get('/', ...) to only match requests like GET / HTTP/1.1 and not a FQDN of any kind. I think that those users that intentionally want to handle FQDNs should be the ones needing to explicitly add a generic middleware to handle those requests (and not the other way around).

@blakeembrey
Copy link
Member

I also never understood the design decision for why FQDNs were supported in the request path either. Manually crafting these requests with a couple of domains (Cloudflare, Heroku, etc) results in a 400. In others (Google), it's a 404. Neither rewrite it to match by the path. I also would think of this as a bug, since there's no way to catch these requests when using Express and therefore no way to reject these types of requests.

Does anyone have a reference to why we'd want to support it in a way that we ignore the original URL? It seems that the ideal recommendation would be that we push people to support this themselves by rewriting the path (if they want to support an FQDN path). I'm even having trouble finding what/why requests like this occur - does anyone have a reference to that either?

@dougwilson
Copy link
Contributor

since there's no way to catch these requests when using Express and therefore no way to reject these types of requests.

Yes there is:

app.use(function (req, res, next) {
  if (url.parse(req.url).host) return res.sendStatus(400);
  next();
});

Does anyone have a reference to why we'd want to support it in a way that we ignore the original URL?

I'm not sure. I looked through the Express and connect issues and found back in 2013 when it wasn't working right a bunch of issues and PRs were made to fix the support, but nothing since. Connect itself supports it since... forever.

It seems that the ideal recommendation would be that we push people to support this themselves by rewriting the path (if they want to support an FQDN path).

I don't agree. That goes against the values of Express, namely being "unopinionated".

I'm even having trouble finding what/why requests like this occur - does anyone have a reference to that either?

I would like to understand that as well.

So from what I can tell so far, I see the following here: (1) Express/Connect has always acted like this (2) over time, there has not been a single objection to this feature until now (3) there have been issue reports and PRs to fix this feature, indicating people actually use it.

@blakeembrey perhaps this should be a topic for the next Express.js TC meeting on 10/5, if there is no good resolution before then? So far I'm a big fat 👎 on changing this behavior, and see this issue as more of a lack of documentation issue. Express's router only routes on method + pathname; it does not route on anything else. That is just how it is designed. I'm sure if we started sending back 400 Bad Request if the req.url had a query string that would be unexpected, since Express does not route based on query string, and to me, doing that for the host is just the same.

Telling people they need to rewrite the URL if they want to support FQDN is non-sense, because they will have to re-implement the entire way that app.use works by chopping down the pathname from the prefix. It would also make it impossible to use any middleware that wants to write back the URL, since that information will be lost from cutting it out.

Perhaps what everyone is surprised on is the fact that Node.js HTTP server even lets these types of URLs in in the first place? Perhaps the Node.js HTTP server should make that an opt-in feature? Express simply routes whatever it gets; Express is not the HTTP server, only a slim routing + sugar layer on top of the Node.js core server.

@dougwilson
Copy link
Contributor

FWIW, I sent the following to www.google.com and got a 200 OK:

GET http://www.google.com/ HTTP/1.1
Host: www.google.com

I got a 200 OK when sending the following to www.microsoft.com:

GET http://www.microsoft.com/ HTTP/1.1
Host: www.microsoft.com

Even sending the following to www.microsoft.com results in a 200 OK with the Microsoft home page as the response:

GET http://www.google.com/ HTTP/1.1
Host: www.microsoft.com

@mscdex
Copy link
Author

mscdex commented Sep 25, 2016

Yes there is:

app.use(function (req, res, next) {
if (url.parse(req.url).host) return res.sendStatus(400);
next();
});

That is one solution, but unfortunately it will result in a significant performance hit since it would be executed for every request.

Perhaps a better workaround might be something like:

app.use(function(req, res, next) {
  if (req.originalUrl.charCodeAt(0) !== 47/*'/'*/)
    return res.sendStatus(400);
  next();
});

Perhaps what everyone is surprised on is the fact that Node.js HTTP server even lets these types of URLs in in the first place?

I'm not surprised, node's http module is a low-level thing (like many other built-in modules) and there are situations that require FQDNs, especially http proxies with the CONNECT method.

I think we can all agree that this is at least a documentation issue, but IMHO the behavior is still unexpected. What is wrong with changing the behavior in the next major version of Express? Does Express not follow semver or ?

@dougwilson
Copy link
Contributor

What is wrong with changing the behavior in the next major version of Express? Does Express not follow semver or ?

It does, but I outlined the reasons above, which boil down to a history of issue of people reporting bug to get this working better vs this being the very first issue asking to not do it. I'm not going to just put those people out in the cold without fully understanding why we should do that, providing a proposal for what Express's behavior should be, and providing documentation & migration paths for all the people currently using this feature. I even provided examples of major websites that are showing the same behavior of the default Express.

@dougwilson
Copy link
Contributor

dougwilson commented Sep 25, 2016

That is one solution, but unfortunately it will result in a significant performance hit since it would be executed for every request.

Yea, sorry, that was just me being quick :) If you actually use the parseurl module instead of url.parse, you shouldn't get the hit, because it will just load the cached parse response instead of re-parsing.

but IMHO the behavior is still unexpected

I would like to understand why. So far it is starting to seem like perhaps we are talking past each other instead of discussing with each other? I know you are expressing that opinion, and I spent a lot of my own time trying to write down all my reasons above, but it doesn't seem like you have any responses to them, rather just re-stating your initial statement... unless I'm missing something?

@dougwilson
Copy link
Contributor

Even the following request to www.apache.org:80 works exactly like Express would handle it:

GET http://www.google.com/licenses/ HTTP/1.1
Host: www.apache.org

It simply acts as if the request was for GET /licenses/

@mscdex
Copy link
Author

mscdex commented Sep 25, 2016

There are plenty of counter-examples though, including the ones that @blakeembrey pointed out. For example:

Sending a request to www.google.com:80 with:

GET http://www.apache.org/ HTTP/1.1
Host: www.google.com
Connection: close

results in a 404 response. But sending this request:

GET http://www.google.com/ HTTP/1.1
Host: www.google.com
Connection: close

results in a 200 response (because they are obviously checking FQDNs but IMHO that is not really necessary).

I would like to understand why.

Because it's obviously/literally not what the client is requesting (especially if the FQDN does not match the vhost)? 'http://www.google.com/' !== '/'

@dougwilson
Copy link
Contributor

I know there are plenty of counter examples. But if the argument is there are servers that don't allow it, then the other side can argue that there are servers that do do it.

@dougwilson
Copy link
Contributor

@mscdex you sill haven't responded to most of all my statements and concerns. It's really hard for me to continue this conversation like this... it's not a conversation, just you saying the same thing over and over.

@expressjs expressjs locked and limited conversation to collaborators Sep 25, 2016
@dougwilson
Copy link
Contributor

I know this is locked, but I have been spending my Sunday afternoon researching this issue, and the best I can find an answer is this is from the HTTP/1.1 RFC itself, RFC 7230, section 5.3.2:

To allow for transition to the absolute-form for all requests in some
future version of HTTP, a server MUST accept the absolute-form in
requests, even though HTTP/1.1 clients will only send them in
requests to proxies.

This quote describes what Express is currently doing as a MUST behavior for origin servers, and rejecting those types of URLs at the Express layer for all cases is a violation of that MUST, imo.

@blakeembrey
Copy link
Member

@dougwilson It might be worth discussing briefly. Although it's a violation of the must, I'm not sure the interpretation is that we should be ignoring all hosts/scheme/port/etc in the request? I guess it's also up to interpretation of "unopinionated", since this does sound like an opinion. Perhaps it's best as not a feature of the router, but as a feature of Express.js that can be configured (much like the X-Powered-By header or case sensitivity). At least then people could route on these non-absolute paths. It'd be really good to hear from the people that are using it, just because I don't understand how/why it's used currently.

@dougwilson
Copy link
Contributor

Although it's a violation of the must, I'm not sure the interpretation is that we should be ignoring all hosts/scheme/port/etc in the request?

Correct, the RFC does not way to ignore it, but also doesn't way what to do with it, only that the server MUST respond to it. I can see two solutions here (1) just ignore it as we do today or (2) provide some kind of mechanism to somehow understand what the allowed hosts are in order to only route on them. We could require everyone who ever writes an Express app to explicitly declare their expected hosts to even use Express I guess... or something else? What are your thoughts or proposal? Should we put this item on the next TC meeting as a proposed to do above?

I guess it's also up to interpretation of "unopinionated", since this does sound like an opinion.

I mean, when I say that, I mean that Express doesn't care what you want to do with those types of URLs, because it's so easy to add a middleware at the top of your app if you wanted to reject them. If we always rejected them, how can people not reject them? Simply rewriting the req.url to remove that from the beginning won't work, because then they need a way to intercept all middleware to put it back on the beginning so the middleware can properly understand the URL (for redirects, as an example). We could go and implement some framework to do all that, if you're willing to make a proposal :)

The biggest issue, and I've brought this up on multiple TC meetings so far, is that we don't have a written explanation of what things like "unopinionated" mean. We talked about it and agreeded on what we think it should be, and there was plenty of time for all TC members to attend or even watch the recordings. I am simply expressing what we agreed upon here. If you're interested in attending the next TC meeting, we can always bring up this topic as well for discussion again, if that helps.

Perhaps it's best as not a feature of the router, but as a feature of Express.js that can be configured (much like the X-Powered-By header or case sensitivity).

I mean, I'm not sure if I'm saying the right words here or some kind of disconnect. The way the router works is that is only routes on the information given to it: the method + pathname. Everything else in the url is ignored during routing, for example, the query string is ignored, just as all other non-specified aspects of the request. If you have a thought on how we should add that into the routing, please, absolutely open an issue or pull request detailing the specifics on the proposal so we can understand and evaluate it :)

At least then people could route on these non-absolute paths.

I assume this is a typo? I'm not actually sure what you meant here, if you can clarify :)

It'd be really good to hear from the people that are using it, just because I don't understand how/why it's used currently.

If you know of some way to reach out to the users of Express (especially since this is an automatic behavior that they may not even know they are relying on), I would love to hear it :) There are many times I would like to get this kind of feedback, but it seems like an impossible audience to reach for the most part. Technically every user of Express is "using it", since it is how Express behaves by default, but probably understanding what happens if we wanted to turn this off is going to be a large project.

Perhaps what we can o is build some kind of telemetry service into Express itself, and ask people to opt-in to collect telemetry on their Express apps which would give us some way to answer these types of questions using analysis back on our end. That would help a lot in these types of things when we think about how to go about changing fundamental Express behavior. I would love to hear ideas around this. Currently I try to use issues as an indicator, and based on issues, we have multiple issues through out the Express and Connect repos reporting and fixing bugs in this behavior to make it work better vs 1 issue (this one) proposing that maybe it should not work that way (and since the issue was filed as a bug rather than a proposal that it should work differently, it's hard to say it was a true proposal vs a surprise reaction against an honest mistake in our documentation).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants