Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Clean up URI parsing: second attempt, in the name of PHP 5.2.4 #1326

Closed
wants to merge 12 commits into
from

Conversation

Projects
None yet
2 participants
Contributor

sourcejedi commented May 4, 2012

[Successor to https://github.com/EllisLab/CodeIgniter/pull/1316]

Plug a few gaps in the URI parsing.

After the first attempt, I noticed that readme.rst says CodeIgniter now requires PHP 5.2.4. Excellent! This means PATH_INFO is now reliable https://bugs.php.net/bug.php?id=31892, which allows some more cleanup (and hence removes the need to document the bug and its workarounds).

Thanks for timw4mail's feedback so far, which inspired me to remove the code which falls back to $_GET.

sourcejedi added some commits May 4, 2012

Remove redundant attempt to parse URL from $_GET
PHP generates $_GET from QUERY_STRING, which we already try.
There's no suggestion that PHP would ever mangle QUERY_STRING,
so there's no reason to try both.
Accept absolute URIs in REQUEST_URI (strict HTTP/1.1 compliance)
"To allow for transition to absoluteURIs in all requests in future
versions of HTTP, all HTTP/1.1 servers MUST accept the absoluteURI
form in requests, even though HTTP/1.1 clients will only generate
them in requests to proxies." - RFC2616

(and at least on my Apache system, REQUEST_URI is taken straight
 from the HTTP request line)
REQUEST_URI and QUERY_STRING are url-encoded - so decode them
The CGI specification (and testing with PHP) says that QUERY_STRING
is url-encoded.  (PATH_INFO is decoded though).  REQUEST_URI must be
url-encoded otherwise it wouldn't work, and this is confirmed by
testing (Apache).

"the origin server MUST decode the Request-URI in order to
 properly interpret the request" - RFC 2616

"The QUERY_STRING variable contains a URL-encoded search or parameter
 string" - CGI/1.1
uri_protocol QUERY_STRING: support extra parameters like REQUEST_URI
index.php?location?key=value

REQUEST_URI is effectively the default at the moment,
so let's make QUERY_STRING consistent.
Remove obsolete use of getenv()
There's no need to use getenv() for QUERY_STRING (or CONTENT_TYPE).
The only reason to use getenv() was to bypass a bug in PHP's handling
of PATH_INFO.

PHP doesn't mangle the other two.  If they're accessible through
getenv(), they will also be present in $_SERVER.  Every SAPI that
defines getenv() also copies the entire environment into $_SERVER.
(See e.g. php_import_environment_variables() and the CGI SAPI).

And the PATH_INFO bug is fixed in PHP 5.2.4 - which is the minimum
required for this version of CodeIgniter.
<https://bugs.php.net/bug.php?id=31892>
URI_test: reset $_SERVER for each test
QUERY_STRING may be set during URI parsing,
so make sure we clear it before running the next test.
PHP 5.2.4 fixes PATH_INFO, so we can start using it (in preference to…
… REQUEST_URI)

CodeIgniter now requires PHP 5.2.4, which fixes the PATH_INFO bug
<https://bugs.php.net/bug.php?id=31892>.

When uri_protocol is AUTO (the default), we should prefer PATH_INFO.
(The REQUEST_URI code is a bit of a hack.  It can't handle arbitrary
URL rewriting e.g. /blog/.* -> /blog.php/$1).

This means noone will need to use ORIG_PATH_INFO any more
(and it won't work if you try to use it).

Also accept any non-empty value of PATH_INFO (or QUERY_STRING).
This makes for easier to predict behaviour and simpler code.
/index.php/0 is not /index.php (inappropriate use of empty())
Don't use empty() to test for an empty string.

    php > $a = '0';
    php > echo empty($a);
    1
/index.phpabout is not /index.php/about
Be more precise in parsing REQUEST_URI
Un-break extra parameters which have url-encoded characters
Fix "REQUEST_URI and QUERY_STRING are url-encoded".
It's correct to decode the path.  But it's not correct
to decode the parameter string before it is parsed.
We want to match the normal behaviour, which looks like this:

</cgi-bin/form.php?a=1&b=2>
	$_GET == array('a'=>1, 'b=>2')

</cgi-bin/form.php?a=1%26b%3D2>
	$_GET == array('a' => '1&b=2')
Fix fallback to REQUEST_URI in default configuration
This is my fault.  At some point I misunderstood what isset() does.
Knowing that isset('') = TRUE, I re-reviewed all my changes to URI
parsing and found this issue.

QUERY_STRING could legitimately be "set" to the empty string.
In fact, CGI requires that QUERY_STRING is always set,
even if it would be empty.

So we need to test whether QUERY_STRING is the empty string
before deciding whether to use it (and if it *is* empty,
we will fall back to parsing REQUEST_URI).
Remove some trailing whitespace
Changes to URI parsing introduced trailing whitespace.
(Git described these as "whitespace errors").
Contributor

sourcejedi commented Jul 31, 2012

Any chance of getting the second commit applied, in particular? I guess this is all a bit dull. But we're breaking the fundamental standard of the Web, established for maybe a whole decade. It's not hard to fix, the code is easier to understand, and it's shorter as well :).

Contributor

narfbg commented Oct 26, 2012

@sourcejedi Could you explain what's wrong with the current code that you're replacing in the second commit (sourcejedi/CodeIgniter@68830a3)?

I'm all for implementing any of those improvements as long as they make sense, but I don't see anything wrong with that particular one. The rest I'll review one by one and sourcejedi/CodeIgniter@8f471d9 is even already implemented.

Contributor

sourcejedi commented Oct 27, 2012

Thanks for looking at this. I agree it's hard to understand the diff. Hmm, I guess it might have been worth spelling out what was meant by an absolute URI :).

In case you're unclear what I'm trying to do, here's the test that the current code fails. A request line using the host-relative URL works as expected...

$ telnet brick.scree.dyndns.org 80 Trying 10.0.0.1... Connected to brick.scree.dyndns.org (10.0.0.1). Escape character is '^]'. GET /Bonfire/admin/ HTTP/1.1 Host: brick.scree.dyndns.org

HTTP/1.1 302 Found
...

...but the absolute URL doesn't:

$ telnet brick.scree.dyndns.org 80 Trying 10.0.0.1... Connected to brick.scree.dyndns.org (10.0.0.1). Escape character is '^]'. GET http://brick.scree.dyndns.org/Bonfire/admin/ HTTP/1.1 Host: brick.scree.dyndns.org

HTTP/1.1 404 Not Found
...

It looks like the problem with the old code is that it applies parse_url() way too late. So, for example, it starts off directly comparing REQUEST_URI against SCRIPT_NAME. REQUEST_URI may start with http://, if that's how the request waas made. But SCRIPT_NAME will never contain http://.

narfbg added a commit that referenced this pull request Oct 31, 2012

CI_URI::_detect_uri() to accept absolute URIs
(thanks to @sourcejedi, PR #1326)

For HTTP/1.1 compliance, RFC2616 specifies that both relative
and absolute URI formats must be accepted:

- http://localhost/path/ (absolute)
- /path/ (relative)
Contributor

narfbg commented Oct 31, 2012

@sourcejedi Thanks, that is indeed a better explanation! The above referenced commit contains your changes from sourcejedi/CodeIgniter@8f471d9 with some small improvements.

I'll get on the rest of them right away.

Contributor

sourcejedi commented Oct 31, 2012

Great.

-    $uri = parse_url('pseudo://hostname/'.$uri, PHP_URL_PATH);

Wow. I had the good fortune not to see that line before :). Relieved to see your diff has a '-' in the leftmost column here.

narfbg added a commit that referenced this pull request Oct 31, 2012

Multiple improvements to the URI class
(thanks to @sourcejedi, PR #1326 for most of the ideas)

 - Renamed _detect_uri() and _parse_cli_args() to _parse_request_uri() and _parse_argv() respectively.
 - Added _parse_query_string() which allows us to detect the URI path from QUERY_STRING much like it is done in _parse_request_uri().

(the above changes also allow for a simpler logic in the case where the *uri_protocol* setting is not set to 'AUTO')

 - Updated application/config/config.php with a better list of the *uri_protocol* options.
 - Added _reset_query_string() to aid in re-processing  from the QUERY_STRING (utilized in _parse_request_uri() and _parse_query_string()).
Contributor

narfbg commented Oct 31, 2012

Used sourcejedi/CodeIgniter@3f2e98b as well.

Can you tell what this here is for?

sourcejedi/CodeIgniter@a259a32#L0R214

I have my guesses, but regardless ... why do we encode something after it's been decoded? parse_str() doesn't like it?

Contributor

sourcejedi commented Oct 31, 2012

Um. QUERY_STRING is defined as being encoded (by the CGI spec). Vanilla PHP code and coders will assume it is encoded. And the other code paths - i.e. when we're not using REQUEST_URI - don't change the original, encoded QUERY_STRING, do they?

Contributor

narfbg commented Oct 31, 2012

I'll handle URL-decoding in a separate commit ... and just answered the above question to myself.

narfbg added a commit that referenced this pull request Oct 31, 2012

Fix issues #388 & #705
(thanks to @sourcejedi, PR #1326 for pointing inconsistencies with RFC2616

narfbg added a commit that referenced this pull request Oct 31, 2012

Changed URI auto-detection to try PATH_INFO first
(thanks to @sourcejedi, PR #1326)

Up until PHP 5.2.4 (which is our new lowest requirement),
there was a bug related to PATH_INFO which made REQUEST_URI
a more reliable choice. This is now no longer the case,
see https://bugs.php.net/bug.php?id=31892 for more details.

Also removed ORIG_PATH_INFO from the suggested alternatives
for uri_protocol in application/config/config.php as it will
not exist in most of PHP's recent versions and is pointless
when you can use PATH_INFO anyway.
Contributor

narfbg commented Oct 31, 2012

... and that was the last commit is all that we need from the rest.

Thanks again @sourcejedi for these contributions! It surely takes a good amount of time to look into all of this.

@narfbg narfbg closed this Oct 31, 2012

Contributor

sourcejedi commented Nov 1, 2012

Fair enough - at least on the encoding, I made bit of a hash of it.

As well as losing some consistency about use of getenv(), I think the new commits omit the fix "/index.phpabout is not /index.php/about". I could rebase that and resend if you like.

nonchip pushed a commit to nonchip/CodeIgniter that referenced this pull request Jun 29, 2013

CI_URI::_detect_uri() to accept absolute URIs
(thanks to @sourcejedi, PR #1326)

For HTTP/1.1 compliance, RFC2616 specifies that both relative
and absolute URI formats must be accepted:

- http://localhost/path/ (absolute)
- /path/ (relative)

nonchip pushed a commit to nonchip/CodeIgniter that referenced this pull request Jun 29, 2013

Multiple improvements to the URI class
(thanks to @sourcejedi, PR #1326 for most of the ideas)

 - Renamed _detect_uri() and _parse_cli_args() to _parse_request_uri() and _parse_argv() respectively.
 - Added _parse_query_string() which allows us to detect the URI path from QUERY_STRING much like it is done in _parse_request_uri().

(the above changes also allow for a simpler logic in the case where the *uri_protocol* setting is not set to 'AUTO')

 - Updated application/config/config.php with a better list of the *uri_protocol* options.
 - Added _reset_query_string() to aid in re-processing  from the QUERY_STRING (utilized in _parse_request_uri() and _parse_query_string()).

nonchip pushed a commit to nonchip/CodeIgniter that referenced this pull request Jun 29, 2013

Fix issues #388 & #705
(thanks to @sourcejedi, PR #1326 for pointing inconsistencies with RFC2616

nonchip pushed a commit to nonchip/CodeIgniter that referenced this pull request Jun 29, 2013

Changed URI auto-detection to try PATH_INFO first
(thanks to @sourcejedi, PR #1326)

Up until PHP 5.2.4 (which is our new lowest requirement),
there was a bug related to PATH_INFO which made REQUEST_URI
a more reliable choice. This is now no longer the case,
see https://bugs.php.net/bug.php?id=31892 for more details.

Also removed ORIG_PATH_INFO from the suggested alternatives
for uri_protocol in application/config/config.php as it will
not exist in most of PHP's recent versions and is pointless
when you can use PATH_INFO anyway.

narfbg added a commit that referenced this pull request Feb 18, 2015

Fix #3593
Revert "fixes" for #167, #388, #705 (also #1326) as it turns out
URL-decoding isn't compliant with the CGI/1.1 specification.

RFC 3875: http://www.faqs.org/rfcs/rfc3875.html
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment