Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor and improve URI implementation #6323

Merged
merged 13 commits into from Apr 5, 2019

Conversation

@straight-shoota
Copy link
Member

commented Jul 3, 2018

This PR applies a ton of mostly small improvements to URI and URI::Parser.

  • The type of URI#path is changed to String. It is no longer nilable. The semantic is: every URI has a path, but it may be empty. This avoids ambiguity between empty path and nil path.
  • The macro URI::Parser.step is removed. It was previously used to spec internals of URI::Parser but it no longer works because the struct can't be inherited. It doesn't make much sense to spec the internal API anyway, having unit tests for URI.parse is totally fine. Consequently, I have removed the specs for URI::Parser and integrated them into the specs for URI.parse.
  • Most notable is probably the removal of the property URI#opaque. It is simply not necessary to have an extra ivar for this. It is replaced by URI#path which contains both a path in the hierarchical part as well as a non-hierarchical, opaque path. URI#opaque? answers if the URI is considered opaque.
  • The spec suite has been reorganized and I added a bunch of new examples, borrows from gloang and akka.
  • During that process I discovered and fixed a few issues in the parser and formatter:
    • #to_s no longer hides the port if it is the default port. A default port can now be stripped using #normalize.
    • #parse downcases the scheme to make it case-insensitive. This is recommended by RFC 3986.
    • #parse decodes escape sequences in host, and #to_s applies them.
    • #parse correctly parses an empty port. Previously, scheme://host:/ would set port to 0, now it's nil.
    • #parse parses a string starting with three slashes as empty host. Previously, host was nil.
    • #to_s inserts a slash if path does not start with one and follows an authority. Paths starting with double slash // are prefixed by /. if it does not follow an authority. Otherwise the first path component would be interpreted as authority.

@straight-shoota straight-shoota force-pushed the straight-shoota:jm/fix/uri-refactor branch from 1afc004 to 165ae8d Jul 3, 2018

describe ".parse" do
# scheme
assert_uri("http:", scheme: "http")
it { URI.parse("HttP:").should eq(URI.new(scheme: "http")) }

This comment has been minimized.

Copy link
@yxhuvud

yxhuvud Jul 3, 2018

Contributor

you are missing a spec of what the result of #scheme on a missing scheme is. Some tests below lacks scheme but none tests what #scheme is in that case.

This comment has been minimized.

Copy link
@straight-shoota

straight-shoota Jul 3, 2018

Author Member

I'm not sure what you mean. There are many specs for no-scheme URIs. For example assert_uri("/foo", path: "/foo"). It ensures that the URI parsed from "/foo" has a path of /foo and all other properties at the default value. This means scheme should be nil.

@yxhuvud

This comment has been minimized.

Copy link
Contributor

commented Jul 3, 2018

The test failures make me wonder if eq should handle default ports.

@straight-shoota

This comment has been minimized.

Copy link
Member Author

commented Jul 3, 2018

Nah, the specs are only failing because of the previous incorrect behavior of ommitting the port if it was specified but equal to the default.

Comparing two URIs should be strict. If a port is specified, even if it's the default, there are chances this is on purpose. http://example.com and http://example.com:80 are not identical. You can easily use #normalize to remove default ports.

@asterite

This comment has been minimized.

Copy link
Member

commented Jul 3, 2018

$ irb
irb(main):001:0> require "uri"
=> true
irb(main):002:0> u1 = URI.parse("http://foo.com:80")
=> #<URI::HTTP http://foo.com>
irb(main):003:0> u2 = URI.parse("http://foo.com")
=> #<URI::HTTP http://foo.com>
irb(main):004:0> u1 == u2
=> true
@asterite

This comment has been minimized.

Copy link
Member

commented Jul 3, 2018

I'm not saying we should copy Ruby... but specifying a default port when it's the default one, and keeping it around... I don't know... there's no difference, really.

@straight-shoota

This comment has been minimized.

Copy link
Member Author

commented Jul 3, 2018

@asterite Yeah, Ruby does that. But it also uses a scheme-specific type URI::HTTP. Maybe that can serve as an excuse...

Besides, just a few counter examples:

u, err := url.Parse("http://foo.org:80")
fmt.Println(u.Port()) // => 80
System.out.println(new URI("http://foo.com:80").getPort()); // => 80
urlparse('http://foo.com:80').port # => 80
@straight-shoota

This comment has been minimized.

Copy link
Member Author

commented Jul 3, 2018

there's no difference, really.

For well-known schemes, probably not. But when you're using a custom scheme, it might not be given that the next consumer of a URI has the same default port configured.

URI should be a simple data container and serialize to and from string representation without any further processing. If you want to get rid of unnecessary default ports, use #normalize.

URI.parse(string).to_s == string should hold for any well-formed URI.

@asterite

This comment has been minimized.

Copy link
Member

commented Jul 3, 2018

Sounds good

@straight-shoota straight-shoota force-pushed the straight-shoota:jm/fix/uri-refactor branch from 0a1513e to dd2d8bc Aug 15, 2018

Show resolved Hide resolved src/uri.cr

@straight-shoota straight-shoota force-pushed the straight-shoota:jm/fix/uri-refactor branch from 6cd19ee to ddda611 Aug 19, 2018

@RX14

RX14 approved these changes Aug 20, 2018

@straight-shoota

This comment has been minimized.

Copy link
Member Author

commented Oct 5, 2018

🏓 A second review would be great!

@RX14 RX14 requested a review from bcardiff Oct 6, 2018

Show resolved Hide resolved src/uri.cr
@straight-shoota

This comment has been minimized.

Copy link
Member Author

commented Nov 27, 2018

@bcardiff 🏓

@straight-shoota

This comment has been minimized.

Copy link
Member Author

commented Jan 7, 2019

I'd really like to get this merged for 0.28.0. It's so annoying to have to work around URI bugs I fixed six months ago.

@straight-shoota straight-shoota added this to the 0.28.0 milestone Jan 7, 2019

@bararchy

This comment has been minimized.

Copy link
Contributor

commented Jan 7, 2019

I'm with @straight-shoota please let's get this merged

Show resolved Hide resolved spec/std/uri_spec.cr
it { URI.new("http", "www.example.com", 80, "/hello", "a=1").to_s.should eq("http://www.example.com/hello?a=1") }
it { URI.new("mailto", opaque: "foo@example.com").to_s.should eq("mailto:foo@example.com") }

it "removes default port" do

This comment has been minimized.

Copy link
@Sija

Sija Jan 7, 2019

Contributor

Why not normalize ports from default?

This comment has been minimized.

Copy link
@straight-shoota

straight-shoota Jan 7, 2019

Author Member

There are no generic default ports in the URI specification. It depends solely on configuration which port is considered "default".

When no port is specified, the resolver is entrusted to choose whichever the port it seems fit. But when a port is explicitly specified, the resolver is asked to use that specific port, regardless of it's configuration for default ports.

Thus removing a port (even if it is considered default) mutates the URI. It looses information.
A well-formed URI should not be automatically stripped of anything unless explicitly asked (for example using #normalize).

@straight-shoota straight-shoota force-pushed the straight-shoota:jm/fix/uri-refactor branch 2 times, most recently from 8f1be0b to a88776b Jan 7, 2019

@asterite

This comment has been minimized.

Copy link
Member

commented Jan 8, 2019

I think it makes sense to keep the port if the URI has it set explicitly, even if it's the default port for that scheme.

Show resolved Hide resolved src/uri.cr

@straight-shoota straight-shoota force-pushed the straight-shoota:jm/fix/uri-refactor branch from a88776b to 19be7a1 Feb 10, 2019

@straight-shoota

This comment has been minimized.

Copy link
Member Author

commented Feb 11, 2019

Rebased and ready to ship.

@wontruefree

This comment has been minimized.

Copy link
Contributor

commented Feb 11, 2019

@straight-shoota should all the parse_ methods in the src/uri/uri_parser.cr file be private?

@wontruefree

This comment has been minimized.

Copy link
Contributor

commented Feb 12, 2019

I just realized that might have been out of scope for this refactor but thanks for moving the parse methods to private.

Show resolved Hide resolved src/uri.cr
@straight-shoota

This comment has been minimized.

Copy link
Member Author

commented Feb 12, 2019

@wontruefree Yeah I thought so at first. But then, what the heck... it's a simple thing and fits to the set of changes in this PR.

@bcardiff
Copy link
Member

left a comment

We need a rebase and fix some conflict and we can merge this.

straight-shoota added some commits Jul 3, 2018

Refactor URI::Parser removing obsolete step macro
This macro was previously used to spec internals of URI::Parser, but it
no longer works because the struct can't be inherited. It doesn't make
much sense to spec the internal API anyway. Having unit tests for
URI.parse is sufficient.
The specs for URI::Parser integrated into the specs for URI.parse.
Refactor URI#path to be non-nilable
Every URI has a path, but it may be empty. Making the ivar non-nilable
avoids any ambiguity whether an empty path is represented by an empty
string (`""`) or `nil`.
Fix URI#to_s to always print port when set
Previously, `#to_s` would omit the port if it matches the default port,
even if it is explicitly specified. This ommitance changes the URI and
must be avoided, unless explicitly asked to normalize.

Changes `#normalize` to remove the port if matches the schema's
default port.
Change URI.parse to downcase scheme
URI schemes are case-insensitive, this ensures that. This approach
is recommended by RFC 3986.
Fix host with escape sequences in URI.parse/URI#to_s
URI.prase decodes escape sequences in host, URI#to_s applies them where
necessary.
This is already some implicity normalization but results in the same
URI, although in a differen string representation.
Fix URI.parse with empty port
An empty port was previously interpreted as port `0`. Now it's `nil`.
Fix URI.parse triple slash path
An URI starting with three slashes (`///`) has an empty authority but
it's still a hierarchical URI. `host` must be the empty string,
not `nil`.
Fix URI#to_s path starting with `/`
`#to_s` inserts a slash if `path` does not start with one and follows a
non-nil authority.
Paths starting with double slash (`"//"`) are prefixed by ("/.`) if it
does not follow an authority. Otherwise the URI would be considered
relative to the schema and the first path component interpreted as authority.
Remove URI#opaque
It is not necessary to have an extra ivar for an opaque path, URI#path
is sufficient and can hold the path of both opaque and hiearchicyl URIs.

Adds URI#opaque? to tell whether the URI is considered opaque.

@straight-shoota straight-shoota force-pushed the straight-shoota:jm/fix/uri-refactor branch from 05e186b to b9acfe7 Mar 27, 2019

@bcardiff bcardiff merged commit b2a1357 into crystal-lang:master Apr 5, 2019

5 checks passed

ci/circleci: check_format Your tests passed on CircleCI!
Details
ci/circleci: test_darwin Your tests passed on CircleCI!
Details
ci/circleci: test_linux Your tests passed on CircleCI!
Details
ci/circleci: test_linux32 Your tests passed on CircleCI!
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@straight-shoota straight-shoota deleted the straight-shoota:jm/fix/uri-refactor branch Apr 5, 2019

@yorci yorci referenced this pull request Apr 19, 2019

Merged

Crystal 0.28.0 Adaptation #62

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.