Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid breaking pages that use the URL fragment for routing/state #15

Closed
dead-claudia opened this issue Feb 20, 2019 · 34 comments
Closed

Comments

@dead-claudia
Copy link

dead-claudia commented Feb 20, 2019

Edit: s/client instruction/processing instruction/g


Currently, URIs treat %% as invalid. Should we maybe extend accepted URIs to support this use case using that token, as effectively "processing instructions" (like "scroll to text") rather than necessarily seeing it as a fragment? These "processing instructions" would not be exposed to users, at least initially, and it wouldn't be included in what's sent to servers.

I'm thinking maybe view it as this:

  • https://example.com#!/route%%q=some%20text - Search for some text
  • https://example.com#!/route%%q=some%20text%%n=2 - Search for second occurrence of some text
  • I could see other potential additions like delaying the search (for JS), selecting or filtering by CSS selector, etc.
  • Of course, order doesn't matter. %%q=some%20text%%n=2 and %%n=2%%q=some%20text are equivalent.

Alternatively, you could wrap each "processing instruction" in brackets like [q=some%20text] as suggested in #13, but I feel a double percent sign is probably a little easier to explain and use. (I see potential use in both technical and non-technical circles, so accessibility to non-technical people is a concern of mine.)

This would resolve and/or address numerous existing issues already filed:

@bokand
Copy link
Collaborator

bokand commented Feb 26, 2019

I actually think that the use case we're proposing is very conforming to the intended and specified purpose of fragment identifiers so I'd rather not invent a new syntax for it. The motivation here being to avoid breaking SPA routing? My guess would be that if an SPA router is broken by appending '&targetText' it'll also be broken by '%%...'.

There's also the post in #7 detailing similar breakage as a result of '##'.

I could be convinced if it turns out extending fragid to support TextQuoteSelector is non-web-compatible but that seems like the most promising way forward to me.

@dead-claudia dead-claudia changed the title Should this extend the URI syntax for its part? Should this extend the URI syntax and be in terms of "client instructions"? Feb 26, 2019
@dead-claudia dead-claudia changed the title Should this extend the URI syntax and be in terms of "client instructions"? Should this extend the URI syntax and be in terms of "processing instructions"? Feb 26, 2019
@dead-claudia
Copy link
Author

@bokand

I actually think that the use case we're proposing is very conforming to the intended and specified purpose of fragment identifiers so I'd rather not invent a new syntax for it.

In this suggestion of mine, I'm also trying to recast it as "processing instructions". I did a poor job of making this clear in the title (fixed), but I was specifically trying to make it more programmatic and less declarative. It's not a simple selector, but a processing instruction.

The motivation here being to avoid breaking SPA routing?

I know the context of the issue might make it seem that way, but SPA router compatibility is actually not a primary driving factor for this - it's a reserved area of extensibility that browsers can iterate on and users can use without risk of breaking websites. I specifically debunked most of the SPA issues and other related hash (ab)uses in #7 as long as they drop some simple code to adapt. I also noted another motivating factor in a parenthetical in the original comment (bold emphasis not in original):

[...] (I see potential use in both technical and non-technical circles, so accessibility to non-technical people is a concern of mine.)

My guess would be that if an SPA router is broken by appending '&targetText' it'll also be broken by '%%...'.

Routers are already broken when you use %% anywhere in an eventually-parsed URL, because of the simple fact it's invalid. This proposal would only change that by just slicing that bit off and exposing a likely-valid URL to the router instead.

In general, I view the existing exceptions thrown in the cases of %% and ## (especially %%) to be expected, and in this case, I'm not sure throwing exceptions in browsers lacking support should necessarily block implementing the proposal, since supporting browsers should just trim off the otherwise invalid bits, requiring no code changes on the part of the apps themselves. JS routers could initially just UA-sniff or trim everything from the first occurrence of %% to the end of the URL during any transition period with partial support or if IE also needs supported. And a sham polyfill could just define var location = ... that strips double-percent signs from location.href on read.

@Maxim-Mazurok
Copy link

I liked that part:

supporting browsers should just trim off the otherwise invalid bits, requiring no code changes on the part of the apps themselves

@bokand
Copy link
Collaborator

bokand commented Feb 28, 2019

I think having the text-fragment available to the app is necessary if it's to work in apps that load content after load. e.g., the would be no way to scroll to text that's on the second page of an infinite scroller. If the fragment is available, the page can parse it and perform the necessary load.

e.g. Wikipedia does this on mobile where the sections are all collapsed by default and the page opens the section specified in the fragment.

@dead-claudia
Copy link
Author

@bokand Fair. So what about stripping it only from the hash/query/pathname, but always keeping it in location.href? Would that work better?

I will throw out there that any extension has potential to interfere with apps, whether it be #targetText=... confusing hash-based SPA routers and related or %%q=... in legacy browsers being mistakenly added to the pathname/query/hash.

@Maxim-Mazurok
Copy link

Maxim-Mazurok commented Feb 28, 2019

Yep, I'm also concerned about that "potential". I mean, we make a very great assumption here, that there's no SPA that uses #targetText=... in its own routing. And I highly doubt that it's true. And I can't be wrong, because if there's no such SPA right now in the visible web (that we can check), there might be private websites on an intranet, or just hidden behind authentication and we never will be able to verify that. And even if there's really no SPA that such changes will break, I promise that I will write mine, just for a proof-of-concept :)
We can, obviously, check only major frameworks/websites that use SPA. But still, we'll have to check every major version of frameworks. And it'll never be enough to be completely safe because you never know about custom SPA routing implementations.

I'm thinking about some way to let website owners opt-in to text highlight feature. Probably, meta-tag.
Also, browsers should turn on the text highlight feature for any website that doesn't use any scripting language. And if JS is blocked by user-agent. In that case, there can't be any operational SPA on that page.

The reason why it will be popular is that it will be really good for SEO. So, think of it as of keywords meta, for example. It's not mandatory to have keywords, but it'll make search engines life easier. And enabling "text-highlight" meta will make users life easier and will decrease a bounce rate on your website from search. And if it broke your SPA - that's your fault.

@dead-claudia
Copy link
Author

@Maxim-Mazurok

I'm thinking about some way to let website owners opt-in to text highlight feature.

That kind of defeats the whole point of the entire proposal of letting you link to arbitrary sections in content you might not necessarily control. It'd be nice (and mildly preferable) to allow sites to customize that jump to content, like what Wikipedia would do to enable jumping to content in a collapsed subsection on mobile.

@Maxim-Mazurok
Copy link

@isiahmeadows I agree with you, it is not a great solution, that's why I haven't proposed it, just shared my thoughts. It's just that website owners probably should have control over the browser behavior in one way or another. We either should give the ability to enable this highlight behavior, or disable it. It depends on how you look at it. If you think that all websites should by default support it - then you might want to have <meta name="text-highlight" value="disable"> to disable default behavior. Or, if we don't want to risk breaking any SPAs, then we might ask website owners to add <meta name="text-highlight" value="enable"> when they want to declare that they want this feature enabled. This approach is far from perfect, but it might be very popular to the extent that all major websites (especially blogs/wikis) will support it, so a lot of internet users and websites will benefit from it while keeping SPAs safe. I hope that it makes sense, but I personally don't like this solution much. There should be a better way of keeping everyone happy :)

@bokand
Copy link
Collaborator

bokand commented May 28, 2019

Looping back here, sorry for the long delay

@bokand Fair. So what about stripping it only from the hash/query/pathname, but always keeping it in location.href? Would that work better?

I think I'm coming around to something like %%. I've now seen a few examples of pages that break with hash fragments. e.g.: https://www.webmd.com/skin-problems-and-treatments/lice-treatment

Specifying any hash on this page causes it to load a blank article. I'll look into how feasible this is.

@bokand bokand changed the title Should this extend the URI syntax and be in terms of "processing instructions"? Avoid breaking pages that use the URL fragment for routing/state May 28, 2019
@bokand
Copy link
Collaborator

bokand commented May 28, 2019

@bokand Fair. So what about stripping it only from the hash/query/pathname, but always keeping it in location.href? Would that work better?

Actually, I think a better solution to the point I brought up might be to implement something like what I mentioned in #2 - provide an explicit API that the browser can fill in using the %% text. Exposing the raw text of %% runs the risk of apps co-opting %% as they have the hash fragment and we're back in the same situation. I'd rather not be debating adding @@ in a few years 😛

@BigBlueHat
Copy link

I'm personally more concerned about breaking the URL than breaking a few poorly written (or narrowly designed at least) SPAs. Introducing %% as a new magic delimiter clogs up lots of existing pluming...

For instance using the example posted in #5

$ curl http://example.com%%targetText

Fails with "Could not resolve host: example.com%%targetText"

Granted, that might have been a typo. 😃 So...

$ curl http://example.com/%%targetText

This one fails with a 404 because it hits the server (as it should).

decodeURI() also chokes with URIError: malformed URI sequence due to the double %% being an invalid escape sequence.

Routing libraries can be updated more easily than browsers and certainly easier than ~30 years of URL usage and design.

SPA's will have to deal with new "clutter" in any client-side URL "magic," so perhaps it's best to sit along side them rather than attempt to avoid each other entirely?

@bokand
Copy link
Collaborator

bokand commented May 28, 2019

That's a good point, thanks, and I think that rules out any invalid sequence.

I wonder if we could use, e.g. ## instead? IIUC, that should be parsed as just a fragment. Existing tools wouldn't break, though they'd pass through the ## section which might still break an SPA like this but that's no worse than the status quo and tools that do support targetText (or future features) wouldn't interact with the page at all. This is all train-of-thought though, definitely needs more thought and investigation.

I also am weary of touching URIs since that feels like a much bigger problem. But we don't have a good sense of how large the SPA problem is and pages can break with trivial fragments. It does also feel like having a section of the URI reserved for the UA (to prevent apps from using it) would be a useful thing for this and future features.

@BigBlueHat
Copy link

BigBlueHat commented May 28, 2019

It does also feel like having a section of the URI reserved for the UA (to prevent apps from using it) would be a useful thing for this and future features.

This is essentially what the # itself has always provided, but now ~25 years later, we have two clients (browser and web app) wrangling for a single URI space.

Consequently, perhaps the XPointer style fragments (proposed in #18) could be removed (or put into the location somewhere other than location.hash) from the URI by the "first" client (i.e. the browser) before it's sent to JS. That would avoid the WebMD issue altogether, but still provide sensible fallback patterns for framework and JS developers in the meantime, or while using other browsers, or when using these URL's in a space other than the browser.

That idea probably needs its own proposal, though... 😃

@bokand
Copy link
Collaborator

bokand commented Jun 27, 2019

I was toying with the idea of proposing having ## perform some magic on URL objects such that, given a URL https://example.com/#hash##inst:

url.toString() or url.href would output https://example.com/#hash##inst but
url.hash would return #hash. url.hash = 'changed' would set the url to https://example.com/#changed##inst. Ditto for window.location.

Effectively, we would change the interpretation of hash (in both reading and writing) to be end-delimited by either the end of the string or the ## token.

However,

this assumes apps don't do their own hash parsing from the full URL and adds a bunch of icky magic to URL handling. A far simpler solution would be to specify that after processing a targetText directive, the UA would change the URL to strip it out. e.g.:

On navigating to https://example.com/#hash##targetText=test, the UA would process the targetText and change the document's URL to https://example.com/#hash. We'd want to avoid modifying the session history in this case. I think we'd also want to avoid firing hashchange as well (moot at this point as we only process targetText on full navigations for now; this would happen before author JS has a chance to run).

The major drawback is the ## part of the hash wouldn't appear in the URL bar once navigated. This would make it a little harder to share the exact link by copy-paste. Though, I think this would be fixable in the UI-layer of a UA. A page refresh would also reload at the page top but I think that's probably ok?

WDYT? Is there anything here I'm missing?

@Maxim-Mazurok
Copy link

Yes, I like this idea. It won't affect any existing apps if I understand correctly.

@iherman
Copy link

iherman commented Jun 27, 2019

I am not sure I understand the proposal right, @bokand, but the way I read your proposal is to have a URL string that has several '#' characters in it. However, the URL spec disallows this: a valid URL can have one '#' character to separate the 'url-fragment-string', which consists of code points that do not include the '#' characters.

There are a number of URL libraries for different languages that parse URL strings that, I presume, rely on this and that raise errors if there are several '#'-s in the URL string. They may all go wrong, and updating all those (as well as updating the URL spec seems to be a major uphill battle...

@dead-claudia
Copy link
Author

dead-claudia commented Jun 27, 2019

@bokand Most client-side routers IIUC do do full parsing of the URL, by necessity. This is especially necessary when query string parameters get involved.

I do feel this compromise wouldn't break existing routers and wouldn't require updating URL specs:

  • Add a special ## suffix (or any invalid suffix) that's split off and saved separately when setting the location.href.
  • Add a new location.target (or some better-named property) that contains that split-off chunk, minus the ##, as a parsed, read-only URLSearchParams instance, for routers to potentially parse and make use of later, in case they need to delay the normal search jump.
  • The browser can just display location.href + "##" + location.target.toString() instead of just location.href, so the user can still see the modified href. Apps wouldn't see it directly, but users would.

Edit: any suffix string should work.

@dead-claudia
Copy link
Author

dead-claudia commented Jun 27, 2019

@iherman location.href doesn't check for (or care about) URL validity. I know this because I maintain a router that treats ?foo=1&bar=2&baz=3&qux=4 equivalently to ?foo=1&bar=2#baz=3&qux=4 and #foo=1&bar=2&baz=3&qux=4. Edit: And you can do this even when the router prefix is #!. It does actually work this way and I wouldn't be surprised if people rely on it.

@iherman
Copy link

iherman commented Jun 27, 2019

Well... are we sure about the URL libraries in other environments like Python, Java, Rust, you-name-it? Would we break any code if I did that?

@bokand
Copy link
Collaborator

bokand commented Jun 27, 2019

Yeah, I noticed just after sending that '#' isn't a valid code point. I agree we wouldn't want to introduce an invalid format since that could break existing libraries - https://indieweb.org/fragmention ran into this exact problem using ##.

I think the core of the idea of stripping the directive is valid though, so long as we could find some valid and web compatible delimiter. That'd require some data gathering which will take time but we can do.

@dead-claudia
Copy link
Author

@iherman The concern is client-side, not server-side. Server-side routers never see the hash anyways – browsers never send it to them – and if they do encounter one erroneously, most just ignore it or reject the request as malformed, assuming it doesn't itself get dropped somewhere in the middle to save bandwidth.

@bokand
Copy link
Collaborator

bokand commented Jun 27, 2019

I think it'd be bad to break client-side as well. The client might not see our special fragment in its own document, but links on the page would, e.g. <a href="exmaple.com##targetText">.

@dead-claudia
Copy link
Author

@bokand I wasn't disagreeing with you, just stating we aren't at high risk of breaking very many servers, especially servers that aren't interpreting <a href> links (the most common case by far).

@iherman
Copy link

iherman commented Jun 27, 2019

@isiahmeadows we are talking creating new kinds of URL-s, which may be used as identifiers regardless of whether they are used client-side or server-side. If one creates an annotation that is stored in an annotation server or database, those URL-s would be out in the wild, subject to processing by other tools.

@domenic
Copy link

domenic commented Jun 28, 2019

The major drawback is the ## part of the hash wouldn't appear in the URL bar once navigated. This would make it a little harder to share the exact link by copy-paste. Though, I think this would be fixable in the UI-layer of a UA.

Right, I don't think the URL bar is constrained in this way. It could contain the ##, even if the JavaScript-exposed URL was missing it.

@tilgovi
Copy link

tilgovi commented Jun 30, 2019

There's a great appeal to this being an invalid URL. It means these new URLs might break some software, but it also means these URLs will not collide with existing ones.

@bokand
Copy link
Collaborator

bokand commented Jul 4, 2019

@kevinmarks - https://indieweb.org/fragmention says:

The first draft of this used double hash anchors ## and escaped the spaces in the fragment with + signs. Experimentation shows that this causes problems with URL parsing in some cases as double-hash is an invalid URL...

Could you elaborate on your experience here? Are there specific tools you found broke down? How did they break?

@kevinmarks
Copy link

I'd have to dig through issues, but we found that some libraries would throw an exception or truncate the url, particularly when it was in a plain text string - irc clients were one example. I think there was one that exited on the parse error.
There were other tools that had url extracting regex to highlight it that didn't work in some ways.
Others would escape the 2nd # so it became #%23 which then confused downstream parsers.

@kevinmarks
Copy link

Part of the value of urls is that the can pass through intermediate text and still be useful, so I disagree with @tilgovi that breaking them on purpose is a good idea.

@nickburris
Copy link
Collaborator

As @bokand mentioned on Chromium bug 961440, we added metrics for URL fragments that contain an additional #, and it's actually surprisingly high at 0.3% of page loads.

We're trying the double-hash syntax on Chromium in M77 (feature still behind a flag/canary-dev experiment/origin trial), while still supporting the original syntax. To summarize, our current idea with the double-hash is that we append ##targetText=example to the existing fragment, if any (e.g. #pagestate##targetText=example) and strip it from the fragment after processing, so the page only sees #pagestate (or an empty hash that it can then use for state like WebMD mentioned above) and behaves normally. I'll update the explainer with our current ideas on this as well.

@kevinmarks
Copy link

FWIW, twitter's latest release has broken the old hashbang links like http://twitter.com/#!/kevinrose/status/89578599098744832

nickburris added a commit to nickburris/ScrollToTextFragment that referenced this issue Aug 7, 2019
Add a section on alternative syntax per issue WICG#15.
@bokand
Copy link
Collaborator

bokand commented Sep 23, 2019

Just to update, I think there's enough risk with ## and cautionary feedback that we'd be better off finding a delimiter that doesn't fail validation on current URL parsers.

I've done some digging over a sample of all URLs seen by Google crawlers over the last 5 years with some candidate delimiters. We're going to update the proposal to use :~: as the delimiter, which didn't appear in any URLs. This should strike a balance between web compat and URL compat.

Example:

https://example.org#fragment:~:text=foo,bar

Of course, we still want it to be part of the fragment for non-implementing UAs so in the absence of an element-id fragment we must still include a #:

https://example.org#:~:text=foo,bar

@tilgovi
Copy link

tilgovi commented Sep 23, 2019

Related to my comment in #25, I actually really like that this does not start with #.

@bokand
Copy link
Collaborator

bokand commented Oct 9, 2019

I think this issue has been sufficiently addressed in our introduction of the fragment directive and the :~: delimiter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants