Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Escape '#' except when fragment identifier #631

Merged
merged 9 commits into from
Mar 2, 2023
Merged

Conversation

jeremypw
Copy link
Collaborator

@jeremypw jeremypw commented Oct 17, 2021

Fixes #625
Rather than allowing all '#' characters in URLs this only allows those after the final /' which might be fragment identifiers.

Open to suggestions for a more elegant method of doing this.

This PR fixes the URL given in the issue report. Suggestions for other corner cases to test welcome.

It would be good to add a test framework to CI test this (and maybe other functions) but that is left for another PR.

Copy link
Contributor

@cassidyjames cassidyjames left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

URLs can end without a trailing slash and still have a # fragment; for example, the link https://github.com/nodesource/distributions/blob/master/README.md#debian-and-ubuntu-based-distributions is perfectly valid but would not be escaped properly here.

@jeremypw
Copy link
Collaborator Author

@cassidyjames Hmm, OK thanks - I'll have to think again 🤔

@jeremypw jeremypw marked this pull request as draft January 21, 2022 10:55
@jeremypw
Copy link
Collaborator Author

These rules from w3.org may help:

2.6.4 URL manipulation and creation
To fragment-escape a string input, a user agent must run the following steps:

Let input be the string to be escaped.

Let position point at the first character of input.

Let output be an empty string.

Loop: If position is past the end of input, then jump to the step labeled end.

If the character in input pointed to by position is in the range U+0000 to U+0020 or is one of the following characters:

U+0022 QUOTATION MARK character (")
U+0023 NUMBER SIGN character (#)
U+0025 PERCENT SIGN character (%)
U+003C LESS-THAN SIGN character (<)
U+003E GREATER-THAN SIGN character (>)
U+005B LEFT SQUARE BRACKET character ([)
U+005C REVERSE SOLIDUS character (\)
U+005D RIGHT SQUARE BRACKET character (])
U+005E CIRCUMFLEX ACCENT character (^)
U+007B LEFT CURLY BRACKET character ({)
U+007C VERTICAL LINE character (|)
U+007D RIGHT CURLY BRACKET character (})
...then append the percent-encoded form of the character to output. [RFC3986]

Otherwise, append the character itself to output.

This escapes any ASCII characters that are not valid in the URI <fragment> production without being escaped.

Advance position to the next character in input.

Return to the step labeled loop.

End: Return output.

@jeremypw jeremypw marked this pull request as ready for review February 22, 2023 12:29
@jeremypw jeremypw requested a review from a team February 22, 2023 12:33
@jeremypw jeremypw merged commit 4ee7da6 into master Mar 2, 2023
@jeremypw jeremypw deleted the fix-escape-links branch March 2, 2023 12:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Some characters are incorrectly escaped in links in Terminal
3 participants