Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strip extraneous characters at end of URLs #5715

Merged
merged 17 commits into from
May 19, 2020
Merged

Strip extraneous characters at end of URLs #5715

merged 17 commits into from
May 19, 2020

Conversation

cwdavies
Copy link
Contributor

@cwdavies cwdavies commented May 7, 2020

Update handle_404_error to remove the last character from the URL if that character is in the extraneous_char_list

Additions

  • test_handle_404_error_strip_extraneous_chars

Removals

  • URLs that end in %20) have these two characters removed

Testing

  1. tox -e unittest-current cfgov.tests.test_urls.HandleErrorTestCase

Checklist

  • PR has an informative and human-readable title
  • Changes are limited to a single goal (no scope creep)
  • Code can be automatically merged (no conflicts)
  • Code follows the standards laid out in the CFPB development guidelines
  • Passes all existing automated tests
  • Any change in functionality is tested
  • New functions are documented (with a description, list of inputs, and expected output)
  • Reviewers requested with the Reviewers tool ➡️

* URLs that end in %20) have these two characters removed
* Add test_handle_404_error_strip_extraneous_chars
cfgov/cfgov/urls.py Outdated Show resolved Hide resolved
@willbarton
Copy link
Member

willbarton commented May 8, 2020

Do you think we could do this with a regular expression matching any of those characters multiple times at the end of a string and then using re.sub() on the request.path? That would generalize this to more than just the last 1-2 characters of the string.

Something like this (pseudocode, I haven't tried it):

extraneous_char_re = re.compile(r'[!#$%&()*+,-.:;<=>?@\[\]^_`{|}~]+$')
request.path = re.sub(extraneous_char_re, '', request.path)

@Scotchester
Copy link
Contributor

Do you think we could do this with a regular expression matching any of those characters multiple times at the end of a string and then using re.sub() on the request.path? That would generalize this to more than just the last 1-2 characters of the string.

Something like this (pseudocode, I haven't tried it):

extraneous_char_re = re.compile(r'[!#$%&()*+,-.:;<=>?@\[\]^_`{|}~]+$')
request.path = re.sub(extraneous_char_re, '', request.path)

I had a similar thought, Will. Or if we wanted to avoid regex, we could while through basic string comparisons of the final character until that final character is not one of the extraneous characters.

I was testing something to that effect yesterday afternoon, but ran into some weirdness with how browsers convert spaces to %20 that I couldn't resolve before the end of the day.

@willbarton
Copy link
Member

I haven't tested, but I'd assume that re.sub is more performant than a while loop over the path.

@Scotchester
Copy link
Contributor

I haven't tested, but I'd assume that re.sub is more performant than a while loop over the path.

Probably true.

cfgov/cfgov/urls.py Outdated Show resolved Hide resolved
@cwdavies cwdavies requested a review from higs4281 May 11, 2020 15:27
cwdavies and others added 2 commits May 11, 2020 11:29
New logic goes as follows:

1. Lowercase the path.
2. Check for and remove extraneous characters at the end of the path.
   - List of extraneous characters now includes curly quotes, fancy 
dashes, and ellipses.
3. If the path has changed, try resolving the path.
   1. If it resolves, redirect to it.
   2. If it doesn't, return a 404 for the original path.
@cwdavies cwdavies requested a review from willbarton May 11, 2020 16:40
Copy link
Contributor Author

@cwdavies cwdavies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This refactoring of handle_404_error looks great as it redirects to a lowercase path and strips off extraneous characters from the end. A unit test should be added that has a mixed case URL with one or two characters from extraneous_char_re at the end of the URL.

cfgov/cfgov/urls.py Outdated Show resolved Hide resolved
cfgov/cfgov/urls.py Outdated Show resolved Hide resolved
cfgov/cfgov/urls.py Outdated Show resolved Hide resolved
cfgov/cfgov/urls.py Outdated Show resolved Hide resolved
Scott Cranfill and others added 2 commits May 11, 2020 13:28
Co-authored-by: Andy Chosak <andy.chosak@cfpb.gov>
Discovered that `resolve` will always return a successful result because 
any URLs that don't match a standard pattern match the Wagtail fallback 
pattern. Didn't catch this before because a bug in the slash-appending 
logic was causing a slash to always be appended, which was "correctly" 
failing to resolve some URLs.

@chosak and I agreed that it wasn't worth the complexity to test for 
both Django and Wagtail URLs (which would involve getting the current 
site and testing with one of its class methods), so falling back to just 
doing a redirect if any transformation of the URL occurred.
@Scotchester
Copy link
Contributor

@cwdavies and @willbarton Retreating to just doing a single redirect if the path changed at all. Reasons are detailed in the 8b495e0 commit message. Let me know if you have questions or concerns.

@willbarton
Copy link
Member

@Scotchester nope, that sounds reasonable.

@Scotchester
Copy link
Contributor

(Note: Still need to write more tests before merging.)

@Scotchester
Copy link
Contributor

New commit pushed with what I think are adequate tests. Ready for final review, @cwdavies @chosak @willbarton @higs4281.

@cwdavies cwdavies requested a review from chosak May 18, 2020 20:14
Copy link
Contributor Author

@cwdavies cwdavies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Scotchester has provided really good unit tests that check one, two, and multiple extraneous characters at end of URL in addition to mixed case for the new handle_404_error feature.

@Scotchester Scotchester dismissed higs4281’s stale review May 19, 2020 14:01

Requested change was made, and subsequently refactored away

@Scotchester Scotchester merged commit 1470f78 into master May 19, 2020
@Scotchester Scotchester deleted the handler404 branch May 19, 2020 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants