Strip extraneous characters at end of URLs #5715

cwdavies · 2020-05-07T19:35:43Z

Update handle_404_error to remove the last character from the URL if that character is in the extraneous_char_list

Additions

test_handle_404_error_strip_extraneous_chars

Removals

URLs that end in %20) have these two characters removed

Testing

tox -e unittest-current cfgov.tests.test_urls.HandleErrorTestCase

Checklist

PR has an informative and human-readable title
Changes are limited to a single goal (no scope creep)
Code can be automatically merged (no conflicts)
Code follows the standards laid out in the CFPB development guidelines
Passes all existing automated tests
Any change in functionality is tested
New functions are documented (with a description, list of inputs, and expected output)
Reviewers requested with the Reviewers tool ➡️

* URLs that end in %20) have these two characters removed * Add test_handle_404_error_strip_extraneous_chars

cfgov/cfgov/urls.py

willbarton · 2020-05-08T13:33:42Z

Do you think we could do this with a regular expression matching any of those characters multiple times at the end of a string and then using re.sub() on the request.path? That would generalize this to more than just the last 1-2 characters of the string.

Something like this (pseudocode, I haven't tried it):

extraneous_char_re = re.compile(r'[!#$%&()*+,-.:;<=>?@\[\]^_`{|}~]+$')
request.path = re.sub(extraneous_char_re, '', request.path)

Scotchester · 2020-05-08T14:47:34Z

Do you think we could do this with a regular expression matching any of those characters multiple times at the end of a string and then using re.sub() on the request.path? That would generalize this to more than just the last 1-2 characters of the string.

Something like this (pseudocode, I haven't tried it):
extraneous_char_re = re.compile(r'[!#$%&()*+,-.:;<=>?@\[\]^_`{|}~]+$')
request.path = re.sub(extraneous_char_re, '', request.path)

I had a similar thought, Will. Or if we wanted to avoid regex, we could while through basic string comparisons of the final character until that final character is not one of the extraneous characters.

I was testing something to that effect yesterday afternoon, but ran into some weirdness with how browsers convert spaces to %20 that I couldn't resolve before the end of the day.

willbarton · 2020-05-08T14:49:58Z

I haven't tested, but I'd assume that re.sub is more performant than a while loop over the path.

Scotchester · 2020-05-08T15:22:28Z

I haven't tested, but I'd assume that re.sub is more performant than a while loop over the path.

Probably true.

cfgov/cfgov/urls.py

New logic goes as follows: 1. Lowercase the path. 2. Check for and remove extraneous characters at the end of the path. - List of extraneous characters now includes curly quotes, fancy dashes, and ellipses. 3. If the path has changed, try resolving the path. 1. If it resolves, redirect to it. 2. If it doesn't, return a 404 for the original path.

cwdavies

This refactoring of handle_404_error looks great as it redirects to a lowercase path and strips off extraneous characters from the end. A unit test should be added that has a mixed case URL with one or two characters from extraneous_char_re at the end of the URL.

cfgov/cfgov/urls.py

Co-authored-by: Andy Chosak <andy.chosak@cfpb.gov>

@chosak

Discovered that `resolve` will always return a successful result because any URLs that don't match a standard pattern match the Wagtail fallback pattern. Didn't catch this before because a bug in the slash-appending logic was causing a slash to always be appended, which was "correctly" failing to resolve some URLs. @chosak and I agreed that it wasn't worth the complexity to test for both Django and Wagtail URLs (which would involve getting the current site and testing with one of its class methods), so falling back to just doing a redirect if any transformation of the URL occurred.

Scotchester · 2020-05-11T19:56:16Z

@cwdavies and @willbarton Retreating to just doing a single redirect if the path changed at all. Reasons are detailed in the 8b495e0 commit message. Let me know if you have questions or concerns.

willbarton · 2020-05-12T13:04:26Z

@Scotchester nope, that sounds reasonable.

Scotchester · 2020-05-12T13:32:11Z

(Note: Still need to write more tests before merging.)

Scotchester · 2020-05-15T21:46:10Z

New commit pushed with what I think are adequate tests. Ready for final review, @cwdavies @chosak @willbarton @higs4281.

cwdavies

@Scotchester has provided really good unit tests that check one, two, and multiple extraneous characters at end of URL in addition to mixed case for the new handle_404_error feature.

Requested change was made, and subsequently refactored away

Handle extraneous characters at end of URLs

12fa0d9

* URLs that end in %20) have these two characters removed * Add test_handle_404_error_strip_extraneous_chars

cwdavies requested review from willbarton, higs4281 and Scotchester May 7, 2020 19:35

higs4281 previously requested changes May 7, 2020

View reviewed changes

cfgov/cfgov/urls.py Outdated Show resolved Hide resolved

cwdavies and others added 3 commits May 8, 2020 12:03

Add regular expression support to handle_404_error

60d4c84

Merge branch 'master' into handler404

9025a08

Added space to extraneous_char_re in handle_404_error

42cdeeb

willbarton reviewed May 11, 2020

View reviewed changes

cfgov/cfgov/urls.py Outdated Show resolved Hide resolved

Added stripped_path in handle_404_error

7d9d907

cwdavies requested a review from higs4281 May 11, 2020 15:27

cwdavies and others added 2 commits May 11, 2020 11:29

Merge branch 'master' into handler404

3fb0aa7

cwdavies requested a review from willbarton May 11, 2020 16:40

cwdavies commented May 11, 2020

View reviewed changes

chosak reviewed May 11, 2020

View reviewed changes

cfgov/cfgov/urls.py Outdated Show resolved Hide resolved

cfgov/cfgov/urls.py Outdated Show resolved Hide resolved

cfgov/cfgov/urls.py Outdated Show resolved Hide resolved

cfgov/cfgov/urls.py Outdated Show resolved Hide resolved

Scott Cranfill and others added 2 commits May 11, 2020 13:28

Apply suggestions from code review

f842273

Co-authored-by: Andy Chosak <andy.chosak@cfpb.gov>

cwdavies added 5 commits May 12, 2020 11:20

Merge branch 'master' into handler404

c644cec

Merge branch 'master' into handler404

134e092

Merge branch 'master' into handler404

784bf1f

Merge branch 'master' into handler404

9ecaac4

Merge branch 'master' into handler404

625f184

cwdavies and others added 2 commits May 15, 2020 11:16

Merge branch 'master' into handler404

d14f751

Add more tests for handle_404_error

ef9658e

Merge branch 'master' into handler404

01b793b

cwdavies requested a review from chosak May 18, 2020 20:14

cwdavies commented May 18, 2020

View reviewed changes

willbarton approved these changes May 19, 2020

View reviewed changes

Scotchester merged commit 1470f78 into master May 19, 2020

Scotchester deleted the handler404 branch May 19, 2020 14:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strip extraneous characters at end of URLs #5715

Strip extraneous characters at end of URLs #5715

cwdavies commented May 7, 2020

willbarton commented May 8, 2020 •

edited

Loading

Scotchester commented May 8, 2020

willbarton commented May 8, 2020

Scotchester commented May 8, 2020

cwdavies left a comment

Scotchester commented May 11, 2020

willbarton commented May 12, 2020

Scotchester commented May 12, 2020

Scotchester commented May 15, 2020

cwdavies left a comment

Strip extraneous characters at end of URLs #5715

Strip extraneous characters at end of URLs #5715

Conversation

cwdavies commented May 7, 2020

Additions

Removals

Testing

Checklist

willbarton commented May 8, 2020 • edited Loading

Scotchester commented May 8, 2020

willbarton commented May 8, 2020

Scotchester commented May 8, 2020

cwdavies left a comment

Choose a reason for hiding this comment

Scotchester commented May 11, 2020

willbarton commented May 12, 2020

Scotchester commented May 12, 2020

Scotchester commented May 15, 2020

cwdavies left a comment

Choose a reason for hiding this comment

willbarton commented May 8, 2020 •

edited

Loading