optimize the awx.main.redact SCM URL sanitizer regex #6254

ryanpetrello · 2020-03-11T02:53:19Z

No description provided.

softwarefactory-project-zuul · 2020-03-11T03:05:37Z

Build failed.

awx-api-lint : FAILURE in 5m 04s
awx-api : FAILURE in 10m 27s
awx-ui : SUCCESS in 6m 52s
awx-ui-next : SUCCESS in 9m 42s
awx-swagger : FAILURE in 11m 00s
awx-detect-schema-change : FAILURE in 11m 46s (non-voting)
awx-ansible-modules : SUCCESS in 5m 14s

ryanpetrello · 2020-03-11T03:41:35Z

@chrismeyersfsu @AlanCoding this is very, very slow if the event_data is really large and contains no actual URLs:

>>> def _x():
...  t = time.time()
...  UriCleaner.remove_sensitive('x' * 150000)
...  print(time.time() - t)
...
>>> _x()
105.74454760551453

We very recently started doing this filtering in the callback receiver to address a bug (though as far as I can tell, this has always been slow for large JSON blobs, and we've always paid this high cost in various API endpoints):

#5812

softwarefactory-project-zuul · 2020-03-11T03:51:20Z

Build succeeded.

awx-api-lint : SUCCESS in 4m 32s
awx-api : SUCCESS in 12m 21s
awx-ui : SUCCESS in 6m 34s
awx-ui-next : SUCCESS in 10m 47s
awx-swagger : SUCCESS in 14m 04s
awx-detect-schema-change : SUCCESS in 11m 06s (non-voting)
awx-ansible-modules : SUCCESS in 6m 30s

ryanpetrello · 2020-03-11T03:52:21Z

awx/main/redact.py

@@ -8,7 +8,7 @@

 class UriCleaner(object):
    REPLACE_STR = REPLACE_STR
-    SENSITIVE_URI_PATTERN = re.compile(r'(\w+:(\/?\/?)[^\s]+)', re.MULTILINE)  # NOQA
+    SENSITIVE_URI_PATTERN = re.compile(r'((http|https|ssh):(\/?\/?)[^\s]+)', re.MULTILINE)  # NOQA


\w+ is too greedy for really large strings that don't contain URLs; if we're parsing basic auth out of URLs, and we're talking about project updates, there's only so many protocols we care about here in practice.

even worse is a really long string that happens to have a <word>: followed by lots of characters that aren't spaces.

Can this also be git:?

https://www.git-scm.com/docs/git-clone#_git_urls_a_id_urls_a

From what I can tell, the only transports that actually allow username/pass as part of the netloc are http/s and ssh.

Would accidentally using a username and password in a git netloc be a common enough mistake? If so, would not including it in this match pattern prevent it from being redacted from error logs, etc.?

Yea, that's a fair point - I guess somebody could put in something that doesn't work, like:

git://user:pass@host

I'll add it.

AlanCoding · 2020-03-11T11:53:41Z

How fast is _x() with this patch?

ryanpetrello · 2020-03-11T13:22:56Z

@AlanCoding it gets much slower the larger the target string is.

Before:

>>> timeit.timeit("import re; re.compile('(\w+:(\/?\/?)[^\s]+)', re.MULTILINE).search('x'*150000)", number=1)
94.14911386510357

After:

timeit.timeit("import re; re.compile('((http|https|ssh):(\/?\/?)[^\s]+)', re.MULTILINE).search('x'*150000)", number=1)
0.0009962848853319883

AlanCoding · 2020-03-11T13:42:28Z

awx/main/tests/unit/test_redact.py

+def test_large_string_performance():
+    length = 100000
+    redacted = UriCleaner.remove_sensitive('x' * length)
+    assert len(redacted) == length


How does this really test anything? If it unreasonably takes 100 seconds, doesn't that just mean that the test runs for that long?

softwarefactory-project-zuul · 2020-03-11T13:58:55Z

Build succeeded.

awx-api-lint : SUCCESS in 4m 34s
awx-api : SUCCESS in 13m 58s
awx-ui : SUCCESS in 7m 06s
awx-ui-next : SUCCESS in 14m 09s
awx-swagger : SUCCESS in 11m 17s
awx-detect-schema-change : SUCCESS in 11m 02s (non-voting)
awx-ansible-modules : SUCCESS in 5m 22s

ghjm · 2020-03-11T14:03:37Z

I would suggest using (\w{1,20}:(\/?\/?)[^\s]+).

The reason the performance is bad is that \w+ is not limited in length, so it has to consider all possible lengths up to and including the length of the whole string. (What if the string was a million x's followed by ://foo - it would match.)

If you limit the length that you think a scheme could possibly be (and I'm just arbitrarily picking 20 here), you eliminate the performance problem while maintaining generality with regard to schemes.

timeit.timeit("import re; re.compile('(\w{1,20}:(\/?\/?)[^\s]+)', re.MULTILINE).search('x'*150000)", number=1)
0.014931650017388165

ryanpetrello · 2020-03-11T14:08:29Z

@ghjm that's a great point, and I like it much better. Thanks for the input - I'll adjust this PR.

\w+ is too greedy for large strings that don't contain URLs

softwarefactory-project-zuul · 2020-03-11T14:25:00Z

Build succeeded.

awx-api-lint : SUCCESS in 5m 58s
awx-api : SUCCESS in 12m 09s
awx-ui : SUCCESS in 7m 00s
awx-ui-next : SUCCESS in 10m 40s
awx-swagger : SUCCESS in 12m 46s
awx-detect-schema-change : SUCCESS in 13m 08s (non-voting)
awx-ansible-modules : SUCCESS in 5m 37s

softwarefactory-project-zuul · 2020-03-11T14:56:00Z

Build succeeded (gate pipeline).

awx-api-lint : SUCCESS in 5m 34s
awx-api : SUCCESS in 12m 27s
awx-ui : SUCCESS in 6m 26s
awx-ui-next : SUCCESS in 11m 53s
awx-swagger : SUCCESS in 10m 20s
awx-detect-schema-change : SUCCESS in 10m 41s (non-voting)
awx-ansible-modules : SUCCESS in 4m 42s
awx-push-new-schema : SUCCESS in 10m 39s (non-voting)

ryanpetrello requested a review from chrismeyersfsu March 11, 2020 02:53

ryanpetrello force-pushed the redact-faster branch from 9874bdc to 3dcca1a Compare March 11, 2020 03:36

ryanpetrello changed the title ~~WIP: place awx.main.redact with something simpler (and more performant)~~ replace awx.main.redact with something simpler (and more performant) Mar 11, 2020

ryanpetrello changed the title ~~replace awx.main.redact with something simpler (and more performant)~~ optimize the awx.main.redact SCM URL sanitizer regex Mar 11, 2020

ryanpetrello requested review from AlanCoding and ghjm March 11, 2020 03:39

ryanpetrello requested a review from matburt March 11, 2020 03:52

ryanpetrello commented Mar 11, 2020

View reviewed changes

ryanpetrello force-pushed the redact-faster branch from 3dcca1a to 1d6f42b Compare March 11, 2020 13:33

AlanCoding reviewed Mar 11, 2020

View reviewed changes

ryanpetrello force-pushed the redact-faster branch from 1d6f42b to 7e3865c Compare March 11, 2020 13:44

AlanCoding approved these changes Mar 11, 2020

View reviewed changes

optimize the SCM URL sanitizer regex

c95624e

\w+ is too greedy for large strings that don't contain URLs

ryanpetrello force-pushed the redact-faster branch from 7e3865c to c95624e Compare March 11, 2020 14:10

ghjm approved these changes Mar 11, 2020

View reviewed changes

ryanpetrello added the mergeit label Mar 11, 2020

softwarefactory-project-zuul bot merged commit 0fd9153 into ansible:devel Mar 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize the awx.main.redact SCM URL sanitizer regex #6254

optimize the awx.main.redact SCM URL sanitizer regex #6254

ryanpetrello commented Mar 11, 2020

softwarefactory-project-zuul bot commented Mar 11, 2020

ryanpetrello commented Mar 11, 2020 •

edited

Loading

softwarefactory-project-zuul bot commented Mar 11, 2020

ryanpetrello Mar 11, 2020 •

edited

Loading

shanemcd Mar 11, 2020

ryanpetrello Mar 11, 2020

jakemcdermott Mar 11, 2020 •

edited

Loading

ryanpetrello Mar 11, 2020

AlanCoding commented Mar 11, 2020

ryanpetrello commented Mar 11, 2020

AlanCoding Mar 11, 2020

ryanpetrello Mar 11, 2020

softwarefactory-project-zuul bot commented Mar 11, 2020

ghjm commented Mar 11, 2020

ryanpetrello commented Mar 11, 2020

softwarefactory-project-zuul bot commented Mar 11, 2020

softwarefactory-project-zuul bot commented Mar 11, 2020

optimize the awx.main.redact SCM URL sanitizer regex #6254

optimize the awx.main.redact SCM URL sanitizer regex #6254

Conversation

ryanpetrello commented Mar 11, 2020

softwarefactory-project-zuul bot commented Mar 11, 2020

ryanpetrello commented Mar 11, 2020 • edited Loading

softwarefactory-project-zuul bot commented Mar 11, 2020

ryanpetrello Mar 11, 2020 • edited Loading

Choose a reason for hiding this comment

shanemcd Mar 11, 2020

Choose a reason for hiding this comment

ryanpetrello Mar 11, 2020

Choose a reason for hiding this comment

jakemcdermott Mar 11, 2020 • edited Loading

Choose a reason for hiding this comment

ryanpetrello Mar 11, 2020

Choose a reason for hiding this comment

AlanCoding commented Mar 11, 2020

ryanpetrello commented Mar 11, 2020

AlanCoding Mar 11, 2020

Choose a reason for hiding this comment

ryanpetrello Mar 11, 2020

Choose a reason for hiding this comment

softwarefactory-project-zuul bot commented Mar 11, 2020

ghjm commented Mar 11, 2020

ryanpetrello commented Mar 11, 2020

softwarefactory-project-zuul bot commented Mar 11, 2020

softwarefactory-project-zuul bot commented Mar 11, 2020

ryanpetrello commented Mar 11, 2020 •

edited

Loading

ryanpetrello Mar 11, 2020 •

edited

Loading

jakemcdermott Mar 11, 2020 •

edited

Loading