Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inverse regex #1

Closed
jayvdb opened this issue Feb 13, 2020 · 5 comments
Closed

Inverse regex #1

jayvdb opened this issue Feb 13, 2020 · 5 comments

Comments

@jayvdb
Copy link

jayvdb commented Feb 13, 2020

It appears that your inverse regex from https://www.mail-archive.com/python-list@python.org/msg125198.html isnt in this repo, or anywhere, however it is mentioned in https://stackoverflow.com/questions/17518554/how-to-reverse-a-regex-in-python/49389042 and there are a few unattributed copies in GitHub.

IMO it is a very useful solution, and warrants being its own Python library on PyPI for easy adoption.

https://github.com/pyparsing/pyparsing/blob/master/examples/invRegex.py and https://pypi.org/project/er/ exist, and there are a few other data-fuzzing type code available, but few do full permutations of the regex efficiently. The old code is py2 only, but the changes needed to support py3 are quite small.

@jayvdb
Copy link
Author

jayvdb commented Feb 13, 2020

My current patch for py3 is

--- regex_inverter.py   2020-02-13 17:25:40.000000000 +0700
+++ regex_inverter.py  2020-02-13 22:46:59.652203153 +0700
@@ -5,10 +5,11 @@
 import sre_parse
 import string
 
+# Note string.ascii_letters is not the same as \w which is unicode by default
 category_chars = {
     CATEGORY_DIGIT: string.digits,
     CATEGORY_SPACE: string.whitespace,
-    CATEGORY_WORD: string.digits + string.letters + '_'
+    CATEGORY_WORD: string.digits + string.ascii_letters + '_',
 }
 
 
@@ -26,7 +27,8 @@
     return string.printable
 
 
-def handle_branch((tok, val)):
+def handle_branch(val):
+    tok, val = val
     all_opts = []
     for toks in val:
         opts = permute_toks(toks)
@@ -49,10 +51,11 @@
     return [chr(val)]
 
 
-def handle_max_repeat((min, max, val)):
+def handle_max_repeat(val):
     """
     Handle a repeat token such as {x,y} or ?.
     """
+    (min, max, val) = val
     subtok, subval = val[0]
 
     if max > 5000:
@@ -76,7 +79,7 @@
 
 
 def handle_subpattern(val):
-    return list(permute_toks(val[1]))
+    return list(permute_toks(val[3]))
 
 
 def handle_tok(tok, val):

@jayvdb
Copy link
Author

jayvdb commented Feb 14, 2020

I see sre-yield also tries to do this, , and input-generator and randre and xeger (and rstr) produce a single match.

@bjourne
Copy link
Owner

bjourne commented Feb 15, 2020

Thank you. I'll take a look at this shortly

@bjourne
Copy link
Owner

bjourne commented Mar 23, 2020

Thank you for the suggestions! I've reimplemented the code and added it to my repository. I don't have time to maintain it as a Python package, but if you or anyone else needs, it feel free to copy the code.

@bjourne bjourne closed this as completed Mar 23, 2020
@jayvdb
Copy link
Author

jayvdb commented Mar 23, 2020

Thanks @bjourne

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants