Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for quantifiers and the shuffle operator in regular expressions #109

Closed
EduardoGoulart1 opened this issue Dec 2, 2022 · 18 comments

Comments

@EduardoGoulart1
Copy link
Contributor

EduardoGoulart1 commented Dec 2, 2022

Is there a specific reason why quantifiers for regular expressions are not supported? (so things like a{2,10}) If there is no specific reason, would you be open to a PR that implements it?

Also, what do you think of adding a special regex shuffle operator? This would be very handy to express languages like "all permutations of a,b,c". My proposal would be to use the ~ character and implement a separate parser for that

@EduardoGoulart1 EduardoGoulart1 changed the title Support for quantifiers Support for quantifiers and the shuffle operator in regular expressions Dec 2, 2022
@caleb531
Copy link
Owner

caleb531 commented Dec 2, 2022

@EduardoGoulart1 I don't recall a specific reason, but I am certainly open to any PRs you would be willing to submit!

@eliotwrobson would quantifiers (like Eduardo is describing) fit within the theoretical regular expressions this library implements? I only ask because it seems like Perl/PCRE regex and the automata-based regex syntaxes are not the same, otherwise we would be using the stdlib's re module in more places.

@EduardoGoulart1
Copy link
Contributor Author

EduardoGoulart1 commented Dec 2, 2022

@caleb531 the quantifiers would be only syntactic sugar to avoid concatenating the same expression repeatedly. So a^{n,m} would get translated to something like a..a_n(a_n+1)?...(a_m)?

The shuffle operator might require some rework on the parser. We would need to construct the NFAs and compute their products explicitly.

@eliotwrobson
Copy link
Collaborator

So I think quantifiers do fit (as they correspond to just concatenating a certain number of copies of an NFA with itself). I actually thought about adding this in before, but it causes some issues with the way that the default alphabet is set for regexes. If no objections to changing this though, I think it's very doable.

I also think that the shuffle operator is easy to do. I think this can actually be done with the same logic as the NFA shuffle_product function. I'm not sure about the ~ symbol, since this is usually used to denote compliment. In the literature, the Cyrillic sha ш is typically used (see https://link.springer.com/chapter/10.1007/978-3-031-19685-0_3#Sec2).

@EduardoGoulart1
Copy link
Contributor Author

@eliotwrobson but we still want the parser to be based on ASCII characters, right? I can look up for what other characters would better suit it

@eliotwrobson
Copy link
Collaborator

eliotwrobson commented Dec 2, 2022

@EduardoGoulart1 yes, sticking to ascii characters is best I think. Perhaps the @ symbol? I'm staring at my keyboard and unfortunately nothing seems to resemble the sha 😅 though we should definitely stick to something that's easy for users to input and doesn't require typing in anything special.

With respect to implementation, the shuffle product is pretty easy to integrate with the existing parser. However, the quantifiers requires some changes to what gets passed in to factory functions for the Token object. The token object will need to be passed the whole match object, not just the string that was matched on. I think this is straightforward to implement but requires changing the way some of the tests are written (which is why I didn't do it before).

@EduardoGoulart1
Copy link
Contributor Author

EduardoGoulart1 commented Dec 2, 2022

@eliotwrobson ok thank you, I'll give it a try. Just to understand the design choice: You opted for writing the Lexer following the structure of the tdparser. Was that just to simplify the implementation? I see pyparsing is added a dependency but not used

@eliotwrobson
Copy link
Collaborator

eliotwrobson commented Dec 2, 2022

@EduardoGoulart1 This was mainly as a guide to myself while writing. The parser here is actually a port of one in another project, and there we aren't allowed to add external dependencies, so everything had to be written from scratch. I didn't realize that pyparsing was a dependency, that may have been used as part of the old parser. Let me know if you have any issues or want some input.

EDIT: @caleb531 do you know why pyparsing is a dependency here? It doesn't look like it's getting used anywhere in the library right now.

@caleb531
Copy link
Owner

caleb531 commented Dec 2, 2022

@eliotwrobson pyparsing is a dependency of pandoc, which I used to compile my Markdown README to RST for PyPI (see c669875). I removed pandoc a while back since PyPI now has native Markdown support, but I guess I forgot to remove pyparsing as well.

@EduardoGoulart1 Looking forward to your PR! Please remember to add the appropriate unit tests and documentation (under docs/regular-expressions.md) as well.

@caleb531
Copy link
Owner

caleb531 commented Dec 7, 2022

@eliotwrobson @EduardoGoulart1 What's left to do before this issue can be closed? The regex shuffle operator is now in, don't quantifiers still need to be implemented?

@eliotwrobson
Copy link
Collaborator

@caleb531 Yeah, still missing quantifiers. Discussed a bit above, but that requires a bit of a refactor to the regex engine. The token class now needs to accept a match object instead of just the text that was matched (so that items from specific match groups can be read). This refactor isn't that bad in terms of the code changes that need to be made, but there are a lot of tests that need to be modified.

@EduardoGoulart1
Copy link
Contributor Author

@eliotwrobson I'd like to take over that one if it's not too urgent to close the ticket. It would be a good opportunity to get into the code a bit more...

@caleb531
Copy link
Owner

caleb531 commented Dec 7, 2022

@EduardoGoulart1 Nope, it's not urgent at all. Just wanted to confirm what was still outstanding. 🙂

@eliotwrobson
Copy link
Collaborator

@EduardoGoulart1 go for it! As I mentioned #115, this will be a good step towards that refactor (and I'm a bit occupied next week). Let me know if you want any input.

@EduardoGoulart1
Copy link
Contributor Author

EduardoGoulart1 commented Dec 20, 2022

Closed because turns out I'll be a bit busy the next days/weeks

@EduardoGoulart1 EduardoGoulart1 closed this as not planned Won't fix, can't repro, duplicate, stale Dec 20, 2022
@eliotwrobson
Copy link
Collaborator

@EduardoGoulart1 @caleb531 I can actually pick this up. I was going to ask about the status since I want to finish out some work for my side-projects by the end of the year. Should be good to reopen.

@caleb531 caleb531 reopened this Dec 20, 2022
@caleb531
Copy link
Owner

@eliotwrobson Very well! Issue reopened. 🙂

@EduardoGoulart1
Copy link
Contributor Author

@caleb531 we close this as both were implemented, right?

@caleb531
Copy link
Owner

caleb531 commented Apr 4, 2023

@EduardoGoulart1 Thank you for the nudge—yes, both quantifiers and the shuffle operator are merged, so we can close this issue now. cc @eliotwrobson

@caleb531 caleb531 closed this as completed Apr 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants