Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite the parser #52

Closed
aureliojargas opened this issue Nov 15, 2019 · 3 comments
Closed

Rewrite the parser #52

aureliojargas opened this issue Nov 15, 2019 · 3 comments
Milestone

Comments

@aureliojargas
Copy link
Owner

aureliojargas commented Nov 15, 2019

(Note: this ticket and the comments are all written after the fact. I'm just writing it here to document my "new parser" quest...)

For the sedsed magic to work, in the first place it needs to read and parse a sed script.

So many years ago I've written a "home made" sed script parser for sedsed. With no previous experience on writing a good parser, my idea of a "simple" parser was to always split the sed script by newlines and ; to detect the commands.

It worked for simple commands such as 5d; s/foo/bar/; 10q, but any command with a literal ; or newline was a challenge. For example, a s/foo;/bar;/g would be broken in three pieces and then the parser rejoined those pieces until finding a valid command. Very hacky and buggy.

Most of the reported bugs in the issue tracker are related to the parser not handling corner cases, or failing to detect invalid code. I don't think patching it is sustainable. I need a real robust parser.

@aureliojargas
Copy link
Owner Author

When thinking about the new parser, I've decided not to reinvent the wheel for the second time. I'm no expert in parsers, and I would end up creating a second buggy one. So the decision was to reuse something already available.

I got very excited when I found https://github.com/GillesArcas/PythonSed, an implementation of GNU sed in Python. I did not need a full sed, but I could use just its parser to fulfill sedsed needs.

I forked the project, and for some weeks I've worked on adapting it, because I needed to preserve the original sed script code (comments, blank lines, s flags in the order they were informed, original regexes with no processing, etc.). I ended up writing new code and testing, but it proved to be hard to transform a parser (which only cares about identifying tokens and data) into a code "preserver".

@aureliojargas
Copy link
Owner Author

aureliojargas commented Nov 16, 2019

My second attempt was looking into the original C code for GNU sed, to check if I could understand its parser and maybe get some ideas from it. While reading the code, I got the idea to convert it to Python. Why not? It is a decades-old battle-tested parser, and the code seemed to be simple to understand and adapt.

That was the start of a solid month of work (and fun!) every night before sleeping, to get an initial working version. You can see the loooong list of commits in my dev branch (the first one is from 24 Jun).

I kept working on it in the following months, fixing bugs and adding the extra features sedsed required. Only after the parser itself was passing an extensive test suite and I was confident it was working as expected, I started adapting sedsed to use it.

My original idea was to insert the new parser code into the existent sedsed.py script, since I like it being a "single file app". But that was a challenge, since I decided that the new parser would follow strictly the original C code, with all its globals and algorithms mostly unchanged. They were two different worlds, so the next decision was having a separate sedparse.py file only for the parser.

Later, I felt that this parser and its test suite deserved a dedicated repository. Maybe other projects could make use of it? So https://github.com/aureliojargas/sedparse was born and I kept working on it in isolation, as a stand alone project.

After 5 months of work in total, since the beginning of this "I need a new parser" quest, I had a first official 0.1.0 release of sedparse (also available in pypi).

@aureliojargas
Copy link
Owner Author

Now the old sedsed parser was removed and sedparse is the official parser. See 961624b and 82c5d19.

Note: Those commits are dated at the end of August, but those dates are from the first iteration that I kept amending for ~2 months, until days ago when I finally pushed them here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant