Rewrite the parser #52

aureliojargas · 2019-11-15T22:37:34Z

(Note: this ticket and the comments are all written after the fact. I'm just writing it here to document my "new parser" quest...)

For the sedsed magic to work, in the first place it needs to read and parse a sed script.

So many years ago I've written a "home made" sed script parser for sedsed. With no previous experience on writing a good parser, my idea of a "simple" parser was to always split the sed script by newlines and ; to detect the commands.

It worked for simple commands such as 5d; s/foo/bar/; 10q, but any command with a literal ; or newline was a challenge. For example, a s/foo;/bar;/g would be broken in three pieces and then the parser rejoined those pieces until finding a valid command. Very hacky and buggy.

Most of the reported bugs in the issue tracker are related to the parser not handling corner cases, or failing to detect invalid code. I don't think patching it is sustainable. I need a real robust parser.

The text was updated successfully, but these errors were encountered:

aureliojargas · 2019-11-15T23:20:34Z

When thinking about the new parser, I've decided not to reinvent the wheel for the second time. I'm no expert in parsers, and I would end up creating a second buggy one. So the decision was to reuse something already available.

I got very excited when I found https://github.com/GillesArcas/PythonSed, an implementation of GNU sed in Python. I did not need a full sed, but I could use just its parser to fulfill sedsed needs.

I forked the project, and for some weeks I've worked on adapting it, because I needed to preserve the original sed script code (comments, blank lines, s flags in the order they were informed, original regexes with no processing, etc.). I ended up writing new code and testing, but it proved to be hard to transform a parser (which only cares about identifying tokens and data) into a code "preserver".

aureliojargas · 2019-11-16T00:03:11Z

My second attempt was looking into the original C code for GNU sed, to check if I could understand its parser and maybe get some ideas from it. While reading the code, I got the idea to convert it to Python. Why not? It is a decades-old battle-tested parser, and the code seemed to be simple to understand and adapt.

That was the start of a solid month of work (and fun!) every night before sleeping, to get an initial working version. You can see the loooong list of commits in my dev branch (the first one is from 24 Jun).

I kept working on it in the following months, fixing bugs and adding the extra features sedsed required. Only after the parser itself was passing an extensive test suite and I was confident it was working as expected, I started adapting sedsed to use it.

My original idea was to insert the new parser code into the existent sedsed.py script, since I like it being a "single file app". But that was a challenge, since I decided that the new parser would follow strictly the original C code, with all its globals and algorithms mostly unchanged. They were two different worlds, so the next decision was having a separate sedparse.py file only for the parser.

Later, I felt that this parser and its test suite deserved a dedicated repository. Maybe other projects could make use of it? So https://github.com/aureliojargas/sedparse was born and I kept working on it in isolation, as a stand alone project.

After 5 months of work in total, since the beginning of this "I need a new parser" quest, I had a first official 0.1.0 release of sedparse (also available in pypi).

aureliojargas · 2019-11-16T00:21:02Z

Now the old sedsed parser was removed and sedparse is the official parser. See 961624b and 82c5d19.

Note: Those commits are dated at the end of August, but those dates are from the first iteration that I kept amending for ~2 months, until days ago when I finally pushed them here.

aureliojargas modified the milestone: Modernize the code Nov 15, 2019

aureliojargas closed this as completed Nov 16, 2019

aureliojargas added this to the v2.0 milestone Nov 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite the parser #52

Rewrite the parser #52

aureliojargas commented Nov 15, 2019 •

edited

aureliojargas commented Nov 15, 2019

aureliojargas commented Nov 16, 2019 •

edited

aureliojargas commented Nov 16, 2019

Rewrite the parser #52

Rewrite the parser #52

Comments

aureliojargas commented Nov 15, 2019 • edited

aureliojargas commented Nov 15, 2019

aureliojargas commented Nov 16, 2019 • edited

aureliojargas commented Nov 16, 2019

aureliojargas commented Nov 15, 2019 •

edited

aureliojargas commented Nov 16, 2019 •

edited