Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parsing: escaping/seperating functions contain duplicate code #160

Closed
sils opened this issue Jan 12, 2015 · 30 comments
Closed

parsing: escaping/seperating functions contain duplicate code #160

sils opened this issue Jan 12, 2015 · 30 comments

Comments

@sils
Copy link
Member

sils commented Jan 12, 2015

No description provided.

@sils
Copy link
Member Author

sils commented Feb 23, 2015

Would be cool to have some EscapeHelper that looks or splits by a pattern but only if its not escaped in some way. Shall also be able to unescape things.

@sils
Copy link
Member Author

sils commented Feb 24, 2015

@sils
Copy link
Member Author

sils commented Feb 24, 2015

Anyone experiences with regexes? So what we need is:

  • A pattern to split things by a given string (more than one char possibly!)
    • This pattern shall also take an optional argument to limit the number of splits so e.g. one could only split by the first occurrence
  • A pattern to match things which are between an (unescaped) begin string and end string
    • Option for allowing/disallowing newlines within the match
    • Option for allowing n matches
  • Unescaping things (trivial)

Note that all those patterns shall not match if the match is escaped. Note that escapes can be escaped too (thus the match isn't escaped anymore).

THis has to be put into a class and should be severely tested especially if regexes are involved. Full branch coverage does not suffice here!

Or can we get this via a small library or so?

Help is wanted here!

@Makman2
Copy link
Member

Makman2 commented Feb 24, 2015

I'm experimenting with regexes now...

@sils
Copy link
Member Author

sils commented Feb 24, 2015

Thank god in heaven you're a genius just for trying.

2015-02-24 20:07 GMT+01:00 Makman2 notifications@github.com:

I'm experimenting with regexes now...

Reply to this email directly or view it on GitHub
#160 (comment)
.

@Makman2
Copy link
Member

Makman2 commented Feb 24, 2015

Okay for the first case: is \n a natural separator or not?

@sils
Copy link
Member Author

sils commented Feb 24, 2015

Reading it again its probably best to fully ignore newlines. If user wants to break at newlines he can do it before.

@sils
Copy link
Member Author

sils commented Feb 24, 2015

Except excaped newlines... damn...

@sils
Copy link
Member Author

sils commented Feb 24, 2015

Lets just ignore newlines for now.

@Makman2
Copy link
Member

Makman2 commented Feb 24, 2015

next question: If a string to split contains two seperators directly behind, so like this:
this string; is splitted; up with some;; separators.
Shall the empty entry be ignored or should it count?

@Makman2
Copy link
Member

Makman2 commented Feb 24, 2015

Okay assume that it counts:
(.*?)<your-separator>|(.+)
Need to specify \g and \s flag (in python use function re.findall or better re.finditer and re.DOTALL)

Okay I found something better: The python regex module supports splitting the way you want especially with limitting the number of splits: re.split
It's really flexible especially with the separators (you can pass more than one string, it's iself a regex!)

@sils
Copy link
Member Author

sils commented Feb 24, 2015

Shall the empty entry be ignored or should it count?

Its an empty entry so it should count.

@sils
Copy link
Member Author

sils commented Feb 24, 2015

Okay I found something better: The python regex module supports splitting the way you want especially with limitting the number of splits: re.split
It's really flexible especially with the separators (you can pass more than one string, it's iself a regex!)

Yes, what I'm asking is to provide the same functionality with the addition that I don't need to care about escapes because thats horrible. :) So we can use split for some tests to compare.

@Makman2
Copy link
Member

Makman2 commented Feb 24, 2015

For the second one split could be an option, but the begin string and end string shall be at the beginning and end, am I right?
So for this one just split two times with only one match allowed:

result = re.split(begin_string, the_text_to_split, 1)
desired_match = re.split(end_string, result[0], 1)

repeat this until all strings are found surrounded with the begin and end string.
or regex it (I don't know if it's faster, but I think so and it's simpler to use, no code overhead):

<begin-string>(.*?)<end-string>

Again use \g and \s flags (to find all matches use like above the re.findall() or better re.finditer() together with the flag re.DOTALL, so newlines are also matched).
The number of matches can't be easily limited (like I tried), since if you try to use the discrete quantifiers {n}, {m,n} and so on they don't match every repetition of the preceded expression. Means
the regex (a){3} applied on aaaaa just puts a in the match list, but not three times aaa.
To do so I think it's the easiest to use code to rematch, so re.search() should be invoked as many times you want, feeded with the subsliced string after each match.

To decide whether to match \n or not: Just enable/disable the \s flag (re.DOTALL).

Also note you can use the python regex extensions which are quite nice. Especially (?P<name>) captures the expression and places the match inside a dictionary with the name name.
For the regexes in python the documentation is quite detailed, but not easy for regex newbies. If you know what you want it's good: https://docs.python.org/2/library/re.html#search-vs-match

To test the regexes, I used https://www.regex101.com/#python. Very nice website 👍

Coming to the last one: What do you mean explicitly with escaping things?

@sils
Copy link
Member Author

sils commented Feb 25, 2015

Example, I want to split this string: a \; test \\; string by ;. So the expected results would be a \; test \\ and string. Otherwise I'm not sure I'm too tired to understand everything you said fully. I also don't know these flags very well...

@sils
Copy link
Member Author

sils commented Feb 25, 2015

and I want to do someting like unescaped_split("a \\; test \\\\; string", ";")

@Makman2
Copy link
Member

Makman2 commented Feb 25, 2015

re.split splits unescaped, so it ignores \. If you want to make an escaped split, you need to preprocess the string with a regex or re.sub()/re.subn().

@sils
Copy link
Member Author

sils commented Feb 25, 2015

I think you know what we need, do you? Conversation is getting a bit long, can we chat somewhere in case you need more info? E.g. https://gitter.im/coala-analyzer/coala

@Makman2
Copy link
Member

Makman2 commented Feb 26, 2015

0.2.1 Prototype:
https://gist.github.com/Makman2/28a71dd80b8ad29d282d

To try it out execute it in python3 interactive mode and call the functions with test-patterns and -strings.

@sils
Copy link
Member Author

sils commented Feb 26, 2015

just a naive comment, can't we reuse the split function for searching in between easily? Don't know how this affects performance but it should be less code at least.

Search in between developer question: thats an easy one. Think of this string: \" a text " some \" text " another ". With " as delimiter you will want to have some \" text " as a match. So neither begin nor end delimiter may be escaped.

If you can have a parameter to not spit out empty splits thats great, shouldn't be much code so yeah - the more it can do, the better. As I hinted a bit we'll replace several subfunctions within our parsing modules with usages of this helper and we'll write other ones. In addition I want to get this into the bearlib so bears can use it for parsing source code files and every bit of flexibility is great because they have to do less.

Nice release history :D You wouldn't mind to come up with such nice names for coala releases too would you?

@Makman2
Copy link
Member

Makman2 commented Feb 26, 2015

It's possible but I think regexes are faster. If you use splits you would first split for the begin-sequence and then split each result with the end-sequence. That's time and memory consuming (since you have an intermediate list after the first split. And if you have duplicated end sequences you have multiple splits. A regex would just ignore the end-sequence if one is already found).
To be honest: I'm sure with the memory consumption, but not with time, need to test that^^

*2
The thing is when the search_in_between begin and end sequences are longer than one char, for example: START the string to match \END now the real END
Shall it match the string to match \ or the string to match \END now the real?

*3
Yes a parameter is possible, no problem :)

*4
Release name:

  • Eucalyptus (or something similar)
  • Nothin' more in my mind...

@Makman2
Copy link
Member

Makman2 commented Feb 26, 2015

Updated gist (https://gist.github.com/Makman2/28a71dd80b8ad29d282d).
Supports optional filtering of empty matches.

@sils
Copy link
Member Author

sils commented Feb 26, 2015

*2
The thing is when the search_in_between begin and end sequences are longer than one char, for example: START the string to match \END now the real END
Shall it match the string to match \ or the string to match \END now the real?

The latter obviously. More chars don't really make any difference from the outside, do they?

@Makman2
Copy link
Member

Makman2 commented Feb 28, 2015

Update of the prototype to 0.2.4 "The first testament"
Tests now included for the split() functions, for search_in_between() will come in next one^^
--> regex_proto

@sils1297 Good idea to use the stackoverflow test strings, I needed to adjust the escaping regexes a bit, sometimes they consumed a bit too much :)

Makman2 pushed a commit that referenced this issue Mar 17, 2015
Searches for a regex pattern in a specified string.
What makes this function more advanced is that it supports the
max_matches parameter that limits the number of matches. No function
inside the 're' namespace does that.

Partially fixes #160
Makman2 pushed a commit that referenced this issue Mar 17, 2015
Add the split() function that splits a string with a provided
pattern while ignoring escapes. For that purpose add the
StringProcessing module.

Includes a full test set for this function.

Partially fixes #160
Makman2 pushed a commit that referenced this issue Mar 17, 2015
Splits a given string by a pattern respecting escapes.

Includes a full test set for this function.

Partially fixes #160
Makman2 pushed a commit that referenced this issue Mar 17, 2015
Searches for a string enclosed from a begin- and end-sequence. Ignores
escapes.

This commit includes a full test set for this function.

Partially fixes #160
Makman2 pushed a commit that referenced this issue Mar 17, 2015
Searches for a string enclosed between a begin- and end-sequence.
Handles escaped sequences.

This commit includes a full test set.

Partially fixes #160
Makman2 pushed a commit that referenced this issue Mar 17, 2015
Add the split() function that splits a string with a provided
pattern while ignoring escapes. For that purpose add the
StringProcessing module.

Includes a full test set for this function.

Partially fixes #160
Makman2 pushed a commit that referenced this issue Mar 17, 2015
Splits a given string by a pattern respecting escapes.

Includes a full test set for this function.

Partially fixes #160
Makman2 pushed a commit that referenced this issue Mar 17, 2015
Searches for a string enclosed from a begin- and end-sequence. Ignores
escapes.

This commit includes a full test set for this function.

Partially fixes #160
Makman2 pushed a commit that referenced this issue Mar 17, 2015
Searches for a string enclosed between a begin- and end-sequence.
Handles escaped sequences.

This commit includes a full test set.

Partially fixes #160
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging a pull request may close this issue.

2 participants