parsing: escaping/seperating functions contain duplicate code #160

sils · 2015-01-12T11:39:09Z

No description provided.

sils · 2015-02-23T12:02:33Z

Would be cool to have some EscapeHelper that looks or splits by a pattern but only if its not escaped in some way. Shall also be able to unescape things.

sils · 2015-02-24T16:50:07Z

http://stackoverflow.com/a/5455705

sils · 2015-02-24T17:09:59Z

Anyone experiences with regexes? So what we need is:

A pattern to split things by a given string (more than one char possibly!)
- This pattern shall also take an optional argument to limit the number of splits so e.g. one could only split by the first occurrence
A pattern to match things which are between an (unescaped) begin string and end string
- Option for allowing/disallowing newlines within the match
- Option for allowing n matches
Unescaping things (trivial)

Note that all those patterns shall not match if the match is escaped. Note that escapes can be escaped too (thus the match isn't escaped anymore).

THis has to be put into a class and should be severely tested especially if regexes are involved. Full branch coverage does not suffice here!

Or can we get this via a small library or so?

Help is wanted here!

Makman2 · 2015-02-24T19:07:13Z

I'm experimenting with regexes now...

sils · 2015-02-24T19:08:44Z

Thank god in heaven you're a genius just for trying.

2015-02-24 20:07 GMT+01:00 Makman2 notifications@github.com:

I'm experimenting with regexes now...

Reply to this email directly or view it on GitHub
#160 (comment)
.

Makman2 · 2015-02-24T19:14:44Z

Okay for the first case: is \n a natural separator or not?

sils · 2015-02-24T19:16:16Z

Reading it again its probably best to fully ignore newlines. If user wants to break at newlines he can do it before.

sils · 2015-02-24T19:16:29Z

Except excaped newlines... damn...

sils · 2015-02-24T20:40:51Z

Lets just ignore newlines for now.

Makman2 · 2015-02-24T21:16:19Z

next question: If a string to split contains two seperators directly behind, so like this:
this string; is splitted; up with some;; separators.
Shall the empty entry be ignored or should it count?

Makman2 · 2015-02-24T21:32:27Z

Okay assume that it counts:
(.*?)<your-separator>|(.+)
Need to specify \g and \s flag (in python use function re.findall or better re.finditer and re.DOTALL)

Okay I found something better: The python regex module supports splitting the way you want especially with limitting the number of splits: re.split
It's really flexible especially with the separators (you can pass more than one string, it's iself a regex!)

sils · 2015-02-24T21:43:14Z

Shall the empty entry be ignored or should it count?

Its an empty entry so it should count.

sils · 2015-02-24T21:44:38Z

Okay I found something better: The python regex module supports splitting the way you want especially with limitting the number of splits: re.split
It's really flexible especially with the separators (you can pass more than one string, it's iself a regex!)

Yes, what I'm asking is to provide the same functionality with the addition that I don't need to care about escapes because thats horrible. :) So we can use split for some tests to compare.

Makman2 · 2015-02-24T22:13:21Z

For the second one split could be an option, but the begin string and end string shall be at the beginning and end, am I right?
So for this one just split two times with only one match allowed:

result = re.split(begin_string, the_text_to_split, 1)
desired_match = re.split(end_string, result[0], 1)

repeat this until all strings are found surrounded with the begin and end string.
or regex it (I don't know if it's faster, but I think so and it's simpler to use, no code overhead):

<begin-string>(.*?)<end-string>

Again use \g and \s flags (to find all matches use like above the re.findall() or better re.finditer() together with the flag re.DOTALL, so newlines are also matched).
The number of matches can't be easily limited (like I tried), since if you try to use the discrete quantifiers {n}, {m,n} and so on they don't match every repetition of the preceded expression. Means
the regex (a){3} applied on aaaaa just puts a in the match list, but not three times aaa.
To do so I think it's the easiest to use code to rematch, so re.search() should be invoked as many times you want, feeded with the subsliced string after each match.

To decide whether to match \n or not: Just enable/disable the \s flag (re.DOTALL).

Also note you can use the python regex extensions which are quite nice. Especially (?P<name>) captures the expression and places the match inside a dictionary with the name name.
For the regexes in python the documentation is quite detailed, but not easy for regex newbies. If you know what you want it's good: https://docs.python.org/2/library/re.html#search-vs-match

To test the regexes, I used https://www.regex101.com/#python. Very nice website 👍

Coming to the last one: What do you mean explicitly with escaping things?

sils · 2015-02-25T06:16:21Z

Example, I want to split this string: a \; test \\; string by ;. So the expected results would be a \; test \\ and string. Otherwise I'm not sure I'm too tired to understand everything you said fully. I also don't know these flags very well...

sils · 2015-02-25T06:17:36Z

and I want to do someting like unescaped_split("a \\; test \\\\; string", ";")

Makman2 · 2015-02-25T15:38:47Z

re.split splits unescaped, so it ignores \. If you want to make an escaped split, you need to preprocess the string with a regex or re.sub()/re.subn().

sils · 2015-02-25T15:46:27Z

I think you know what we need, do you? Conversation is getting a bit long, can we chat somewhere in case you need more info? E.g. https://gitter.im/coala-analyzer/coala

Makman2 · 2015-02-26T00:59:12Z

0.2.1 Prototype:
https://gist.github.com/Makman2/28a71dd80b8ad29d282d

To try it out execute it in python3 interactive mode and call the functions with test-patterns and -strings.

sils · 2015-02-26T06:03:46Z

just a naive comment, can't we reuse the split function for searching in between easily? Don't know how this affects performance but it should be less code at least.

Search in between developer question: thats an easy one. Think of this string: \" a text " some \" text " another ". With " as delimiter you will want to have some \" text " as a match. So neither begin nor end delimiter may be escaped.

If you can have a parameter to not spit out empty splits thats great, shouldn't be much code so yeah - the more it can do, the better. As I hinted a bit we'll replace several subfunctions within our parsing modules with usages of this helper and we'll write other ones. In addition I want to get this into the bearlib so bears can use it for parsing source code files and every bit of flexibility is great because they have to do less.

Nice release history :D You wouldn't mind to come up with such nice names for coala releases too would you?

Makman2 · 2015-02-26T15:44:03Z

It's possible but I think regexes are faster. If you use splits you would first split for the begin-sequence and then split each result with the end-sequence. That's time and memory consuming (since you have an intermediate list after the first split. And if you have duplicated end sequences you have multiple splits. A regex would just ignore the end-sequence if one is already found).
To be honest: I'm sure with the memory consumption, but not with time, need to test that^^

*2
The thing is when the search_in_between begin and end sequences are longer than one char, for example: START the string to match \END now the real END
Shall it match the string to match \ or the string to match \END now the real?

*3
Yes a parameter is possible, no problem :)

*4
Release name:

Eucalyptus (or something similar)
Nothin' more in my mind...

Makman2 · 2015-02-26T16:13:19Z

Updated gist (https://gist.github.com/Makman2/28a71dd80b8ad29d282d).
Supports optional filtering of empty matches.

sils · 2015-02-26T17:41:38Z

*2
The thing is when the search_in_between begin and end sequences are longer than one char, for example: START the string to match \END now the real END
Shall it match the string to match \ or the string to match \END now the real?

The latter obviously. More chars don't really make any difference from the outside, do they?

Makman2 · 2015-02-28T00:16:06Z

Update of the prototype to 0.2.4 "The first testament"
Tests now included for the split() functions, for search_in_between() will come in next one^^
--> regex_proto

@sils1297 Good idea to use the stackoverflow test strings, I needed to adjust the escaping regexes a bit, sometimes they consumed a bit too much :)

Searches for a regex pattern in a specified string. What makes this function more advanced is that it supports the max_matches parameter that limits the number of matches. No function inside the 're' namespace does that. Partially fixes #160

Add the split() function that splits a string with a provided pattern while ignoring escapes. For that purpose add the StringProcessing module. Includes a full test set for this function. Partially fixes #160

Splits a given string by a pattern respecting escapes. Includes a full test set for this function. Partially fixes #160

Searches for a string enclosed from a begin- and end-sequence. Ignores escapes. This commit includes a full test set for this function. Partially fixes #160

Searches for a string enclosed between a begin- and end-sequence. Handles escaped sequences. This commit includes a full test set. Partially fixes #160

Add the split() function that splits a string with a provided pattern while ignoring escapes. For that purpose add the StringProcessing module. Includes a full test set for this function. Partially fixes #160

Splits a given string by a pattern respecting escapes. Includes a full test set for this function. Partially fixes #160

Searches for a string enclosed from a begin- and end-sequence. Ignores escapes. This commit includes a full test set for this function. Partially fixes #160

Searches for a string enclosed between a begin- and end-sequence. Handles escaped sequences. This commit includes a full test set. Partially fixes #160

sils added enhancement type/codestyle labels Jan 22, 2015

sils added this to the 0.2 alpha milestone Feb 23, 2015

sils added the importance/high label Feb 24, 2015

sils mentioned this issue Feb 24, 2015

StringConverter: Support dictionary conversion #325

Closed

sils assigned Makman2 Feb 26, 2015

Makman2 pushed a commit that referenced this issue Mar 17, 2015

parsing: Add unescaped_split() function

c5db370

Splits a given string by a pattern respecting escapes. Includes a full test set for this function. Partially fixes #160

Makman2 pushed a commit that referenced this issue Mar 17, 2015

parsing: Add search_in_between function

d34beb3

Searches for a string enclosed from a begin- and end-sequence. Ignores escapes. This commit includes a full test set for this function. Partially fixes #160

Makman2 pushed a commit that referenced this issue Mar 17, 2015

parsing: Add unescaped_search_in_between function

efcda40

Searches for a string enclosed between a begin- and end-sequence. Handles escaped sequences. This commit includes a full test set. Partially fixes #160

Makman2 pushed a commit that referenced this issue Mar 17, 2015

parsing: Add unescaped_split() function

8a8b768

Splits a given string by a pattern respecting escapes. Includes a full test set for this function. Partially fixes #160

Makman2 pushed a commit that referenced this issue Mar 17, 2015

parsing: Add search_in_between function

bc712f9

Searches for a string enclosed from a begin- and end-sequence. Ignores escapes. This commit includes a full test set for this function. Partially fixes #160

Makman2 pushed a commit that referenced this issue Mar 17, 2015

parsing: Add unescaped_search_in_between function

d92e143

Searches for a string enclosed between a begin- and end-sequence. Handles escaped sequences. This commit includes a full test set. Partially fixes #160

Makman2 closed this as completed in aa49874 Mar 18, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parsing: escaping/seperating functions contain duplicate code #160

parsing: escaping/seperating functions contain duplicate code #160

sils commented Jan 12, 2015

sils commented Feb 23, 2015

sils commented Feb 24, 2015

sils commented Feb 24, 2015

Makman2 commented Feb 24, 2015

sils commented Feb 24, 2015

Makman2 commented Feb 24, 2015

sils commented Feb 24, 2015

sils commented Feb 24, 2015

sils commented Feb 24, 2015

Makman2 commented Feb 24, 2015

Makman2 commented Feb 24, 2015

sils commented Feb 24, 2015

sils commented Feb 24, 2015

Makman2 commented Feb 24, 2015

sils commented Feb 25, 2015

sils commented Feb 25, 2015

Makman2 commented Feb 25, 2015

sils commented Feb 25, 2015

Makman2 commented Feb 26, 2015

sils commented Feb 26, 2015

Makman2 commented Feb 26, 2015

Makman2 commented Feb 26, 2015

sils commented Feb 26, 2015

Makman2 commented Feb 28, 2015

parsing: escaping/seperating functions contain duplicate code #160

parsing: escaping/seperating functions contain duplicate code #160

Comments

sils commented Jan 12, 2015

sils commented Feb 23, 2015

sils commented Feb 24, 2015

sils commented Feb 24, 2015

Makman2 commented Feb 24, 2015

sils commented Feb 24, 2015

Makman2 commented Feb 24, 2015

sils commented Feb 24, 2015

sils commented Feb 24, 2015

sils commented Feb 24, 2015

Makman2 commented Feb 24, 2015

Makman2 commented Feb 24, 2015

sils commented Feb 24, 2015

sils commented Feb 24, 2015

Makman2 commented Feb 24, 2015

sils commented Feb 25, 2015

sils commented Feb 25, 2015

Makman2 commented Feb 25, 2015

sils commented Feb 25, 2015

Makman2 commented Feb 26, 2015

sils commented Feb 26, 2015

Makman2 commented Feb 26, 2015

Makman2 commented Feb 26, 2015

sils commented Feb 26, 2015

Makman2 commented Feb 28, 2015