-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parsing: escaping/seperating functions contain duplicate code #160
Comments
Would be cool to have some EscapeHelper that looks or splits by a pattern but only if its not escaped in some way. Shall also be able to unescape things. |
Anyone experiences with regexes? So what we need is:
Note that all those patterns shall not match if the match is escaped. Note that escapes can be escaped too (thus the match isn't escaped anymore). THis has to be put into a class and should be severely tested especially if regexes are involved. Full branch coverage does not suffice here! Or can we get this via a small library or so? Help is wanted here! |
I'm experimenting with regexes now... |
Thank god in heaven you're a genius just for trying. 2015-02-24 20:07 GMT+01:00 Makman2 notifications@github.com:
|
Okay for the first case: is \n a natural separator or not? |
Reading it again its probably best to fully ignore newlines. If user wants to break at newlines he can do it before. |
Except excaped newlines... damn... |
Lets just ignore newlines for now. |
next question: If a string to split contains two seperators directly behind, so like this: |
Okay assume that it counts: Okay I found something better: The python regex module supports splitting the way you want especially with limitting the number of splits: |
Its an empty entry so it should count. |
Yes, what I'm asking is to provide the same functionality with the addition that I don't need to care about escapes because thats horrible. :) So we can use split for some tests to compare. |
For the second one split could be an option, but the begin string and end string shall be at the beginning and end, am I right? result = re.split(begin_string, the_text_to_split, 1)
desired_match = re.split(end_string, result[0], 1) repeat this until all strings are found surrounded with the begin and end string.
Again use To decide whether to match \n or not: Just enable/disable the Also note you can use the python regex extensions which are quite nice. Especially To test the regexes, I used https://www.regex101.com/#python. Very nice website 👍 Coming to the last one: What do you mean explicitly with escaping things? |
Example, I want to split this string: |
and I want to do someting like |
|
I think you know what we need, do you? Conversation is getting a bit long, can we chat somewhere in case you need more info? E.g. https://gitter.im/coala-analyzer/coala |
0.2.1 Prototype: To try it out execute it in python3 interactive mode and call the functions with test-patterns and -strings. |
just a naive comment, can't we reuse the split function for searching in between easily? Don't know how this affects performance but it should be less code at least. Search in between developer question: thats an easy one. Think of this string: If you can have a parameter to not spit out empty splits thats great, shouldn't be much code so yeah - the more it can do, the better. As I hinted a bit we'll replace several subfunctions within our parsing modules with usages of this helper and we'll write other ones. In addition I want to get this into the bearlib so bears can use it for parsing source code files and every bit of flexibility is great because they have to do less. Nice release history :D You wouldn't mind to come up with such nice names for coala releases too would you? |
It's possible but I think regexes are faster. If you use splits you would first split for the begin-sequence and then split each result with the end-sequence. That's time and memory consuming (since you have an intermediate list after the first split. And if you have duplicated end sequences you have multiple splits. A regex would just ignore the end-sequence if one is already found). *2 *3 *4
|
Updated gist (https://gist.github.com/Makman2/28a71dd80b8ad29d282d). |
The latter obviously. More chars don't really make any difference from the outside, do they? |
Update of the prototype to 0.2.4 "The first testament" @sils1297 Good idea to use the stackoverflow test strings, I needed to adjust the escaping regexes a bit, sometimes they consumed a bit too much :) |
Searches for a regex pattern in a specified string. What makes this function more advanced is that it supports the max_matches parameter that limits the number of matches. No function inside the 're' namespace does that. Partially fixes #160
Add the split() function that splits a string with a provided pattern while ignoring escapes. For that purpose add the StringProcessing module. Includes a full test set for this function. Partially fixes #160
Splits a given string by a pattern respecting escapes. Includes a full test set for this function. Partially fixes #160
Searches for a string enclosed from a begin- and end-sequence. Ignores escapes. This commit includes a full test set for this function. Partially fixes #160
Searches for a string enclosed between a begin- and end-sequence. Handles escaped sequences. This commit includes a full test set. Partially fixes #160
Add the split() function that splits a string with a provided pattern while ignoring escapes. For that purpose add the StringProcessing module. Includes a full test set for this function. Partially fixes #160
Splits a given string by a pattern respecting escapes. Includes a full test set for this function. Partially fixes #160
Searches for a string enclosed from a begin- and end-sequence. Ignores escapes. This commit includes a full test set for this function. Partially fixes #160
Searches for a string enclosed between a begin- and end-sequence. Handles escaped sequences. This commit includes a full test set. Partially fixes #160
No description provided.
The text was updated successfully, but these errors were encountered: