Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed-up for nested block pattern matching #17

Closed
GoogleCodeExporter opened this issue Jun 30, 2015 · 2 comments
Closed

Speed-up for nested block pattern matching #17

GoogleCodeExporter opened this issue Jun 30, 2015 · 2 comments

Comments

@GoogleCodeExporter
Copy link

Before change:

input string length: 475
4000 iterations in 3814 ms (0.9535 ms per iteration)
input string length: 2356
1000 iterations in 4215 ms (4.215 ms per iteration)
input string length: 27737
100 iterations in 5908 ms (59.08 ms per iteration)
input string length: 11075
1 iteration in 25 ms
input string length: 88607
1 iteration in 278 ms
input string length: 354431
1 iteration in 2386 ms

After:

input string length: 475
4000 iterations in 3756 ms (0.939 ms per iteration)
input string length: 2356
1000 iterations in 4196 ms (4.196 ms per iteration)
input string length: 27737
100 iterations in 4753 ms (47.53 ms per iteration)
input string length: 11075
1 iteration in 23 ms
input string length: 88607
1 iteration in 190 ms
input string length: 354431
1 iteration in 1027 ms

with all unit tests passing.

So a moderate speed-up.

Change to:

        private static Regex _blocksNested = new Regex(string.Format(@"
                (                       # save in 
$1
                    ^                   # start of line  
(with /m)
                    <({0})              # start tag = $2
                    \b                  # word break
                    (?>.*\n)*?          # any number of lines, 
minimally matching
                    </\2>               # the matching end 
tag
                    [ \t]*              # trailing 
spaces/tabs
                    (?=\n+|\Z)          # followed by a newline or end of 
document
                )", _blockTags1), RegexOptions.Multiline | 
RegexOptions.IgnorePatternWhitespace | RegexOptions.Compiled);

        private static string _blockTags2 = "p|div|h[1-
6]|blockquote|pre|table|dl|ol|ul|script|noscript|form|fieldset|iframe|math"
;
        private static Regex _blocksNestedLiberal = new 
Regex(string.Format(@"
               (                        # save in 
$1
                    ^                   # start of line  
(with /m)
                    <({0})              # start tag = $2
                    \b                  # word break
                    (?>.*\n)*?          # any number of lines, 
minimally matching
                    .*</\2>             # the matching end 
tag
                    [ \t]*              # trailing 
spaces/tabs
                    (?=\n+|\Z)          # followed by a newline or end of 
document
                )", _blockTags2), RegexOptions.Multiline | 
RegexOptions.IgnorePatternWhitespace | RegexOptions.Compiled);

The important part is:

(?>.*\n)*?

instead of:

(.*\n)*?



Original issue reported on code.google.com by wcshie...@gmail.com on 4 Jan 2010 at 10:22

@GoogleCodeExporter
Copy link
Author

beware, this area is slated to change entirely in 1.07/1.08 -- the last two 
failing
tests have to do with the horribly brokem HTML block parser..

Original comment by wump...@gmail.com on 5 Jan 2010 at 1:05

@GoogleCodeExporter
Copy link
Author

Thanks for the contribution -- unfortunately now obselete based on new Html 
block
parser in r74

Original comment by wump...@gmail.com on 6 Jan 2010 at 10:56

  • Changed state: WontFix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant