New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make ds9 region file parser fast #48

Open
cdeil opened this Issue Oct 30, 2014 · 13 comments

Comments

Projects
None yet
5 participants
@cdeil
Member

cdeil commented Oct 30, 2014

The docs contain this note: "pyregion is rather slow, likely due to a inefficient parser."
( see http://pyregion.readthedocs.org/en/latest/index.html#documentation )

It's really slow ... parsing 1000 simple circle regions

with open('ds9.reg', 'w') as f:
    for _ in range(1000):
        f.write('circle(0,0,1)\n')

takes 7 seconds:

%time pyregion.open('ds9.reg')
CPU times: user 7.59 s, sys: 20 ms, total: 7.61 s

Astropy bundles ply, so if someone takes this on it might be worth considering using ply (currently pyparsing is used).

I'm using pyregion in a data analysis / image plotting pipeline and region file parsing is actually the bottleneck, but it's acceptable and I'm not very familiar with parsing and region files, so I don't plan to take this on ... just wanted to make a GH issue so this known issue is not forgotten.

@astrofrog

This comment has been minimized.

Show comment
Hide comment
@astrofrog

astrofrog Oct 31, 2014

Member

@cdeil - can you do some basic profiling (%prun) to see what the bottleneck is?

Member

astrofrog commented Oct 31, 2014

@cdeil - can you do some basic profiling (%prun) to see what the bottleneck is?

@cdeil

This comment has been minimized.

Show comment
Hide comment
@cdeil

cdeil Nov 1, 2014

Member

All the time is spent in pyparsing.

RegionParser.parseLine calls self.parser.parseString(l) where self.parser is a pyparsing object created in RegionParser.__init__ (see here).

I have no idea how to optimise that ... this should probably be rewritten with ply (because it's bundled in astropy, i.e. the extra pyparsing dependency is avoided) and only then one should start profiling ... but it's some effort and requires someone that knows (or has time to learn) how ply works.

Member

cdeil commented Nov 1, 2014

All the time is spent in pyparsing.

RegionParser.parseLine calls self.parser.parseString(l) where self.parser is a pyparsing object created in RegionParser.__init__ (see here).

I have no idea how to optimise that ... this should probably be rewritten with ply (because it's bundled in astropy, i.e. the extra pyparsing dependency is avoided) and only then one should start profiling ... but it's some effort and requires someone that knows (or has time to learn) how ply works.

@astrofrog

This comment has been minimized.

Show comment
Hide comment
@astrofrog

astrofrog Nov 1, 2014

Member

Thanks for checking!

Member

astrofrog commented Nov 1, 2014

Thanks for checking!

@leejjoon

This comment has been minimized.

Show comment
Hide comment
@leejjoon

leejjoon Nov 2, 2014

Contributor

The slowness is likely due to my lack of knowledge instead pyparsing is inherently slow.

My recollection is that I wanted to keep passing the relevant part of the original string during the parsing, which somehow complicated the code. But not sure if this is the main reason for slow performance.

I note that matplotlib depends on pyparsing. So the pyparsing may not be an extra dependency.

Contributor

leejjoon commented Nov 2, 2014

The slowness is likely due to my lack of knowledge instead pyparsing is inherently slow.

My recollection is that I wanted to keep passing the relevant part of the original string during the parsing, which somehow complicated the code. But not sure if this is the main reason for slow performance.

I note that matplotlib depends on pyparsing. So the pyparsing may not be an extra dependency.

@cdeil cdeil added this to the wishlist milestone Aug 1, 2016

@cdeil cdeil added the enhancement label Aug 1, 2016

@cdeil

This comment has been minimized.

Show comment
Hide comment
@cdeil

cdeil May 6, 2017

Member

This speed issue in the DS9 parser in pyregion is still of interest.
Next week @joleroi will give a presentation on regions (but also mentioning pyregion) at PyAstro17 and has updated a comparison: astropy/regions#123

Long-term it would be nice if we could have one easy to understand parser that's complete and reasonably fast (speed is not the most important thing, it doesn't have to be super fast).

So pyregion uses http://pyparsing.wikispaces.com/ and one could try to understand the parser here and why it's slow (profile and see which part of the parser is slow and if it can be restructured to be faster).

In https://github.com/astropy/regions/blob/master/regions/io/read_ds9.py there's a handwritten parser. I don't know how complete it is or could be made with some work.

To me, from a quick look, the parser in pyregion and regions looks about the same complexity. In either case I think I would have to read and think for an hour or two to grok them. (cc @keflavich and I think @astrofrog and @joleroi as authors)

If someone becomes really interested in this topic and wants to prototype and benchmark parsers, other options worth looking at are http://www.dabeaz.com/ply/ and https://github.com/erikrose/parsimonious .
I don't have time now or in the coming weeks, but I might in the summer if no-one figures out and finishes up this DS9 parsing part.

Member

cdeil commented May 6, 2017

This speed issue in the DS9 parser in pyregion is still of interest.
Next week @joleroi will give a presentation on regions (but also mentioning pyregion) at PyAstro17 and has updated a comparison: astropy/regions#123

Long-term it would be nice if we could have one easy to understand parser that's complete and reasonably fast (speed is not the most important thing, it doesn't have to be super fast).

So pyregion uses http://pyparsing.wikispaces.com/ and one could try to understand the parser here and why it's slow (profile and see which part of the parser is slow and if it can be restructured to be faster).

In https://github.com/astropy/regions/blob/master/regions/io/read_ds9.py there's a handwritten parser. I don't know how complete it is or could be made with some work.

To me, from a quick look, the parser in pyregion and regions looks about the same complexity. In either case I think I would have to read and think for an hour or two to grok them. (cc @keflavich and I think @astrofrog and @joleroi as authors)

If someone becomes really interested in this topic and wants to prototype and benchmark parsers, other options worth looking at are http://www.dabeaz.com/ply/ and https://github.com/erikrose/parsimonious .
I don't have time now or in the coming weeks, but I might in the summer if no-one figures out and finishes up this DS9 parsing part.

@cdeil cdeil referenced this issue May 6, 2017

Open

Correctness and speed comparison to pyregion #45

0 of 2 tasks complete
@cdeil

This comment has been minimized.

Show comment
Hide comment
@cdeil

cdeil May 6, 2017

Member

One more thing I forgot to mention: the nice thing about https://github.com/erikrose/parsimonious is that (from the looks of it, I haven't tried) one just writes down the grammar of the text to be parsed in BNF form. Independent of whether parsimonious is used or not (probably not) in the end, having the ds9 grammar written down would be nice, e.g. in the docstring of the parser .py file, or the developer docs section.

I don't presume the BNF for DS9 region files exists, but it could probably be extracted from the pyparsing parser code in pyregion, which looks a bit similar.

Member

cdeil commented May 6, 2017

One more thing I forgot to mention: the nice thing about https://github.com/erikrose/parsimonious is that (from the looks of it, I haven't tried) one just writes down the grammar of the text to be parsed in BNF form. Independent of whether parsimonious is used or not (probably not) in the end, having the ds9 grammar written down would be nice, e.g. in the docstring of the parser .py file, or the developer docs section.

I don't presume the BNF for DS9 region files exists, but it could probably be extracted from the pyparsing parser code in pyregion, which looks a bit similar.

@keflavich

This comment has been minimized.

Show comment
Hide comment
@keflavich

keflavich May 6, 2017

Contributor

We'll talk this week, but my recollection from last year was either no one understood how to write a grammar for ds9, or we tried and found it incompatible.

With a functional parser like the one we've started on, we can guarantee 100% coverage given enough programmer time. I don't know if the same is true for a grammar-based approach; it's not obvious to me whether ds9 regions follow a single self-consistent grammar.

Probably the existing code could be made more clear by breaking it down into individual region types; right now, there are extra if statements around for the cases that have indefinite numbers of parameters (e.g., polygons, annuli). I won't object to a BNF-based approach if it can be demonstrated, but I have to see it to believe it.

Contributor

keflavich commented May 6, 2017

We'll talk this week, but my recollection from last year was either no one understood how to write a grammar for ds9, or we tried and found it incompatible.

With a functional parser like the one we've started on, we can guarantee 100% coverage given enough programmer time. I don't know if the same is true for a grammar-based approach; it's not obvious to me whether ds9 regions follow a single self-consistent grammar.

Probably the existing code could be made more clear by breaking it down into individual region types; right now, there are extra if statements around for the cases that have indefinite numbers of parameters (e.g., polygons, annuli). I won't object to a BNF-based approach if it can be demonstrated, but I have to see it to believe it.

@cdeil

This comment has been minimized.

Show comment
Hide comment
@cdeil

cdeil May 7, 2017

Member

how to write a grammar for ds9,

What about the pyparsing based parser in pyregion? Is there a problem apart from the speed issue?
Can you point out a (ideally small) example that can't be parsed at all or isn't parsed correctly?
If yes, maybe make a new issue in the pyregion tracker (to keep this one focused on the speed problem)?

Member

cdeil commented May 7, 2017

how to write a grammar for ds9,

What about the pyparsing based parser in pyregion? Is there a problem apart from the speed issue?
Can you point out a (ideally small) example that can't be parsed at all or isn't parsed correctly?
If yes, maybe make a new issue in the pyregion tracker (to keep this one focused on the speed problem)?

@joleroi

This comment has been minimized.

Show comment
Hide comment
@joleroi

joleroi May 8, 2017

This is the table with the current status

With a functional parser like the one we've started on, we can guarantee 100% coverage given enough programmer time.

I agree, we should work until the table linked above shows 100% for coverage everywhere for the regions package. Speed is already much better than pyregion, so we probably don't need to worry about that (as @cdeil said). We discussed this already a year ago, and couldn't come up with a complete DS9 grammar.

Can you point out a (ideally small) example that can't be parsed at all or isn't parsed correctly?

See the script to generate the table, there are many warnings like this

UserWarning: Failed to parse : box(18:13:27.558,+31:22:30.76,13.32",6.66",19.98",9.99",45.2623) # font="helvetica 10 bold roman" text={Box Annulus}
  warnings.warn("Failed to parse : " + l)

Also, most of the files that cannot be parsed with the regions package, show only a 75% coverage with pyregion. So I would be reluctant to try going down the grammar avenue again.

joleroi commented May 8, 2017

This is the table with the current status

With a functional parser like the one we've started on, we can guarantee 100% coverage given enough programmer time.

I agree, we should work until the table linked above shows 100% for coverage everywhere for the regions package. Speed is already much better than pyregion, so we probably don't need to worry about that (as @cdeil said). We discussed this already a year ago, and couldn't come up with a complete DS9 grammar.

Can you point out a (ideally small) example that can't be parsed at all or isn't parsed correctly?

See the script to generate the table, there are many warnings like this

UserWarning: Failed to parse : box(18:13:27.558,+31:22:30.76,13.32",6.66",19.98",9.99",45.2623) # font="helvetica 10 bold roman" text={Box Annulus}
  warnings.warn("Failed to parse : " + l)

Also, most of the files that cannot be parsed with the regions package, show only a 75% coverage with pyregion. So I would be reluctant to try going down the grammar avenue again.

@cdeil

This comment has been minimized.

Show comment
Hide comment
@cdeil

cdeil May 8, 2017

Member

I didn't know that pyregion is not a complete DS9 region parser and I think wasn't part of the DS9 grammar / parse discussions last year (or forgot).

OK, fine with me to go with a hand-written parser if you think it's a good (or even the only possible) solution.

The pyregions docs are a bit all over the place.

  • On the front page there's a note "pyregion is rather slow, likely due to a inefficient parser. Any contribution will be welcome." and another one pointing to the regions package without saying anything about the relation of the two.
  • At the top of http://pyregion.readthedocs.io/en/latest/getting_started.html there's a statement containing "It reads most of the region files created by ds9.". I think it would be good to give an example of such a case and say what happens in the docs.

The regions docs also don't really say anything about the relation or regions and pyregion.
I think it would be good to add an About section to both, briefly describing what they are with a few bullet points.
Given the discussion here I think it's roughly: pyregion is a legacy package, will likely remain as-is. (Would still be nice to increase test coverage a bit, improve docs a bit and make one more release).
regions is supposed to be a full replacement and contain a full DS9 region parser. I think the "Shape" and "ShapeList" abstractions from pyregion might be good ones and should become part of the regions.io.ds9 API as an intermediate layer?

@joleroi or @keflavich - Can you do such a docs update this week or should I put it on my TODO list?

Note that @olebole as Debian Astro lead asked me a week ago if he should include pyregion and / or regions in Debian. My answer was that "both" would be good. I think there's many pyregion users and it would be good to have it also, at least for the next 5 years, then maybe it can be dropped at some point (like pywcs and pyfits were recently now that there's Astropy).
For regions I don't think we can make it into Astropy 2.0. It's far from finished / polished / user-tested. It'll be under development for another year and we don't want to commit to stable API like putting as astropy.regions implies, no?

Member

cdeil commented May 8, 2017

I didn't know that pyregion is not a complete DS9 region parser and I think wasn't part of the DS9 grammar / parse discussions last year (or forgot).

OK, fine with me to go with a hand-written parser if you think it's a good (or even the only possible) solution.

The pyregions docs are a bit all over the place.

  • On the front page there's a note "pyregion is rather slow, likely due to a inefficient parser. Any contribution will be welcome." and another one pointing to the regions package without saying anything about the relation of the two.
  • At the top of http://pyregion.readthedocs.io/en/latest/getting_started.html there's a statement containing "It reads most of the region files created by ds9.". I think it would be good to give an example of such a case and say what happens in the docs.

The regions docs also don't really say anything about the relation or regions and pyregion.
I think it would be good to add an About section to both, briefly describing what they are with a few bullet points.
Given the discussion here I think it's roughly: pyregion is a legacy package, will likely remain as-is. (Would still be nice to increase test coverage a bit, improve docs a bit and make one more release).
regions is supposed to be a full replacement and contain a full DS9 region parser. I think the "Shape" and "ShapeList" abstractions from pyregion might be good ones and should become part of the regions.io.ds9 API as an intermediate layer?

@joleroi or @keflavich - Can you do such a docs update this week or should I put it on my TODO list?

Note that @olebole as Debian Astro lead asked me a week ago if he should include pyregion and / or regions in Debian. My answer was that "both" would be good. I think there's many pyregion users and it would be good to have it also, at least for the next 5 years, then maybe it can be dropped at some point (like pywcs and pyfits were recently now that there's Astropy).
For regions I don't think we can make it into Astropy 2.0. It's far from finished / polished / user-tested. It'll be under development for another year and we don't want to commit to stable API like putting as astropy.regions implies, no?

@keflavich

This comment has been minimized.

Show comment
Hide comment
@keflavich

keflavich May 8, 2017

Contributor

Agreed that both pyregion and region need to coexist for at least a while.

Could you make the docs thing into a separate issue? Maybe we'll be able to address it this week, but it should be an Issue.

Contributor

keflavich commented May 8, 2017

Agreed that both pyregion and region need to coexist for at least a while.

Could you make the docs thing into a separate issue? Maybe we'll be able to address it this week, but it should be an Issue.

@cdeil

This comment has been minimized.

Show comment
Hide comment
@cdeil

cdeil May 8, 2017

Member

@keflavich - OK. I made a separate issue suggesting the docs improvements here: astropy/regions#124

Member

cdeil commented May 8, 2017

@keflavich - OK. I made a separate issue suggesting the docs improvements here: astropy/regions#124

@leejjoon

This comment has been minimized.

Show comment
Hide comment
@leejjoon

leejjoon May 8, 2017

Contributor
  1. coverage of pyregion

I have made a pull request (#111) that should improve the coverage. As far as I can tell, it now can parse all the regions in the test. The current version of the regions_pyregion_comparison.py reports coverage around 82%, but this is because the script overestimates the number of defined regions in the file.

  1. speed of pyregion

I think that the reason that pyregion is slow is likely not due to the inherent performance of the pyparsing, but more likely due to the inefficient grammar I created (and things inbetween). By no means I am an expert on this grammar thing and also the pyparsing.

Contributor

leejjoon commented May 8, 2017

  1. coverage of pyregion

I have made a pull request (#111) that should improve the coverage. As far as I can tell, it now can parse all the regions in the test. The current version of the regions_pyregion_comparison.py reports coverage around 82%, but this is because the script overestimates the number of defined regions in the file.

  1. speed of pyregion

I think that the reason that pyregion is slow is likely not due to the inherent performance of the pyparsing, but more likely due to the inefficient grammar I created (and things inbetween). By no means I am an expert on this grammar thing and also the pyparsing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment