Optimize fits.Header parsing #8428

saimn · 2019-02-15T23:16:00Z

Related to #5593, this implements one of the idea mentioned there: adding a fromcards class method to speed up the creation of a Header instance from a list of cards.
The other main change is some reorganization of keyword parsing (_parse_keyword), improving the most common case.
And probably a small bugfix with the return value from _check_if_rvkc_image which was not used. This was probably harmless, I guess the check was done again later.

The gain is significant (~12-13%) for files with many HDUs, below just computing the number of HDUs for a file with >300 extensions (the second command is using 100 extensions, the third only 10).

Before:

~/lib/astropy/fits-header
❯ python -m timeit -n 1 -r 5 -s "from astropy.io import fits" "print(len(fits.open('./scipost_white.fits')))"
336
336
336
336
336
1 loop, best of 5: 6.16 sec per loop

~/lib/astropy/fits-header 34s
❯ python -m timeit -n 1 -r 5 -s "from astropy.io import fits" "print(len(fits.open('./scipost_white.fits')[:100]))"
100
100
100
100
100
1 loop, best of 5: 1.85 sec per loop

~/lib/astropy/fits-header
❯ python -m timeit -n 5 -r 5 -s "from astropy.io import fits" "print(len(fits.open('./scipost_white.fits')[:10]))"
10
...
10
5 loops, best of 5: 167 msec per loop

After:

~/lib/astropy/fits-header
❯ python -m timeit -n 1 -r 5 -s "from astropy.io import fits" "print(len(fits.open('./scipost_white.fits')))"
336
336
336
336
336
1 loop, best of 5: 5.38 sec per loop

~/lib/astropy/fits-header 28s
❯ python -m timeit -n 1 -r 5 -s "from astropy.io import fits" "print(len(fits.open('./scipost_white.fits')[:100]))"
100
100
100
100
100
1 loop, best of 5: 1.64 sec per loop

~/lib/astropy/fits-header
❯ python -m timeit -n 5 -r 5 -s "from astropy.io import fits" "print(len(fits.open('./scipost_white.fits')[:10]))"
10
...
10
5 loops, best of 5: 152 msec per loop

I'm also working on other ideas from #5593, but this will take more time, so let's go with this for now!

codecov · 2019-02-15T23:36:06Z

Codecov Report

Merging #8428 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #8428      +/-   ##
==========================================
+ Coverage   86.77%   86.78%   +<.01%     
==========================================
  Files         387      387              
  Lines       58109    58125      +16     
  Branches     1060     1060              
==========================================
+ Hits        50426    50442      +16     
  Misses       7068     7068              
  Partials      615      615

Impacted Files	Coverage Δ
astropy/io/fits/header.py	`96.17% <100%> (+0.06%)`	⬆️
astropy/io/fits/card.py	`86.48% <100%> (+0.06%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8d97a4f...26c61be. Read the comment docs.

MSeifert04

A few question, but on the whole this seems like a great addition (given the timings).

Did you run a profiler against the benchmark? If so I'm very interested to see where the time is spent here. But I can also do that later (probably after the weekend).

MSeifert04 · 2019-02-16T08:44:54Z

astropy/io/fits/card.py

@@ -24,6 +24,7 @@
 KEYWORD_LENGTH = 8  # The max length for FITS-standard keywords

 VALUE_INDICATOR = '= '  # The standard FITS value indicator
+VALUE_INDICATOR_LEN = 2


Wouldn't it be more semantically correct to replace the 2 with len(VALUE_INDICATOR)?

Yes, but I hard coded the value to avoid a tiny overhead at import time.

Is it worth it?

The overhead is probably negligible, I can change if you prefer. But the value is hardcoded and defined just above so using len seems overkill.

Maybe overkill, but I only commented because it took me a few seconds to figure out what this value is (probably more time than saved by hardcoding it 😄).

Thanks for changing it 👍

MSeifert04 · 2019-02-16T08:48:49Z

astropy/io/fits/header.py

@@ -93,7 +93,9 @@ def __init__(self, cards=[], copy=False):

            .. versionadded:: 1.3
        """
-        self.clear()
+        self._cards = []


Hm, why this change? Isn't this just duplicating code?

I don't think this is a good change. It's identical to clear and the method clear seems much more obvious to me (even without looking at the method) than these 3 re-assignments.

I'd prefer the previous call self.clear() here.

MSeifert04 · 2019-02-16T08:50:05Z

astropy/io/fits/header.py

+        header = cls()
+        for idx, card in enumerate(cards):
+            header._cards.append(card)
+            keyword = Card.normalize_keyword(card.keyword)


This seems to be copied from the append method, couldn't this be factored into a separate method?

If you can find a good name yes ;)

_add_card?

The method could not contain the _cards.append since .append can also insert the card at any position ... so it would just take care of the indices.

Could you elaborate? I don't understand the previous comment.

MSeifert04 · 2019-02-16T08:53:32Z

astropy/io/fits/header.py

@@ -938,13 +953,14 @@ def keys(self):
        instance has the same behavior.
        """

-        return self.__iter__()


Why was this changed? It seems unlikely this will affect performance, does it?

About this and .clear above, I included a commit to change a few non-obvious lines of code:
04f2c36
I think that these changes help a lot in the understanding of the flow in this complex class. For instance in __init__ I prefer to duplicate three lines instead of having to search each time where some variable is defined.

But it makes it harder to understand the semantics. Searching for definitions is an IDE thing while understanding the semantics and invariants is a Developer thing. And fits is complicated enough we shouldn't make it harder.

This is just me thinking. If that makes it easier for you to maintain the module go ahead (you're a lot more active than me so it's your judgement call). :)

I don't think that the change makes thing harder ;), if you look at the items, keys and values definition it is also much more consistent and easier to understand now.

Okay sorry, maybe this was the worst comment for my argument. 😅

I don't mind the changes to keys and values (much).

MSeifert04 · 2019-02-16T08:56:22Z

CAn you share the fits or the code to produce one that's like it?

saimn · 2019-02-17T20:54:41Z

I cannot share this specific file, I could build a similar one but this should noticeable for any file with many keywords (and not specifically for files with many HDUs - I updated the issue title -, this will be a following PR). So it would be useful if you can test on other files ;)

Here is the profiler result, a lot a time spent in Header.append (hence the new _fromcards) and in keyword parsing:

MSeifert04 · 2019-02-18T17:21:38Z

Thanks for the profiling!

May seem like a stupid idea but could you put the non-performance changes into a different PR, because I really like the performance improvements and I don't want the PR to suffer from my nitpicking comments about the non-performance-related changes? Or is that too much trouble?

saimn · 2019-02-18T23:10:27Z

Ok, I reverted the change for the clear call, I still find it weird to have to call a clear method as the first line of __init__ but this is not a big deal. I have another branch with a dozen commits based on this one so I would prefer not having to split things now, and the cosmetic changes are rather limited.

mhvk

Looks good! The only thing I'm not too fond of is the introduction of VALUE_INDICATOR_LEN - it is a very small speed improvement for a bit of lack of readability. But not a big deal at all.

pllim · 2019-02-19T15:35:16Z

CircleCI failure is #8431 and unrelated. Since there are 2 approvals, I am merging this. Thanks!

pllim · 2019-02-19T15:43:44Z

By the way, do we need a benchmark for this over at https://github.com/astropy/astropy-benchmarks ?

saimn · 2019-02-19T22:01:11Z

Thanks for the reviews.

@mhvk - I don't like it too, but the overhead of a function call is not negligible when it is done for thousands of keywords. Would be great if Python could optimize this itself.

@pllim - Yes, it would be great. I will see of I can add some benchmarks.

saimn added 4 commits February 15, 2019 23:56

Add a faster method to create a Header from a list of Card

8982e7d

Simplify code, making things more explicit

04f2c36

Reorganize Card._parse_keyword to avoid unnecessary operations

8b84a1f

Fix use of _check_if_rvkc_image

4b64774

saimn added io.fits Performance labels Feb 15, 2019

saimn added this to the v3.2 milestone Feb 15, 2019

saimn requested review from drdavella and MSeifert04 February 15, 2019 23:16

saimn changed the title ~~Optimize fits.Header for files with many HDUs~~ Optimize fits.Header parsing for files with many HDUs Feb 15, 2019

MSeifert04 reviewed Feb 16, 2019

View reviewed changes

saimn changed the title ~~Optimize fits.Header parsing for files with many HDUs~~ Optimize fits.Header parsing Feb 17, 2019

saimn added 2 commits February 17, 2019 21:56

Add changelog entry [skip ci]

04965c3

Use len for VALUE_INDICATOR_LEN

5766e77

Revert change in __init__

26c61be

MSeifert04 approved these changes Feb 19, 2019

View reviewed changes

mhvk approved these changes Feb 19, 2019

View reviewed changes

pllim merged commit c73640b into astropy:master Feb 19, 2019

saimn deleted the fits-perf branch February 19, 2019 21:57

saimn mentioned this pull request Feb 20, 2019

Add benchmarks for FITS Header parsing astropy/astropy-benchmarks#76

Closed

saimn mentioned this pull request Mar 16, 2019

Improve FITS header performance #8502

Merged

saimn mentioned this pull request Mar 25, 2019

Add fits benchmarks astropy/astropy-benchmarks#80

Merged

saimn mentioned this pull request May 24, 2019

Add lazy parsing of FITS headers #5593

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize fits.Header parsing #8428

Optimize fits.Header parsing #8428

saimn commented Feb 15, 2019

codecov bot commented Feb 15, 2019 •

edited

Loading

MSeifert04 left a comment

MSeifert04 Feb 16, 2019

saimn Feb 17, 2019

MSeifert04 Feb 17, 2019

saimn Feb 17, 2019

MSeifert04 Feb 18, 2019

MSeifert04 Feb 18, 2019

MSeifert04 Feb 16, 2019

MSeifert04 Feb 18, 2019

MSeifert04 Feb 16, 2019

saimn Feb 17, 2019

MSeifert04 Feb 17, 2019

saimn Feb 17, 2019

MSeifert04 Feb 18, 2019

MSeifert04 Feb 16, 2019

saimn Feb 17, 2019

MSeifert04 Feb 17, 2019

saimn Feb 17, 2019

MSeifert04 Feb 18, 2019

MSeifert04 commented Feb 16, 2019 •

edited

Loading

saimn commented Feb 17, 2019

MSeifert04 commented Feb 18, 2019

saimn commented Feb 18, 2019

mhvk left a comment

pllim commented Feb 19, 2019

pllim commented Feb 19, 2019

saimn commented Feb 19, 2019

Optimize fits.Header parsing #8428

Optimize fits.Header parsing #8428

Conversation

saimn commented Feb 15, 2019

codecov bot commented Feb 15, 2019 • edited Loading

Codecov Report

MSeifert04 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MSeifert04 commented Feb 16, 2019 • edited Loading

saimn commented Feb 17, 2019

MSeifert04 commented Feb 18, 2019

saimn commented Feb 18, 2019

mhvk left a comment

Choose a reason for hiding this comment

pllim commented Feb 19, 2019

pllim commented Feb 19, 2019

saimn commented Feb 19, 2019

codecov bot commented Feb 15, 2019 •

edited

Loading

MSeifert04 commented Feb 16, 2019 •

edited

Loading