New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up writing of FITS tables with string columns #6920
Conversation
…which avoids auto-converting byte string columns to unicode. This provides a significant speedup when writing FITS tables with strings.
…cant (~40x) speedup compared to the original implementation.
…rite, not by default when using table_to_hdu
Hi there @astrofrog 👋 - thanks for the pull request! I'm just a friendly 🤖 that checks for issues related to the changelog and making sure that this pull request is milestoned and labeled correctly. This is mainly intended for the maintainers, so if you are not a maintainer you can ignore this, and a maintainer will let you know if any action is required on your part 😃. Everything looks good from my point of view! 👍 If there are any issues with this message, please report them here. |
@astrofrog - no worry about credit! I did edit your version a little, as I realized |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments, mostly for the last version from @mhvk
astropy/io/fits/fitsrec.py
Outdated
if dt.kind not in 'SU': | ||
raise TypeError("This function can only be used on string arrays") | ||
bpc = 1 if dt.kind == 'S' else 4 | ||
dt_int = "{0}{1}u{2}".format(dt.itemsize // bpc, dt.byteorder, dt.bpc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should use bpc
instead of dt.bpc
.
astropy/io/fits/fitsrec.py
Outdated
# the right length. Trailing spaces (which are represented as 32) are then | ||
# converted to null characters (represented as zeros). To avoid creating | ||
# large temporary mask arrays, we loop over chunks (attempting to do that | ||
# on a 1-D version of the array; large memomry may still be needed in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: memomry
astropy/io/fits/fitsrec.py
Outdated
# arrays, although the chunks will now be larger. | ||
if b.ndim > 1: | ||
try: | ||
b.shape = -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, it took me some time to understand the previous version, but now I'm lost: if you flatten the array how do you know the beginning/end of each string ? and the chunking below may split a string ? After some testing it does not seem to work, though it works without this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dud, that's stupid of me!!
I canceled Travis because it would fail with |
Apart from the change in |
@mhvk - could you fix @saimn - what do you think about my question regarding the following example:
(see towards the end of the PR description) |
OK, I tried to correct my mistakes. Still worked from the on-line editor, so no guarantees... |
@mhvk - thanks! I wonder whether we might want to have a centralized place in Astropy to keep Numpy helper functions like this (or in other words functions that try and get around Numpy performance or functionality issues)? |
astropy/io/fits/fitsrec.py
Outdated
# equal the number of characters in each string. | ||
bpc = 1 if dt.kind == 'S' else 4 | ||
dt_int = "{0}{1}u{2}".format(dt.itemsize // bpc, dt.byteorder, bpc) | ||
b = np.array(array, copy=False).view(dt_int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, why are we doing the np.array
here? If it is not a proper array, this will copy and nothing will happen. Maybe replace with b = array.view(dt_int, np.ndarray)
(it is useful, I think, so remove any subclass here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because array is a chararray
and things don't work otherwise... (view doesn't work with a chararray
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specifically:
In [12]: array = np.array([b'a ', b'b ', b'c ']).view(np.chararray)
In [13]: array.view('2|u1')
Out[13]: ---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
~/miniconda3/envs/dev/lib/python3.6/site-packages/IPython/core/formatters.py in __call__(self, obj)
691 type_pprinters=self.type_printers,
692 deferred_pprinters=self.deferred_printers)
--> 693 printer.pretty(obj)
694 printer.flush()
695 return stream.getvalue()
~/miniconda3/envs/dev/lib/python3.6/site-packages/IPython/lib/pretty.py in pretty(self, obj)
378 if callable(meth):
379 return meth(obj, self, cycle)
--> 380 return _default_pprint(obj, self, cycle)
381 finally:
382 self.end_group()
~/miniconda3/envs/dev/lib/python3.6/site-packages/IPython/lib/pretty.py in _default_pprint(obj, p, cycle)
493 if _safe_getattr(klass, '__repr__', None) is not object.__repr__:
494 # A user-provided repr. Find newlines and replace them with p.break_()
--> 495 _repr_pprint(obj, p, cycle)
496 return
497 p.begin_group(1, '<')
~/miniconda3/envs/dev/lib/python3.6/site-packages/IPython/lib/pretty.py in _repr_pprint(obj, p, cycle)
691 """A pprint that just redirects to the normal repr function."""
692 # Find newlines and replace them with p.break_()
--> 693 output = repr(obj)
694 for idx,output_line in enumerate(output.splitlines()):
695 if idx:
~/miniconda3/envs/dev/lib/python3.6/site-packages/numpy/core/numeric.py in array_repr(arr, max_line_width, precision, suppress_small)
1896 if arr.size > 0 or arr.shape == (0,):
1897 lst = array2string(arr, max_line_width, precision, suppress_small,
-> 1898 ', ', "array(")
1899 else: # show zero-length shape unless it is (0,)
1900 lst = "[], shape=%s" % (repr(arr.shape),)
~/miniconda3/envs/dev/lib/python3.6/site-packages/numpy/core/arrayprint.py in array2string(a, max_line_width, precision, suppress_small, separator, prefix, style, formatter)
461 else:
462 lst = _array2string(a, max_line_width, precision, suppress_small,
--> 463 separator, prefix, formatter=formatter)
464 return lst
465
~/miniconda3/envs/dev/lib/python3.6/site-packages/numpy/core/arrayprint.py in _array2string(a, max_line_width, precision, suppress_small, separator, prefix, formatter)
334 lst = _formatArray(a, format_function, len(a.shape), max_line_width,
335 next_line_prefix, separator,
--> 336 _summaryEdgeItems, summary_insert)[:-1]
337 return lst
338
~/miniconda3/envs/dev/lib/python3.6/site-packages/numpy/core/arrayprint.py in _formatArray(a, format_function, rank, max_line_len, next_line_prefix, separator, edge_items, summary_insert)
529 if leading_items or i != trailing_items:
530 s += next_line_prefix
--> 531 s += _formatArray(a[-i], format_function, rank-1, max_line_len,
532 " " + next_line_prefix, separator, edge_items,
533 summary_insert)
~/miniconda3/envs/dev/lib/python3.6/site-packages/numpy/core/defchararray.py in __getitem__(self, obj)
1850
1851 def __getitem__(self, obj):
-> 1852 val = ndarray.__getitem__(self, obj)
1853
1854 if isinstance(val, character):
~/miniconda3/envs/dev/lib/python3.6/site-packages/numpy/core/defchararray.py in __array_finalize__(self, obj)
1847 # The b is a special case because it is used for reconstructing.
1848 if not _globalvar and self.dtype.char not in 'SUbc':
-> 1849 raise ValueError("Can only create a chararray from string data.")
1850
1851 def __getitem__(self, obj):
ValueError: Can only create a chararray from string data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, but then we still should just do a .view(dt_int, np.ndarray)
- that way we get a failure rather than a copy for the wrong input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ps. The view does work if you also pass in the class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see, I didn't know about that trick!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't seem to work for chararray unfortunately. However I changed it so that I only do the explicit call to np.array if the array is a chararray, which should avoid accidental copies I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's weird; I had checked:
In [8]: np.char.array(['abc', 'bcd']).view('<3u4', np.ndarray)
Out[8]:
array([[ 97, 98, 99],
[ 98, 99, 100]], dtype=uint32)
Might it be version-dependent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I think I got it to work now, not sure what was happening before :-/
@astrofrog - ideally we upstream improvements. Though this particular one is so specific to stripping spaces that I think it is fine to keep it in astropy. It may indeed make sense to move it to |
@mhvk - I agree about upstreaming improvements, but sometimes we want the improvement now versus waiting for the fix to be in the oldest supported version of Numpy, so I meant there could be a place where things like that live (and if they never make it upstream, that's fine too) |
Related to
I wonder whether we might want to consider getting rid of chararrays in io.fits. As indicated in the numpy docs, chararrays are not really recommended anymore and are pretty ancient, so it might be confusing to still return them to users. We also now only support Numpy >= 1.10 which the above comment alludes to. I'm not entirely sure what we would risk breaking in practice for users. |
There used to be |
Yes, I think a new |
@astropy - I think it would be good to get rid of |
What about having it private, |
@astrofrog - I started a long answer but now I doubt, the array is converted to unicode when you create the |
About chararray deprecation, there is already an issue: #3862 |
Ok, I understand a bit better now ;), the data is indeed converted when creating the BinTableHDU... |
…e np.ndarray is for chararrays)
@saimn - ok thanks, this is helpful. I think the reason the strings get converted to unicode even if you pass a bytes array in is because of the following blocks of code: # Make the ndarrays in the Column objects of the ColDefs
# object of the HDU reference the same ndarray as the HDU's
# FITS_rec object.
for idx, col in enumerate(self.columns):
col.array = self.data.field(idx) and # Now replace the original column array references with the new
# fields
# This is required to prevent the issue reported in
# https://github.com/spacetelescope/PyFITS/issues/99
for idx in range(len(columns)):
columns._arrays[idx] = data.field(idx) Specifically, doing One thing that isn't clear to me is whether the following current behavior is buggy (this is before this PR): In [1]: import numpy as np
...: from astropy.io.fits import BinTableHDU
...: x = np.repeat(b'a', 10_000_000)
...: array = np.array(x, dtype=[('col', 'S1')])
...: hdu = BinTableHDU.from_columns(array)
...:
In [2]: hdu.data
Out[2]:
FITS_rec([('a',), ('a',), ('a',), ..., ('a',), ('a',), ('a',)],
dtype=(numpy.record, [('col', 'S1')]))
In [3]: hdu.data['col']
Out[3]:
chararray(['a', 'a', 'a', ..., 'a', 'a', 'a'],
dtype='<U1') Specifically I would expect that since I explicitly initialized the BinTableHDU with a byte/string array, there should be no automatic conversion to unicode (in fact the At the end of the day, I think you are right: we should never be storing unicode arrays inside the FITS code, and should only ever decode to unicode on output. I do think we need to decide about the behavior of the last example above, that is if we do initialize something with a bytes array explicitly, should it ever be turned into unicode automatically. |
…r write_table_fits since one would always want to use this when writing (as unicode arrays can't be written to FITS)
I think that having these kinds of improvements when using plain io.fits is going to require a lot more work, as made clear from the discussion above. I think ultimately we should never convert a whole array of bytes to unicode and probably would want to do that lazily when elements are accessed (as a kind of unicode sandwich). In any case, I think the PR as it is now already provides improvements when using One last thing I was wondering - I think FITS doesn't allow vector columns for strings, correct? So maybe we can simplify |
@@ -236,7 +236,7 @@ def write_table_fits(input, output, overwrite=False): | |||
Whether to overwrite any existing file without warning. | |||
""" | |||
|
|||
table_hdu = table_to_hdu(input) | |||
table_hdu = table_to_hdu(input, character_as_bytes=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just FYI, this is deliberate - there are no situations I can think of under which we should expose character_as_bytes
as a keyword argument in write_table_fits
(unlike for reading)
Yes, I came to the same blocks of code. I think that it should be possible to add the
Same here 🙀 , i'm always lost with all the cross-references between the BinTableHDU, FITSRec, Coldefs and Column classes !
I don't know, is there a reason why muti-dimensionnal arrays of strings should be forbidden ? Otherwise, this looks good and is a clear improvement! |
@saimn - thanks for approving! @taldcroft, can we go ahead with this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't dug through existing tests to check for coverage, but the points that I marked need to be tested somewhere. If you can point to existing tests that's fine. Otherwise looks great!
astropy/io/fits/fitsrec.py
Outdated
# Note: the code will work if this fails; the chunks will just be larger. | ||
if b.ndim > 2: | ||
try: | ||
b.shape = -1, b.shape[-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need test of this case.
astropy/io/fits/fitsrec.py
Outdated
try: | ||
b.shape = -1, b.shape[-1] | ||
except AttributeError: | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need test of this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
astropy/io/fits/fitsrec.py
Outdated
# mask which will tell whether we're in a sequence of trailing spaces. | ||
mask = np.ones(c.shape[:-1], dtype=bool) | ||
# loop over the characters in the strings, in reverse order. | ||
for i in range(-1, -c.shape[-1], -1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need explicit test of leading / trailing / embedded spaces, e.g. c = np.array([[' a b', 'bbbb'], [' c ', 'd']])
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
astropy/io/fits/fitsrec.py
Outdated
c = b[j:j + bufsize] | ||
# mask which will tell whether we're in a sequence of trailing spaces. | ||
mask = np.ones(c.shape[:-1], dtype=bool) | ||
# loop over the characters in the strings, in reverse order. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This little algorithm is so clever that I would suggest expanding the comment. In particular the way that the backward loop works on the i-th character within every string at once and effectively terminates within each string by setting the mask to zero. Save any future generations the trouble I had in figuring this little gem out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good to add comments to my little trick... Which really mostly shows I'm too lazy to code in C...
astropy/io/fits/fitsrec.py
Outdated
raise TypeError("This function can only be used on string arrays") | ||
# View the array as appropriate integers. The last dimension will | ||
# equal the number of characters in each string. | ||
bpc = 1 if dt.kind == 'S' else 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs testing ('S' type and 'U' type, with the embedded spaces case).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
As a more general (and naive) comment about FITS and our implementation, I guess I can infer from the existing code that stripping trailing spaces in input data is considered the correct and desired behavior? I note that leading spaces are preserved, and null-terminated (shorter) strings are handled correctly, but trailing spaces are just dropped. |
@taldcroft - thanks for the review! I'll add some tests. I'm going to assume the reason for the behavior is something to do with the fact that this is what Numpy does:
and one might want this kind of array to round-trip? I'm not sure though. |
@astrofrog - the point is that in numpy, leading and trailing spaces are always preserved, but when passing through FITS, any trailing spaces are lost (while leading spaces are preserved).
But now I just happened on #2608, and it seems from the comment by @mdboom that there is a real-world motivation for right stripping. Older FITS files may use space-padding instead of terminating strings. So I rescind my question, AND it looks like we can close #2608 since io.fits now rstrips? |
Ha, I guess the answer is always "older FITS files" |
@taldcroft - I've implemented your changes, including a new test specifically for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
@taldcroft : @pllim's comment there suggests the opposite. I have not verified but I think that Thanks for approval, so let's merge this 🎉 |
This is a follow-up to #6821 and speeds up specifically writing tables with string columns.
At the moment, if a table contains a string column, that string column is first decoded by io.fits to a unicode array (if not already the case) before getting re-encoded to a byte array. This means that if the original column was a byte string column (which one could get after reading in a FITS file with Table.read), there is an unnecessary conversion to unicode and back. Furthermore, there was previously a Python loop over all string elements to remove trailing whitespace. This PR uses the
character_as_bytes
functionality developed in #6821 to avoid the conversion to unicode in the case a byte string array is passed, and also includes a significantly sped up_rstrip_replace
provided by @mhvk in #6906.@mhvk - I feel bad including your code without assigning the commit to you, so if you want the credit, feel free to edit this PR and change the existing commit to be under your user!
Test code:
The table initialization takes 1.5s, and the writing of the table itself takes 22.1s before this PR and 2.1s after, giving a speedup of a factor of 10x, and the peak memory usage goes down by a factor of ~2.5x:
(the 'speed up
_rstrip_replace
' curve also includescharacter_as_bytes
)If I set
t['strings']
to a unicode column instead of bytes column:Things are also faster with this PR:
(the 'speed up
_rstrip_replace
' curve also includescharacter_as_bytes
)In both cases writing a table still seems to double the memory used by the table, but I'm not sure if we can avoid that. One use case I'd like to test is whether if I have a larger-than-memory table I can read it in with memory mapping, modify a cell, and write it out again without using much memory.
Now we might want to consider whether to put some of the optimizations lower down in io.fits - for instance if doing
BinTableHDU.from_columns
with a byte array, it will be converted to unicode internally then back to bytes as mentioned above. Check this out:This takes almost a minute and 2.2Gb of memory:
The output for the dtype is
(numpy.record, [('col', 'S1')])
. If I call it with:it takes 2 seconds and 350Mb of memory:
And produces the exact same output. So maybe
BinTableHDU.from_columns
shouldn't even have acharacter_as_bytes
option and it should always default toTrue
. But I need to think about the implications about this some more and would value input from FITS experts (e.g. @saimn @MSeifert04).