Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up writing of FITS tables with string columns #6920

Merged
merged 12 commits into from Dec 18, 2017

Conversation

astrofrog
Copy link
Member

@astrofrog astrofrog commented Dec 1, 2017

This is a follow-up to #6821 and speeds up specifically writing tables with string columns.

At the moment, if a table contains a string column, that string column is first decoded by io.fits to a unicode array (if not already the case) before getting re-encoded to a byte array. This means that if the original column was a byte string column (which one could get after reading in a FITS file with Table.read), there is an unnecessary conversion to unicode and back. Furthermore, there was previously a Python loop over all string elements to remove trailing whitespace. This PR uses the character_as_bytes functionality developed in #6821 to avoid the conversion to unicode in the case a byte string array is passed, and also includes a significantly sped up _rstrip_replace provided by @mhvk in #6906.

@mhvk - I feel bad including your code without assigning the commit to you, so if you want the credit, feel free to edit this PR and change the existing commit to be under your user!

Test code:

import numpy as np
from astropy.table import Table

N = 10_000_000

t = Table()
t['floats1'] = np.random.random(N)
t['ints1'] = np.random.randint(0, 100, N)
t['strings'] = b'some strings'

t.write('test_write.fits', overwrite=True)

The table initialization takes 1.5s, and the writing of the table itself takes 22.1s before this PR and 2.1s after, giving a speedup of a factor of 10x, and the peak memory usage goes down by a factor of ~2.5x:

perf

(the 'speed up _rstrip_replace' curve also includes character_as_bytes)

If I set t['strings'] to a unicode column instead of bytes column:

t['strings'] = 'some strings'

Things are also faster with this PR:

perf_u

(the 'speed up _rstrip_replace' curve also includes character_as_bytes)

In both cases writing a table still seems to double the memory used by the table, but I'm not sure if we can avoid that. One use case I'd like to test is whether if I have a larger-than-memory table I can read it in with memory mapping, modify a cell, and write it out again without using much memory.

Now we might want to consider whether to put some of the optimizations lower down in io.fits - for instance if doing BinTableHDU.from_columns with a byte array, it will be converted to unicode internally then back to bytes as mentioned above. Check this out:

import numpy as np
from astropy.io.fits import BinTableHDU
x = np.repeat(b'a', 100_000_000)
array = np.array(x, dtype=[('col', 'S1')])
hdu = BinTableHDU.from_columns(array)
print(hdu.dtype)

This takes almost a minute and 2.2Gb of memory:

hdu_false

The output for the dtype is (numpy.record, [('col', 'S1')]). If I call it with:

hdu = BinTableHDU.from_columns(array, character_as_bytes=True)

it takes 2 seconds and 350Mb of memory:

hdu_true

And produces the exact same output. So maybe BinTableHDU.from_columns shouldn't even have a character_as_bytes option and it should always default to True. But I need to think about the implications about this some more and would value input from FITS experts (e.g. @saimn @MSeifert04).

…which avoids auto-converting byte string columns to unicode. This provides a significant speedup when writing FITS tables with strings.
…cant (~40x) speedup compared to the original implementation.
…rite, not by default when using table_to_hdu
@astropy-bot
Copy link

astropy-bot bot commented Dec 1, 2017

Hi there @astrofrog 👋 - thanks for the pull request! I'm just a friendly 🤖 that checks for issues related to the changelog and making sure that this pull request is milestoned and labeled correctly. This is mainly intended for the maintainers, so if you are not a maintainer you can ignore this, and a maintainer will let you know if any action is required on your part 😃.

Everything looks good from my point of view! 👍

If there are any issues with this message, please report them here.

@mhvk
Copy link
Contributor

mhvk commented Dec 1, 2017

@astrofrog - no worry about credit! I did edit your version a little, as I realized ravel() could go wrong in some cases: if the array is not contiguous in memory, it will make a copy. So, now I try to do the same by setting the shape, and if it fails just let the array be (i.e., it will use somewhat more memory, but I think one really needs to try to hit that! a proper fits column should never have the problem).

Copy link
Contributor

@saimn saimn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments, mostly for the last version from @mhvk

if dt.kind not in 'SU':
raise TypeError("This function can only be used on string arrays")
bpc = 1 if dt.kind == 'S' else 4
dt_int = "{0}{1}u{2}".format(dt.itemsize // bpc, dt.byteorder, dt.bpc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use bpc instead of dt.bpc.

# the right length. Trailing spaces (which are represented as 32) are then
# converted to null characters (represented as zeros). To avoid creating
# large temporary mask arrays, we loop over chunks (attempting to do that
# on a 1-D version of the array; large memomry may still be needed in the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: memomry

# arrays, although the chunks will now be larger.
if b.ndim > 1:
try:
b.shape = -1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it took me some time to understand the previous version, but now I'm lost: if you flatten the array how do you know the beginning/end of each string ? and the chunking below may split a string ? After some testing it does not seem to work, though it works without this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dud, that's stupid of me!!

@saimn
Copy link
Contributor

saimn commented Dec 1, 2017

I canceled Travis because it would fail with dt.bpc.

@saimn
Copy link
Contributor

saimn commented Dec 1, 2017

Apart from the change in rstrip_replace this is really straightforward, just avoiding useless bytes/unicode conversion, wow ! For rstrip_replace, the new implementation is less readable for sure, but given the gain we can live with it ;).

@astrofrog
Copy link
Member Author

@mhvk - could you fix _rstrip_replace directly in this branch? (thanks for the previous commit!)

@saimn - what do you think about my question regarding the following example:

import numpy as np
from astropy.io.fits import BinTableHDU
x = np.repeat(b'a', 100_000_000)
array = np.array(x, dtype=[('col', 'S1')])
hdu = BinTableHDU.from_columns(array)
print(hdu.dtype)

(see towards the end of the PR description)

@mhvk
Copy link
Contributor

mhvk commented Dec 1, 2017

OK, I tried to correct my mistakes. Still worked from the on-line editor, so no guarantees...

@astrofrog
Copy link
Member Author

@mhvk - thanks! I wonder whether we might want to have a centralized place in Astropy to keep Numpy helper functions like this (or in other words functions that try and get around Numpy performance or functionality issues)?

# equal the number of characters in each string.
bpc = 1 if dt.kind == 'S' else 4
dt_int = "{0}{1}u{2}".format(dt.itemsize // bpc, dt.byteorder, bpc)
b = np.array(array, copy=False).view(dt_int)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, why are we doing the np.array here? If it is not a proper array, this will copy and nothing will happen. Maybe replace with b = array.view(dt_int, np.ndarray) (it is useful, I think, so remove any subclass here)

Copy link
Member Author

@astrofrog astrofrog Dec 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because array is a chararray and things don't work otherwise... (view doesn't work with a chararray)

Copy link
Member Author

@astrofrog astrofrog Dec 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically:

In [12]: array = np.array([b'a ', b'b ', b'c ']).view(np.chararray)

In [13]: array.view('2|u1')
Out[13]: ---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~/miniconda3/envs/dev/lib/python3.6/site-packages/IPython/core/formatters.py in __call__(self, obj)
    691                 type_pprinters=self.type_printers,
    692                 deferred_pprinters=self.deferred_printers)
--> 693             printer.pretty(obj)
    694             printer.flush()
    695             return stream.getvalue()

~/miniconda3/envs/dev/lib/python3.6/site-packages/IPython/lib/pretty.py in pretty(self, obj)
    378                             if callable(meth):
    379                                 return meth(obj, self, cycle)
--> 380             return _default_pprint(obj, self, cycle)
    381         finally:
    382             self.end_group()

~/miniconda3/envs/dev/lib/python3.6/site-packages/IPython/lib/pretty.py in _default_pprint(obj, p, cycle)
    493     if _safe_getattr(klass, '__repr__', None) is not object.__repr__:
    494         # A user-provided repr. Find newlines and replace them with p.break_()
--> 495         _repr_pprint(obj, p, cycle)
    496         return
    497     p.begin_group(1, '<')

~/miniconda3/envs/dev/lib/python3.6/site-packages/IPython/lib/pretty.py in _repr_pprint(obj, p, cycle)
    691     """A pprint that just redirects to the normal repr function."""
    692     # Find newlines and replace them with p.break_()
--> 693     output = repr(obj)
    694     for idx,output_line in enumerate(output.splitlines()):
    695         if idx:

~/miniconda3/envs/dev/lib/python3.6/site-packages/numpy/core/numeric.py in array_repr(arr, max_line_width, precision, suppress_small)
   1896     if arr.size > 0 or arr.shape == (0,):
   1897         lst = array2string(arr, max_line_width, precision, suppress_small,
-> 1898                            ', ', "array(")
   1899     else:  # show zero-length shape unless it is (0,)
   1900         lst = "[], shape=%s" % (repr(arr.shape),)

~/miniconda3/envs/dev/lib/python3.6/site-packages/numpy/core/arrayprint.py in array2string(a, max_line_width, precision, suppress_small, separator, prefix, style, formatter)
    461     else:
    462         lst = _array2string(a, max_line_width, precision, suppress_small,
--> 463                             separator, prefix, formatter=formatter)
    464     return lst
    465 

~/miniconda3/envs/dev/lib/python3.6/site-packages/numpy/core/arrayprint.py in _array2string(a, max_line_width, precision, suppress_small, separator, prefix, formatter)
    334     lst = _formatArray(a, format_function, len(a.shape), max_line_width,
    335                        next_line_prefix, separator,
--> 336                        _summaryEdgeItems, summary_insert)[:-1]
    337     return lst
    338 

~/miniconda3/envs/dev/lib/python3.6/site-packages/numpy/core/arrayprint.py in _formatArray(a, format_function, rank, max_line_len, next_line_prefix, separator, edge_items, summary_insert)
    529             if leading_items or i != trailing_items:
    530                 s += next_line_prefix
--> 531             s += _formatArray(a[-i], format_function, rank-1, max_line_len,
    532                               " " + next_line_prefix, separator, edge_items,
    533                               summary_insert)

~/miniconda3/envs/dev/lib/python3.6/site-packages/numpy/core/defchararray.py in __getitem__(self, obj)
   1850 
   1851     def __getitem__(self, obj):
-> 1852         val = ndarray.__getitem__(self, obj)
   1853 
   1854         if isinstance(val, character):

~/miniconda3/envs/dev/lib/python3.6/site-packages/numpy/core/defchararray.py in __array_finalize__(self, obj)
   1847         # The b is a special case because it is used for reconstructing.
   1848         if not _globalvar and self.dtype.char not in 'SUbc':
-> 1849             raise ValueError("Can only create a chararray from string data.")
   1850 
   1851     def __getitem__(self, obj):

ValueError: Can only create a chararray from string data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, but then we still should just do a .view(dt_int, np.ndarray) - that way we get a failure rather than a copy for the wrong input.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ps. The view does work if you also pass in the class.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, I didn't know about that trick!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem to work for chararray unfortunately. However I changed it so that I only do the explicit call to np.array if the array is a chararray, which should avoid accidental copies I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's weird; I had checked:

In [8]: np.char.array(['abc', 'bcd']).view('<3u4', np.ndarray)
Out[8]: 
array([[ 97,  98,  99],
       [ 98,  99, 100]], dtype=uint32)

Might it be version-dependent?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I think I got it to work now, not sure what was happening before :-/

@mhvk
Copy link
Contributor

mhvk commented Dec 1, 2017

@astrofrog - ideally we upstream improvements. Though this particular one is so specific to stripping spaces that I think it is fine to keep it in astropy. It may indeed make sense to move it to utils...

@astrofrog
Copy link
Member Author

@mhvk - I agree about upstreaming improvements, but sometimes we want the improvement now versus waiting for the fix to be in the oldest supported version of Numpy, so I meant there could be a place where things like that live (and if they never make it upstream, that's fine too)

@astrofrog
Copy link
Member Author

Related to chararray, I found this function:

def _get_recarray_field(array, key):
    """
    Compatibility function for using the recarray base class's field method.
    This incorporates the legacy functionality of returning string arrays as
    Numeric-style chararray objects.
    """

    # Numpy >= 1.10.dev recarray no longer returns chararrays for strings
    # This is currently needed for backwards-compatibility and for
    # automatic truncation of trailing whitespace
    field = np.recarray.field(array, key)
    if (field.dtype.char in ('S', 'U') and
            not isinstance(field, chararray.chararray)):
        field = field.view(chararray.chararray)
    return field

I wonder whether we might want to consider getting rid of chararrays in io.fits. As indicated in the numpy docs, chararrays are not really recommended anymore and are pretty ancient, so it might be confusing to still return them to users. We also now only support Numpy >= 1.10 which the above comment alludes to. I'm not entirely sure what we would risk breaking in practice for users.

@bsipocz
Copy link
Member

bsipocz commented Dec 1, 2017

@mhvk - thanks! I wonder whether we might want to have a centralized place in Astropy to keep Numpy helper functions like this (or in other words functions that try and get around Numpy performance or functionality issues)?

There used to be numpycompat.py in utils/compat, and also there is still a numpy library there. I wonder whether these can live there? They just need some documentation if migrated upstream about the versions, so we know when it's OK to remove them.

@mhvk
Copy link
Contributor

mhvk commented Dec 1, 2017

Yes, utils.compat.numpy is close, but at the time I created it we agreed we would only use it for things that were in the process of being upstreamed, i.e., the compatibility was meant to indicate that these things overrode numpy functions.

I think a new utils.numpy might be the way forward for little routines such as this. Probably could start with some stuff from utils.misc (NumpyRNGContext, ShapedLikeNDArray, check_broadcast, IncompatibleShapeError, dtype_bytes_or_chars).

@mhvk
Copy link
Contributor

mhvk commented Dec 1, 2017

@astropy - I think it would be good to get rid of chararray - do raise a separate issue for it!

@bsipocz
Copy link
Member

bsipocz commented Dec 1, 2017

What about having it private, utils._numpy, so even if they make their way upstream, we can just move them around to compat, without the need of having to think about deprecation, etc for downstream users?

@saimn
Copy link
Contributor

saimn commented Dec 1, 2017

@astrofrog - I started a long answer but now I doubt, the array is converted to unicode when you create the BinTableHDU with from_columns ? I thought that it stores the byte array and convert to unicode only if the column is accessed, but I need to look more in detail.

@saimn
Copy link
Contributor

saimn commented Dec 1, 2017

About chararray deprecation, there is already an issue: #3862

@saimn
Copy link
Contributor

saimn commented Dec 2, 2017

Ok, I understand a bit better now ;), the data is indeed converted when creating the BinTableHDU...
But the issue with BinTableHDU.from_columns(array, character_as_bytes=True) is that you disable completely the bytes to unicode conversion, so if you print the column (or use it) you get bytes.
I think we need to find a way to disable the conversion for inputs (to be able to store bytes internally, as Table do) but still have the conversion when the data is used.

@astrofrog
Copy link
Member Author

astrofrog commented Dec 2, 2017

@saimn - ok thanks, this is helpful. I think the reason the strings get converted to unicode even if you pass a bytes array in is because of the following blocks of code:

                # Make the ndarrays in the Column objects of the ColDefs
                # object of the HDU reference the same ndarray as the HDU's
                # FITS_rec object.
                for idx, col in enumerate(self.columns):
                    col.array = self.data.field(idx)

and

        # Now replace the original column array references with the new
        # fields
        # This is required to prevent the issue reported in
        # https://github.com/spacetelescope/PyFITS/issues/99
        for idx in range(len(columns)):
            columns._arrays[idx] = data.field(idx)

Specifically, doing data.field(idx) causes _convert_other to be called (see inside the definition for field), which in turn contains the clause about decoding the array. To be honest I'm a bit lost with all the different references to columns, and including what the purpose of columns._arrays is. Currently the comments above are actually wrong if the data is converted - that is columns._arrays and col.array are NOT references to the underlying data since they might be converted.

One thing that isn't clear to me is whether the following current behavior is buggy (this is before this PR):

In [1]: import numpy as np
   ...: from astropy.io.fits import BinTableHDU
   ...: x = np.repeat(b'a', 10_000_000)
   ...: array = np.array(x, dtype=[('col', 'S1')])
   ...: hdu = BinTableHDU.from_columns(array)
   ...: 

In [2]: hdu.data
Out[2]: 
FITS_rec([('a',), ('a',), ('a',), ..., ('a',), ('a',), ('a',)], 
      dtype=(numpy.record, [('col', 'S1')]))

In [3]: hdu.data['col']
Out[3]: 
chararray(['a', 'a', 'a', ..., 'a', 'a', 'a'], 
      dtype='<U1')

Specifically I would expect that since I explicitly initialized the BinTableHDU with a byte/string array, there should be no automatic conversion to unicode (in fact the Out[2] output is confusing as it indicates the dtype is S1 but the output itself is unicode).

At the end of the day, I think you are right: we should never be storing unicode arrays inside the FITS code, and should only ever decode to unicode on output. I do think we need to decide about the behavior of the last example above, that is if we do initialize something with a bytes array explicitly, should it ever be turned into unicode automatically.

…r write_table_fits since one would always want to use this when writing (as unicode arrays can't be written to FITS)
@astrofrog
Copy link
Member Author

I think that having these kinds of improvements when using plain io.fits is going to require a lot more work, as made clear from the discussion above. I think ultimately we should never convert a whole array of bytes to unicode and probably would want to do that lazily when elements are accessed (as a kind of unicode sandwich).

In any case, I think the PR as it is now already provides improvements when using Table.write and doesn't have any downside or backward-incompatibility as far as I can see. As such I'd like to propose that it gets a final review, then I can open an issue to track some of the items raised in the discussion above.

One last thing I was wondering - I think FITS doesn't allow vector columns for strings, correct? So maybe we can simplify _rstrip_inplace a little by assuming arrays are always 1D?

@astrofrog astrofrog changed the title WIP: Speed up writing of FITS tables with string columns Speed up writing of FITS tables with string columns Dec 2, 2017
@@ -236,7 +236,7 @@ def write_table_fits(input, output, overwrite=False):
Whether to overwrite any existing file without warning.
"""

table_hdu = table_to_hdu(input)
table_hdu = table_to_hdu(input, character_as_bytes=True)
Copy link
Member Author

@astrofrog astrofrog Dec 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just FYI, this is deliberate - there are no situations I can think of under which we should expose character_as_bytes as a keyword argument in write_table_fits (unlike for reading)

@astrofrog astrofrog added this to the v3.0.0 milestone Dec 2, 2017
@saimn
Copy link
Contributor

saimn commented Dec 4, 2017

Specifically, doing data.field(idx) causes _convert_other to be called (see inside the definition for field), which in turn contains the clause about decoding the array.

Yes, I came to the same blocks of code. I think that it should be possible to add the character_as_bytes option to .field and then internally choose when to convert (for output) or not (for input). But I agree that we should keep this for later, it probably needs more time and work.

To be honest I'm a bit lost with all the different references to columns, and including what the purpose of columns._arrays is.

Same here 🙀 , i'm always lost with all the cross-references between the BinTableHDU, FITSRec, Coldefs and Column classes !

One last thing I was wondering - I think FITS doesn't allow vector columns for strings, correct? So maybe we can simplify _rstrip_inplace a little by assuming arrays are always 1D?

I don't know, is there a reason why muti-dimensionnal arrays of strings should be forbidden ?

Otherwise, this looks good and is a clear improvement!

@astrofrog
Copy link
Member Author

@saimn - thanks for approving! @taldcroft, can we go ahead with this?

@pllim pllim added this to Needs decision in 3.0 Feature Planning Dec 13, 2017
Copy link
Member

@taldcroft taldcroft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't dug through existing tests to check for coverage, but the points that I marked need to be tested somewhere. If you can point to existing tests that's fine. Otherwise looks great!

# Note: the code will work if this fails; the chunks will just be larger.
if b.ndim > 2:
try:
b.shape = -1, b.shape[-1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need test of this case.

try:
b.shape = -1, b.shape[-1]
except AttributeError:
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need test of this case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

# mask which will tell whether we're in a sequence of trailing spaces.
mask = np.ones(c.shape[:-1], dtype=bool)
# loop over the characters in the strings, in reverse order.
for i in range(-1, -c.shape[-1], -1):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need explicit test of leading / trailing / embedded spaces, e.g. c = np.array([[' a b', 'bbbb'], [' c ', 'd']]).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

c = b[j:j + bufsize]
# mask which will tell whether we're in a sequence of trailing spaces.
mask = np.ones(c.shape[:-1], dtype=bool)
# loop over the characters in the strings, in reverse order.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This little algorithm is so clever that I would suggest expanding the comment. In particular the way that the backward loop works on the i-th character within every string at once and effectively terminates within each string by setting the mask to zero. Save any future generations the trouble I had in figuring this little gem out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good to add comments to my little trick... Which really mostly shows I'm too lazy to code in C...

raise TypeError("This function can only be used on string arrays")
# View the array as appropriate integers. The last dimension will
# equal the number of characters in each string.
bpc = 1 if dt.kind == 'S' else 4
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs testing ('S' type and 'U' type, with the embedded spaces case).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@taldcroft
Copy link
Member

As a more general (and naive) comment about FITS and our implementation, I guess I can infer from the existing code that stripping trailing spaces in input data is considered the correct and desired behavior? I note that leading spaces are preserved, and null-terminated (shorter) strings are handled correctly, but trailing spaces are just dropped.

@astrofrog
Copy link
Member Author

@taldcroft - thanks for the review! I'll add some tests. I'm going to assume the reason for the behavior is something to do with the fact that this is what Numpy does:

In [2]: np.array(['a', 'bb', ' cc'])
Out[2]: 
array(['a', 'bb', ' cc'],
      dtype='<U3')

and one might want this kind of array to round-trip? I'm not sure though.

@taldcroft
Copy link
Member

taldcroft commented Dec 15, 2017

@astrofrog - the point is that in numpy, leading and trailing spaces are always preserved, but when passing through FITS, any trailing spaces are lost (while leading spaces are preserved).

In [2]: c = np.array(['a', 'bb ', ' cc'])
In [5]: t = Table([c])
In [6]: t.write('junk.fits', overwrite=True)
In [7]: t2 = Table.read('junk.fits')
In [9]: t2['col0'].view(np.ndarray)
Out[9]: 
array([b'a', b'bb', b' cc'], 
      dtype='|S3')

But now I just happened on #2608, and it seems from the comment by @mdboom that there is a real-world motivation for right stripping. Older FITS files may use space-padding instead of terminating strings. So I rescind my question, AND it looks like we can close #2608 since io.fits now rstrips?

@astrofrog
Copy link
Member Author

Ha, I guess the answer is always "older FITS files"

@astrofrog
Copy link
Member Author

@taldcroft - I've implemented your changes, including a new test specifically for _rstrip_replace

Copy link
Member

@taldcroft taldcroft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@saimn
Copy link
Contributor

saimn commented Dec 18, 2017

So I rescind my question, AND it looks like we can close #2608 since io.fits now rstrips?

@taldcroft : @pllim's comment there suggests the opposite. I have not verified but I think that io.fits rstrips but this is lost when converted to Table.

Thanks for approval, so let's merge this 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

5 participants