Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In io.ascii, fall back to string if integers are too large #2234

Merged
merged 7 commits into from Mar 27, 2014

Conversation

taldcroft
Copy link
Member

If I try and read the following file:

a                             b
12121312311248721894712984728 1122

using Table, I get:

In [5]: t = Table.read('data', format='ascii')
ERROR: OverflowError: Python int too large to convert to C long [astropy.io.ascii.core]
---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
<ipython-input-5-bcb96ef3f221> in <module>()
----> 1 t = Table.read('data', format='ascii')

/Volumes/Raptor/Library/Python/3.3/lib/python/site-packages/astropy-0.4.dev7437-py3.3-macosx-10.8-x86_64.egg/astropy/table/table.py in read(cls, *args, **kwargs)
   1725         passed through to the underlying data reader (e.g. `~astropy.io.ascii.ui.read`).
   1726         """
-> 1727         return io_registry.read(cls, *args, **kwargs)
   1728 
   1729     def write(self, *args, **kwargs):

/Volumes/Raptor/Library/Python/3.3/lib/python/site-packages/astropy-0.4.dev7437-py3.3-macosx-10.8-x86_64.egg/astropy/io/registry.py in read(cls, *args, **kwargs)
    317 
    318         reader = get_reader(format, cls)
--> 319         table = reader(*args, **kwargs)
    320 
    321         if not isinstance(table, cls):

/Volumes/Raptor/Library/Python/3.3/lib/python/site-packages/astropy-0.4.dev7437-py3.3-macosx-10.8-x86_64.egg/astropy/io/ascii/connect.py in read_asciitable(filename, **kwargs)
     19 def read_asciitable(filename, **kwargs):
     20     from .ui import read
---> 21     return read(filename, **kwargs)
     22 
     23 io_registry.register_reader('ascii', Table, read_asciitable)

/Volumes/Raptor/Library/Python/3.3/lib/python/site-packages/astropy-0.4.dev7437-py3.3-macosx-10.8-x86_64.egg/astropy/io/ascii/ui.py in read(table, guess, **kwargs)
    154         guess = _GUESS
    155     if guess:
--> 156         dat = _guess(table, new_kwargs)
    157     else:
    158         reader = get_reader(**new_kwargs)

/Volumes/Raptor/Library/Python/3.3/lib/python/site-packages/astropy-0.4.dev7437-py3.3-macosx-10.8-x86_64.egg/astropy/io/ascii/ui.py in _guess(table, read_kwargs)
    196 
    197             reader = get_reader(**guess_kwargs)
--> 198             dat = reader.read(table)
    199 
    200             # When guessing require at least two columns

/Volumes/Raptor/Library/Python/3.3/lib/python/site-packages/astropy-0.4.dev7437-py3.3-macosx-10.8-x86_64.egg/astropy/io/ascii/core.py in read(self, table)
    851 
    852         self.data.masks(cols)
--> 853         table = self.outputter(cols, self.meta)
    854         self.cols = self.header.cols
    855 

/Volumes/Raptor/Library/Python/3.3/lib/python/site-packages/astropy-0.4.dev7437-py3.3-macosx-10.8-x86_64.egg/astropy/io/ascii/core.py in __call__(self, cols, meta)
    649 
    650     def __call__(self, cols, meta):
--> 651         self._convert_vals(cols)
    652 
    653         # If there are any values that were filled and tagged with a mask bit then this

/Volumes/Raptor/Library/Python/3.3/lib/python/site-packages/astropy-0.4.dev7437-py3.3-macosx-10.8-x86_64.egg/astropy/io/ascii/core.py in _convert_vals(self, cols)
    631                     if not issubclass(converter_type, col.type):
    632                         raise TypeError()
--> 633                     col.data = converter_func(col.str_vals)
    634                     col.type = converter_type
    635                 except (TypeError, ValueError):

/Volumes/Raptor/Library/Python/3.3/lib/python/site-packages/astropy-0.4.dev7437-py3.3-macosx-10.8-x86_64.egg/astropy/io/ascii/core.py in converter(vals)
    590 
    591     def converter(vals):
--> 592         return numpy.array(vals, numpy_type)
    593     return converter, converter_type
    594 

OverflowError: Python int too large to convert to C long

Maybe when the integers are too large, strings should be used instead? This is a simplified version from the issue described in http://stackoverflow.com/questions/22617428/overflowerror-python-int-too-large-to-convert-to-c-long-with-astropy-table.

@astrofrog astrofrog added this to the v0.3.2 milestone Mar 25, 2014
@mdboom
Copy link
Contributor

mdboom commented Mar 25, 2014

Or maybe use Python longs stored in an object array. This would keep operations on these columns numerical rather than string-based (though obviously they'd still be far slower to work with than "native" ints).

@embray
Copy link
Member

embray commented Mar 25, 2014

I agree--an object array could be used for that column. Not ideal, but better than strings.

@taldcroft
Copy link
Member

There is a really simple fix that leads to an alternative behavior. Essentially what's happening is that the code below which tries various conversion options is missing OverflowError as a possible exception. If I put that in then it fails to do the int conversion if the native int size is too small and instead falls through to float.

/Volumes/Raptor/Library/Python/3.3/lib/python/site-packages/astropy-0.4.dev7437-py3.3-macosx-10.8-x86_64.egg/astropy/io/ascii/core.py in _convert_vals(self, cols)
    631                     if not issubclass(converter_type, col.type):
    632                         raise TypeError()
--> 633                     col.data = converter_func(col.str_vals)
    634                     col.type = converter_type
    635                 except (TypeError, ValueError):

Adding OverflowError above yields:

In [4]: print Table.read(['a b', '12121312311248721894712984728 1122'], format='ascii')
        a          b  
----------------- ----
1.21213123112e+28 1122

To me this seems like a better alternative than string or objects. What do you think? Doing objects is a bit scary to me because I'm not sure what else will break.

I'll try to post on stackoverflow later, but there is also an option on a 32-bit machine to explicitly specify the converter as int64, or if supported on a particular platform, int128.

@astrofrog
Copy link
Member Author

@taldcroft - converting to a float sounds good to me.

@taldcroft
Copy link
Member

Code attached.

@astrofrog
Copy link
Member Author

@taldcroft - this needs rebasing for some reason

@jkyeung
Copy link

jkyeung commented Mar 25, 2014

Speaking as a regular Python user (not an astropy user, and not even a NumPy user): I would be extremely wary of automatic conversion to float. It could be that incoming data was very carefully and specifically crafted to be integer. Silently introducing potential loss of precision doesn't strike me as the right thing to do. But I dunno; maybe astropy's domain is such that float is always a reasonable choice.

@astrofrog
Copy link
Member Author

Maybe a compromise is that this should emit a warning?

@mdboom
Copy link
Contributor

mdboom commented Mar 26, 2014

I was going to say the same thing as @jkyeung. Putting arbitrary precision Python longs in object arrays, while slower, doesn't introduce any data loss, which I think is more important. Maybe we can provide a flag to select the desired behavior here? Of course, the user can always convert from "object array of long ints" to floats if necessary after the fact, but the inverse is not true without data loss.

@astrofrog
Copy link
Member Author

Thinking about this more, I think that it's highly unlikely whoever stored >64-bit integers stored them as actual numbers, and much more likely they were intended as some kind of ID. In that case, I would argue that it would make more sense to use an array of strings than an object array of long ints. One can still easily convert an array of strings to an array of normal ints or long ints. This would also be in line with the idea that if a value can't be parsed, it stays as a string. Object arrays have the potential to confuse people, and will be less efficient than string arrays.

@taldcroft
Copy link
Member

OK, I'm persuaded that float is no good.

I'm on the fence about object vs. string. I'll say that in the past I've been burned by having a numpy array that looked like a normal int array but was actually an object dtype. I can't remember the specific problem, but it was one of those subtle issues that took a while to understand. Users with more limited knowledge might have a difficult time with that.

If you get a string instead then it's immediately obvious what's going on.

As a slight technical issue, at the point of the code in question, what you actually have is a list of strings. The only thing you know is that doing np.array(str_vals, dtype=np.int) gave an overflow error. So there is no list of Python ints at that point, but I suppose it could be created.

@taldcroft
Copy link
Member

I forgot to say that I do think there are cases where the big int was really intended as an int and getting a string will be initially surprising / annoying.

In any case a warning can be emitted which should help out.

@taldcroft
Copy link
Member

Thinking about this even more, the conversion function can also be user-specified, so you don't even know that int is the intended target. From that perspective the only universally applicable response is to leave the values as strings and emit a warning like "Warning: overflow error occurred during type conversion for column , leaving as string".

@taldcroft
Copy link
Member

I've reworked this to fall through to strings and emit a warning. What do y'all think?

@astrofrog
Copy link
Member Author

This works for me! Just a couple of comments:

  • It looks like the Travis failure is genuine
  • Should this be 0.4.0 since technically it changes the API (at least it changes the output type for some table columns)?

EDIT: ignore the second comment - at the moment the code would crash, so it's not like it's working at all. This can go in 0.3.2.

@taldcroft
Copy link
Member

@astrofrog - you can see the resolution of this in 0bfbd1f. I think this is sufficiently in the corner-case realm to leave as a known issue. Eventually numpy 1.5 support will go away.

@astrofrog
Copy link
Member Author

Sounds good! I've restarted Travis - feel free to merge once it passes.

@taldcroft
Copy link
Member

OK there is one unrelated Travis failure (SAMP), so I'm merging.

taldcroft added a commit that referenced this pull request Mar 27, 2014
In io.ascii, fall back to string if integers are too large
@taldcroft taldcroft merged commit 9c5debf into astropy:master Mar 27, 2014
@taldcroft taldcroft deleted the catch-overflow branch March 27, 2014 15:28
taldcroft added a commit that referenced this pull request Mar 27, 2014
In io.ascii, fall back to string if integers are too large
Conflicts:

	CHANGES.rst
	astropy/io/ascii/core.py
@dhomeier dhomeier mentioned this pull request Sep 3, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants