Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ascii.read `name` keyword cuts off first row #1692

Merged
merged 8 commits into from Oct 31, 2013

Conversation

@taldcroft
Copy link
Member

taldcroft commented Oct 29, 2013

I've included the text of the file I'm reading at the bottom of this issue. If I try the following:

from astropy.io import ascii 
t = ascii.read('file', names= ['a', 'b', 'c'])
len(t)

it says t has 26 rows, while the actual number of rows is 27. However, if I instead do

t = ascii.read('file')

it says t has 27 rows!

I guess there's a bug here somewhere in the names parsing, but it's not clear where...

cc @taldcroft

file:

3 3.503374e-02  27.543909 
4 4.143642e-02  23.133362 
6 5.296628e-02  17.879936 
8 6.335184e-02  14.784860 
10 7.294666e-02  12.708647 
12 8.194863e-02  11.202767 
16 9.863398e-02  9.138494 
20 1.140191e-01  7.770461 
40 1.798141e-01  4.561298
60 2.355798e-01  3.244847
80 2.860021e-01  2.496477
100 3.331038e-01  2.002067
120 3.780198e-01  1.645364
140 4.214853e-01  1.372562
160 4.640228e-01  1.155066
180 5.060299e-01  0.976168
200 5.478243e-01  0.825403
220 5.896708e-01  0.695861
240 6.317970e-01  0.582787
260 6.744042e-01  0.482790
280 7.176742e-01  0.393390
300 7.617747e-01  0.312724
320 8.068627e-01  0.239368
340 8.530875e-01  0.172213
360 9.005927e-01  0.110380
380 9.495178e-01  0.053166
400 1.000000e+00  0.000000

@ghost ghost assigned taldcroft Oct 28, 2013
@eteq

This comment has been minimized.

Copy link
Member Author

eteq commented Oct 28, 2013

Oh, and in the first case (when names is provided), the first row is the one that's missing (the one that starts with 3).

@astrofrog

This comment has been minimized.

Copy link
Member

astrofrog commented Oct 28, 2013

I ran into the same thing the other day - I think if names is provided, the header length should default to 0. What do you think @taldcroft?

@taldcroft

This comment has been minimized.

Copy link
Member

taldcroft commented Oct 28, 2013

The workaround is to explicitly specify the no_header format:

In [3]: ascii.read(['1 2'], names=('a', 'b'))
Out[3]: 
<Table rows=0 names=('a','b')>
array([], 
      dtype=[('a', '<i8'), ('b', '<i8')])

In [4]: ascii.read(['1 2'], names=('a', 'b'), format='no_header')
Out[4]: 
<Table rows=1 names=('a','b')>
array([(1, 2)], 
      dtype=[('a', '<i8'), ('b', '<i8')])

The issue here is that names is providing a dual-role: if there are no names in the file is it supplying the column names, but if there are already names it is letting you override the existing ones (which precludes just setting the header length to 0). Between that and the guessing process (which includes automatically assuming that numeric column names are probably not real) the solution is not immediately obvious to me, but something does need fixing.

Somehow this issue seems familiar. I'm not sure if it actually exists or we just thought about it. @hamogu?

@hamogu

This comment has been minimized.

Copy link
Member

hamogu commented Oct 28, 2013

@taldcroft I think what you have in mind might be #885 (when header_start is given, data_start should default to header_start +1), which is related, but not that same.

And, just for the record, instead of no_header you can also explictely state in which line the data starts

In [2]: import astropy.io.ascii as ascii

In [3]: data = ['1 2', '3 4']

In [4]: ascii.read(data)
Out[4]: 
<Table rows=2 names=('col1','col2')>
array([(1, 2), (3, 4)], 
      dtype=[('col1', '<i8'), ('col2', '<i8')])

In [5]: ascii.read(data, names=['a', 'b'])
Out[5]: 
<Table rows=1 names=('a','b')>
array([(3, 4)], 
      dtype=[('a', '<i8'), ('b', '<i8')])

In [6]: ascii.read(data, names=['a', 'b'], data_start=0)
Out[6]: 
<Table rows=2 names=('a','b')>
array([(1, 2), (3, 4)], 
      dtype=[('a', '<i8'), ('b', '<i8')])

So, the question here and in #885 really is: What should the default for data_start be?
Looking at this example tt seems natural to say "a early as possible", that is:

  • default to 0 (with or without names given)
  • set to 1, if line 0 has characters and the rest of the has numbers
  • if header_start is given, then data_start+1.

We can probably implement that by putting no_header before Basic in the guess list. Currently, it's the other way round. (It's a one-liner, but let us agree that that's what we want, before I do it.)

However, that means that a table that is composed of text only would by default not have a header and thus always have columns called col1, col2 etc., so that for text tables one would explicetly have to set header_start=0, data_start=1 or maybe Reader=HasHeader.

That's the fundamental problem: If the program cannot determine if a tables has headers (i.e. the table is all numbers or all strings) should the first line be interpreted as a header or not?

@hamogu

This comment has been minimized.

Copy link
Member

hamogu commented Oct 28, 2013

This issue is labelled as a bug, but it might actually be called a feature ;-)

@astrofrog

This comment has been minimized.

Copy link
Member

astrofrog commented Oct 28, 2013

Another report of this on the mailing list! I think we do indeed need to change the defaults :)

@hamogu

This comment has been minimized.

Copy link
Member

hamogu commented Oct 28, 2013

I assume @taldcroft would want to go through a depreciation release with
a warning like
"This behaviour changed. If the first line if your string tables really
are header, use the following commend: XXX"?

I would advocate not to do that in this case, since the existing
behaviour confuses people already.

On 10/28/2013 12:00 PM, Thomas Robitaille wrote:

Another report of this on the mailing list! I think we do indeed need
to change the defaults :)


Reply to this email directly or view it on GitHub
#1692 (comment).

@eteq

This comment has been minimized.

Copy link
Member Author

eteq commented Oct 28, 2013

Also, is this reasonable to get fixed within the next week? (i.e., for the v0.3 RC?) I think @taldcroft said he's busy for much of this week, so perhaps @hamogu could do this?

And I agree deprecation may not be necessary here - given the ambiguity of bug vs. feature, you can always say it was a bug and thereby not require deprecation :)

@embray

This comment has been minimized.

Copy link
Member

embray commented Oct 28, 2013

I agree this should be considered a bug. Even though it was the "intended" behavior that's more by accident than by conscious design, as the default behavior is confusing enough to be called a bug.

@taldcroft

This comment has been minimized.

Copy link
Member

taldcroft commented Oct 28, 2013

PR almost ready... Trying to get to my next flight. :-)

@taldcroft

This comment has been minimized.

Copy link
Member

taldcroft commented Oct 29, 2013

Code attached.

@taldcroft

This comment has been minimized.

Copy link
Member

taldcroft commented Oct 29, 2013

This PR makes two real changes (corresponding to the first two commits):

  • Refactor handling of the names arg (extending work that @hamogu did) so that the associated processing happens after the table is parsed rather than before.
  • This allows part two, which is to add a strict_names attribute the BaseReader. When True the reader checks the original (as-read) column names to see they look like reasonable column names. This happens in the guessing process prior to the above names processing, thus avoiding the confusing/wrong behavior that was reported.
@hamogu

This comment has been minimized.

Copy link
Member

hamogu commented Oct 29, 2013

Looks good to me.

@astrofrog

This comment has been minimized.

Copy link
Member

astrofrog commented Oct 30, 2013

@taldcroft - looks great to me! The guessing works well in the cases I tried. I have one small question - given:

In [15]: t2 = Table.read("""
a 2 3
4 5.5 6
""", format='ascii')

how do I tell it that I want it to use the first line as column names, even if it doesn't make sense? (i.e. how do I override the defaults?).

This also needs an entry in the CHANGES.rst, and maybe a short sentence in the docs?

@hamogu

This comment has been minimized.

Copy link
Member

hamogu commented Oct 30, 2013

@astrofrog I have not tried this, but if I understand the code correctly then guess=False should revert to the basic reader (which assumes that the first line is a header).
More general, all the options like data_start=5 and header_start=3 are still there (and if I remember correctly there are tests for that and since all tests passed, it should work).

@astrofrog

This comment has been minimized.

Copy link
Member

astrofrog commented Oct 30, 2013

@hamogu - you are correct, this is exactly what happens (turning off guess=False). So this looks good in my opinion. I think it just needs an entry in CHANGES.rst and it'll be good to go.

@taldcroft

This comment has been minimized.

Copy link
Member

taldcroft commented Oct 30, 2013

@astrofrog - about your question, the direct way to tell it to use the first line as column names is to give format='ascii.basic'. [EDIT] Sorry, it's true that you now also need guess=False.

@taldcroft

This comment has been minimized.

Copy link
Member

taldcroft commented Oct 30, 2013

The new commit e6b52de should make things "just work" more frequently for these common cases. It turns off the strict column name checking if you have explicitly specified a format (or Reader).

>>> from astropy.io import ascii
>>> print ascii.read(['a 1', '1 2'])
col1 col2
---- ----
   a    1
   1    2
>>> print ascii.read(['a 1', '1 2'], format='basic')
 a   1 
--- ---
  1   2
>>> from astropy.table import Table
>>> print Table.read(['a 1', '1 2'], format='ascii.basic')
 a   1 
--- ---
  1   2
>>> print Table.read(['a 1', '1 2'], format='ascii.no_header')
col1 col2
---- ----
   a    1
   1    2
>>> print Table.read(['1', '2'], format='ascii.basic')
 1 
---
  2

Hopefully this makes it less likely that you need to use guess=False. I need to add a few tests of this and update CHANGES.rst.

@taldcroft

This comment has been minimized.

Copy link
Member

taldcroft commented Oct 30, 2013

Assuming tests pass this is ready for final review and hopefully merge.

@astrofrog - please take note of 947d1a3, which makes Table.read behave more closely to io.ascii.read. I don't remember the precise rationale for defaulting to guess=False for Table.read. It passes tests locally so this change didn't break anything actually in our tests.

@astrofrog

This comment has been minimized.

Copy link
Member

astrofrog commented Oct 31, 2013

This works perfectly - thanks @taldcroft!

astrofrog added a commit that referenced this pull request Oct 31, 2013
ascii.read `name` keyword cuts off first row
@astrofrog astrofrog merged commit aa2786d into astropy:master Oct 31, 2013
1 check passed
1 check passed
default The Travis CI build passed
Details
@taldcroft taldcroft deleted the taldcroft:ascii/guess-basic branch Feb 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.