Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read units row from ascii files #9665

Open
keflavich opened this issue Nov 24, 2019 · 9 comments
Open

Read units row from ascii files #9665

keflavich opened this issue Nov 24, 2019 · 9 comments

Comments

@keflavich
Copy link
Contributor

Related to #4743 and #756, and loosely related to #9639 (which provides the workaround): we should have a mechanism for reading the units from the second (or n'th) header row of an ascii table.

Presently the only (?) human-readable format supported by astropy.tables is the IPAC format, which includes additional information and a very specific set of formatting, but it should be straightforward to include a units-parsing component to the generic ascii reader.

(if we have this feature somewhere, I was unable to find it by reading the docs...)

@taldcroft
Copy link
Member

I would argue that ECSV is human-readable. What ECSV is NOT is human-writeable.

# %ECSV 0.9
# ---
# datatype:
# - {name: a, datatype: int64}
# - {name: b, unit: m, datatype: float64}
# - {name: c, datatype: string}
# schema: astropy-2.0
a b c
1 1.0 c
2 2.0 d

My question related to these requests to read such files is whether the data files with units in the second line are showing up "in the wild" (from collaborators or whatnot), or if you want to purposely write such a file in a new, implicitly-defined format.

@keflavich
Copy link
Contributor Author

I disagree: ecsv is not very human-readable. It's not entirely unreadable, but this is difficult to parse and doesn't generalize well as more fields are added (while fixed width does):
image

Re: questions:
(1) Yes, these show in the wild all the time
(2) Yes, I'd like to write to this format. Generally I want to use this for small tables, not gigantic ones, but it's useful

@pllim
Copy link
Member

pllim commented Nov 25, 2019

I am hesitant to support random formats on a whim. It is a slippery slope. We already have options for you to write your own parser by subclassing Table.

@keflavich
Copy link
Contributor Author

I am not suggesting creating or supporting new formats, really. I just want to be able to treat a specified header row as 'units' in the general case. Sure, my use case is mostly for human-readable formats, but the application is broad.

@pllim
Copy link
Member

pllim commented Nov 25, 2019

I have a lot of trauma trying to parse human-written units in the past. For example, "angstroms" vs "angstrom" or "erg" vs "ergs" or "ct" vs "cts" vs "count" vs "counts". And when you get to flux or magnitude units, even worse. Where do we draw the line? Is it human reable when the unit string in your header is very very long compared to the actual data for that column?

@keflavich
Copy link
Contributor Author

But units already handles parsing. I'm perfectly fine with a reader that ignores units that it can't parse. If it's a human-readable file, it's also pretty editable.

You've highlighted plenty of corner cases, but corner cases always exist. I'm not asking for a fully flexible, always-solves-all-problems solution here, but I do think it should be possible to read something as simple as this:

Field    | B3_res | B3_sens | B6_res | B6_sens
         | arcsec | mJy     | arcsec | mJy
W51-E    | 0.37   | 0.03    | 0.37   | 0.1
W51-IRS2 | 0.37   | 0.03    | 0.37   | 0.1
W43-MM1  | 0.37   | 0.03    | 0.37   | 0.1
W43-MM2  | 0.37   | 0.03    | 0.37   | 0.1

more simply than:

tbl = ascii.read('/orange/adamginsburg/ALMA_IMF/analysis/requested.txt', data_start=2)
tbl_ = Table.read('/orange/adamginsburg/ALMA_IMF/analysis/requested.txt', format='ascii.fixed_width')
for colname in tbl.colnames:
    try:
        tbl[colname].unit = u.Unit(tbl_[colname][0])
    except:
        pass

which is what I have to do now to get the units extracted in the same format as the data

@taldcroft
Copy link
Member

taldcroft commented Nov 25, 2019

My real pushback in supporting easily reading this in astropy core is that it encourages people to write in this format, which is IMO a bad thing. It is basically a new ASCII data format that is not a standardized or a legacy format (like having the column names in the first row). But I do recognize that a certain degree of pragmatism is needed, and this issue comes up every so often, so I'm happy to hear arguments for accepting this format for reading.

In astropy core we could certainly add a subclass of the basic reader and a subclass of the fixed-width read to do this. It might not be entirely trivial just because there are many options and we aim to make any reader class that is in the core actually work correctly for all options.

With #9671 (and a slight mod to accept a masked Row for input) the code to read your file could be 2 lines:

units = Table.read(filename, format='ascii', data_start=1, data_end=2)[0]
tbl = Table.read(filename, format='ascii', units=units)

@taldcroft
Copy link
Member

Sorry, I didn't get all the back to your first response (about files in the wild), so ignore my first paragraph.

@keflavich
Copy link
Contributor Author

👍 on #9671, that's a nice step in the right direction.

The fixed-width, headers optionally included format is to me an undefined legacy standard that has always existed. Fixed-width is the old FORTRAN standard, and adding headers to it is - imo - a strict improvement. I often encounter these as un-headered fixed-width files that I manually edit to include headers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants