Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interface Table to pandas DataFrame #3504

Merged
merged 10 commits into from Apr 25, 2015
Merged

Interface Table to pandas DataFrame #3504

merged 10 commits into from Apr 25, 2015

Conversation

@astrofrog
Copy link
Member

astrofrog commented Feb 14, 2015

This is simple but often requested by users. Still needs docs, but let me know if you have any feedback in the mean time.

Remaining to-dos:

  • Add docs

Closes #2804

@astrofrog astrofrog added the table label Feb 14, 2015
@astrofrog astrofrog force-pushed the astrofrog:pandas branch from a2e3113 to 51f5264 Feb 14, 2015
print(self.has_mixin_columns)
if self.has_mixin_columns:
raise ValueError("Cannot convert a table with mixin columns to a pandas DataFrame")
for column in self.columns:

This comment has been minimized.

Copy link
@taldcroft

taldcroft Feb 15, 2015

Member

I think this can be a bit cleaner with for column in self.columns.values(): then check column.ndim > 1. If you like functional, if any(col.ndim > 1 for col in self.columns.values()): is even better.

@taldcroft

This comment has been minimized.

Copy link
Member

taldcroft commented Feb 15, 2015

This may be a can of worms you don't want to open, but what about masked data? This is supported in both Table and DataFrame, but in different ways. At the least you should do some checking and raise exceptions if not supported.

@taldcroft

This comment has been minimized.

Copy link
Member

taldcroft commented Feb 15, 2015

I think you can support mixin columns since they just go to object arrays in as_array():

In [66]: t = Table([Time(['2001:001', '2002:002']), ['a', 'b'], [1, 2]], names=['a', 'b', 'c'])

In [67]: nt = t.as_array()

In [68]: ntd = OrderedDict((name, nt[name]) for name in nt.dtype.names)

In [69]: df = DataFrame(ntd)

In [70]: df['a'][1].jyear
Out[70]: 2002.002737850787

Of course if you have a coordinate object with a million elements this will blow up...

@Cadair

This comment has been minimized.

Copy link
Member

Cadair commented Feb 15, 2015

from_pandas would be nice if such a thing doesn't already exist.

@astrofrog

This comment has been minimized.

Copy link
Member Author

astrofrog commented Feb 15, 2015

@Cadair - this PR does include from_pandas. It's pretty simple for now, but once I start messing with the masking it might start to get more complex.

@astrofrog

This comment has been minimized.

Copy link
Member Author

astrofrog commented Feb 15, 2015

@taldcroft - indeed we need to be able to support the masking. I'll play around with it.

@astrofrog

This comment has been minimized.

Copy link
Member Author

astrofrog commented Feb 15, 2015

Note to self: regarding the masking, I think that when masked columns are present they need to be converted to float to be able to represent NaN (and in that case a warning should be raised about the data type loss). For string arrays, missing values are represented by passing an object array that contains None.

@Cadair

This comment has been minimized.

Copy link
Member

Cadair commented Feb 15, 2015

@astrofrog I should probably read things better :p

@taldcroft

This comment has been minimized.

Copy link
Member

taldcroft commented Feb 15, 2015

@astrofrog - yes now I remember not liking the way pandas handles missing values. Works great for float, no so much for any other type.

@astrofrog astrofrog force-pushed the astrofrog:pandas branch from 3c8dede to 6dc174a Feb 15, 2015
@astrofrog

This comment has been minimized.

Copy link
Member Author

astrofrog commented Feb 15, 2015

This should now support masked values. At the moment the caveat is that if a Pandas series does not contain any masked values, it gets converted to a Column, and if it contains any masked values, to a MaskedColumn. One can imagine having a masked column that has no masked values in a specific table but here this would just get converted to Column. I think this is ok, but I'd be interested in hearing if not. Also, different numerical types are lost for masked columns since pandas needs to convert to float64 for these (to be able to represent nan).

Anyway, this is ready for an initial review/testing to see if there is anything that I have overlooked.

@astrofrog

This comment has been minimized.

Copy link
Member Author

astrofrog commented Feb 15, 2015

If this becomes more complicated I could also move the code to pandas.py and then call helper functions from to_pandas and from_pandas.

@astrofrog astrofrog force-pushed the astrofrog:pandas branch from 57ad605 to adb0172 Feb 16, 2015
@gully

This comment has been minimized.

Copy link

gully commented Feb 17, 2015

OK, I just ran this on my table which was the output of a 791 x 17 astroquery job. It worked just fine (not that there was any doubt), with exactly the behavior I would expect. Thank you @astrofrog !
👍

@astrofrog astrofrog changed the title WIP: interface Table to pandas DataFrame Interface Table to pandas DataFrame Feb 18, 2015
@astrofrog

This comment has been minimized.

Copy link
Member Author

astrofrog commented Feb 18, 2015

For now I'm leaving the conversion disabled for mix-in columns since I think we may want to decide how to handle these better. For example for a Time column we should be able to actually make a date/time series in pandas, etc.

I've added a documentation page but I'm getting a weird issue - to_pandas and from_pandas are not showing up in the API docs. I must be missing something obvious so will look into it later (if anyone spots the issue in the mean time, please let me know!)

if self.has_mixin_columns:
raise ValueError("Cannot convert a table with mixin columns to a pandas DataFrame")

for column in self.columns:

This comment has been minimized.

Copy link
@mhvk

mhvk Feb 18, 2015

Contributor

Rather than check beforehand, maybe let DataFrame raise an error if something bad is passed in? It keeps the code cleaner, and if pandas starts to support multi-dimensional columns, things will just work. (or If you think it is important to give a clearer error message, maybe have the actual data frame assignments inside try/except clauses?).

This comment has been minimized.

Copy link
@astrofrog

astrofrog Feb 18, 2015

Author Member

The issue is that the pandas error messages are not always very explicit. For the mixin columns it might actually work but just be very inefficient. So I prefer to control it at this stage. We could put it in a try...except but I'm not sure if it's easy to know what the issue is so as to know what better error to raise. I can look into it, but would be interested in @taldcroft's opinion here too.

This comment has been minimized.

Copy link
@taldcroft

taldcroft Feb 18, 2015

Member

@mhvk - the error that Pandas gives is pretty cryptic:

In [13]: t = Table([[[1,2]]])
In [14]: print(t)
col0 [2]
--------
  1 .. 2
In [15]: pandas.DataFrame(np.array(t))
...
ValueError: If using all scalar values, you must must pass an index

I think pre-catching the exception is better. Also, I don't think support for multi-d columns is coming any time soon. I think this would cut rather deeply into the fundamental data structures (though I'm not an expert in Pandas so I could be wrong). I asked Wes about this several years ago and the answer was hmm... would have to think about that...

This comment has been minimized.

Copy link
@mhvk

mhvk Feb 18, 2015

Contributor

Definitely no big deal. Though since every usage of the dtype property recreates the dtypes of all the columns, maybe write:

if any(getattr(col, 'ndim', 1) > 1 for col in self.columns.values()):
    raise ...

This comment has been minimized.

Copy link
@astrofrog

astrofrog Apr 24, 2015

Author Member

Done


class TestPandas(object):

@pytest.mark.skipif('not HAS_PANDAS')

This comment has been minimized.

Copy link
@mhvk

mhvk Feb 18, 2015

Contributor

Can the skipif be on the class?

This comment has been minimized.

Copy link
@astrofrog

astrofrog Feb 18, 2015

Author Member

Yes, I think so

This comment has been minimized.

Copy link
@astrofrog

astrofrog Apr 24, 2015

Author Member

Done

dataframe : :class:`pandas.DataFrame`
A pandas :class:`pandas.DataFrame` instance
"""
from pandas import DataFrame

This comment has been minimized.

Copy link
@taldcroft

taldcroft Feb 18, 2015

Member

Is there a general policy on doing try/except ImportError around optional dependencies? I guess it would be a bit silly for someone to expect to_pandas to work without pandas being installed, but I wonder if it might be better to get a message like "pandas package must be installed to convert Table to DataFrame object".

This comment has been minimized.

Copy link
@mhvk

mhvk Feb 18, 2015

Contributor

I think here it is reasonable not to try/except -- too much clutter. However, one could add a Raises section in the docstring, which includes the ImportError: as well as the two ValueError. Indeed, the documentation should warn one way or another that multi-d and mixin columns are not (yet) supported, and that masked columns are converted.

This comment has been minimized.

Copy link
@astrofrog

astrofrog Apr 24, 2015

Author Member

Done

d = DataFrame()
for name, column in self.columns.items():
if isinstance(column, MaskedColumn):
if column.dtype.kind in ['i']:

This comment has been minimized.

Copy link
@taldcroft

taldcroft Feb 18, 2015

Member

I think you need 'u' here as well for unsigned int.

This comment has been minimized.

Copy link
@astrofrog

astrofrog Apr 24, 2015

Author Member

Done

elif column.dtype.kind in ['f', 'c']:
d[name] = column.filled(np.nan)
else:
d[name] = column.astype(np.object).filled(np.nan)

This comment has been minimized.

Copy link
@taldcroft

taldcroft Feb 18, 2015

Member

Missing an else: clause here if not isinstance(column, MaskedColumn). (Though without mixins that can't actually happen, but good to future-proof things). Inspired by @mdboom, I've been using an idiom of collecting columns in a dict prior to making the table. So something like this might generically work and doesn't require an intermediate numpy array in the non-masked case. This also could support special-casing for certain mixins like you mentioned, e.g. Time => datetime:

out = OrderedDict()
for name, column in self.columns.items():
    if isinstance(column, MaskedColumn):
        ....
    elif isinstance(column, Time):
        out[name] = column.datetime  # this actually works in pandas, don't know if there is a more efficient way though
    else:  # Not a masked column
        out[name] = column

return DataFrame(out)

This comment has been minimized.

Copy link
@taldcroft

taldcroft Feb 18, 2015

Member

You can take this further and support the existing built-in mixins.

elif isinstance(column, Quantity):
    out[name] = column.value
elif isinstance(column, SkyCoord):
    for rcn in column.representation_component_names:
        outname = name + '__' + rcn  # I like double underscore here for better namespace separation
        if outname in self.colnames: 
            # do something...
        out[outname] = getattr(column, rcn)

This comment has been minimized.

Copy link
@mhvk

mhvk Feb 18, 2015

Contributor

👍 to going via an OrderDict rather than as_array. I like that in that case DataFrame gets set in only one place. Furthermore, it would also still allow the check on column dimensionality to be done in an try/except clause -- I mean to keep the if statement, but only execute it if there was an actual error:

try:
    d = DataFrame(out)
except:
    if any(...):
       raise ValueError("Multi...")
    else:
       raise

The advantage (I think) is that the normal path is both as readable and as fast as it can possibly be, while the exception is still informative.

This comment has been minimized.

Copy link
@astrofrog

astrofrog Apr 24, 2015

Author Member

I've switched to OrderedDict. Mixin column support can come later.

A `Table` instance
"""

self = cls()

This comment has been minimized.

Copy link
@taldcroft

taldcroft Feb 18, 2015

Member

It's annoying to me, but at this moment it is still noticeably faster to collect columns in a dict out and then create the table from that object at the end with cls(out, names=dataframe.columns, copy=False)

This comment has been minimized.

Copy link
@taldcroft

taldcroft Feb 18, 2015

Member

Actually you can use an OrderedDict here as well.

This comment has been minimized.

Copy link
@astrofrog

astrofrog Mar 7, 2015

Author Member

Sounds good, but note that I can't do copy=False:

        # TODO: is this restriction still needed with no ndarray?
        if not copy:
>           raise ValueError('Cannot use copy=False with a dict data input')
E           ValueError: Cannot use copy=False with a dict data input
@mhvk

This comment has been minimized.

Copy link
Contributor

mhvk commented Feb 18, 2015

On the mixin: @taldcroft - I like the idea of supporting Time and SkyCoord, but think we should not do that in this PR. This mostly because I think the way the objects should be stored in a pandas dataframe should be left to the classes themselves. Indeed, they seem really good examples of what could go into the ColumnAttribute class that we discussed before; just like we thought it could override __str__, it could have a to_pandas and from_pandas method. If so, in the routine here, one would simply check whether the column had that attribute; indeed, with that in place, it would be good to just define such an attribute in MaskedColumn rather than special-case it here. See new issue #3518.

@taldcroft

This comment has been minimized.

Copy link
Member

taldcroft commented Feb 18, 2015

For reference, there is a long-running but still-active pandas issue pandas-dev/pandas#2485 about allowing metadata to be attached to pandas objects (with milestone "someday"). In order to have any hope of round-tripping astropy mixin columns we would need a metadata container on the pandas side.

@taldcroft

This comment has been minimized.

Copy link
Member

taldcroft commented Feb 18, 2015

@mhvk - I agree with your proposal for the optimal implementation using the framework of ColumnAttribute (when it exists), but I don't see why not put in functionality now in a hardwired way if it won't change the user API. That would allow testing the basic interface concept in astropy master sooner rather than later. (Of course for SkyCoord there is a real API decision about mapping one coordinate column to multiple pandas columns).

OTOH @astrofrog might prefer to just limit the scope of this PR to non-mixins for the sake of getting something done and merged, and that would be fine.

elif column == 's':
assert np.all(t['s'] == np.array([b'a',b'b',b'c']))
else:
assert_allclose(t[column], d[column])

This comment has been minimized.

Copy link
@taldcroft

taldcroft Feb 18, 2015

Member

For the integral values in your test case, shouldn't the float representations be exact and allow equality (not close) comparisons? The default tolerances for allclose wouldn't catch a few bit error.

This comment has been minimized.

Copy link
@taldcroft

taldcroft Feb 18, 2015

Member

Compare the to_pandas() dtypes to the original? I notice in the doc example that a Table string column got turned into an object column when round-tripped through pandas. That doesn't seem right.

This comment has been minimized.

Copy link
@astrofrog

astrofrog Mar 7, 2015

Author Member

That does seem to be what pandas does:

In [1]: from pandas import DataFrame

In [2]: d = DataFrame()

In [3]: d['a'] = ['a','b','c']

In [4]: d
Out[4]: 
   a
0  a
1  b
2  c

In [5]: d['a'].dtype
Out[5]: dtype('O')

I'm not sure if there is any way to force a string/unicode column in Pandas.

This comment has been minimized.

Copy link
@taldcroft

taldcroft Mar 7, 2015

Member

Now I recall discussion about pandas and saying that it always uses a variable-length container for strings. In the context of general support for UTF-8 this is a good thing, but I don't know if pandas is literally using a Python string/unicode object under the hood. That would be pretty inefficient.

This comment has been minimized.

Copy link
@astrofrog

astrofrog Mar 7, 2015

Author Member

Passing a bytes array does work:

In [4]: d['a'] = np.array(['a','b','c'],dtype='S1')

In [5]: d['a'].dtype
Out[5]: dtype('S1')

but not if passed as part of a dictionary:

In [9]: d2 = DataFrame({'a':np.array(['a','b','c'],dtype='S1'),'b':np.array(['a','b','c'],dtype='U1')})

In [10]: d2
Out[10]: 
      a  b
0  b'a'  a
1  b'b'  b
2  b'c'  c

In [11]: d2['a'].dtype
Out[11]: dtype('O')

🐼

This comment has been minimized.

Copy link
@astrofrog

astrofrog Mar 7, 2015

Author Member

@taldcroft - based on the above, if I add columns one by one to the data frame then for byte arrays it will actually have the correct type. For unicode I think it's always going to be O

@astrofrog astrofrog force-pushed the astrofrog:pandas branch from 622ce61 to d03683c Apr 25, 2015
taldcroft added a commit that referenced this pull request Apr 25, 2015
Interface Table to pandas DataFrame
@taldcroft taldcroft merged commit 0cf8f67 into astropy:master Apr 25, 2015
3 checks passed
3 checks passed
continuous-integration/appveyor AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage increased (+0.02%) to 76.63%
Details
@astrofrog

This comment has been minimized.

Copy link
Member Author

astrofrog commented Apr 25, 2015

@barentsen - just FYI, this has been merged! Let me know if you try it out and have any issues!

@embray

This comment has been minimized.

Copy link
Member

embray commented May 4, 2015

Oh, I mistakenly assumed this was using the unified I/O framework. To which my question becomes, why not? Table.read(some_dataframe) might be nice?

@astrofrog

This comment has been minimized.

Copy link
Member Author

astrofrog commented May 4, 2015

@embray - we could support that but I'd still recommend we keep the explicit methods in addition. In particular, writing is a bit trickier API-wise.

@taldcroft

This comment has been minimized.

Copy link
Member

taldcroft commented May 4, 2015

I actually thing that read() should be restricted to reading from a file-like object. A more natural way would be t = Table(some_dataframe).

@embray

This comment has been minimized.

Copy link
Member

embray commented May 5, 2015

I'm not sure I necessary agree with that. For example an "HDUList" object (despite its odd name) is sort of a high-level "file-like" object representing a FITS file, though not in the sense you mean (in that it's like a stream with a read and possibly seek methods). But it's still an object used for reading out of a FITS file, and is the recommended object to use when passing around a FITS file open for reading between different parts of code.

I think for Pandas you have more of a point though. A Pandas DF is an abstract data structure that is not tied to any particular file format.

@abigailStev

This comment has been minimized.

Copy link
Contributor

abigailStev commented Oct 2, 2015

I'm attempting to use .to_pandas() on an astropy Table object (from astroquery) and it's not working. My astropy is up-to-date from conda. Could someone more knowledgeable help me with this?

I would attach the ipy notebook (using python 2) i've been testing it with but that's not allowed, so here's the code:

from astroquery.sdss import SDSS
import astropy ## Using astropy v 1.0.4 (from Conda)
import pandas as pd

#### Get some data from the SDSS SQL server
cmd = """
SELECT TOP 10000
    p.u, p.g, p.r, p.i, p.z, s.class, s.z, s.zerr
FROM
    PhotoObj AS p
JOIN
    SpecObj AS s ON s.bestobjid = p.objid
WHERE
    p.u BETWEEN 0 AND 19.6 AND
    p.g BETWEEN 0 AND 20 AND
    (s.class = 'STAR' OR s.class = 'GALAXY' OR s.class = 'QSO')
"""
data = SDSS.query_sql(cmd)  # Loads data as an astropy Table
type(data)
#### yields: astropy.table.table.Table
df = data.to_pandas()
#### Gives an error!
"""
 ---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-11-d0f0089f5e2a> in <module>()
----> 1 df = data.to_pandas()

>AttributeError: 'Table' object has no attribute 'to_pandas'
"""
@embray

This comment has been minimized.

Copy link
Member

embray commented Oct 2, 2015

@abigailStev This feature will be in Astropy v1.1.0 which isn't out yet. If you got Astropy through conda then you'd only get the most recent release which is v1.0.4 (unless there's some mechanism to get updates from master).

Where'd you find out about this feature?

@abigailStev

This comment has been minimized.

Copy link
Contributor

abigailStev commented Oct 2, 2015

In this issue right here. Hadn't realized it wasn't out yet! I'll be patient :)

@embray

This comment has been minimized.

Copy link
Member

embray commented Oct 2, 2015

@abigailStev Usually you can check by looking at the milestone. That will tend to correspond with the lowest version number that something is available in (though there isn't a great way to tell if that version has been released yet, though if the milestone isn't closed it probably hasn't been yet, or just check the tags or changelog or something :/)

@bsipocz

This comment has been minimized.

Copy link
Member

bsipocz commented Oct 3, 2015

@abigailStev The other option is to look at the version number in the upper left corder of the docs. For the pandas page on "latest", it's v1.1dev... meaning 1.1 is not out yet.
screen shot 2015-10-03 at 11 10 39

@embray

This comment has been minimized.

Copy link
Member

embray commented Oct 5, 2015

That's why I asked--if @abigailStev had found it via the docs I was going to point that out. Good to know anyways though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
8 participants
You can’t perform that action at this time.