Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pandas backend support for Table read/write #8381

Merged
merged 9 commits into from Mar 18, 2019

Conversation

taldcroft
Copy link
Member

@taldcroft taldcroft commented Jan 26, 2019

This makes it easy to use pandas read_* functions or DataFrame.to_* methods to read/write tables via pandas I/O. E.g. see #8379. This is a relatively bare-bones start but it does the basics of round-tripping a simple table. There are other more obscure pandas formats like feather or parquet that could be supported, but these all require additional packages so my initial plan is to just cover the common formats.

To do:

  • Changelog
  • Docs
  • HTML reader has a bs4 or lxml dependency, need to capture this in testing.

With #8255 the underlying pandas function/method docstrings should also show up. @astrofrog - any chance of reviewing that?

Example:

In [2]: t = simple_table()

In [3]: t.write('junk.csv', format='pandas.csv')

In [4]: Table.read('junk.csv', format='pandas.csv')
Out[4]: 
<Table length=3>
  a      b     c  
int64 float64 str1
----- ------- ----
    1     1.0    c
    2     2.0    d
    3     3.0    e

@codecov
Copy link

codecov bot commented Jan 26, 2019

Codecov Report

Merging #8381 into master will increase coverage by <.01%.
The diff coverage is 88.7%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #8381      +/-   ##
==========================================
+ Coverage    86.8%   86.81%   +<.01%     
==========================================
  Files         385      386       +1     
  Lines       58104    58166      +62     
  Branches     1060     1060              
==========================================
+ Hits        50440    50495      +55     
- Misses       7049     7056       +7     
  Partials      615      615
Impacted Files Coverage Δ
astropy/table/__init__.py 100% <100%> (ø) ⬆️
astropy/io/misc/pandas/connect.py 88.52% <88.52%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 72e2fb8...4388d89. Read the comment docs.

@astrofrog astrofrog self-requested a review January 26, 2019 21:54
@pllim
Copy link
Member

pllim commented Jan 28, 2019

This would be so cool! Do we have to worry about minversion for pandas?

@taldcroft
Copy link
Member Author

This would be so cool! Do we have to worry about minversion for pandas?

@pllim - I'm glad you are psyched! Can you and/or @saimn review?

About the minversion, I looked back in the docs to pandas 0.12 from Jan 2014 and from what I can see it has the same functions and methods here, and similar enough API. In particular there are a few hardcoded args in the connect wrapper that appear to be available as of pandas 0.12.

@taldcroft taldcroft requested review from pllim and saimn and removed request for astrofrog March 15, 2019 21:27
@taldcroft
Copy link
Member Author

The docs look like this:
image

@bsipocz
Copy link
Member

bsipocz commented Mar 15, 2019

About the minversion, I looked back in the docs to pandas 0.12 from Jan 2014 and from what I can see it has the same functions and methods here, and similar enough API. In particular there are a few hardcoded args in the connect wrapper that appear to be available as of pandas 0.12.

if we could just run one of the test jobs with this min version, and another one with the latest, that could cover us pretty well I think.

@taldcroft
Copy link
Member Author

OK, good that you suggested checking older versions. I had to make a few changes and got this working back to pandas 0.14. Before that it fails, so we have a concrete min version now.

@@ -34,8 +34,8 @@ env:
- PYTEST_VERSION=3.10
- MAIN_CMD='python setup.py'
- CONDA_DEPENDENCIES='Cython jinja2'
- CONDA_ALL_DEPENDENCIES='Cython jinja2 scipy h5py matplotlib pyyaml pandas pytz beautifulsoup4 ipython mpmath bleach bottleneck'
- DEV_PIP_DEP='asdf>=2.3 Cython jinja2 scipy h5py matplotlib pyyaml scikit-image pandas pytz beautifulsoup4 ipython mpmath bleach bottleneck'
- CONDA_ALL_DEPENDENCIES='Cython jinja2 scipy h5py matplotlib pyyaml pandas pytz html5lib beautifulsoup4 ipython mpmath bleach bottleneck'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we need to list html5lib as optional dependency in docs/install.rst, and while editing that file, please mention that 0.14 is the minimum pandas version required.

@taldcroft
Copy link
Member Author

@bsipocz - I made the suggested edits. Please check to see if I've done the new optional dependency on html5lib fully and correctly. I added a couple more things in 3fd90c8.

Copy link
Member

@pllim pllim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some nitpicky nitpicks. LGTM. Thanks!

astropy/io/misc/pandas/connect.py Show resolved Hide resolved
"""
Test round-trip through pandas write/read for supported formats.

:param fmt: format name, e.g. csv, html, json
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, any reason to use epydoc formatting in the docstring here? This does not matter because test function docstring is not exposed to users, still... why?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's just the PyCharm default that got in there, not really by purpose.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe @cdeil as pycharm power user can chime in and tell how to change the default to be numpydoc?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PyCharm has a docstring setting - choose "NumPy" when working on Astropy.
Screenshot 2019-03-18 at 22 00 44

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! My PyCharm experiment is looking better each day!

@pllim
Copy link
Member

pllim commented Mar 18, 2019

@Gabriel-p , do you want to take this for a spin with your use case?

Co-Authored-By: taldcroft <taldcroft@gmail.com>
@taldcroft
Copy link
Member Author

Gah, should have made that last nit-pick commit with no-skip.

@taldcroft taldcroft merged commit 642a776 into astropy:master Mar 18, 2019
@taldcroft taldcroft deleted the table-pandas branch March 18, 2019 20:46
@Gabriel-p
Copy link
Contributor

@pllim is this implemented in the 3.1.2 release?

@taldcroft
Copy link
Member Author

@Gabriel-p - No, just merged to master yesterday. It will be in the 3.2 release.

@astrofrog
Copy link
Member

This is great!

@Gabriel-p
Copy link
Contributor

Gabriel-p commented Mar 19, 2019

@pllim still appears to be a no go in my case. Downloaded the master branch and tried:

$ python
Python 3.7.1 (default, Dec 14 2018, 19:28:38) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import astropy
WARNING: You appear to be trying to import astropy from within a source checkout without building the extension modules first.  Attempting to (re)build extension modules: [astropy]
Rebuilding extension modules [Done]
>>> astropy.__version__
'3.2.dev24000'
>>> from astropy.io import ascii
>>> data = ascii.read('survey_results_public.csv', format='fast_csv', guess=False)
Traceback (most recent call last):
  File "astropy/io/ascii/cparser.pyx", line 611, in astropy.io.ascii.cparser.CParser._convert_data
    cols[name] = self._convert_int(t, i, num_rows)
  File "astropy/io/ascii/cparser.pyx", line 688, in astropy.io.ascii.cparser.CParser._convert_int
    raise ValueError()
ValueError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "astropy/io/ascii/cparser.pyx", line 626, in astropy.io.ascii.cparser.CParser._convert_data
    cols[name] = self._convert_float(t, i, num_rows)
  File "astropy/io/ascii/cparser.pyx", line 743, in astropy.io.ascii.cparser.CParser._convert_float
    raise ValueError()
ValueError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/gabriel/Descargas/temp_del/astropy/astropy/io/ascii/ui.py", line 409, in read
    dat = reader.read(table)
  File "/home/gabriel/Descargas/temp_del/astropy/astropy/io/ascii/fastbasic.py", line 128, in read
    data, comments = self.engine.read(try_int, try_float, try_string)
  File "astropy/io/ascii/cparser.pyx", line 397, in astropy.io.ascii.cparser.CParser.read
    return self._convert_data(self.tokenizer, try_int, try_float,
  File "astropy/io/ascii/cparser.pyx", line 635, in astropy.io.ascii.cparser.CParser._convert_data
    cols[name] = self._convert_str(t, i, num_rows)
  File "astropy/io/ascii/cparser.pyx", line 794, in astropy.io.ascii.cparser.CParser._convert_str
    cdef np.ndarray col = np.array(fields_list, dtype=(np.str, max_len))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)
>>> 

@astrofrog
Copy link
Member

@Gabriel-p - try with format='pandas.csv'

@Gabriel-p
Copy link
Contributor

It doesn't recognize that as a valid format.

>>> data = ascii.read('survey_results_public.csv', format='pandas.csv', guess=False)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/gabriel/Descargas/temp_del/astropy/astropy/io/ascii/ui.py", line 319, in read
    Reader = _get_format_class(format, kwargs.get('Reader'), 'Reader')
  File "/home/gabriel/Descargas/temp_del/astropy/astropy/io/ascii/ui.py", line 188, in _get_format_class
    .format(format, sorted(core.FORMAT_CLASSES)))
ValueError: ASCII format 'pandas.csv' not in allowed list ['aastex', 'basic', 'cds', 'commented_header', 'csv', 'daophot', 'ecsv', 'fast_basic', 'fast_commented_header', 'fast_csv', 'fast_no_header', 'fast_rdb', 'fast_tab', 'fixed_width', 'fixed_width_no_header', 'fixed_width_two_line', 'html', 'ipac', 'latex', 'no_header', 'rdb', 'rst', 'sextractor', 'tab']

@taldcroft
Copy link
Member Author

@Gabriel-p - You need to use the Table interface:

from astropy.table import Table
data = Table.read('survey_results_public.csv', format='pandas.csv')

@Gabriel-p
Copy link
Contributor

That's it, loaded the table in a few seconds. It is a bit confusing though, having to import Table to read some data from a file. Isn't the io module a more reasonable place for this?

>>> from astropy.table import Table
>>> data = Table.read('survey_results_public.csv', format='pandas.csv')
__main__:1: DtypeWarning: Columns (8,12,13,14,15,16,50,51,52,53,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128) have mixed types. Specify dtype option on import or set low_memory=False.
>>> data
<Table masked=True length=98855>
Respondent Hobby OpenSource      Country       ... Dependents MilitaryUS            SurveyTooLong                 SurveyEasy    
  int64     str3    str3          str41        ...   object     object                  object                      object      
---------- ----- ---------- ------------------ ... ---------- ---------- ------------------------------------ ------------------
         1   Yes         No              Kenya ...        Yes         -- The survey was an appropriate length          Very easy
         3   Yes        Yes     United Kingdom ...        Yes         -- The survey was an appropriate length      Somewhat easy
         4   Yes        Yes      United States ...         --         --                                   --                 --
         5    No         No      United States ...         No         No The survey was an appropriate length      Somewhat easy
         7   Yes         No       South Africa ...        Yes         -- The survey was an appropriate length      Somewhat easy
         8   Yes         No     United Kingdom ...         No         -- The survey was an appropriate length      Somewhat easy
         9   Yes        Yes      United States ...         No         No The survey was an appropriate length      Somewhat easy
        10   Yes        Yes            Nigeria ...         No         --              The survey was too long Somewhat difficult
       ...   ...        ...                ... ...        ...        ...                                  ...                ...
    101396   Yes        Yes         Azerbaijan ...         --         --                                   --                 --
    101411    No         No      United States ...         --         --                                   --                 --
    101432   Yes         No              India ...         --         --                                   --                 --
    101478    No        Yes              India ...         --         --                                   --                 --
    101513   Yes        Yes      United States ...         --         --                                   --                 --
    101531    No        Yes              Spain ...         --         --                                   --                 --
    101541   Yes        Yes              India ...         --         --                                   --                 --
    101544   Yes         No Russian Federation ...         --         --                                   --                 --
    101548   Yes        Yes           Cambodia ...         --         --                                   --                 --
>>> 

@taldcroft
Copy link
Member Author

taldcroft commented Mar 19, 2019

@Gabriel-p - about where to come at this (from IO or from Table), that is a matter of perspective. Astropy has worked toward the idea that you mostly think about the data type (e.g. table or ND image) and then should not worry so much about the specific file format (ASCII table, FITS table, HDF5 table). So we have the unified I/O interface (https://astropy.readthedocs.io/en/latest/io/unified.html) that starts from the data type, in this case Table.

So most of the time in your data I/O with tables you should be using Table.read as the single way to read a table, not ascii.read or fits.open(); ... or h5py.open() ....

@Gabriel-p
Copy link
Contributor

Thanks but this is still not clear to me (perhaps this is not the place to be discussing this, if so let me know) When are we then supposed to use ascii.read() and when are we supposed to use Table.read()?

@taldcroft
Copy link
Member Author

You should almost always use Table.read() and rarely ascii.read(). Simple! 😄 E.g. see the first note in https://astropy.readthedocs.io/en/latest/io/ascii/index.html.

The only time I ever use ascii.read() directly is if I have a data table as a single string or list of strings. Table.read() doesn't handle that as an input, it needs a file name or file handle. Apart from that, Table.read() will do everything that ascii.read() does, and more since you can use the same basic command regardless of table format.

@Gabriel-p
Copy link
Contributor

Great! I was under the impression (perhaps because I've been using astropy almost from day 1) that ascii.read() was the way to go. I'll have to do some changes to my code. Thank you for the hard work and the clear explanation!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants