Add pandas backend support for Table read/write #8381

taldcroft · 2019-01-26T15:53:16Z

This makes it easy to use pandas read_* functions or DataFrame.to_* methods to read/write tables via pandas I/O. E.g. see #8379. This is a relatively bare-bones start but it does the basics of round-tripping a simple table. There are other more obscure pandas formats like feather or parquet that could be supported, but these all require additional packages so my initial plan is to just cover the common formats.

To do:

Changelog
Docs
HTML reader has a bs4 or lxml dependency, need to capture this in testing.

With #8255 the underlying pandas function/method docstrings should also show up. @astrofrog - any chance of reviewing that?

Example:

In [2]: t = simple_table()

In [3]: t.write('junk.csv', format='pandas.csv')

In [4]: Table.read('junk.csv', format='pandas.csv')
Out[4]: 
<Table length=3>
  a      b     c  
int64 float64 str1
----- ------- ----
    1     1.0    c
    2     2.0    d
    3     3.0    e

codecov · 2019-01-26T16:13:16Z

Codecov Report

Merging #8381 into master will increase coverage by <.01%.
The diff coverage is 88.7%.

@@            Coverage Diff             @@
##           master    #8381      +/-   ##
==========================================
+ Coverage    86.8%   86.81%   +<.01%     
==========================================
  Files         385      386       +1     
  Lines       58104    58166      +62     
  Branches     1060     1060              
==========================================
+ Hits        50440    50495      +55     
- Misses       7049     7056       +7     
  Partials      615      615

Impacted Files	Coverage Δ
astropy/table/__init__.py	`100% <100%> (ø)`	⬆️
astropy/io/misc/pandas/connect.py	`88.52% <88.52%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 72e2fb8...4388d89. Read the comment docs.

pllim · 2019-01-28T16:36:14Z

This would be so cool! Do we have to worry about minversion for pandas?

taldcroft · 2019-03-15T21:23:36Z

This would be so cool! Do we have to worry about minversion for pandas?

@pllim - I'm glad you are psyched! Can you and/or @saimn review?

About the minversion, I looked back in the docs to pandas 0.12 from Jan 2014 and from what I can see it has the same functions and methods here, and similar enough API. In particular there are a few hardcoded args in the connect wrapper that appear to be available as of pandas 0.12.

taldcroft · 2019-03-15T21:32:32Z

The docs look like this:

bsipocz · 2019-03-15T21:35:03Z

About the minversion, I looked back in the docs to pandas 0.12 from Jan 2014 and from what I can see it has the same functions and methods here, and similar enough API. In particular there are a few hardcoded args in the connect wrapper that appear to be available as of pandas 0.12.

if we could just run one of the test jobs with this min version, and another one with the latest, that could cover us pretty well I think.

taldcroft · 2019-03-16T13:06:18Z

OK, good that you suggested checking older versions. I had to make a few changes and got this working back to pandas 0.14. Before that it fails, so we have a concrete min version now.

bsipocz · 2019-03-16T21:55:20Z

.travis.yml

@@ -34,8 +34,8 @@ env:
        - PYTEST_VERSION=3.10
        - MAIN_CMD='python setup.py'
        - CONDA_DEPENDENCIES='Cython jinja2'
-        - CONDA_ALL_DEPENDENCIES='Cython jinja2 scipy h5py matplotlib pyyaml pandas pytz beautifulsoup4 ipython mpmath bleach bottleneck'
-        - DEV_PIP_DEP='asdf>=2.3 Cython jinja2 scipy h5py matplotlib pyyaml scikit-image pandas pytz beautifulsoup4 ipython mpmath bleach bottleneck'
+        - CONDA_ALL_DEPENDENCIES='Cython jinja2 scipy h5py matplotlib pyyaml pandas pytz html5lib beautifulsoup4 ipython mpmath bleach bottleneck'


I suppose we need to list html5lib as optional dependency in docs/install.rst, and while editing that file, please mention that 0.14 is the minimum pandas version required.

taldcroft · 2019-03-17T11:10:15Z

@bsipocz - I made the suggested edits. Please check to see if I've done the new optional dependency on html5lib fully and correctly. I added a couple more things in 3fd90c8.

pllim

Just some nitpicky nitpicks. LGTM. Thanks!

astropy/io/misc/pandas/connect.py

pllim · 2019-03-18T18:01:07Z

astropy/io/misc/tests/test_pandas.py

+    """
+    Test round-trip through pandas write/read for supported formats.
+
+    :param fmt: format name, e.g. csv, html, json


Hmm, any reason to use epydoc formatting in the docstring here? This does not matter because test function docstring is not exposed to users, still... why?

That's just the PyCharm default that got in there, not really by purpose.

maybe @cdeil as pycharm power user can chime in and tell how to change the default to be numpydoc?

PyCharm has a docstring setting - choose "NumPy" when working on Astropy.

Cool! My PyCharm experiment is looking better each day!

pllim · 2019-03-18T19:01:59Z

@Gabriel-p , do you want to take this for a spin with your use case?

Co-Authored-By: taldcroft <taldcroft@gmail.com>

taldcroft · 2019-03-18T19:20:22Z

Gah, should have made that last nit-pick commit with no-skip.

Gabriel-p · 2019-03-19T13:08:59Z

@pllim is this implemented in the 3.1.2 release?

taldcroft · 2019-03-19T13:43:14Z

@Gabriel-p - No, just merged to master yesterday. It will be in the 3.2 release.

astrofrog · 2019-03-19T13:59:28Z

This is great!

Gabriel-p · 2019-03-19T14:16:51Z

@pllim still appears to be a no go in my case. Downloaded the master branch and tried:

$ python
Python 3.7.1 (default, Dec 14 2018, 19:28:38) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import astropy
WARNING: You appear to be trying to import astropy from within a source checkout without building the extension modules first.  Attempting to (re)build extension modules: [astropy]
Rebuilding extension modules [Done]
>>> astropy.__version__
'3.2.dev24000'
>>> from astropy.io import ascii
>>> data = ascii.read('survey_results_public.csv', format='fast_csv', guess=False)
Traceback (most recent call last):
  File "astropy/io/ascii/cparser.pyx", line 611, in astropy.io.ascii.cparser.CParser._convert_data
    cols[name] = self._convert_int(t, i, num_rows)
  File "astropy/io/ascii/cparser.pyx", line 688, in astropy.io.ascii.cparser.CParser._convert_int
    raise ValueError()
ValueError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "astropy/io/ascii/cparser.pyx", line 626, in astropy.io.ascii.cparser.CParser._convert_data
    cols[name] = self._convert_float(t, i, num_rows)
  File "astropy/io/ascii/cparser.pyx", line 743, in astropy.io.ascii.cparser.CParser._convert_float
    raise ValueError()
ValueError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/gabriel/Descargas/temp_del/astropy/astropy/io/ascii/ui.py", line 409, in read
    dat = reader.read(table)
  File "/home/gabriel/Descargas/temp_del/astropy/astropy/io/ascii/fastbasic.py", line 128, in read
    data, comments = self.engine.read(try_int, try_float, try_string)
  File "astropy/io/ascii/cparser.pyx", line 397, in astropy.io.ascii.cparser.CParser.read
    return self._convert_data(self.tokenizer, try_int, try_float,
  File "astropy/io/ascii/cparser.pyx", line 635, in astropy.io.ascii.cparser.CParser._convert_data
    cols[name] = self._convert_str(t, i, num_rows)
  File "astropy/io/ascii/cparser.pyx", line 794, in astropy.io.ascii.cparser.CParser._convert_str
    cdef np.ndarray col = np.array(fields_list, dtype=(np.str, max_len))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)
>>>

astrofrog · 2019-03-19T14:22:24Z

@Gabriel-p - try with format='pandas.csv'

Gabriel-p · 2019-03-19T14:27:33Z

It doesn't recognize that as a valid format.

>>> data = ascii.read('survey_results_public.csv', format='pandas.csv', guess=False)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/gabriel/Descargas/temp_del/astropy/astropy/io/ascii/ui.py", line 319, in read
    Reader = _get_format_class(format, kwargs.get('Reader'), 'Reader')
  File "/home/gabriel/Descargas/temp_del/astropy/astropy/io/ascii/ui.py", line 188, in _get_format_class
    .format(format, sorted(core.FORMAT_CLASSES)))
ValueError: ASCII format 'pandas.csv' not in allowed list ['aastex', 'basic', 'cds', 'commented_header', 'csv', 'daophot', 'ecsv', 'fast_basic', 'fast_commented_header', 'fast_csv', 'fast_no_header', 'fast_rdb', 'fast_tab', 'fixed_width', 'fixed_width_no_header', 'fixed_width_two_line', 'html', 'ipac', 'latex', 'no_header', 'rdb', 'rst', 'sextractor', 'tab']

taldcroft · 2019-03-19T14:31:24Z

@Gabriel-p - You need to use the Table interface:

from astropy.table import Table
data = Table.read('survey_results_public.csv', format='pandas.csv')

Gabriel-p · 2019-03-19T14:35:57Z

That's it, loaded the table in a few seconds. It is a bit confusing though, having to import Table to read some data from a file. Isn't the io module a more reasonable place for this?

>>> from astropy.table import Table
>>> data = Table.read('survey_results_public.csv', format='pandas.csv')
__main__:1: DtypeWarning: Columns (8,12,13,14,15,16,50,51,52,53,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128) have mixed types. Specify dtype option on import or set low_memory=False.
>>> data
<Table masked=True length=98855>
Respondent Hobby OpenSource      Country       ... Dependents MilitaryUS            SurveyTooLong                 SurveyEasy    
  int64     str3    str3          str41        ...   object     object                  object                      object      
---------- ----- ---------- ------------------ ... ---------- ---------- ------------------------------------ ------------------
         1   Yes         No              Kenya ...        Yes         -- The survey was an appropriate length          Very easy
         3   Yes        Yes     United Kingdom ...        Yes         -- The survey was an appropriate length      Somewhat easy
         4   Yes        Yes      United States ...         --         --                                   --                 --
         5    No         No      United States ...         No         No The survey was an appropriate length      Somewhat easy
         7   Yes         No       South Africa ...        Yes         -- The survey was an appropriate length      Somewhat easy
         8   Yes         No     United Kingdom ...         No         -- The survey was an appropriate length      Somewhat easy
         9   Yes        Yes      United States ...         No         No The survey was an appropriate length      Somewhat easy
        10   Yes        Yes            Nigeria ...         No         --              The survey was too long Somewhat difficult
       ...   ...        ...                ... ...        ...        ...                                  ...                ...
    101396   Yes        Yes         Azerbaijan ...         --         --                                   --                 --
    101411    No         No      United States ...         --         --                                   --                 --
    101432   Yes         No              India ...         --         --                                   --                 --
    101478    No        Yes              India ...         --         --                                   --                 --
    101513   Yes        Yes      United States ...         --         --                                   --                 --
    101531    No        Yes              Spain ...         --         --                                   --                 --
    101541   Yes        Yes              India ...         --         --                                   --                 --
    101544   Yes         No Russian Federation ...         --         --                                   --                 --
    101548   Yes        Yes           Cambodia ...         --         --                                   --                 --
>>>

taldcroft · 2019-03-19T14:43:38Z

@Gabriel-p - about where to come at this (from IO or from Table), that is a matter of perspective. Astropy has worked toward the idea that you mostly think about the data type (e.g. table or ND image) and then should not worry so much about the specific file format (ASCII table, FITS table, HDF5 table). So we have the unified I/O interface (https://astropy.readthedocs.io/en/latest/io/unified.html) that starts from the data type, in this case Table.

So most of the time in your data I/O with tables you should be using Table.read as the single way to read a table, not ascii.read or fits.open(); ... or h5py.open() ....

Gabriel-p · 2019-03-19T14:54:28Z

Thanks but this is still not clear to me (perhaps this is not the place to be discussing this, if so let me know) When are we then supposed to use ascii.read() and when are we supposed to use Table.read()?

taldcroft · 2019-03-19T15:12:22Z

You should almost always use Table.read() and rarely ascii.read(). Simple! 😄 E.g. see the first note in https://astropy.readthedocs.io/en/latest/io/ascii/index.html.

The only time I ever use ascii.read() directly is if I have a data table as a single string or list of strings. Table.read() doesn't handle that as an input, it needs a file name or file handle. Apart from that, Table.read() will do everything that ascii.read() does, and more since you can use the same basic command regardless of table format.

Gabriel-p · 2019-03-19T15:16:51Z

Great! I was under the impression (perhaps because I've been using astropy almost from day 1) that ascii.read() was the way to go. I'll have to do some changes to my code. Thank you for the hard work and the clear explanation!

taldcroft added table io.registry Enhancement Work in progress labels Jan 26, 2019

taldcroft added this to the v3.2 milestone Jan 26, 2019

astrofrog self-requested a review January 26, 2019 21:54

taldcroft added 5 commits March 15, 2019 15:17

Add pandas backend for Table read/write

7347526

Add tests of pandas backend

a75a6e0

Use flavor=bs4 for HTML and add optional html5lib pkg for test

307ff64

Update change log

2f80876

Add docs

fed9477

taldcroft force-pushed the table-pandas branch from 66b0b5b to fed9477 Compare March 15, 2019 21:13

taldcroft removed the Work in progress label Mar 15, 2019

Fix trailing whitespace that is failing flake8

5e50006

taldcroft requested review from pllim and saimn and removed request for astrofrog March 15, 2019 21:27

Allow compatibility with pandas back to 0.14

35365a8

bsipocz reviewed Mar 16, 2019

View reviewed changes

Fix test/doc issues related to optional dependencies

3fd90c8

pllim added io.misc whatsnew-needed labels Mar 18, 2019

pllim approved these changes Mar 18, 2019

View reviewed changes

Update astropy/io/misc/pandas/connect.py

4388d89

Co-Authored-By: taldcroft <taldcroft@gmail.com>

taldcroft merged commit 642a776 into astropy:master Mar 18, 2019

taldcroft deleted the table-pandas branch March 18, 2019 20:46

pllim mentioned this pull request May 10, 2019

3.2rc1 Table tests that call pandas failed #8682

Closed

taldcroft added whatsnew-needed and removed whatsnew-needed labels May 10, 2019

taldcroft mentioned this pull request Jul 16, 2021

Reading and writing Excel spreadsheets in the unified table interface #3744

Open

NotQuiteAwake mentioned this pull request Jul 27, 2022

Add Table support for reading from / writing to excel files through pandas (astropy.io.misc.pandas) #13506

Closed

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pandas backend support for Table read/write #8381

Add pandas backend support for Table read/write #8381

taldcroft commented Jan 26, 2019 •

edited

codecov bot commented Jan 26, 2019 •

edited

pllim commented Jan 28, 2019

taldcroft commented Mar 15, 2019

taldcroft commented Mar 15, 2019

bsipocz commented Mar 15, 2019

taldcroft commented Mar 16, 2019

bsipocz Mar 16, 2019

taldcroft commented Mar 17, 2019

pllim left a comment

pllim Mar 18, 2019

taldcroft Mar 18, 2019

bsipocz Mar 18, 2019

cdeil Mar 18, 2019

bsipocz Mar 18, 2019

taldcroft Mar 18, 2019

pllim commented Mar 18, 2019

taldcroft commented Mar 18, 2019

Gabriel-p commented Mar 19, 2019

taldcroft commented Mar 19, 2019

astrofrog commented Mar 19, 2019

Gabriel-p commented Mar 19, 2019 •

edited

astrofrog commented Mar 19, 2019

Gabriel-p commented Mar 19, 2019

taldcroft commented Mar 19, 2019

Gabriel-p commented Mar 19, 2019

taldcroft commented Mar 19, 2019 •

edited

Gabriel-p commented Mar 19, 2019

taldcroft commented Mar 19, 2019

Gabriel-p commented Mar 19, 2019

Add pandas backend support for Table read/write #8381

Add pandas backend support for Table read/write #8381

Conversation

taldcroft commented Jan 26, 2019 • edited

codecov bot commented Jan 26, 2019 • edited

Codecov Report

pllim commented Jan 28, 2019

taldcroft commented Mar 15, 2019

taldcroft commented Mar 15, 2019

bsipocz commented Mar 15, 2019

taldcroft commented Mar 16, 2019

bsipocz Mar 16, 2019

Choose a reason for hiding this comment

taldcroft commented Mar 17, 2019

pllim left a comment

Choose a reason for hiding this comment

pllim Mar 18, 2019

Choose a reason for hiding this comment

taldcroft Mar 18, 2019

Choose a reason for hiding this comment

bsipocz Mar 18, 2019

Choose a reason for hiding this comment

cdeil Mar 18, 2019

Choose a reason for hiding this comment

bsipocz Mar 18, 2019

Choose a reason for hiding this comment

taldcroft Mar 18, 2019

Choose a reason for hiding this comment

pllim commented Mar 18, 2019

taldcroft commented Mar 18, 2019

Gabriel-p commented Mar 19, 2019

taldcroft commented Mar 19, 2019

astrofrog commented Mar 19, 2019

Gabriel-p commented Mar 19, 2019 • edited

astrofrog commented Mar 19, 2019

Gabriel-p commented Mar 19, 2019

taldcroft commented Mar 19, 2019

Gabriel-p commented Mar 19, 2019

taldcroft commented Mar 19, 2019 • edited

Gabriel-p commented Mar 19, 2019

taldcroft commented Mar 19, 2019

Gabriel-p commented Mar 19, 2019

taldcroft commented Jan 26, 2019 •

edited

codecov bot commented Jan 26, 2019 •

edited

Gabriel-p commented Mar 19, 2019 •

edited

taldcroft commented Mar 19, 2019 •

edited