Skip to content

Reading and writing Excel spreadsheets in the unified table interface #3744

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
hamogu opened this issue Apr 29, 2015 · 18 comments
Open

Reading and writing Excel spreadsheets in the unified table interface #3744

hamogu opened this issue Apr 29, 2015 · 18 comments
Labels
Enhancement good first issue Issues that are well-suited for new contributors Package-novice table unified-io

Comments

@hamogu
Copy link
Member

hamogu commented Apr 29, 2015

tl;dr: look at #8381 and add support for the pandas.read_excel function and tests.

The use of spreadsheet programs is not uncommon in astronomy. We, as a programming community sometimes tend to think of spreadsheet programs as an inferior way to view and manipulate data, but in fact astropy.table.Table by now (or after open PRs are merged) supports many features that are either directly borrowed from spreadsheet editors or at least can be found in spreadsheet editors as well: Output values can be formatted for displaying in a particular way, columns can be hidden from display, columns can have a certain data dtype (e.g. a time in mixin columns), columns can be calculated automatically based on values from other columns...

Given how many people use spreadsheets and how often I get xlsx files from collaborartors by email it does not seem unreasonable to provide an xlsx reader and writer for the astropy unified table reader and writer interface like so tab = astropy.table.Table.read('mysheet.xlsx')

Note that #1562 modified the csv reader so that io.ascii can read cvs files written by Microsoft Excel and similar spreadsheet programs, but more could be done if we could read those files directly. In particular, dates and times (difficult when exporting to a csv first) and the column formatting could be preserved (bonus points for hidden colums).

Obviously, reading and writing a spreadsheet as a table requires some assumptions which need to be documented, e.g. column headers are in the first row, units (if any) are in the second row, the formatting of numbers is the same for all numbers in a column, the data is read from the active sheet if no sheet name is given, etc. and will never be feature complete (e.g. xlsx files can contain plots and graphics).
Yet, many tables conform to "reasonable" formatting.

Using openpyxl (https://openpyxl.readthedocs.org/en/latest/) as a dependency, this should not be too hard to implement for a set of basic features.

In my opinion, this has a low priority.

@astrofrog
Copy link
Member

Pandas has a read_excel function and we now have Table.from_pandas so this might end up being pretty simple :) Or we can also use openpyxl as you suggest. Someone should try this out!

@astrofrog
Copy link
Member

Pandas actually relies on a package called xlrd (and if that or openpyxl are easy to use, no point relying on pandas too)

@astrofrog
Copy link
Member

On the other hand, they've done a lot of the hard work. Check this out:

screen shot 2015-04-29 at 10 58 42 pm

In [1]: from astropy.table import Table

In [2]: import pandas

In [3]: data = pandas.read_excel('/Users/tom/Desktop/test.xlsx')

In [4]: tab = Table.from_pandas(data)

In [5]: tab
Out[5]: 
<Table masked=False length=45>
       a                b       
    float64          float64    
---------------- ---------------
  0.590645978653  0.940560003647
 0.0209686414087  0.684967210823
  0.423907828133  0.707174587807
0.00973762842767  0.506741174321
  0.914415227502  0.379658311371
  0.904642670671 0.0119727057601
             ...             ...
 0.0776180046316  0.890611230055
  0.457925927845  0.657839891141
  0.757054469741  0.889551293441
  0.272012751841  0.789669554003
  0.723013376468  0.755585923547
  0.319739593282  0.260240194502
  0.251965376574  0.358252027644

So should be pretty easy to make a custom reader with that? On the other hand, writing might be harder.

@hamogu
Copy link
Member Author

hamogu commented Apr 29, 2015

I would have said that writing is easier because in writing you get to pick the conventions (empty line below the column names? Units?), in reading you have to assume that things are formatted as you think they are.

openpyxl and xlrd are closely related: http://www.python-excel.org/
One is for the (older) xls files, the other one for the (newer) xlsx files (but might understand the older files, too?)

@hamogu
Copy link
Member Author

hamogu commented Apr 29, 2015

I have not looked at the pandas code, but I just needed to read an Excel table and could do so in ~10 lines of code, so this should not be hard. I would not think that we save much by going via pandas, but I'll leave that for the author of a PR to decide.

@embray
Copy link
Member

embray commented Apr 30, 2015

I've used xlrd and found it pretty reasonable. However, I think a lot of effort has already gone into pandas on this, so I think unless we have any special needs I'd be in favor of adding Excel support via Pandas as an optional dependency (which we already have). So it would be just a think wrapper around the Pandas I/O plugin as @astrofrog just demonstrated.

@embray
Copy link
Member

embray commented Apr 30, 2015

On the other hand if we can easily avoid the indirection that's fine too.

@sunilk747
Copy link

If this has not been solved, I would like to work on it.

@hamogu
Copy link
Member Author

hamogu commented Jan 18, 2016

@sunilk747 the issue is still open, please go ahead and let us know if you need help at any point!

@sunilk747
Copy link

@hamogu I have read the above comment's from which what I understand is the task here is to add a method to read excel sheet and create table from that. Correct me if I am wrong.

@astrofrog
Copy link
Member

@sunilk747 - that's correct, and as we discussed above, you can do this very easy by using pandas which already does the hard work of calling xlrd

@sunilk747
Copy link

@astrofrog the from_pandas() method is not present in the master branch. Do I need to write this method also? what I found is, it's already there in Astropy version 1.1.

@astrofrog
Copy link
Member

@sunilk747 - from_pandas does exist in master: https://github.com/astropy/astropy/blob/master/astropy/table/table.py#L2470 - is your local git repository up-to-date?

@sunilk747
Copy link

@astrofrog From what I understand I come with this code segment, please have a look and tell me if I am going wrong.

@classmethod
def read(cls, path):

    import pandas

    data = pandas.read_excel(path)
    tab = Table.from_pandas(data)
    return tab

Here the path variable holds the path to the excel sheet.

@hamogu
Copy link
Member Author

hamogu commented Feb 21, 2016

@sunilk747 The best way to find out if it works, is to try it out.
Make an Excel table, check out a new branch of your repository (see the developer docs for help on how to do that), add the code you suggested, install, and see if it works.
If yes, add that as a test case to the tests, if no, try to debug until it works.

Let us know if you get stuck somewhere. We try to be helpful, but we are all contributing to astropy in our free time, so please try to go as far as you can yourself and then we'll review, comment and suggest improvements and try to help you out to solve any problems.

@pllim
Copy link
Member

pllim commented Dec 13, 2018

With the improved interface to pandas (#8247) and that pandas already support Excel, maybe this is no longer necessary?

@astrofrog
Copy link
Member

Alternatively one idea is to add a pandas 'reader', e.g. Table.read(..., format='pandas.csv') and Table.read(..., format='pandas.excel') that is a thin wrapper around the pandas readers and Table.from_pandas.

@bsipocz bsipocz added the Hacktoberfest Hacktoberfest annual event label Sep 26, 2019
@taldcroft taldcroft added good first issue Issues that are well-suited for new contributors unified-io and removed Effort-medium Hacktoberfest Hacktoberfest annual event table labels Jul 16, 2021
@taldcroft
Copy link
Member

one idea is to add a pandas 'reader', e.g. Table.read(..., format='pandas.csv')

This is done in #8381, so adding support for excel is likely quite easy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement good first issue Issues that are well-suited for new contributors Package-novice table unified-io
Projects
None yet
7 participants