Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Table.read_table() could be smarter about auto-detecting the file format #66

Closed
davidwagner opened this issue Sep 14, 2015 · 4 comments
Closed

Comments

@davidwagner
Copy link
Member

Try this:

Table.read_table('https://data.oaklandnet.com/api/views/7axi-hi5i/rows.csv?accessType=DOWNLOAD')

Table.read_table() fails to recognize the columns; it stuff everything into one column.

Compare to

Table.read_table('https://data.oaklandnet.com/api/views/7axi-hi5i/rows.csv')

which does recognize that there are three columns.

Perhaps it is looking at the URL and trying to parse out the filename extension, and then using that to decide how to decode the data. If so, maybe it should be smarter about how to parse URLs (to remove fragments and parameters), or maybe it should ignore the URL/filename and have smarter format detection (e.g., auto-detect it as CSV based on the contents of the data rather than the filename).

@deculler
Copy link
Contributor

read_table uses the pandas tool for this. In fact, it is the only thing we
use out of pandas. It is pretty sophisticated and we replaced the more
straightforward csv reader that was in my original tables implementation.
We might need to offer that as an alternative when read_table can't figure
it out. csv is a really awful world in many ways.

David E. Culler
Friesen Professor of Computer Science
Electrical Engineering and Computer Sciences
University of California, Berkeley

On Mon, Sep 14, 2015 at 12:04 PM, davidwagner notifications@github.com
wrote:

Try this:

Table.read_table('https://data.oaklandnet.com/api/views/7axi-hi5i/rows.csv?accessType=DOWNLOAD')

Table.read_table() fails to recognize the columns; it stuff everything
into one column.

Compare to

Table.read_table('https://data.oaklandnet.com/api/views/7axi-hi5i/rows.csv')

which does recognize that there are three columns.

Perhaps it is looking at the URL and trying to parse out the filename
extension, and then using that to decide how to decode the data. If so,
maybe it should be smarter about how to parse URLs (to remove fragments and
parameters), or maybe it should ignore the URL/filename and have smarter
format detection (e.g., auto-detect it as CSV based on the contents of the
data rather than the filename).


Reply to this email directly or view it on GitHub
#66.

@davidwagner
Copy link
Member Author

Cool, thank you! I wonder if this line in datascience/tables.py is causing the problem:

    if filepath_or_buffer.endswith('.csv') and 'sep' not in vargs:
        vargs['sep'] = ','

Note to self: investigate when I get a chance.

Anyway, this is absolutely not a big deal, just a super-minor annoyance I thought I'd document.

@papajohn
Copy link
Contributor

The table reader doesn't inspect the file, just the path. I think that behavior is here to stay. Instead, you'll have to specify the separator manually.

address = 'https://data.oaklandnet.com/api/views/7axi-hi5i/rows.csv?accessType=DOWNLOAD'
Table.read_table(address, sep=',')

@papajohn
Copy link
Contributor

Slightly improved in new release (handles the http query string case)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants