Table.read_table() could be smarter about auto-detecting the file format #66

davidwagner · 2015-09-14T19:04:18Z

Try this:

Table.read_table('https://data.oaklandnet.com/api/views/7axi-hi5i/rows.csv?accessType=DOWNLOAD')

Table.read_table() fails to recognize the columns; it stuff everything into one column.

Compare to

Table.read_table('https://data.oaklandnet.com/api/views/7axi-hi5i/rows.csv')

which does recognize that there are three columns.

Perhaps it is looking at the URL and trying to parse out the filename extension, and then using that to decide how to decode the data. If so, maybe it should be smarter about how to parse URLs (to remove fragments and parameters), or maybe it should ignore the URL/filename and have smarter format detection (e.g., auto-detect it as CSV based on the contents of the data rather than the filename).

The text was updated successfully, but these errors were encountered:

deculler · 2015-09-14T21:49:17Z

read_table uses the pandas tool for this. In fact, it is the only thing we
use out of pandas. It is pretty sophisticated and we replaced the more
straightforward csv reader that was in my original tables implementation.
We might need to offer that as an alternative when read_table can't figure
it out. csv is a really awful world in many ways.

David E. Culler
Friesen Professor of Computer Science
Electrical Engineering and Computer Sciences
University of California, Berkeley

On Mon, Sep 14, 2015 at 12:04 PM, davidwagner notifications@github.com
wrote:

Try this:

Table.read_table('https://data.oaklandnet.com/api/views/7axi-hi5i/rows.csv?accessType=DOWNLOAD')

Table.read_table() fails to recognize the columns; it stuff everything
into one column.

Compare to

Table.read_table('https://data.oaklandnet.com/api/views/7axi-hi5i/rows.csv')

which does recognize that there are three columns.

Perhaps it is looking at the URL and trying to parse out the filename
extension, and then using that to decide how to decode the data. If so,
maybe it should be smarter about how to parse URLs (to remove fragments and
parameters), or maybe it should ignore the URL/filename and have smarter
format detection (e.g., auto-detect it as CSV based on the contents of the
data rather than the filename).

—
Reply to this email directly or view it on GitHub
#66.

davidwagner · 2015-09-16T01:57:31Z

Cool, thank you! I wonder if this line in datascience/tables.py is causing the problem:

    if filepath_or_buffer.endswith('.csv') and 'sep' not in vargs:
        vargs['sep'] = ','

Note to self: investigate when I get a chance.

Anyway, this is absolutely not a big deal, just a super-minor annoyance I thought I'd document.

papajohn · 2015-09-17T21:03:36Z

The table reader doesn't inspect the file, just the path. I think that behavior is here to stay. Instead, you'll have to specify the separator manually.

address = 'https://data.oaklandnet.com/api/views/7axi-hi5i/rows.csv?accessType=DOWNLOAD'
Table.read_table(address, sep=',')

papajohn · 2015-09-23T04:13:26Z

Slightly improved in new release (handles the http query string case)

papajohn closed this as completed Sep 23, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Table.read_table() could be smarter about auto-detecting the file format #66

Table.read_table() could be smarter about auto-detecting the file format #66

davidwagner commented Sep 14, 2015

deculler commented Sep 14, 2015

davidwagner commented Sep 16, 2015

papajohn commented Sep 17, 2015

papajohn commented Sep 23, 2015

Table.read_table() could be smarter about auto-detecting the file format #66

Table.read_table() could be smarter about auto-detecting the file format #66

Comments

davidwagner commented Sep 14, 2015

deculler commented Sep 14, 2015

davidwagner commented Sep 16, 2015

papajohn commented Sep 17, 2015

papajohn commented Sep 23, 2015