Data('file.csv') no longer infers fields with string values #1254

bittlingmayer · 2015-09-17T08:06:24Z

On 0.8.0 this worked. But since I updated, instantiating Data from a .csv is broken in cases like the following:

x,tl,z
Be careful driving.,hy,en
Be careful.,hy,en
Can you translate this for me?,hy,en
Chicago is very different from Boston.,hy,en
Don't worry.,hy,en

Actual vs expected output:

Data('data/broken.csv').fields
['0', '1', '2']
Data('data/works.csv').fields
[u'x', u'tl', u'z']

If the values are numbers, then it is inferred correctly:

x,tl,z
Be careful driving.,1,en
Be careful.,12,en
Can you translate this for me?,2,en
Chicago is very different from Boston.,2,en
Don't worry.,2,en

If the column name is 'y', then it is also inferred as expected:

x,y,z
Be careful driving.,hy,en
Be careful.,hy,en
Can you translate this for me?,hy,en
Chicago is very different from Boston.,hy,en
Don't worry.,hy,en

So it's very sensitive.

The text was updated successfully, but these errors were encountered:

bittlingmayer · 2015-09-17T08:30:55Z

I will solve it for now by passing an argument to force the first row to be interpreted as column names.

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

bittlingmayer · 2015-09-17T08:47:39Z

http://blaze.pydata.org/en/latest/csv.html#correcting-csv-dialects states: In the first case of incorrect guessing of CSV dialect (e.g. delimiter) Blaze respects and passes through all keyword arguments to pandas.read_csv.

However

Data('data/broken.csv', header=0).fields
['0', '1', '2']

Moreover, pandas infers the fields as expected:

pandas.read_csv('data/broken.csv').columns
Index([u'x', u'tl', u'z'], dtype='object')

So for now the workaround is Data(pandas.read_csv('data/broken.csv')).

cpcloud · 2015-09-28T18:38:43Z

@bittlingmayer Pass has_header=True to the Data constructor:

d = Data('foo.csv', has_header=True)

cpcloud · 2015-09-28T18:39:31Z

I'm going to add a test and a bit of documentation for this issue and then close it

bittlingmayer · 2015-09-29T07:00:44Z

Documentation would be good, although I also think the original ie pandas behaviour is a bit smarter.

cpcloud · 2015-12-04T20:39:02Z

@bittlingmayer Yes, it's smarter, however this is a compromise between being explicit and convenience

bittlingmayer closed this as completed Sep 17, 2015

bittlingmayer reopened this Sep 17, 2015

cpcloud added documentation user experience labels Sep 28, 2015

cpcloud added this to the 0.9.0 milestone Sep 28, 2015

cpcloud closed this as completed Dec 4, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data('file.csv') no longer infers fields with string values #1254

Data('file.csv') no longer infers fields with string values #1254

bittlingmayer commented Sep 17, 2015

bittlingmayer commented Sep 17, 2015

bittlingmayer commented Sep 17, 2015

cpcloud commented Sep 28, 2015

cpcloud commented Sep 28, 2015

bittlingmayer commented Sep 29, 2015

cpcloud commented Dec 4, 2015

Data('file.csv') no longer infers fields with string values #1254

Data('file.csv') no longer infers fields with string values #1254

Comments

bittlingmayer commented Sep 17, 2015

bittlingmayer commented Sep 17, 2015

bittlingmayer commented Sep 17, 2015

cpcloud commented Sep 28, 2015

cpcloud commented Sep 28, 2015

bittlingmayer commented Sep 29, 2015

cpcloud commented Dec 4, 2015