Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data('file.csv') no longer infers fields with string values #1254

Closed
bittlingmayer opened this issue Sep 17, 2015 · 6 comments

Comments

@bittlingmayer
Copy link

commented Sep 17, 2015

On 0.8.0 this worked. But since I updated, instantiating Data from a .csv is broken in cases like the following:

x,tl,z
Be careful driving.,hy,en
Be careful.,hy,en
Can you translate this for me?,hy,en
Chicago is very different from Boston.,hy,en
Don't worry.,hy,en

Actual vs expected output:

Data('data/broken.csv').fields
['0', '1', '2']
Data('data/works.csv').fields
[u'x', u'tl', u'z']

If the values are numbers, then it is inferred correctly:

x,tl,z
Be careful driving.,1,en
Be careful.,12,en
Can you translate this for me?,2,en
Chicago is very different from Boston.,2,en
Don't worry.,2,en

If the column name is 'y', then it is also inferred as expected:

x,y,z
Be careful driving.,hy,en
Be careful.,hy,en
Can you translate this for me?,hy,en
Chicago is very different from Boston.,hy,en
Don't worry.,hy,en

So it's very sensitive.

@bittlingmayer

This comment has been minimized.

Copy link
Author

commented Sep 17, 2015

I will solve it for now by passing an argument to force the first row to be interpreted as column names.

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

@bittlingmayer

This comment has been minimized.

Copy link
Author

commented Sep 17, 2015

http://blaze.pydata.org/en/latest/csv.html#correcting-csv-dialects states: In the first case of incorrect guessing of CSV dialect (e.g. delimiter) Blaze respects and passes through all keyword arguments to pandas.read_csv.

However

Data('data/broken.csv', header=0).fields
['0', '1', '2']

Moreover, pandas infers the fields as expected:

pandas.read_csv('data/broken.csv').columns
Index([u'x', u'tl', u'z'], dtype='object')

So for now the workaround is Data(pandas.read_csv('data/broken.csv')).

@cpcloud

This comment has been minimized.

Copy link
Member

commented Sep 28, 2015

@bittlingmayer Pass has_header=True to the Data constructor:

d = Data('foo.csv', has_header=True)

@cpcloud cpcloud added this to the 0.9.0 milestone Sep 28, 2015

@cpcloud

This comment has been minimized.

Copy link
Member

commented Sep 28, 2015

I'm going to add a test and a bit of documentation for this issue and then close it

@bittlingmayer

This comment has been minimized.

Copy link
Author

commented Sep 29, 2015

Documentation would be good, although I also think the original ie pandas behaviour is a bit smarter.

@cpcloud

This comment has been minimized.

Copy link
Member

commented Dec 4, 2015

@bittlingmayer Yes, it's smarter, however this is a compromise between being explicit and convenience

@cpcloud cpcloud closed this Dec 4, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.