Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data('file.csv') no longer infers fields with string values #1254

Closed
bittlingmayer opened this issue Sep 17, 2015 · 6 comments
Closed

Data('file.csv') no longer infers fields with string values #1254

bittlingmayer opened this issue Sep 17, 2015 · 6 comments

Comments

@bittlingmayer
Copy link

On 0.8.0 this worked. But since I updated, instantiating Data from a .csv is broken in cases like the following:

x,tl,z
Be careful driving.,hy,en
Be careful.,hy,en
Can you translate this for me?,hy,en
Chicago is very different from Boston.,hy,en
Don't worry.,hy,en

Actual vs expected output:

Data('data/broken.csv').fields
['0', '1', '2']
Data('data/works.csv').fields
[u'x', u'tl', u'z']

If the values are numbers, then it is inferred correctly:

x,tl,z
Be careful driving.,1,en
Be careful.,12,en
Can you translate this for me?,2,en
Chicago is very different from Boston.,2,en
Don't worry.,2,en

If the column name is 'y', then it is also inferred as expected:

x,y,z
Be careful driving.,hy,en
Be careful.,hy,en
Can you translate this for me?,hy,en
Chicago is very different from Boston.,hy,en
Don't worry.,hy,en

So it's very sensitive.

@bittlingmayer
Copy link
Author

I will solve it for now by passing an argument to force the first row to be interpreted as column names.

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

@bittlingmayer
Copy link
Author

http://blaze.pydata.org/en/latest/csv.html#correcting-csv-dialects states: In the first case of incorrect guessing of CSV dialect (e.g. delimiter) Blaze respects and passes through all keyword arguments to pandas.read_csv.

However

Data('data/broken.csv', header=0).fields
['0', '1', '2']

Moreover, pandas infers the fields as expected:

pandas.read_csv('data/broken.csv').columns
Index([u'x', u'tl', u'z'], dtype='object')

So for now the workaround is Data(pandas.read_csv('data/broken.csv')).

@cpcloud
Copy link
Member

cpcloud commented Sep 28, 2015

@bittlingmayer Pass has_header=True to the Data constructor:

d = Data('foo.csv', has_header=True)

@cpcloud
Copy link
Member

cpcloud commented Sep 28, 2015

I'm going to add a test and a bit of documentation for this issue and then close it

@bittlingmayer
Copy link
Author

Documentation would be good, although I also think the original ie pandas behaviour is a bit smarter.

@cpcloud
Copy link
Member

cpcloud commented Dec 4, 2015

@bittlingmayer Yes, it's smarter, however this is a compromise between being explicit and convenience

@cpcloud cpcloud closed this as completed Dec 4, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants