Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] CSV reader: Ability to not infer column types. #22232

Open
asfimport opened this issue Jun 30, 2019 · 8 comments
Open

[C++] CSV reader: Ability to not infer column types. #22232

asfimport opened this issue Jun 30, 2019 · 8 comments

Comments

@asfimport
Copy link

I'm trying to read CSV as is. All columns as strings. I don't know the schema of these CSVs and they will vary as they are provided by user.

Right now i'm using pandas.read_csv(dtype=str) which works great, but since final destination of these CSVs are parquet files it seems like much more efficient to use pyarrow.csv.read_csv in future, as soon as this becomes available :)

I tried things like pyarrow.csv.read_csv(convert_types=ConvertOptions(columns_types=defaultdict(lambda: 'string'))) but it doesn't work.

Maybe I just didnt' find something that already exists? :)

Environment: Ubuntu Xenial
Reporter: Bogdan Klichuk

Note: This issue was originally created as ARROW-5811. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
No, convert_types must be the full mapping of column names to data types. C++ doesn't know about defaultdict...

We could add more inference options, though, for example to select the datatypes for which inference is enabled.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
@wesm @nealrichardson do you have an idea about a desirable API here?

@asfimport
Copy link
Author

Wes McKinney / @wesm:
I think we need to create an abstract C++ type (or similar) that is a ConversionRule. We have other types of conversion rules where we have not defined an API yet, for example "timestamp with striptime-like format of $FORMAT". Whatever API we have, it needs to be extensible to accommodate new kinds of logic

@asfimport
Copy link
Author

Neal Richardson / @nealrichardson:
I think I'm not understanding the problem. What's missing from the column_types we already support? https://github.com/apache/arrow/blob/master/cpp/src/arrow/csv/options.h#L69

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
The request is for no inference to occur, without knowing the column names or the number of columns in advance (so you cannot pass an explicit column_types).

@asfimport
Copy link
Author

Neal Richardson / @nealrichardson:
In principle, a user could parse the header row of the CSV separately to identify the column names, then use that to define column_types mapping each name to string type. So are we just talking about how to facilitate that, whether/how to internalize that logic and expose it as a simple argument? Or is there something else?

If column_types didn't have to be a map, maybe that would help. Perhaps it could also accept an array of length equal to the number of columns, or a single value, in which case it would recycle that type for every column. 

 

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
We're talking about C++ here. Dynamic typing isn't terribly idiomatic (though it's possible using std::variant) :-)

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Yeah, so we could define a conversion rule to return string or binary, and then add an option to set a default conversion rule (where currently we have an implicit default of "use type inference")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant