[C++] CSV reader: Ability to not infer column types. #22232

asfimport · 2019-06-30T21:42:18Z

I'm trying to read CSV as is. All columns as strings. I don't know the schema of these CSVs and they will vary as they are provided by user.

Right now i'm using pandas.read_csv(dtype=str) which works great, but since final destination of these CSVs are parquet files it seems like much more efficient to use pyarrow.csv.read_csv in future, as soon as this becomes available :)

I tried things like pyarrow.csv.read_csv(convert_types=ConvertOptions(columns_types=defaultdict(lambda: 'string'))) but it doesn't work.

Maybe I just didnt' find something that already exists? :)

Environment: Ubuntu Xenial
Reporter: Bogdan Klichuk

_{Note: This issue was originally created as ARROW-5811. Please see the migration documentation for further details.}

The text was updated successfully, but these errors were encountered:

asfimport · 2019-07-01T08:58:40Z

Antoine Pitrou / @pitrou:
No, convert_types must be the full mapping of column names to data types. C++ doesn't know about defaultdict...

We could add more inference options, though, for example to select the datatypes for which inference is enabled.

asfimport · 2019-07-17T15:22:07Z

Antoine Pitrou / @pitrou:
@wesm @nealrichardson do you have an idea about a desirable API here?

asfimport · 2019-07-17T15:26:03Z

Wes McKinney / @wesm:
I think we need to create an abstract C++ type (or similar) that is a ConversionRule. We have other types of conversion rules where we have not defined an API yet, for example "timestamp with striptime-like format of $FORMAT". Whatever API we have, it needs to be extensible to accommodate new kinds of logic

asfimport · 2019-07-17T15:31:36Z

Neal Richardson / @nealrichardson:
I think I'm not understanding the problem. What's missing from the column_types we already support? https://github.com/apache/arrow/blob/master/cpp/src/arrow/csv/options.h#L69

asfimport · 2019-07-17T15:33:38Z

Antoine Pitrou / @pitrou:
The request is for no inference to occur, without knowing the column names or the number of columns in advance (so you cannot pass an explicit column_types).

asfimport · 2019-07-17T15:41:46Z

Neal Richardson / @nealrichardson:
In principle, a user could parse the header row of the CSV separately to identify the column names, then use that to define column_types mapping each name to string type. So are we just talking about how to facilitate that, whether/how to internalize that logic and expose it as a simple argument? Or is there something else?

If column_types didn't have to be a map, maybe that would help. Perhaps it could also accept an array of length equal to the number of columns, or a single value, in which case it would recycle that type for every column.

asfimport · 2019-07-17T15:48:15Z

Antoine Pitrou / @pitrou:
We're talking about C++ here. Dynamic typing isn't terribly idiomatic (though it's possible using std::variant) :-)

asfimport · 2019-07-17T15:56:28Z

Wes McKinney / @wesm:
Yeah, so we could define a conversion rule to return string or binary, and then add an option to set a default conversion rule (where currently we have an implicit default of "use type inference")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++] CSV reader: Ability to not infer column types. #22232

[C++] CSV reader: Ability to not infer column types. #22232

asfimport commented Jun 30, 2019

asfimport commented Jul 1, 2019

asfimport commented Jul 17, 2019

asfimport commented Jul 17, 2019

asfimport commented Jul 17, 2019

asfimport commented Jul 17, 2019

asfimport commented Jul 17, 2019

asfimport commented Jul 17, 2019

asfimport commented Jul 17, 2019

[C++] CSV reader: Ability to not infer column types. #22232

[C++] CSV reader: Ability to not infer column types. #22232

Comments

asfimport commented Jun 30, 2019

asfimport commented Jul 1, 2019

asfimport commented Jul 17, 2019

asfimport commented Jul 17, 2019

asfimport commented Jul 17, 2019

asfimport commented Jul 17, 2019

asfimport commented Jul 17, 2019

asfimport commented Jul 17, 2019

asfimport commented Jul 17, 2019