[C++] Unable to read date64 or date32 in specific format from CSV #28303

asfimport · 2021-04-26T10:41:30Z

when importing csv data with dates in the format "%d-%b-%y" or "%d-%b-%Y" an error is given in conversion:

example:

import pyarrow as pa
from pyarrow import csv 

data = b"a,b\n1,15-OCT-15\n2,18-JUN-90\n"
tp = ["%d-%b-%y"]

try:
    schema_d64 = pa.schema([pa.field("a", pa.int64()), pa.field("b", pa.date64())])
    co_d64 = csv.ConvertOptions(timestamp_parsers=tp, column_types=schema_d64)
    a_d64 = csv.read_csv(pa.py_buffer(data), convert_options=co_d64)
except Exception as e:
    print(e)
try:
    schema_d32 = pa.schema([pa.field("a", pa.int64()), pa.field("b", pa.date32())])
    co_d32 = csv.ConvertOptions(timestamp_parsers=tp, column_types=schema_d32)
    a_d32 = csv.read_csv(pa.py_buffer(data), convert_options=co_d32)
except Exception as e:
    print(e)

Reporter: Stephen Bias

Related issues:

[C++] Parse time32 from string and infer in CSV reader (is related to)

_{Note: This issue was originally created as ARROW-12539. Please see the migration documentation for further details.}

The text was updated successfully, but these errors were encountered:

asfimport · 2021-04-28T13:55:35Z

Joris Van den Bossche / @jorisvandenbossche:
Related, it would also be good (and probably a prerequisite for having it in CSV) to be able to cast strings to date (which works for timestamp, but not for date):

In [3]: pa.array(["2012-01-01"]).cast(pa.timestamp('ms'))
Out[3]: 
<pyarrow.lib.TimestampArray object at 0x7fae22d778e0>
[
  2012-01-01 00:00:00.000
]

In [4]: pa.array(["2012-01-01"]).cast(pa.date32())
...
ArrowNotImplementedError: Unsupported cast from string to date32 using function cast_date32

yan-hic · 2023-04-17T17:01:59Z

Seems this has no traction - facing the same with another date format. Works if timestamp(s) dtype is passed but a date32 should indeed to cast accordingly.
I thought this was resolved as #27149 is closed for some time but OP's snippet still errors out.

felipecrv · 2023-04-24T18:50:15Z

@yan-hic this format seems a bit tricky since it requires knowledge about the language of the CSV -- English in this case.

yan-hic · 2023-04-24T19:42:39Z

@felipecrv I don't understand. The language is irrelevant. One passes the timestamp_parsers to parse from string, and arrow could be smart enough to cast to date32 instead if applicable i.e. if destination field is defined as such in the schema.
As @jorisvandenbossche wrote earlier, casting from string to date is (still) not possible today, which is in effect what the read_csv does.

Until then, it's a 2 step process: read_csv which parses to timestamp, and then cast all relevant columns to date32 (date64 not used in parquet)

fdzuJ · 2023-04-26T08:41:59Z

Which also means that it's impossible to read some dates

jorisvandenbossche · 2023-04-26T09:20:37Z

I thought this was resolved as #27149 is closed for some time but OP's snippet still errors out.

That issue was about inferring strings in default ISO date format as date type, not about being able to specify a custom string format (this issue).

Which also means that it's impossible to read some dates

Can you clarify this? What kind of date is impossible to read as a timestamp?

jorisvandenbossche · 2023-04-26T09:24:23Z

this format seems a bit tricky since it requires knowledge about the language of the CSV -- English in this case.

@felipecrv that's a general issue for parsing that kind of strings, i.e. the same locale-dependent behaviour applies to parsing it as a timestamp as well, and that's something we already support:

>>> pa.compute.strptime(["15-OCT-15", "18-JUN-90"], format="%d-%b-%y", unit="s")
<pyarrow.lib.TimestampArray object at 0x7f2c4f7b1f00>
[
  2015-10-15 00:00:00,
  1990-06-18 00:00:00
]

The issue here is also enabling passing such custom format for date32/date64 columns, and not only for timestamp columns.

fdzuJ · 2023-04-26T09:32:06Z

Can you clarify this? What kind of date is impossible to read as a timestamp?

Sorry, I was wrong - assumed date32/date64 is stored as number of days since UNIX epoch

davlee1972 · 2024-05-01T18:54:19Z

This is still an issue.. It also extends to time32[s]..

CSV conversion error to time32[s]: invalid value '7:55:00'

Right now the timestamp_parser will only convert different string formats to timestamp[x] types..

If you have a schema column with date32, date64 or time32 then the CSV conversion will fail.
Your date column in the CSV has to be in YYYY-MM-DD HH:MM:SS format or it will fail.. There is no alternative format..

The solution is to allow the format lists in timestamp_parser to apply to date32, date64, time32 columns..
or
create date_parser and time_parser options..

I have a current hack I implemented to be able to parse DATEs out of CSV files..
If the schema being used to read a CSV file has any column data types that start with "DATE"..
Change the schema column to timestamp[s] type.
Include the alternative Date formats in timestamp_parsers.. i.e. %d-%m-%Y, %m/%d/%Y, etc..
Read the CSV file which will read the date columns in as timestamps..
Convert the timestamp[s] columns of the result back to date32/64[day] using pyarrow.compute.cast().

Code to swap out date columns with timestamp columns in a schema for dataset api

        new_fields = []
        for field in self.arrow_schema.names:
            new_field = self.arrow_schema.field(field)
            if str(new_field.type).startswith("date"):
                new_fields.append(pa.field(field, pa.timestamp("s")))
            else:
                new_fields.append(self.arrow_schema.field(field))
        new_schema = pa.schema(new_fields)

Expression code to cast timestamp columns to dates when reading CSV files using dataset.to_table

                # convert column list into column dict selection
                if isinstance(columns, List):
                    columns = {column: dataset.field(column) for column in columns}

                # cast timestamps to date32 or date64 in schema definition
                columns = {
                    column: (
                        dataset.field(column).cast(
                            str(self.arrow_schema.field(column).type)
                        )
                        if column in self.arrow_schema.names
                        and str(self.arrow_schema.field(column).type).startswith("date")
                        else expr
                    )
                    for column, expr in columns.items()
                }

asfimport mentioned this issue Jan 11, 2023

[C++] Parse time32 from string and infer in CSV reader #27146

Closed

davlee1972 mentioned this issue May 1, 2024

CSV reader cannot parse dates or times #41488

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++] Unable to read date64 or date32 in specific format from CSV #28303

[C++] Unable to read date64 or date32 in specific format from CSV #28303

asfimport commented Apr 26, 2021 •

edited

asfimport commented Apr 28, 2021

yan-hic commented Apr 17, 2023

felipecrv commented Apr 24, 2023

yan-hic commented Apr 24, 2023 •

edited

fdzuJ commented Apr 26, 2023

jorisvandenbossche commented Apr 26, 2023

jorisvandenbossche commented Apr 26, 2023

fdzuJ commented Apr 26, 2023 •

edited

davlee1972 commented May 1, 2024 •

edited

[C++] Unable to read date64 or date32 in specific format from CSV #28303

[C++] Unable to read date64 or date32 in specific format from CSV #28303

Comments

asfimport commented Apr 26, 2021 • edited

Related issues:

asfimport commented Apr 28, 2021

yan-hic commented Apr 17, 2023

felipecrv commented Apr 24, 2023

yan-hic commented Apr 24, 2023 • edited

fdzuJ commented Apr 26, 2023

jorisvandenbossche commented Apr 26, 2023

jorisvandenbossche commented Apr 26, 2023

fdzuJ commented Apr 26, 2023 • edited

davlee1972 commented May 1, 2024 • edited

asfimport commented Apr 26, 2021 •

edited

yan-hic commented Apr 24, 2023 •

edited

fdzuJ commented Apr 26, 2023 •

edited

davlee1972 commented May 1, 2024 •

edited