Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Unable to read date64 or date32 in specific format from CSV #28303

Open
asfimport opened this issue Apr 26, 2021 · 9 comments
Open

[C++] Unable to read date64 or date32 in specific format from CSV #28303

asfimport opened this issue Apr 26, 2021 · 9 comments

Comments

@asfimport
Copy link

asfimport commented Apr 26, 2021

when importing csv data with dates in the format "%d-%b-%y" or "%d-%b-%Y" an error is given in conversion:

example:

import pyarrow as pa
from pyarrow import csv 

data = b"a,b\n1,15-OCT-15\n2,18-JUN-90\n"
tp = ["%d-%b-%y"]

try:
    schema_d64 = pa.schema([pa.field("a", pa.int64()), pa.field("b", pa.date64())])
    co_d64 = csv.ConvertOptions(timestamp_parsers=tp, column_types=schema_d64)
    a_d64 = csv.read_csv(pa.py_buffer(data), convert_options=co_d64)
except Exception as e:
    print(e)
try:
    schema_d32 = pa.schema([pa.field("a", pa.int64()), pa.field("b", pa.date32())])
    co_d32 = csv.ConvertOptions(timestamp_parsers=tp, column_types=schema_d32)
    a_d32 = csv.read_csv(pa.py_buffer(data), convert_options=co_d32)
except Exception as e:
    print(e)

 

Reporter: Stephen Bias

Related issues:

Note: This issue was originally created as ARROW-12539. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
Related, it would also be good (and probably a prerequisite for having it in CSV) to be able to cast strings to date (which works for timestamp, but not for date):

In [3]: pa.array(["2012-01-01"]).cast(pa.timestamp('ms'))
Out[3]: 
<pyarrow.lib.TimestampArray object at 0x7fae22d778e0>
[
  2012-01-01 00:00:00.000
]

In [4]: pa.array(["2012-01-01"]).cast(pa.date32())
...
ArrowNotImplementedError: Unsupported cast from string to date32 using function cast_date32

@yan-hic
Copy link

yan-hic commented Apr 17, 2023

Seems this has no traction - facing the same with another date format. Works if timestamp(s) dtype is passed but a date32 should indeed to cast accordingly.
I thought this was resolved as #27149 is closed for some time but OP's snippet still errors out.

@felipecrv
Copy link
Contributor

@yan-hic this format seems a bit tricky since it requires knowledge about the language of the CSV -- English in this case.

@yan-hic
Copy link

yan-hic commented Apr 24, 2023

@felipecrv I don't understand. The language is irrelevant. One passes the timestamp_parsers to parse from string, and arrow could be smart enough to cast to date32 instead if applicable i.e. if destination field is defined as such in the schema.
As @jorisvandenbossche wrote earlier, casting from string to date is (still) not possible today, which is in effect what the read_csv does.

Until then, it's a 2 step process: read_csv which parses to timestamp, and then cast all relevant columns to date32 (date64 not used in parquet)

@fdzuJ
Copy link
Contributor

fdzuJ commented Apr 26, 2023

Which also means that it's impossible to read some dates

@jorisvandenbossche
Copy link
Member

I thought this was resolved as #27149 is closed for some time but OP's snippet still errors out.

That issue was about inferring strings in default ISO date format as date type, not about being able to specify a custom string format (this issue).

Which also means that it's impossible to read some dates

Can you clarify this? What kind of date is impossible to read as a timestamp?

@jorisvandenbossche
Copy link
Member

this format seems a bit tricky since it requires knowledge about the language of the CSV -- English in this case.

@felipecrv that's a general issue for parsing that kind of strings, i.e. the same locale-dependent behaviour applies to parsing it as a timestamp as well, and that's something we already support:

>>> pa.compute.strptime(["15-OCT-15", "18-JUN-90"], format="%d-%b-%y", unit="s")
<pyarrow.lib.TimestampArray object at 0x7f2c4f7b1f00>
[
  2015-10-15 00:00:00,
  1990-06-18 00:00:00
]

The issue here is also enabling passing such custom format for date32/date64 columns, and not only for timestamp columns.

@fdzuJ
Copy link
Contributor

fdzuJ commented Apr 26, 2023

Can you clarify this? What kind of date is impossible to read as a timestamp?

Sorry, I was wrong - assumed date32/date64 is stored as number of days since UNIX epoch

@davlee1972
Copy link

davlee1972 commented May 1, 2024

This is still an issue.. It also extends to time32[s]..

CSV conversion error to time32[s]: invalid value '7:55:00'

Right now the timestamp_parser will only convert different string formats to timestamp[x] types..

If you have a schema column with date32, date64 or time32 then the CSV conversion will fail.
Your date column in the CSV has to be in YYYY-MM-DD HH:MM:SS format or it will fail.. There is no alternative format..

The solution is to allow the format lists in timestamp_parser to apply to date32, date64, time32 columns..
or
create date_parser and time_parser options..

I have a current hack I implemented to be able to parse DATEs out of CSV files..
If the schema being used to read a CSV file has any column data types that start with "DATE"..
Change the schema column to timestamp[s] type.
Include the alternative Date formats in timestamp_parsers.. i.e. %d-%m-%Y, %m/%d/%Y, etc..
Read the CSV file which will read the date columns in as timestamps..
Convert the timestamp[s] columns of the result back to date32/64[day] using pyarrow.compute.cast().

Code to swap out date columns with timestamp columns in a schema for dataset api

        new_fields = []
        for field in self.arrow_schema.names:
            new_field = self.arrow_schema.field(field)
            if str(new_field.type).startswith("date"):
                new_fields.append(pa.field(field, pa.timestamp("s")))
            else:
                new_fields.append(self.arrow_schema.field(field))
        new_schema = pa.schema(new_fields)

Expression code to cast timestamp columns to dates when reading CSV files using dataset.to_table

                # convert column list into column dict selection
                if isinstance(columns, List):
                    columns = {column: dataset.field(column) for column in columns}

                # cast timestamps to date32 or date64 in schema definition
                columns = {
                    column: (
                        dataset.field(column).cast(
                            str(self.arrow_schema.field(column).type)
                        )
                        if column in self.arrow_schema.names
                        and str(self.arrow_schema.field(column).type).startswith("date")
                        else expr
                    )
                    for column, expr in columns.items()
                }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants