Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confusion over Date64 array values #5288

Closed
Jefffrey opened this issue Jan 7, 2024 · 3 comments · Fixed by #5323
Closed

Confusion over Date64 array values #5288

Jefffrey opened this issue Jan 7, 2024 · 3 comments · Fixed by #5323
Labels
question Further information is requested

Comments

@Jefffrey
Copy link
Contributor

Jefffrey commented Jan 7, 2024

Which part is this question about

Date64 array values.

Describe your question

Docs for Date64 type states:

/// A signed 64-bit date representing the elapsed time since UNIX epoch (1970-01-01)
/// in milliseconds (64 bits). Values are evenly divisible by 86400000.
Date64,

Values are evenly divisible by 86400000

This seems to suggest that Date64 should NOT store time, and should only represent days since UNIX epoch, akin to Date32 (but as milliseconds, not days).

What is the point of Date64 type, then? It would be the same as Date32 but multiplied by 86400000 assuming it's used according to spec.

The bold is important, as there are examples where you can set values that are not evenly divisible by the factor, and the printing code even shows the time as well:

#[test]
fn test_pretty_format_date_64() {
let expected = vec![
"+---------------------+",
"| f |",
"+---------------------+",
"| 2005-03-18T01:58:20 |",
"| |",
"+---------------------+",
];
check_datetime!(Date64Array, 1111111100000, expected);
}

The C++ implementation seems to have a validate function, see apache/arrow#12014

But I can still set 'invalid' values via PyArrow as this full validation is optional:

>>> import pyarrow as pa
>>> days = pa.array([0, 1, 2], type=pa.date64())
>>> days
<pyarrow.lib.Date64Array object at 0x7f6810ecba60>
[
  1970-01-01,
  1970-01-01,
  1970-01-01
]
>>> days.validate()
>>> days.validate(full=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/array.pxi", line 1630, in pyarrow.lib.Array.validate
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: date64[ms] 1 does not represent a whole number of days
>>>

So I'm just wondering, even if we implement some sort of validation on these values (there is this old issue on the arrow repo: apache/arrow#26853), if this is not made mandatory, then what is the point of having that restriction on Date64 type?

Do we need to implement this optional validation on arrow-rs too, and also fix the print code to not show the time for Date64? Or just embrace that Date64 will also store time, contrary to the docs (both in arrow-rs and the official arrow repo)?

Additional context

This might be a wider arrow discussion, I'm not sure if it's been had before, feel free to link if so.

Came across this whilst looking into #5266

As I wasn't sure, given the case of a Date64 array, whether extracting the millisecond part should always return 0 (assuming the array contains valid values) or should return the actual milliseconds part (though that would technically mean the value is invalid?)

@Jefffrey Jefffrey added the question Further information is requested label Jan 7, 2024
@tustvold
Copy link
Contributor

tustvold commented Jan 7, 2024

https://lists.apache.org/thread/q036r1q3cw5ysn3zkpvljx3s9ho18419

TLDR, its for compatibility with some language's native libraries. It is certainly an odd one

full validation is optional

This would be consistent with how we handle decimals FWIW, validating always is too much of a performance hit, not to mention complex to implement

@Jefffrey
Copy link
Contributor Author

Jefffrey commented Jan 8, 2024

Right, I see.

So I guess for the sake of backward compatibility, arrow-rs would maintain its current behaviour for Date64 (essentially treating it like a timestamp, being able to extract a non-zero hour/minute/second from it and displaying the time as part of pretty print), but some changes can be made to clarify things for users:

  • Introduce an optional validate function similar to what C++ did
  • Fix up the doc on Date64 data type to document what the spec says, vs how its treated in reality (i.e. like a timestamp, unless validated)

@tustvold
Copy link
Contributor

tustvold commented Jan 8, 2024

Could probably go so far as to state that most use-cases should prefer Date32 or Timestamp depending on their use-case

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants