-
Notifications
You must be signed in to change notification settings - Fork 791
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parse Time32/Time64 from formatted string #3101
Conversation
This looks good to me, I especially love the test coverage. Before merging I would like to get consensus on whether we want to support such a broad range of representations, I could definitely see an argument to just support RFC3339 style times, i.e. Thoughts @waitingkuo @alamb ? |
It's a fair point, as I was unsure which to support too. I was planning to base it on other arrow implementations, like pyarrow, but had difficulty tracking down the actual code that did the parsing, to reference, so I just went and did a bunch of formats which seemed reasonable. Happy to cut down any if its too broad |
Can definitely add the support for arrow-rs/arrow-csv/src/reader.rs Lines 587 to 604 in 3ca41f5
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks great personally -- thanks @Jefffrey
My opinion on the wide range of formats is that it is a good feature. My rationale is:
- It is consistent with the variety of timestamp handling we have
- It is a better user experience (if i have Time made by excel or some other program I don't want to have to specify a custom timestamp format to deal with it).
If the speed of parsing a csv file is the core issue perhaps @tustvold 's suggestion of allowing a custom format string #3101 (comment) would be one way to speed it up
parser_primitive!(Time32MillisecondType); | ||
parser_primitive!(Time32SecondType); | ||
impl Parser for Time64NanosecondType { | ||
fn parse(string: &str) -> Option<Self::Native> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think accepting a wide range of formats is consistent with string_to_timestamp_nanos
for better or worse
The only thing I recommend is adding docstring documentation (that will show up on docs.rs) for the types of formats accepted. We could follow the example of string_to_timestamp_nanos :
https://github.com/Jefffrey/arrow-rs/blob/3ca41f50d0e8b6da95d83e5bf0b09fd518e2110f/arrow-cast/src/parse.rs#L23-L54
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One other option to speed this up might be to do a pass through the string and compute
- Number of colons
- Presence of space
- Presence of capital
M
- Presence of decimal point
And use this to prune the list of candidates
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback, I'll update the documentation & take a shot at implementing that string pre-pass to prune the formats to try parse for
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implemented the parse_formatted function, and also implemented a preprocess on the string to prune the formats to attempt parsing for
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you @Jefffrey
here's my comments
- i wonder whether there's any public api that other crate could use? (like
string_to_timestamp_nanos
for timestamp`) - move the default display as the first place (commented bellow)
- consider support leap second
23:59:60
(commented bellow)
this is what postgresql has
willy=# select time '23:59:60';
time
----------
24:00:00
(1 row)
- consider the timezone
we could discuss either drop the timezone directly or shift it to utc and then get the time
postgrseql drop the timezone directly
willy=# select time '00:00:00+08:00';
time
----------
00:00:00
(1 row)
willy=# select timestamp '2000-01-01T00:00:00+08:00';
timestamp
---------------------
2000-01-01 00:00:00
(1 row)
while datafusion's timestamp shifts it to utc first
❯ select timestamp '2000-01-01T00:00:00+08:00';
+-----------------------------------+
| Utf8("2000-01-01T00:00:00+08:00") |
+-----------------------------------+
| 1999-12-31T16:00:00 |
+-----------------------------------+
1 row in set. Query took 0.003 seconds.
we could submit another issue/pr for 3 or 4 later if it's too large or we need time to discuss
arrow-cast/src/parse.rs
Outdated
"%I:%M:%S%.9f %p", | ||
"%l:%M:%S%.9f %P", | ||
"%l:%M:%S%.9f %p", | ||
"%H:%M:%S%.9f", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll suggest that move this as the first place as this is our default display (23:59:59.123456789
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this one %H:%M:%S%.9f
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing, will do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should no longer be relevant as I've implemented a preprocess pass which should ensure there isn't a priority clash between 24 and 12 hour time formats now
arrow-cast/src/parse.rs
Outdated
.map(|nt| { | ||
nt.num_seconds_from_midnight() as i64 * 1_000_000 | ||
+ (nt.nanosecond() as i64) / 1_000 | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could consider whether to support leap second 23:59:60
%S
already captured `Second number (00–60), zero-padded to 2 digits.
22:59:60
is parsed as 82800000000000 nanos
which works.
23:59:60
is parsed as 86400000000000 nanos
which overflows while we construct the array by Time64NanosecondArray::from(vec![86400000000000]);
so it'll return a null
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All four should now support having a leap second, per what chrono NaiveTime also supports
I originally was going to suggest this, but I decided against it as the semantics are actually a little bit funky. In particular, courtesy of the wonders of daylight savings, a non-FixedOffset timezone requires the date in order to be interpreted. I think it is acceptable to only handle timezones for timestamps, and not for times. FWIW this is the same approach taken by chrono - there is |
3ca41f5
to
1fba65a
Compare
A behaviour I've changed which is worth noting is that it is valid to pass in fractions of a second, even if the type you're parsing for doesn't support that precision; it'll simply be truncated from the final representation. See: assert_eq!(Time32SecondType::parse("02:10:01.1"), Some(7_801)); This technically was already happening for milli/micro/nano seconds anyway, but has been extended to seconds as well, to centralize all the behaviour. Let me know any thoughts on if instead it should be stricter and fail the parsing, rather than passing and truncating. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some minor suggestions, but also happy for this to go in as is. Nice work 👍
arrow-cast/src/parse.rs
Outdated
.fold((0, false, false), |tup, char| match char { | ||
':' => (tup.0.saturating_add(1), tup.1, tup.2), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.fold((0, false, false), |tup, char| match char { | |
':' => (tup.0.saturating_add(1), tup.1, tup.2), | |
.fold((0_usize, false, false), |tup, char| match char { | |
':' => (tup.0 + 1, tup.1, tup.2), |
Using a usize means this can't actually overflow
arrow-cast/src/parse.rs
Outdated
// colon count, presence of decimal, presence of whitespace | ||
fn preprocess_time_string(string: &str) -> (u8, bool, bool) { | ||
string | ||
.chars() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.chars() | |
.as_bytes() | |
.iter() |
And then match using b':'
We don't actually need to use chars
here, as the nature of the UTF-8 encoding is such that ASCII can be compared without ambiguity - https://en.wikipedia.org/wiki/UTF-8#Encoding
Some(86_400_000_000_000) | ||
); | ||
|
||
// custom format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
arrow-cast/src/parse.rs
Outdated
}) | ||
} | ||
|
||
fn naive_time_parser(string: &str, formats: &[&str]) -> Option<NaiveTime> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be cleaner to make the match block evaluate to &[&str]
and then have this follow.
e.g. something like
let formats = match preprocess_time_string(s.trim()) {
...
};
formats
.iter()
.find_map(|f| NaiveTime::parse_from_str(string, f).ok())
1fba65a
to
dea411d
Compare
Benchmark runs are scheduled for baseline = c99d2f3 and contender = c95eb4c. c95eb4c is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
* Parse Time32/Time64 from formatted string * PR comments * PR comments refactoring
* Parse Time32/Time64 from formatted string * PR comments * PR comments refactoring
Which issue does this PR close?
Closes #3100.
Rationale for this change
What changes are included in this PR?
Enable parsing
Time32
/Time64
from formatted string.Enable reading
Time32
/Time64
from CSV files.Are there any user-facing changes?
Able to parse
Time32
/Time64
types from formatted string, and from CSV.