Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse Time32/Time64 from formatted string #3101

Merged
merged 3 commits into from
Nov 15, 2022

Conversation

Jefffrey
Copy link
Contributor

@Jefffrey Jefffrey commented Nov 13, 2022

Which issue does this PR close?

Closes #3100.

Rationale for this change

What changes are included in this PR?

Enable parsing Time32/Time64 from formatted string.

Enable reading Time32/Time64 from CSV files.

Are there any user-facing changes?

Able to parse Time32/Time64 types from formatted string, and from CSV.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Nov 13, 2022
@tustvold
Copy link
Contributor

tustvold commented Nov 14, 2022

This looks good to me, I especially love the test coverage.

Before merging I would like to get consensus on whether we want to support such a broad range of representations, I could definitely see an argument to just support RFC3339 style times, i.e. %H:%M:%S and %H:%M:%S%.f.

Thoughts @waitingkuo @alamb ?

@Jefffrey
Copy link
Contributor Author

It's a fair point, as I was unsure which to support too. I was planning to base it on other arrow implementations, like pyarrow, but had difficulty tracking down the actual code that did the parsing, to reference, so I just went and did a bunch of formats which seemed reasonable. Happy to cut down any if its too broad

@tustvold
Copy link
Contributor

One other option would be to support a default representation, but support specifying a custom format string by using Parse::parse_formatted originally added by @sum12 in #1451

@Jefffrey
Copy link
Contributor Author

One other option would be to support a default representation, but support specifying a custom format string by using Parse::parse_formatted originally added by @sum12 in #1451

Can definitely add the support for Parse::parse_formatted, but still would leave the question of what the default(s) should be, especially for the original use case of parsing from csv. Ideally wouldn't want to have to slap parse_formatted for all the different representations in here:

DataType::Time32(TimeUnit::Second) => {
build_primitive_array::<Time32SecondType>(line_number, rows, i, None)
}
DataType::Time32(TimeUnit::Millisecond) => build_primitive_array::<
Time32MillisecondType,
>(
line_number, rows, i, None
),
DataType::Time64(TimeUnit::Microsecond) => build_primitive_array::<
Time64MicrosecondType,
>(
line_number, rows, i, None
),
DataType::Time64(TimeUnit::Nanosecond) => build_primitive_array::<
Time64NanosecondType,
>(
line_number, rows, i, None
),

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks great personally -- thanks @Jefffrey

My opinion on the wide range of formats is that it is a good feature. My rationale is:

  1. It is consistent with the variety of timestamp handling we have
  2. It is a better user experience (if i have Time made by excel or some other program I don't want to have to specify a custom timestamp format to deal with it).

If the speed of parsing a csv file is the core issue perhaps @tustvold 's suggestion of allowing a custom format string #3101 (comment) would be one way to speed it up

parser_primitive!(Time32MillisecondType);
parser_primitive!(Time32SecondType);
impl Parser for Time64NanosecondType {
fn parse(string: &str) -> Option<Self::Native> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think accepting a wide range of formats is consistent with string_to_timestamp_nanos for better or worse

https://github.com/Jefffrey/arrow-rs/blob/3ca41f50d0e8b6da95d83e5bf0b09fd518e2110f/arrow-cast/src/parse.rs#L71-L133

The only thing I recommend is adding docstring documentation (that will show up on docs.rs) for the types of formats accepted. We could follow the example of string_to_timestamp_nanos :
https://github.com/Jefffrey/arrow-rs/blob/3ca41f50d0e8b6da95d83e5bf0b09fd518e2110f/arrow-cast/src/parse.rs#L23-L54

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One other option to speed this up might be to do a pass through the string and compute

  • Number of colons
  • Presence of space
  • Presence of capital M
  • Presence of decimal point

And use this to prune the list of candidates

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback, I'll update the documentation & take a shot at implementing that string pre-pass to prune the formats to try parse for

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented the parse_formatted function, and also implemented a preprocess on the string to prune the formats to attempt parsing for

Copy link
Contributor

@waitingkuo waitingkuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you @Jefffrey
here's my comments

  1. i wonder whether there's any public api that other crate could use? (like string_to_timestamp_nanos for timestamp`)
  2. move the default display as the first place (commented bellow)
  3. consider support leap second 23:59:60 (commented bellow)
    this is what postgresql has
willy=# select time '23:59:60';
   time   
----------
 24:00:00
(1 row)
  1. consider the timezone
    we could discuss either drop the timezone directly or shift it to utc and then get the time
    postgrseql drop the timezone directly
willy=# select time '00:00:00+08:00';
   time   
----------
 00:00:00
(1 row)

willy=# select timestamp '2000-01-01T00:00:00+08:00';
      timestamp      
---------------------
 2000-01-01 00:00:00
(1 row)

while datafusion's timestamp shifts it to utc first

select timestamp '2000-01-01T00:00:00+08:00';
+-----------------------------------+
| Utf8("2000-01-01T00:00:00+08:00") |
+-----------------------------------+
| 1999-12-31T16:00:00               |
+-----------------------------------+
1 row in set. Query took 0.003 seconds.

we could submit another issue/pr for 3 or 4 later if it's too large or we need time to discuss

"%I:%M:%S%.9f %p",
"%l:%M:%S%.9f %P",
"%l:%M:%S%.9f %p",
"%H:%M:%S%.9f",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll suggest that move this as the first place as this is our default display (23:59:59.123456789)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one %H:%M:%S%.9f

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing, will do

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should no longer be relevant as I've implemented a preprocess pass which should ensure there isn't a priority clash between 24 and 12 hour time formats now

Comment on lines 257 to 260
.map(|nt| {
nt.num_seconds_from_midnight() as i64 * 1_000_000
+ (nt.nanosecond() as i64) / 1_000
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could consider whether to support leap second 23:59:60

%S already captured `Second number (00–60), zero-padded to 2 digits.

22:59:60 is parsed as 82800000000000 nanos which works.

23:59:60 is parsed as 86400000000000 nanos which overflows while we construct the array by Time64NanosecondArray::from(vec![86400000000000]); so it'll return a null

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All four should now support having a leap second, per what chrono NaiveTime also supports

@tustvold
Copy link
Contributor

  1. consider the timezone

I originally was going to suggest this, but I decided against it as the semantics are actually a little bit funky. In particular, courtesy of the wonders of daylight savings, a non-FixedOffset timezone requires the date in order to be interpreted. I think it is acceptable to only handle timezones for timestamps, and not for times. FWIW this is the same approach taken by chrono - there is DateTime<Tz> and NaiveDateTime but only NaiveTime and no Time<Tz>

@Jefffrey
Copy link
Contributor Author

A behaviour I've changed which is worth noting is that it is valid to pass in fractions of a second, even if the type you're parsing for doesn't support that precision; it'll simply be truncated from the final representation.

See:

assert_eq!(Time32SecondType::parse("02:10:01.1"), Some(7_801));

This technically was already happening for milli/micro/nano seconds anyway, but has been extended to seconds as well, to centralize all the behaviour. Let me know any thoughts on if instead it should be stricter and fail the parsing, rather than passing and truncating.

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some minor suggestions, but also happy for this to go in as is. Nice work 👍

Comment on lines 155 to 156
.fold((0, false, false), |tup, char| match char {
':' => (tup.0.saturating_add(1), tup.1, tup.2),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.fold((0, false, false), |tup, char| match char {
':' => (tup.0.saturating_add(1), tup.1, tup.2),
.fold((0_usize, false, false), |tup, char| match char {
':' => (tup.0 + 1, tup.1, tup.2),

Using a usize means this can't actually overflow

// colon count, presence of decimal, presence of whitespace
fn preprocess_time_string(string: &str) -> (u8, bool, bool) {
string
.chars()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.chars()
.as_bytes()
.iter()

And then match using b':'

We don't actually need to use chars here, as the nature of the UTF-8 encoding is such that ASCII can be compared without ambiguity - https://en.wikipedia.org/wiki/UTF-8#Encoding

Some(86_400_000_000_000)
);

// custom format
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

})
}

fn naive_time_parser(string: &str, formats: &[&str]) -> Option<NaiveTime> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be cleaner to make the match block evaluate to &[&str] and then have this follow.

e.g. something like

let formats = match preprocess_time_string(s.trim()) {
    ...
};


formats
    .iter()
    .find_map(|f| NaiveTime::parse_from_str(string, f).ok())

@tustvold tustvold merged commit c95eb4c into apache:master Nov 15, 2022
@Jefffrey Jefffrey deleted the read_time_from_csv branch November 15, 2022 22:10
@ursabot
Copy link

ursabot commented Nov 15, 2022

Benchmark runs are scheduled for baseline = c99d2f3 and contender = c95eb4c. c95eb4c is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

jimexist pushed a commit that referenced this pull request Nov 16, 2022
* Parse Time32/Time64 from formatted string

* PR comments

* PR comments refactoring
jimexist pushed a commit that referenced this pull request Nov 16, 2022
* Parse Time32/Time64 from formatted string

* PR comments

* PR comments refactoring
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Be able to parse time formatted strings
5 participants