Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use kernel utility for parsing timestamps in csv reader. #832

Merged
merged 5 commits into from
Oct 20, 2021

Conversation

novemberkilo
Copy link
Contributor

Which issue does this PR close?

This is related to a datafusion issue -- apache/datafusion#958

Rationale for this change

apache/datafusion#958 (comment)

What changes are included in this PR?

So as to support more timestamp formats, this PR changes the implementation of parse for Microsecond and Nanosecond type Timestamps to use kernels::cast_utils::string_to_timestamp_nanos

Are there any user-facing changes?

I've checked that a user can now parse timestamps in CSV files as described in the referenced datafusion issue. Specifically, you can now do:

fn nyctaxi_schema() -> Schema {
    Schema::new(vec![
        Field::new("VendorID", DataType::Utf8, true),
        Field::new("pickup_datetime", DataType::Timestamp(TimeUnit::Microsecond, None), true),
        Field::new("dropoff_datetime", DataType::Timestamp(TimeUnit::Microsecond, None), true),
        ...

// @alamb

@github-actions github-actions bot added the arrow Changes to the arrow crate label Oct 17, 2021
@codecov-commenter
Copy link

codecov-commenter commented Oct 17, 2021

Codecov Report

Merging #832 (4aad3dc) into master (7472e5e) will increase coverage by 0.03%.
The diff coverage is 98.07%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #832      +/-   ##
==========================================
+ Coverage   82.60%   82.64%   +0.03%     
==========================================
  Files         168      168              
  Lines       48040    48088      +48     
==========================================
+ Hits        39685    39742      +57     
+ Misses       8355     8346       -9     
Impacted Files Coverage Δ
arrow/src/csv/reader.rs 91.77% <98.07%> (+1.86%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7472e5e...4aad3dc. Read the comment docs.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me -- thank you @novemberkilo -- this change will ensure that the timestamp format supported by the csv parser are the same as the timestamp formats supported by the cast kernel (utf8 -> timestamp) 👍

arrow/src/csv/reader.rs Outdated Show resolved Hide resolved
arrow/src/csv/reader.rs Outdated Show resolved Hide resolved
arrow/src/csv/reader.rs Outdated Show resolved Hide resolved
@alamb
Copy link
Contributor

alamb commented Oct 18, 2021

I think this PR will pass CI checks once it is rebased against master to pick up #839

}
_ => panic!(
"Unexpected failure converting {} to local datetime",
stringify! { naive_datetime }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this stringify is needed here: it converts naive_datetime into the string -- so this message always looks like:

thread 'main' panicked at 'Unexpected failure converting naive_datetime to local datetime', src/main.rs:2:3

Which you can see at the playground: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=9f6fbabade9152946ea5a84db486c2de

Since it is in test code, this is likely not a critical problem but I noticed it so figured I would point it out

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing that out Andrew. Will fix.

@alamb alamb merged commit 4cfe621 into apache:master Oct 20, 2021
@alamb
Copy link
Contributor

alamb commented Oct 20, 2021

Thanks again @novemberkilo !

alamb pushed a commit that referenced this pull request Oct 24, 2021
* Use kernel utility for parsing timestamps in csvs.

* Remove cruft.

* Cleanup.

* Lint.

* Remove erroneous stringify.
alamb added a commit that referenced this pull request Oct 24, 2021
* Use kernel utility for parsing timestamps in csvs.

* Remove cruft.

* Cleanup.

* Lint.

* Remove erroneous stringify.

Co-authored-by: Navin <navin@novemberkilo.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants