Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow using custom datetime format for inference and parsing csv file #1112

Merged
merged 3 commits into from
Jan 2, 2022

Conversation

sum12
Copy link
Contributor

@sum12 sum12 commented Dec 31, 2021

The patch extends the current implementation to allow passing a custom
datetime_re and datetime_format to the ReaderBuilder.

datetime_re is used infer schema of the csv and then datetime_format is
used to parse the actual string to a Date64.
ofcourse passing non-compatible datetime_re and datetime_format values
is going to fail the parsing or inference, however it is an expected but
hard-to-detect failure.

Which issue does this PR close?

Closes #.

Rationale for this change

Sometimes csv files follow non-standard datetime format. Arrow does not really care for the datetime format stored in the csv but merely guesses/assumes it to be of a certain fomrat. The change extends the csv reader to use a custom fromat.

What changes are included in this PR?

Are there any user-facing changes?

Any external libraries relying on ReaderBuilder and Reader structs will have to be updated to the new new API,

@github-actions github-actions bot added the arrow Changes to the arrow crate label Dec 31, 2021
@sum12 sum12 force-pushed the read_csv_with_formatted_datetime branch from 9d01b17 to af441b8 Compare December 31, 2021 01:54
@codecov-commenter
Copy link

codecov-commenter commented Dec 31, 2021

Codecov Report

Merging #1112 (2e91790) into master (2ad99ec) will decrease coverage by 0.06%.
The diff coverage is 72.95%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1112      +/-   ##
==========================================
- Coverage   82.34%   82.28%   -0.07%     
==========================================
  Files         168      168              
  Lines       49479    49670     +191     
==========================================
+ Hits        40743    40870     +127     
- Misses       8736     8800      +64     
Impacted Files Coverage Δ
arrow/src/csv/reader.rs 88.10% <72.78%> (-2.48%) ⬇️
arrow/src/csv/writer.rs 72.13% <100.00%> (+0.10%) ⬆️
arrow/src/compute/kernels/comparison.rs 90.87% <0.00%> (-2.60%) ⬇️
arrow/src/datatypes/datatype.rs 66.38% <0.00%> (-0.43%) ⬇️
parquet_derive/src/parquet_field.rs 65.98% <0.00%> (-0.23%) ⬇️
parquet/src/encodings/encoding.rs 93.52% <0.00%> (-0.20%) ⬇️
arrow/src/array/transform/mod.rs 85.69% <0.00%> (+0.13%) ⬆️
arrow/src/datatypes/field.rs 53.68% <0.00%> (+0.30%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2ad99ec...2e91790. Read the comment docs.

@alamb alamb added the api-change Changes to the arrow API label Jan 1, 2022
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution @sum12 !

}
DataType::Date64 => {
build_primitive_array::<Date64Type>(line_number, rows, i)
build_primitive_array::<Date32Type>(line_number, rows, i, None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this also have datetime_format?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would need a date_format.
Since datetime_re/DATETIME_RE regex rejected the string. Then using the same format for date32 and date64 can lead to parsing errors

line_number,
rows,
i,
None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

datefime_format?

line_number,
rows,
i,
None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also here?

arrow/src/csv/reader.rs Show resolved Hide resolved
arrow/src/csv/reader.rs Outdated Show resolved Hide resolved
@sum12 sum12 force-pushed the read_csv_with_formatted_datetime branch from af441b8 to 856cf39 Compare January 1, 2022 14:45
The patch extends the current implementation to allow passing a custom
datetime_re and datetime_format to the ReaderBuilder.

datetime_re is used infer schema of the csv and then datetime_format is
used to parse the actual string to a Date64.
ofcourse  passing non-compatible datetime_re and datetime_format values
is going to fail the parsing or inference, however it is an expected but
hard-to-detect failure.
The patch adds a new struct to collect all these options together and
then passes the struct around. Ideally the struct could be embedded into
the reader but that can be done as separate exercise.
The patch decides on using NaiveDateTime or DateTime from chrono lib
based on presence of timezone components

chrono expects timezone to be presetn if DateTime is used, errors
otherwise. Whereas NaiveDateTime ignores timezone even if explicitly
provided.
@sum12 sum12 force-pushed the read_csv_with_formatted_datetime branch from 856cf39 to 2e91790 Compare January 1, 2022 15:06
use chrono::format::Fixed;
use chrono::format::StrftimeItems;
let fmt = StrftimeItems::new(format);
let has_zone = fmt.into_iter().any(|item| match item {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would love to avoid evaluation of has_zone on very call. Not sure if I can have some "attributes" on this particular implementation of Parser/Date64Type and avoid having to re-evaluate it everytime

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a cool idea -- I wonder if we could implement Date64TypeWithZone or something -- or move the has_zone attribute into Date64Type directly?

escape: Option<u8>,
quote: Option<u8>,
terminator: Option<u8>,
roptions: ReaderOptions,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 this is much nicer

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sum12 -- I think this one is ready to go 👍

@alamb alamb merged commit d852229 into apache:master Jan 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants