Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Format Timestamps as RFC3339 #2939

Merged
merged 4 commits into from
Oct 28, 2022
Merged

Format Timestamps as RFC3339 #2939

merged 4 commits into from
Oct 28, 2022

Conversation

waitingkuo
Copy link
Contributor

Which issue does this PR close?

Closes #2934
Closes #2937

  1. follows rfc3339 to format timestamp as 2000-01-01T00:00:00+08:00 instead of 2000-01-01 00:00:00 +08:00
  2. prettyprint to support timezone

this

use arrow::array::{
    Array, ArrayData, BooleanArray, Int32Array, Int32Builder, ListArray, PrimitiveArray,
    StringArray, StructArray, TimestampSecondArray,
};
use arrow::datatypes::*;
use arrow::record_batch::RecordBatch;
use std::sync::Arc;

use arrow::util::pretty::{pretty_format_batches, print_batches};

fn main() {
    println!("------------------------------------------------------------------------------------------------------------------");

    let arr =
        TimestampSecondArray::from_vec(vec![0, 1, 2, 3], Some("Asia/Taipei".to_string()));
    let schema = Arc::new(Schema::new(vec![Field::new(
        &arr.data_type().to_string(),
        arr.data_type().clone(),
        true,
    )]));

    let batch = RecordBatch::try_new(schema, vec![Arc::new(arr)]).unwrap();
    print_batches(&[batch]).unwrap();

    println!("------------------------------------------------------------------------------------------------------------------");

    let arr = TimestampSecondArray::from_vec(vec![0], Some("Asia/Taipei2".to_string()));
    let schema = Arc::new(Schema::new(vec![Field::new(
        &arr.data_type().to_string(),
        arr.data_type().clone(),
        true,
    )]));

    let batch = RecordBatch::try_new(schema, vec![Arc::new(arr)]).unwrap();
    print_batches(&[batch]).unwrap();

    println!("------------------------------------------------------------------------------------------------------------------");

    let arr = TimestampSecondArray::from_vec(vec![0], Some("+08:00".to_string()));
    let schema = Arc::new(Schema::new(vec![Field::new(
        &arr.data_type().to_string(),
        arr.data_type().clone(),
        true,
    )]));

    let batch = RecordBatch::try_new(schema, vec![Arc::new(arr)]).unwrap();
    print_batches(&[batch]).unwrap();

    println!("------------------------------------------------------------------------------------------------------------------");

    let arr: PrimitiveArray<TimestampMicrosecondType> =
        vec![Some(37800000000), None, Some(86339000000)].into();
    let schema = Arc::new(Schema::new(vec![Field::new(
        &arr.data_type().to_string(),
        arr.data_type().clone(),
        true,
    )]));

    let batch = RecordBatch::try_new(schema, vec![Arc::new(arr)]).unwrap();
    print_batches(&[batch]).unwrap();

    println!("-----------------------------------------------------------------------------------------------------------------");

    let arr: PrimitiveArray<TimestampMicrosecondType> =
        vec![Some(37800000000), None, Some(86339000000)].into();

    let arr = arr.with_timezone("Australia/Sydney".to_string());

    let schema = Arc::new(Schema::new(vec![Field::new(
        &arr.data_type().to_string(),
        arr.data_type().clone(),
        true,
    )]));

    let batch = RecordBatch::try_new(schema, vec![Arc::new(arr)]).unwrap();
    print_batches(&[batch]).unwrap();
}

outputs

---------------------------------
+----------------------------------------+
| Timestamp(Second, Some("Asia/Taipei")) |
+----------------------------------------+
| 1970-01-01T08:00:00+08:00              |
| 1970-01-01T08:00:01+08:00              |
| 1970-01-01T08:00:02+08:00              |
| 1970-01-01T08:00:03+08:00              |
+----------------------------------------+
------------------------------------------------------------------------------------------------------------------
+--------------------------------------------------------+
| Timestamp(Second, Some("Asia/Taipei2"))                |
+--------------------------------------------------------+
| 1970-01-01T00:00:00 (Unknown Time Zone 'Asia/Taipei2') |
+--------------------------------------------------------+
------------------------------------------------------------------------------------------------------------------
+-----------------------------------+
| Timestamp(Second, Some("+08:00")) |
+-----------------------------------+
| 1970-01-01T08:00:00+08:00         |
+-----------------------------------+
------------------------------------------------------------------------------------------------------------------
+------------------------------+
| Timestamp(Microsecond, None) |
+------------------------------+
| 1970-01-01T10:30:00          |
|                              |
| 1970-01-01T23:58:59          |
+------------------------------+
-----------------------------------------------------------------------------------------------------------------
+--------------------------------------------------+
| Timestamp(Microsecond, Some("Australia/Sydney")) |
+--------------------------------------------------+
| 1970-01-01T20:30:00+10:00                        |
|                                                  |
| 1970-01-02T09:58:59+10:00                        |

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

yes, timestamp format changed

@tustvold tustvold changed the title standarize timestamp output format Format Timestamps as RFC3339 Oct 27, 2022
@tustvold tustvold added the api-change Changes to the arrow API label Oct 27, 2022
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly just some minor nits, that can also be addressed in subsequent cleanup PRs.

I would like to double-check with @alamb as this will likely churn tests in downstreams

make_string_datetime!(array::TimestampSecondArray, column, row)
DataType::Timestamp(unit, tz_string_opt) if *unit == TimeUnit::Second => {
match tz_string_opt {
Some(tz_string) => make_string_datetime_with_tz!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nice to combine the make_string_datetime_with_tz and make_string_datetime macros together?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would also be awesome to avoid having to make a string at all (like return a impl Display) but that is definitely not something to do in this PR 😆

/// Generate an array with type $ARRAYTYPE with a numeric value of
/// $VALUE, and compare $EXPECTED_RESULT to the output of
/// formatting that array with `pretty_format_batches`
macro_rules! check_datetime_with_timezone {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think passing an optional timezone to check_datetime and calling with_timezone_opt would avoid some duplication.

On a related note, I don't think either need to be macros and therefore probably shouldn't be.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in "make them functions" 👍

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the approach and results -- thank you @waitingkuo

I had some comments on the implementation, but this PR is quite close I think

/// // Corresponds to single element array with entry 1970-05-09T14:25:11+0:00
/// let arr = TimestampSecondArray::from(vec![11111111]);
/// // OR
/// let arr = TimestampSecondArray::from(vec![Some(11111111)]);
/// let utc_offset = FixedOffset::east(0);
/// let utc_tz: Tz = "+00:00".parse().unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice to show an example of going from a chrono timezone (like FixedOffset) to a Arrow::tz (though maybe that is already available -- I didn't check)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb
I can't add it here as inner data is private now

error[E0423]: cannot initialize a tuple struct which contains private fields
   --> src/array/primitive_array.rs:143:14
    |
12  | let utc_tz = Tz(FixedOffset::east(0));
    |              ^^
    |
note: constructor is not visible here due to private fields
   --> /Users/willy/willy/waitingkuo/arrow-rs/arrow-array/src/timezone.rs:246:19
    |
246 |     pub struct Tz(FixedOffset);
    |                   ^^^^^^^^^^^ private field

/// An Arrow [`TimeZone`]
#[derive(Debug, Copy, Clone)]
pub struct Tz(FixedOffset);

as #2909 is to add a timezone abstraction, do we still encourage user to use chrono api directly?

@tustvold do you have any comments?

make_string_datetime!(array::TimestampSecondArray, column, row)
DataType::Timestamp(unit, tz_string_opt) if *unit == TimeUnit::Second => {
match tz_string_opt {
Some(tz_string) => make_string_datetime_with_tz!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would also be awesome to avoid having to make a string at all (like return a impl Display) but that is definitely not something to do in this PR 😆

.unwrap_or_else(|| "ERROR CONVERTING DATE".to_string())
};

Ok(s)
}};
}

macro_rules! make_string_datetime_with_tz {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might also be worth commenting that date/times are formatted as rfc3339?

.unwrap_or_else(|| "ERROR CONVERTING DATE".to_string()),
Err(_) => array
.value_as_datetime($row)
.map(|d| format!("{:?} (Unknown Time Zone '{}')", d, $tz_string))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem right -- isn't this error for the case when $ts_string.parse() fails? As written I think it will be returned when the entry in array at $row is null (which should never happen).

I think we should remove the match statement

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the logic behind is:

while parsing TZ failed, parse datetime only

  1. if it works, show the datetime part with a unknown time zone warning
+--------------------------------------------------------+
| Timestamp(Second, Some("Asia/Taipei2"))                |
+--------------------------------------------------------+
| 1970-01-01T00:00:00 (Unknown Time Zone 'Asia/Taipei2') |
+--------------------------------------------------------+
  1. if it doesn't work, display something (similar as make_string_datetime) like this
+--------------------------------------------------------+
| Timestamp(Second, Some("Asia/Taipei2"))                |
+--------------------------------------------------------+
| ERROR CONVERTING DATE                                  |
+--------------------------------------------------------+

@alamb @tustvold
I wonder if it's preferred. we could also show followings directly while failed to parse timezone

+--------------------------------------------------------+
| Timestamp(Second, Some("Asia/Taipei2"))                |
+--------------------------------------------------------+
| Unknown Time Zone 'Asia/Taipei2'                       |
+--------------------------------------------------------+

i agree the logic here is confusing, i should've added some comments here

let s = if array.is_null($row) {
"".to_string()
} else {
match $tz_string.parse::<Tz>() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as written this is going to parse the timezone string for every row -- perhaps we can parse it once per array 🤔

Copy link
Contributor

@tustvold tustvold Oct 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general this display code is very inefficient, I think there is a broader issue to clean it up.

See the disclaimer on https://docs.rs/arrow/latest/arrow/util/display/fn.array_value_to_string.html

/// Generate an array with type $ARRAYTYPE with a numeric value of
/// $VALUE, and compare $EXPECTED_RESULT to the output of
/// formatting that array with `pretty_format_batches`
macro_rules! check_datetime_with_timezone {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in "make them functions" 👍

@tustvold
Copy link
Contributor

Lets get this in so that it can make the release, further cleanups can be performed as follow on PRs

@tustvold tustvold merged commit cbee739 into apache:master Oct 28, 2022
@ursabot
Copy link

ursabot commented Oct 28, 2022

Benchmark runs are scheduled for baseline = 843a2e5 and contender = cbee739. cbee739 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

prettyprint to show timezone offset for timestamp with timezone Cast Timestamps to RFC3339 strings
4 participants