Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIVE-27199: Read TIMESTAMP WITH LOCAL TIME ZONE columns from text files using custom formats #4170

Closed
wants to merge 5 commits into from

Conversation

zabetak
Copy link
Contributor

@zabetak zabetak commented Mar 29, 2023

What changes were proposed in this pull request?

  1. Support parsing TimestampTZ using the TimestampParser, which accepts multiple DateTimeFormatters.
  2. Pass timestamp.formats in Lazy inspector handling TIMESTAMP WITH LOCAL TIME ZONE and instantiate a TimestampParser.
  3. Refactor TimestampTZUtil to allow passing different DateTimeFormatters.
  4. Add tests covering timestamps with 3 different formats (built-in, plus 2 more not covered by the default).

Why are the changes needed?

  1. Provide flexibility to users storing timestamps with custom formats in text files.
  2. Align TIMESTAMP and TIMESTAMP WITH LOCAL TIME ZONE as far as concerns custom format support.

Does this PR introduce any user-facing change?

Yes, users can now specify custom timestamp.formats for both TIMESTAMP and TIMESTAMP WITH LOCAL TIME ZONE data types.
For existing tables containing timestamp.formats and TIMESTAMP WITH LOCAL TIME ZONE type results may change from NULL to a "valid" timestamp value.

How was this patch tested?

mvn test -Dtest=TestMiniLlapLocalCliDriver -Dqfile=timestamptz_formats.q

…enrich tests

Before the changes only space was allowed between date and time parts.
2022-05-29 12:23:43

After the changes following literals become valid (were not before)
2022-05-29T 12:23:43
2022-05-29T12:23:43
2022-05-2912:23:43

Regarding the TZL parser there still a few inconsistencies:

Valid: 2022-05-29T 12:23:43
Invalid: 2022-05-29 T12:23:43
Invalid: 2022-05-29 T 12:23:43

General observation is that TZL parser is much more lenient than the TZ parser; not sure we want that.
The change fixes some inconsistencies which were introduced by the previous change:
2016-05-03T 12:26:34 -> NULL
2016-05-0312:26:34 -> NULL

but introduces some new:

# Case I
Valid TZ: 2016-05-03T12:26:34
Invalid TZL: 2016-05-03T12:26:34

# Case II
Valid TZL: 2016-05-03 12:26:34
Invalid TZL: 2016-05-03T12:26:34
@github-actions
Copy link

github-actions bot commented Mar 29, 2023

@check-spelling-bot Report

🔴 Please review

See the files view or the action log for details.

Unrecognized words (1)

TZL

Previously acknowledged words that are now absent aarry timestamplocal yyyy
To accept these unrecognized words as correct (and remove the previously acknowledged and now absent words), run the following commands

... in a clone of the git@github.com:zabetak/hive.git repository
on the tzlocalformat branch:

update_files() {
perl -e '
my @expect_files=qw('".github/actions/spelling/expect.txt"');
@ARGV=@expect_files;
my @stale=qw('"$patch_remove"');
my $re=join "|", @stale;
my $suffix=".".time();
my $previous="";
sub maybe_unlink { unlink($_[0]) if $_[0]; }
while (<>) {
if ($ARGV ne $old_argv) { maybe_unlink($previous); $previous="$ARGV$suffix"; rename($ARGV, $previous); open(ARGV_OUT, ">$ARGV"); select(ARGV_OUT); $old_argv = $ARGV; }
next if /^(?:$re)(?:(?:\r|\n)*$| .*)/; print;
}; maybe_unlink($previous);'
perl -e '
my $new_expect_file=".github/actions/spelling/expect.txt";
use File::Path qw(make_path);
use File::Basename qw(dirname);
make_path (dirname($new_expect_file));
open FILE, q{<}, $new_expect_file; chomp(my @words = <FILE>); close FILE;
my @add=qw('"$patch_add"');
my %items; @items{@words} = @words x (1); @items{@add} = @add x (1);
@words = sort {lc($a)."-".$a cmp lc($b)."-".$b} keys %items;
open FILE, q{>}, $new_expect_file; for my $word (@words) { print FILE "$word\n" if $word =~ /\w/; };
close FILE;
system("git", "add", $new_expect_file);
'
}

comment_json=$(mktemp)
curl -L -s -S \
-H "Content-Type: application/json" \
"https://api.github.com/repos/apache/hive/issues/comments/1488812049" > "$comment_json"
comment_body=$(mktemp)
jq -r ".body // empty" "$comment_json" > $comment_body
rm $comment_json

patch_remove=$(perl -ne 'next unless s{^</summary>(.*)</details>$}{$1}; print' < "$comment_body")

patch_add=$(perl -e '$/=undef; $_=<>; if (m{Unrecognized words[^<]*</summary>\n*```\n*([^<]*)```\n*</details>$}m) { print "$1" } elsif (m{Unrecognized words[^<]*\n\n((?:\w.*\n)+)\n}m) { print "$1" };' < "$comment_body")

update_files
rm $comment_body
git add -u
If the flagged items do not appear to be text

If items relate to a ...

  • well-formed pattern.

    If you can write a pattern that would match it,
    try adding it to the patterns.txt file.

    Patterns are Perl 5 Regular Expressions - you can test yours before committing to verify it will match your lines.

    Note that patterns can't match multiline strings.

  • binary file.

    Please add a file path to the excludes.txt file matching the containing file.

    File paths are Perl 5 Regular Expressions - you can test yours before committing to verify it will match your files.

    ^ refers to the file's path from the root of the repository, so ^README\.md$ would exclude README.md (on whichever branch you're using).

…es using custom formats

1. Support parsing TimestampTZ using the TimestampParser, which accepts
multiple DateTimeFormatters.
2. Pass timestamp.formats in Lazy inspector handling
TIMESTAMP WITH LOCAL TIME ZONE and instantiate a TimestampParser.
3. Refactor TimestampTZUtil to allow passing different
DateTimeFormatters.
4. Add tests covering timestamps with 3 different formats (built-in,
plus 2 more not covered by the default).
@zabetak zabetak changed the title Parse experiments for TIMESTAMP WITH LOCAL TIME ZONE data type HIVE-27199: Read TIMESTAMP WITH LOCAL TIME ZONE columns from text files using custom formats Mar 30, 2023
@zabetak zabetak marked this pull request as ready for review March 30, 2023 15:23
@jfsii
Copy link
Contributor

jfsii commented Mar 30, 2023

I took a look - nothing stands out to me, not even nits :) LGTM pending tests. This is a nice approach to handle the general problem.

Copy link
Member

@ayushtkn ayushtkn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sonarcloud
Copy link

sonarcloud bot commented Apr 17, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

@zabetak zabetak closed this in f6ac8d4 Apr 18, 2023
@zabetak zabetak deleted the tzlocalformat branch April 18, 2023 08:30
henrib pushed a commit to henrib/hive that referenced this pull request Apr 24, 2023
…es using custom formats (Stamatis Zampetakis reviewed by Ayush Saxena, John Sherman, Attila Turóczy)

1. Support parsing TimestampTZ using the TimestampParser, which accepts
multiple DateTimeFormatters.
2. Pass timestamp.formats in Lazy inspector handling
TIMESTAMP WITH LOCAL TIME ZONE and instantiate a TimestampParser.
3. Refactor TimestampTZUtil to allow passing different
DateTimeFormatters.
4. Add tests covering timestamps with 3 different formats (built-in,
plus 2 more not covered by the default).

These changes give more flexibility to users reading timestamps from
text files and it also aligns the way TIMESTAMP and
TIMESTAMP WITH LOCAL TIME ZONE behave when a custom format is provided.

Closes apache#4170
yeahyung pushed a commit to yeahyung/hive that referenced this pull request Jul 20, 2023
…es using custom formats (Stamatis Zampetakis reviewed by Ayush Saxena, John Sherman, Attila Turóczy)

1. Support parsing TimestampTZ using the TimestampParser, which accepts
multiple DateTimeFormatters.
2. Pass timestamp.formats in Lazy inspector handling
TIMESTAMP WITH LOCAL TIME ZONE and instantiate a TimestampParser.
3. Refactor TimestampTZUtil to allow passing different
DateTimeFormatters.
4. Add tests covering timestamps with 3 different formats (built-in,
plus 2 more not covered by the default).

These changes give more flexibility to users reading timestamps from
text files and it also aligns the way TIMESTAMP and
TIMESTAMP WITH LOCAL TIME ZONE behave when a custom format is provided.

Closes apache#4170
tarak271 pushed a commit to tarak271/hive-1 that referenced this pull request Dec 19, 2023
…es using custom formats (Stamatis Zampetakis reviewed by Ayush Saxena, John Sherman, Attila Turóczy)

1. Support parsing TimestampTZ using the TimestampParser, which accepts
multiple DateTimeFormatters.
2. Pass timestamp.formats in Lazy inspector handling
TIMESTAMP WITH LOCAL TIME ZONE and instantiate a TimestampParser.
3. Refactor TimestampTZUtil to allow passing different
DateTimeFormatters.
4. Add tests covering timestamps with 3 different formats (built-in,
plus 2 more not covered by the default).

These changes give more flexibility to users reading timestamps from
text files and it also aligns the way TIMESTAMP and
TIMESTAMP WITH LOCAL TIME ZONE behave when a custom format is provided.

Closes apache#4170
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants