[ORC-341] Support time zone as a parameter for Java reader and writer #249

jcamachor · 2018-04-14T11:33:25Z

No description provided.

wgtmac · 2018-04-16T22:45:20Z

Why do we need this change?

jcamachor · 2018-04-17T06:07:48Z

@wgtmac , see discussion in https://issues.apache.org/jira/browse/HIVE-12192 for more context.

wgtmac · 2018-04-17T17:44:34Z

AFAIK, I don't think ORC has any issue in HIVE-12192. What ORC guarantees is that we should always get same wall clock time representation w/o timezone. Current Java implementation leverages java.sql.Timestamp which uses local timezone and that's why writer and reader always use timestamp values in local timezone. Unless we add a new TimestampColumnVector which enforces UTC timezone to adopt your change here.

omalley · 2018-04-17T18:10:15Z

@jcamachor I'd suggest a much simpler API:

Instead of passing in the reader timezone, make a boolean option to useUtcForTimestamp.
Extend TimestampColumnVector to have a boolean isUTC field.
The TimestampTreeWriter can use the isUTC in the ColumnVector to determine if it is in UTC.
The reader can set isUTC appropriately based on the option.

Thoughts?

omalley · 2018-04-17T18:10:57Z

Also note that the C++ reader already uses UTC for its TimestampColumnVector. :)

jcamachor · 2018-04-17T20:05:49Z

@omalley , it seems like a good idea, let me explore it and refresh the PR. I will adapt HIVE-19226 to these new changes too.

@wgtmac , I understand you are suggesting that this can be fixed only from Hive side? Problem is that existing ORC files should still be read properly, hence you would need to recognize old vs new ORC files. In addition, you will apply displacement twice when reading/writing, in Hive and in ORC. It seems to me the cleaner solution is just being able to point to ORC that timestamp data is in UTC from Java reader/writer. FWIW, change to stringify in TimestampColumnVector is needed indeed.

wgtmac · 2018-04-17T21:02:16Z

@jcamachor Yes I was just meaning ORC doesn't have problems in dealing with timestamps itself. Definitely using UTC everywhere makes things way easier.

jcamachor · 2018-04-18T16:16:59Z

@omalley , I have been trying to add the Boolean useUTCTimestamp as suggested. Making it work with the reader/writer does not seem to be a problem, since I can pass the information through the context. However, we also create column vectors in the TypeDescription class, where we do not seem to have any context information, just the type string representation. It seems that unless we pass the information through that representation, we cannot know the value for the boolean when we create the column over there, and I do not think we want to go in that direction. Any ideas?

If we do not go in that direction, I thought that I can change current patch to use a boolean instead of the TimeZone itself (but without storing it).

Please, let me know what you think.

wgtmac · 2018-04-18T19:10:25Z

For reader, can we set useUTCTImestamp in the function TimestampTreeReader::nextVector? For writer, it is caller's responsibility to set useUTCTImestamp before calling TimestampTreeWriter::writeBatch. Does this help? @jcamachor

jcamachor · 2018-04-19T14:19:12Z

@wgtmac , thanks for the feedback. Please bear with me for a bit, as it is first time I am touching ORC code base.
OK, I think TypeDescription is not a problem then since we set the value at reader / writer, independently of the default that we use at creation time. For reader, everything seems easy. However, for the writer, it is a bit trickier since the stripe footer stores the information about the time zone, hence it should be set beforehand using, e.g., the context or options objects. Does that seem reasonable?

jcamachor · 2018-04-19T15:31:13Z

Pushed a new commit with the changes.

We would still need a storage-api release for the TimestampColumnVector changes.

wgtmac · 2018-04-19T17:02:11Z

@jcamachor You are right. WriterOptions/WriterContext are ideal places to set this kind of values.

wgtmac · 2018-04-27T21:11:22Z

java/core/src/java/org/apache/orc/impl/TreeReaderFactory.java

@@ -975,6 +992,8 @@ public void nextVector(ColumnVector previousVector,
      TimestampColumnVector result = (TimestampColumnVector) previousVector;
      super.nextVector(previousVector, isNull, batchSize);

+      // TODO: If context.isUseUTCTimestamp(), set TimestampColumnVector.useUTC to true


I think it is better to set storage-api to 3.0.0 and fix this TODO in this patch as well.

There is a vote going on for a storage-api release, then next week I can rebase the patch to consume it and hopefully we can check it in. Thanks @wgtmac !

jcamachor · 2018-05-01T19:41:51Z

I have just updated the patch now that we have moved to the new storage-api version. I will run some tests with Hive locally asap and will get back confirming that everything is working as expected.

jcamachor · 2018-05-02T21:31:05Z

I have been testing the patch from Hive and everything seems to be working as expected.

I have rebased the patch and merge both commits. Also, I had to extend my changes to the newly created WriterImplV2.

@omalley , @wgtmac , could you take a final look and merge if it is OK? Thanks

wgtmac

@jcamachor Please see my comments.

wgtmac · 2018-05-03T03:26:09Z

java/core/src/java/org/apache/orc/impl/writer/TimestampTreeWriter.java

-    // for unit tests to set different time zones
-    this.baseEpochSecsLocalTz = Timestamp.valueOf(BASE_TIMESTAMP_STRING).getTime() / MILLIS_PER_SECOND;
+    if (writer.isUseUTCTimestamp()) {
+      this.localTimezone = TimeZone.getTimeZone("UTC");


We'd better change its name to this.writeTimezone to avoid confusion in the future.
Same for localDateFormat and baseEpochSecsLocalTz below.

wgtmac · 2018-05-03T03:31:52Z

java/core/src/java/org/apache/orc/impl/TreeReaderFactory.java

@@ -990,6 +1007,10 @@ public void nextVector(ColumnVector previousVector,
      TimestampColumnVector result = (TimestampColumnVector) previousVector;
      super.nextVector(previousVector, isNull, batchSize);

+      if (context.isUseUTCTimestamp()) {
+        result.setIsUTC(true);


result.setIsUTC(context.isUseUTCTimestamp());

Just in case result is in UTC but context.isUseUTCTimestamp() is false.

wgtmac · 2018-05-03T03:38:52Z

java/core/src/java/org/apache/orc/impl/writer/TimestampTreeWriter.java

-import java.sql.Timestamp;
+import java.text.DateFormat;
+import java.text.ParseException;
+import java.text.SimpleDateFormat;
 import java.util.TimeZone;

 public class TimestampTreeWriter extends TreeWriterBase {


We should also change writeBatch function below.

The input vector.isUTC may be true while writer.isUseUTCTimestamp() is false; vice versa. In this case, we need to convert them to correct writer timezone.

omalley

This is looking good, although a couple of points need to get fixed.

the TimestampColumnStatistics on the read side need to be fixed. I think the write side will just work.
As Gang pointed out the writer needs to use the value in the ColumnVector to interpret the values.
Effectively the writer option is really about setting the writerTimezone to UTC.
This does mean that ORC readers older than Hive 1.2 will misinterpret timestamps from these files unless their local timezone is UTC.

omalley · 2018-05-04T16:22:44Z

java/core/src/java/org/apache/orc/OrcFile.java

+      return this;
+    }
+
+    public boolean isUseUTCTimestamp() {


This should be getUseUTCTimestamp.

omalley · 2018-05-04T16:23:54Z

java/core/src/java/org/apache/orc/OrcFile.java

@@ -320,6 +321,16 @@ public ReaderOptions fileMetadata(final FileMetadata metadata) {
    public FileMetadata getFileMetadata() {
      return fileMetadata;
    }
+
+    public ReaderOptions useUTCTimestamp(boolean value) {


This should also cause the TimestampStatistics to use UTC.

omalley · 2018-05-04T16:24:34Z

java/core/src/java/org/apache/orc/OrcFile.java

@@ -761,6 +782,10 @@ public boolean getWriteVariableLengthBlocks() {
    public HadoopShims getHadoopShims() {
      return shims;
    }
+
+    public boolean isUseUTCTimestamp() {


Rename this to getUseUTCTimestamp.

omalley · 2018-05-04T16:28:06Z

java/core/src/java/org/apache/orc/impl/writer/WriterImplV2.java

@@ -373,7 +379,11 @@ private void flushStripe() throws IOException {
      OrcProto.StripeFooter.Builder builder =
          OrcProto.StripeFooter.newBuilder();
      if (writeTimeZone) {
-        builder.setWriterTimezone(TimeZone.getDefault().getID());
+        if (useUTCTimeZone) {
+          builder.setWriterTimezone(TimeZone.getTimeZone("UTC").getID());


I'd be tempted to just use setWriterTimezone("UTC"), because we'll already fail if UTC is called something else.

omalley · 2018-05-04T16:40:35Z

java/core/src/java/org/apache/orc/impl/writer/TimestampTreeWriter.java

+    } else {
+      this.localTimezone = TimeZone.getDefault();
+    }
+    this.localDateFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");


It sucks that there isn't a simpler way in Java to do this.

jcamachor · 2018-05-05T03:24:54Z

@wgtmac , @omalley , thanks for the feedback. I think I have addressed all your points with last two commits, could you take another look? Thanks

The Parquet and Avro format store long value (the number of milliseconds since epoch) as timestamp value. Thus no timezone information is encoded when writing. However, when we import the files using Exasol UDFs, we construct a Java `java.sql.Timestamp` object from the milliseconds since epoch, and emit them into table. At this moment, the Java Timestamp uses the JVM timezone. It is usually `Europe/Berlin` for instance for the Exasol docker container. You can the Exasol system timezone using `DBTIMEZONE` (`SELECT DBTIMEZONE`). This introduces a difference integration tests, timestamps written in UTC is returned in Europe/Berlin timezone. Therefore, we shift the expected values to database timezone. This issue does not occur in the Orc format because it also encodes the timezone information in the file so that Orc readers uses that timezone (apache/orc#249).

## Added data importer integration tests ## Fixed timestamp integration tests The Parquet and Avro format store long value (the number of milliseconds since epoch) as timestamp value. Thus no timezone information is encoded when writing. However, when we import the files using Exasol UDFs, we construct a Java `java.sql.Timestamp` object from the milliseconds since epoch, and emit them into table. At this moment, the Java Timestamp uses the JVM timezone. It is usually `Europe/Berlin` for instance for the Exasol docker container. You can the Exasol system timezone using `DBTIMEZONE` (`SELECT DBTIMEZONE`). This introduces a difference integration tests, timestamps written in UTC is returned in Europe/Berlin timezone. Therefore, we shift the expected values to database timezone. This issue does not occur in the Orc format because it also encodes the timezone information in the file so that Orc readers uses that timezone (apache/orc#249). ## Added nested suites with single docker container stack

jcamachor force-pushed the ORC-341 branch from 1653d36 to d3a5e25 Compare April 19, 2018 15:29

wgtmac reviewed Apr 27, 2018

View reviewed changes

jcamachor force-pushed the ORC-341 branch from d3a5e25 to 06eccce Compare May 1, 2018 19:39

jcamachor force-pushed the ORC-341 branch 2 times, most recently from 0f617ee to e52e79e Compare May 2, 2018 21:20

[ORC-341] Support time zone as a parameter for Java reader and writer

8001bbc

jcamachor force-pushed the ORC-341 branch from e52e79e to 8001bbc Compare May 2, 2018 21:29

wgtmac requested changes May 3, 2018

View reviewed changes

omalley requested changes May 4, 2018

View reviewed changes

jcamachor added 2 commits May 4, 2018 20:11

Addressing wgtmac and omalley comments

31d68ac

fixup

83993e8

asfgit closed this in aa790d4 May 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ORC-341] Support time zone as a parameter for Java reader and writer #249

[ORC-341] Support time zone as a parameter for Java reader and writer #249

jcamachor commented Apr 14, 2018

wgtmac commented Apr 16, 2018

jcamachor commented Apr 17, 2018

wgtmac commented Apr 17, 2018

omalley commented Apr 17, 2018

omalley commented Apr 17, 2018

jcamachor commented Apr 17, 2018

wgtmac commented Apr 17, 2018

jcamachor commented Apr 18, 2018

wgtmac commented Apr 18, 2018

jcamachor commented Apr 19, 2018

jcamachor commented Apr 19, 2018

wgtmac commented Apr 19, 2018

wgtmac Apr 27, 2018

jcamachor Apr 27, 2018

jcamachor commented May 1, 2018

jcamachor commented May 2, 2018

wgtmac left a comment

wgtmac May 3, 2018

wgtmac May 3, 2018

wgtmac May 3, 2018

omalley left a comment

omalley May 4, 2018

omalley May 4, 2018

omalley May 4, 2018

omalley May 4, 2018

omalley May 4, 2018

jcamachor commented May 5, 2018

[ORC-341] Support time zone as a parameter for Java reader and writer #249

[ORC-341] Support time zone as a parameter for Java reader and writer #249

Conversation

jcamachor commented Apr 14, 2018

wgtmac commented Apr 16, 2018

jcamachor commented Apr 17, 2018

wgtmac commented Apr 17, 2018

omalley commented Apr 17, 2018

omalley commented Apr 17, 2018

jcamachor commented Apr 17, 2018

wgtmac commented Apr 17, 2018

jcamachor commented Apr 18, 2018

wgtmac commented Apr 18, 2018

jcamachor commented Apr 19, 2018

jcamachor commented Apr 19, 2018

wgtmac commented Apr 19, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jcamachor commented May 1, 2018

jcamachor commented May 2, 2018

wgtmac left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

omalley left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jcamachor commented May 5, 2018