-
Notifications
You must be signed in to change notification settings - Fork 482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ORC-341] Support time zone as a parameter for Java reader and writer #249
Conversation
Why do we need this change? |
@wgtmac , see discussion in https://issues.apache.org/jira/browse/HIVE-12192 for more context. |
AFAIK, I don't think ORC has any issue in HIVE-12192. What ORC guarantees is that we should always get same wall clock time representation w/o timezone. Current Java implementation leverages java.sql.Timestamp which uses local timezone and that's why writer and reader always use timestamp values in local timezone. Unless we add a new TimestampColumnVector which enforces UTC timezone to adopt your change here. |
@jcamachor I'd suggest a much simpler API:
Thoughts? |
Also note that the C++ reader already uses UTC for its TimestampColumnVector. :) |
@omalley , it seems like a good idea, let me explore it and refresh the PR. I will adapt HIVE-19226 to these new changes too. @wgtmac , I understand you are suggesting that this can be fixed only from Hive side? Problem is that existing ORC files should still be read properly, hence you would need to recognize old vs new ORC files. In addition, you will apply displacement twice when reading/writing, in Hive and in ORC. It seems to me the cleaner solution is just being able to point to ORC that timestamp data is in UTC from Java reader/writer. FWIW, change to stringify in TimestampColumnVector is needed indeed. |
@jcamachor Yes I was just meaning ORC doesn't have problems in dealing with timestamps itself. Definitely using UTC everywhere makes things way easier. |
@omalley , I have been trying to add the Boolean If we do not go in that direction, I thought that I can change current patch to use a Please, let me know what you think. |
For reader, can we set useUTCTImestamp in the function TimestampTreeReader::nextVector? For writer, it is caller's responsibility to set useUTCTImestamp before calling TimestampTreeWriter::writeBatch. Does this help? @jcamachor |
@wgtmac , thanks for the feedback. Please bear with me for a bit, as it is first time I am touching ORC code base. |
Pushed a new commit with the changes. We would still need a storage-api release for the |
@jcamachor You are right. WriterOptions/WriterContext are ideal places to set this kind of values. |
@@ -975,6 +992,8 @@ public void nextVector(ColumnVector previousVector, | |||
TimestampColumnVector result = (TimestampColumnVector) previousVector; | |||
super.nextVector(previousVector, isNull, batchSize); | |||
|
|||
// TODO: If context.isUseUTCTimestamp(), set TimestampColumnVector.useUTC to true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is better to set storage-api to 3.0.0 and fix this TODO in this patch as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a vote going on for a storage-api release, then next week I can rebase the patch to consume it and hopefully we can check it in. Thanks @wgtmac !
I have just updated the patch now that we have moved to the new storage-api version. I will run some tests with Hive locally asap and will get back confirming that everything is working as expected. |
0f617ee
to
e52e79e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jcamachor Please see my comments.
// for unit tests to set different time zones | ||
this.baseEpochSecsLocalTz = Timestamp.valueOf(BASE_TIMESTAMP_STRING).getTime() / MILLIS_PER_SECOND; | ||
if (writer.isUseUTCTimestamp()) { | ||
this.localTimezone = TimeZone.getTimeZone("UTC"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd better change its name to this.writeTimezone to avoid confusion in the future.
Same for localDateFormat and baseEpochSecsLocalTz below.
@@ -990,6 +1007,10 @@ public void nextVector(ColumnVector previousVector, | |||
TimestampColumnVector result = (TimestampColumnVector) previousVector; | |||
super.nextVector(previousVector, isNull, batchSize); | |||
|
|||
if (context.isUseUTCTimestamp()) { | |||
result.setIsUTC(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
result.setIsUTC(context.isUseUTCTimestamp());
Just in case result is in UTC but context.isUseUTCTimestamp() is false.
import java.sql.Timestamp; | ||
import java.text.DateFormat; | ||
import java.text.ParseException; | ||
import java.text.SimpleDateFormat; | ||
import java.util.TimeZone; | ||
|
||
public class TimestampTreeWriter extends TreeWriterBase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also change writeBatch function below.
The input vector.isUTC may be true while writer.isUseUTCTimestamp() is false; vice versa. In this case, we need to convert them to correct writer timezone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good, although a couple of points need to get fixed.
- the TimestampColumnStatistics on the read side need to be fixed. I think the write side will just work.
- As Gang pointed out the writer needs to use the value in the ColumnVector to interpret the values.
- Effectively the writer option is really about setting the writerTimezone to UTC.
- This does mean that ORC readers older than Hive 1.2 will misinterpret timestamps from these files unless their local timezone is UTC.
return this; | ||
} | ||
|
||
public boolean isUseUTCTimestamp() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be getUseUTCTimestamp.
@@ -320,6 +321,16 @@ public ReaderOptions fileMetadata(final FileMetadata metadata) { | |||
public FileMetadata getFileMetadata() { | |||
return fileMetadata; | |||
} | |||
|
|||
public ReaderOptions useUTCTimestamp(boolean value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should also cause the TimestampStatistics to use UTC.
@@ -761,6 +782,10 @@ public boolean getWriteVariableLengthBlocks() { | |||
public HadoopShims getHadoopShims() { | |||
return shims; | |||
} | |||
|
|||
public boolean isUseUTCTimestamp() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename this to getUseUTCTimestamp.
@@ -373,7 +379,11 @@ private void flushStripe() throws IOException { | |||
OrcProto.StripeFooter.Builder builder = | |||
OrcProto.StripeFooter.newBuilder(); | |||
if (writeTimeZone) { | |||
builder.setWriterTimezone(TimeZone.getDefault().getID()); | |||
if (useUTCTimeZone) { | |||
builder.setWriterTimezone(TimeZone.getTimeZone("UTC").getID()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be tempted to just use setWriterTimezone("UTC"), because we'll already fail if UTC is called something else.
} else { | ||
this.localTimezone = TimeZone.getDefault(); | ||
} | ||
this.localDateFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sucks that there isn't a simpler way in Java to do this.
The Parquet and Avro format store long value (the number of milliseconds since epoch) as timestamp value. Thus no timezone information is encoded when writing. However, when we import the files using Exasol UDFs, we construct a Java `java.sql.Timestamp` object from the milliseconds since epoch, and emit them into table. At this moment, the Java Timestamp uses the JVM timezone. It is usually `Europe/Berlin` for instance for the Exasol docker container. You can the Exasol system timezone using `DBTIMEZONE` (`SELECT DBTIMEZONE`). This introduces a difference integration tests, timestamps written in UTC is returned in Europe/Berlin timezone. Therefore, we shift the expected values to database timezone. This issue does not occur in the Orc format because it also encodes the timezone information in the file so that Orc readers uses that timezone (apache/orc#249).
## Added data importer integration tests ## Fixed timestamp integration tests The Parquet and Avro format store long value (the number of milliseconds since epoch) as timestamp value. Thus no timezone information is encoded when writing. However, when we import the files using Exasol UDFs, we construct a Java `java.sql.Timestamp` object from the milliseconds since epoch, and emit them into table. At this moment, the Java Timestamp uses the JVM timezone. It is usually `Europe/Berlin` for instance for the Exasol docker container. You can the Exasol system timezone using `DBTIMEZONE` (`SELECT DBTIMEZONE`). This introduces a difference integration tests, timestamps written in UTC is returned in Europe/Berlin timezone. Therefore, we shift the expected values to database timezone. This issue does not occur in the Orc format because it also encodes the timezone information in the file so that Orc readers uses that timezone (apache/orc#249). ## Added nested suites with single docker container stack
No description provided.