New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRILL-6016 - Fix for Error reading INT96 created by Apache Spark #1166
Conversation
@parthchandra @vdiravka would you please review this? |
@parthchandra @vdiravka I do not have a test case for this. I have manually verified the scenario with and without the patch. The sample input file is attached with https://issues.apache.org/jira/browse/DRILL-6016. |
@rajrahul, thanks for submitting the patch. It looks good. I guess we missed dictionary encoded int96 timestamps (even though timestamps with nanosecond precision are the one thing that should never, ever, be dictionary encoded). Just to make sure, I tried the use the sample file in DRILL-6016, but I could not even unzip it! Can you please check and see if the file is correct? WE can use that to create the unit test as well. |
@parthchandra please use the link https://github.com/rajrahul/files/raw/master/result.tar.gz |
@rajrahul this link is good. As expected, the int96 column is dictionary encoded. @vdiravka TestParquetWriter.testImpalaParquetBinaryAsTimeStamp_DictChange also uses an int96 that is dictionary encoded. Any idea whether (and why) it might be going thru a different code path? |
@parthchandra I will create a unit test with few time stamp fields. |
@parthchandra I have compared meta of files from But I believe we have some mess in names of that column readers. Maybe to make some refactoring would be a good point. What do you think? For example to remove |
@parthchandra @vdiravka I have added the test case using the same parquet file(2.9k bytes). I tried creating a smaller file using Spark, but could not replicate the behavior. I have rebased the changes on the same commit and PR. |
The schema given below creates the issue, as @vdiravka pointed int96 is marked required here. This parquet was generated with an older version of spark and is included in the test case.
Newer spark version created the schema below where int96 has become optional.
|
+1. LGTM |
@vdiravka I have made the changes. Please have a look. |
@Test | ||
public void testSparkParquetBinaryAsTimeStamp_DictChange() throws Exception { | ||
try { | ||
mockUtcDateTimeZone(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't work without @RunWith(JMockit.class)
.
Also please enable above test case testImpalaParquetBinaryAsTimeStamp_DictChange
with the same change. And be sure that tests pass in the other time zone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could see two ways of doing this within the code itself.
- Mock and run with UTC, and compare the results in UTC as in TestCastFunctions#testToDateForTimeStamp. Since TestParquetWriter already has a RunWith annotation, we might have to create another class and move both the methods.
- Run with the JVM timezone(no mocking) and compare the results after a 'convertToLocalTimestamp' as in TestParquetWriter#testInt96TimeStampValueWidth
Approach 2 does not used fixed UTC timezone. Which approach do you suggest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vdiravka your thoughts on comment above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that in TestParquetWriter
only one parameter is used - repeat
. I think you can replace Parameterized
running of this. test with simple variable.
Other approach - you can write programmatically using of JMockit.
But I prefer not to use mocks if possible. So try to use convertToLocalTimestamp
. By using it you can enable also testHiveParquetTimestampAsInt96_basic
test and testImpalaParquetBinaryAsTimeStamp_DictChange
with removing redundant rows.
@vdiravka I have made similar changes for testSparkParquetBinaryAsTimeStamp_DictChange, testHiveParquetTimestampAsInt96_basic and testImpalaParquetBinaryAsTimeStamp_DictChange. All tests are passing, please have a look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please address minor comments
.baselineColumns("int96_ts") | ||
.baselineValues(new DateTime(convertToLocalTimestamp("1970-01-01 00:00:01.000"))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One baselineValue is enough. Please use where
in the query.
@@ -35,6 +36,7 @@ | |||
import java.util.Map; | |||
|
|||
import com.google.common.base.Joiner; | |||
import mockit.integration.junit4.JMockit; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the same
@@ -27,6 +27,7 @@ | |||
import java.math.BigDecimal; | |||
import java.nio.file.Paths; | |||
import java.sql.Date; | |||
import java.sql.Timestamp; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused import?
@vdiravka Done. Please review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, thank you for editing additional tests.
@@ -61,6 +60,7 @@ | |||
import org.junit.runners.Parameterized; | |||
|
|||
@RunWith(Parameterized.class) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually not required, tried to add another RunWith for Mocking and removed later on leaving the newline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, just remove it
public void testImpalaParquetBinaryAsTimeStamp_DictChange() throws Exception { | ||
try { | ||
testBuilder() | ||
.sqlQuery("select int96_ts from dfs.`parquet/int96_dict_change` order by int96_ts") | ||
.sqlQuery("select min(int96_ts) date_value from dfs.`parquet/int96_dict_change`") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you try WHERE statement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not try a WHERE statement, MIN was used to select a single record to compare. Was there any specific reason to use WHERE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is just more obvious what result is expected. But using MIN is ok.
@vdiravka removed the extra line. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
@rajrahul thanks for making all the changes (and of course for the fix)! |
This fixes DRILL-6016 where drill was failing to read int96 generated by Apache Spark even after setting store.parquet.reader.int96_as_timestamp to true.