DRILL-6016 - Fix for Error reading INT96 created by Apache Spark #1166

rajrahul · 2018-03-14T06:43:26Z

This fixes DRILL-6016 where drill was failing to read int96 generated by Apache Spark even after setting store.parquet.reader.int96_as_timestamp to true.

priteshm · 2018-03-14T06:59:40Z

@parthchandra @vdiravka would you please review this?

rajrahul · 2018-03-14T07:37:07Z

@parthchandra @vdiravka I do not have a test case for this. I have manually verified the scenario with and without the patch. The sample input file is attached with https://issues.apache.org/jira/browse/DRILL-6016.

parthchandra · 2018-03-14T09:57:02Z

@rajrahul, thanks for submitting the patch. It looks good. I guess we missed dictionary encoded int96 timestamps (even though timestamps with nanosecond precision are the one thing that should never, ever, be dictionary encoded).

Just to make sure, I tried the use the sample file in DRILL-6016, but I could not even unzip it! Can you please check and see if the file is correct? WE can use that to create the unit test as well.

rajrahul · 2018-03-14T10:44:57Z

@parthchandra please use the link https://github.com/rajrahul/files/raw/master/result.tar.gz
The files are present inside result/parquet/latest.

parthchandra · 2018-03-14T11:56:21Z

@rajrahul this link is good. As expected, the int96 column is dictionary encoded.
Is it possible for you to extract just a couple of records from this file and then use that for a unit test?
see TestParquetWriter.testImpalaParquetBinaryAsTimeStamp_DictChange

@vdiravka TestParquetWriter.testImpalaParquetBinaryAsTimeStamp_DictChange also uses an int96 that is dictionary encoded. Any idea whether (and why) it might be going thru a different code path?

rajrahul · 2018-03-14T12:36:43Z

@parthchandra I will create a unit test with few time stamp fields.

vdiravka · 2018-03-14T14:18:22Z

@parthchandra I have compared meta of files from TestParquetWriter.testImpalaParquetBinaryAsTimeStamp_DictChange and the meta from Rahul's dataset and found that test case indeed makes a query from two parquet files: one is dictionary encoded and other isn't. But the dataMode of column is Optional, that's why Nullable column reader is used.
Rahul's dataset contains required mode for INT96 column. This is a difference. Therefore other non-nullable column reader is necessary.

But I believe we have some mess in names of that column readers. Maybe to make some refactoring would be a good point. What do you think? For example to remove Dictionary prefixes from nested classes, but to leave it for top class name.

rajrahul · 2018-03-15T10:03:52Z

@parthchandra @vdiravka I have added the test case using the same parquet file(2.9k bytes). I tried creating a smaller file using Spark, but could not replicate the behavior. I have rebased the changes on the same commit and PR.

rajrahul · 2018-03-16T04:12:02Z

The schema given below creates the issue, as @vdiravka pointed int96 is marked required here. This parquet was generated with an older version of spark and is included in the test case.

message spark_schema {
  optional binary article_no (UTF8);
  optional binary qty (UTF8);
  required int96 run_date;
}

Newer spark version created the schema below where int96 has become optional.

message spark_schema {
  optional binary country (UTF8);
  optional double sales;
  optional int96 targetDate;
}

parthchandra · 2018-03-21T09:54:36Z

+1. LGTM

closes apache#1166

vdiravka · 2018-03-24T18:22:27Z

@rajrahul Unit test from your PR relies on particular timezone similar to TestParquetWriter.testImpalaParquetBinaryAsTimeStamp_DictChange.

Could you please edit test case for working within any time zone?
Please see this PR #904 for more details.

rajrahul · 2018-03-26T12:40:48Z

@vdiravka I have made the changes. Please have a look.

vdiravka · 2018-03-26T16:27:14Z

exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/writer/TestParquetWriter.java

+  @Test
+  public void testSparkParquetBinaryAsTimeStamp_DictChange() throws Exception {
+    try {
+      mockUtcDateTimeZone();


It doesn't work without @RunWith(JMockit.class).
Also please enable above test case testImpalaParquetBinaryAsTimeStamp_DictChange with the same change. And be sure that tests pass in the other time zone.

I could see two ways of doing this within the code itself.

Mock and run with UTC, and compare the results in UTC as in TestCastFunctions#testToDateForTimeStamp. Since TestParquetWriter already has a RunWith annotation, we might have to create another class and move both the methods.

Run with the JVM timezone(no mocking) and compare the results after a 'convertToLocalTimestamp' as in TestParquetWriter#testInt96TimeStampValueWidth

Approach 2 does not used fixed UTC timezone. Which approach do you suggest?

@vdiravka your thoughts on comment above?

I see that in TestParquetWriter only one parameter is used - repeat. I think you can replace Parameterized running of this. test with simple variable.
Other approach - you can write programmatically using of JMockit.

But I prefer not to use mocks if possible. So try to use convertToLocalTimestamp. By using it you can enable also testHiveParquetTimestampAsInt96_basic test and testImpalaParquetBinaryAsTimeStamp_DictChange with removing redundant rows.

rajrahul · 2018-03-29T13:23:19Z

@vdiravka I have made similar changes for testSparkParquetBinaryAsTimeStamp_DictChange, testHiveParquetTimestampAsInt96_basic and testImpalaParquetBinaryAsTimeStamp_DictChange. All tests are passing, please have a look.

vdiravka

Please address minor comments

vdiravka · 2018-03-29T14:19:16Z

exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/writer/TestParquetWriter.java

          .baselineColumns("int96_ts")
+          .baselineValues(new DateTime(convertToLocalTimestamp("1970-01-01 00:00:01.000")))


One baselineValue is enough. Please use where in the query.

vdiravka · 2018-03-29T14:20:27Z

exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/writer/TestParquetWriter.java

@@ -35,6 +36,7 @@
 import java.util.Map;

 import com.google.common.base.Joiner;
+import mockit.integration.junit4.JMockit;


vdiravka · 2018-03-29T14:24:55Z

exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/writer/TestParquetWriter.java

@@ -27,6 +27,7 @@
 import java.math.BigDecimal;
 import java.nio.file.Paths;
 import java.sql.Date;
+import java.sql.Timestamp;


unused import?

rajrahul · 2018-03-30T04:39:04Z

@vdiravka Done. Please review.

vdiravka

+1, thank you for editing additional tests.

vdiravka · 2018-03-30T08:41:35Z

exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/writer/TestParquetWriter.java

@@ -61,6 +60,7 @@
 import org.junit.runners.Parameterized;

 @RunWith(Parameterized.class)
+


Actually not required, tried to add another RunWith for Mocking and removed later on leaving the newline.

ok, just remove it

vdiravka · 2018-03-30T08:43:11Z

exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/writer/TestParquetWriter.java

  public void testImpalaParquetBinaryAsTimeStamp_DictChange() throws Exception {
    try {
      testBuilder()
-          .sqlQuery("select int96_ts from dfs.`parquet/int96_dict_change` order by int96_ts")
+          .sqlQuery("select min(int96_ts) date_value from dfs.`parquet/int96_dict_change`")


Did you try WHERE statement?

I did not try a WHERE statement, MIN was used to select a single record to compare. Was there any specific reason to use WHERE?

It is just more obvious what result is expected. But using MIN is ok.

rajrahul · 2018-04-02T05:25:27Z

@vdiravka removed the extra line.

vdiravka

+1

parthchandra · 2018-04-02T09:53:56Z

@rajrahul thanks for making all the changes (and of course for the fix)!

rajrahul force-pushed the DRILL-6016 branch from 6d74884 to 9b69bf1 Compare March 15, 2018 09:18

vdiravka pushed a commit to vdiravka/drill that referenced this pull request Mar 24, 2018

DRILL-6016 - Fix for Error reading INT96 created by Apache Spark

88b9768

closes apache#1166

rajrahul force-pushed the DRILL-6016 branch from 9b69bf1 to b74769f Compare March 26, 2018 11:35

vdiravka requested changes Mar 26, 2018

View reviewed changes

rajrahul force-pushed the DRILL-6016 branch from b74769f to 37e2078 Compare March 29, 2018 12:36

vdiravka requested changes Mar 29, 2018

View reviewed changes

rajrahul force-pushed the DRILL-6016 branch from 37e2078 to 1e28d4d Compare March 30, 2018 03:51

vdiravka reviewed Mar 30, 2018

View reviewed changes

DRILL-6016 - Fix for Error reading INT96 created by Apache Spark

2750f7f

rajrahul force-pushed the DRILL-6016 branch from 1e28d4d to 2750f7f Compare April 2, 2018 04:38

vdiravka approved these changes Apr 2, 2018

View reviewed changes

asfgit closed this in 127e415 Apr 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRILL-6016 - Fix for Error reading INT96 created by Apache Spark #1166

DRILL-6016 - Fix for Error reading INT96 created by Apache Spark #1166

rajrahul commented Mar 14, 2018

priteshm commented Mar 14, 2018 •

edited

rajrahul commented Mar 14, 2018

parthchandra commented Mar 14, 2018 •

edited

rajrahul commented Mar 14, 2018

parthchandra commented Mar 14, 2018

rajrahul commented Mar 14, 2018

vdiravka commented Mar 14, 2018

rajrahul commented Mar 15, 2018

rajrahul commented Mar 16, 2018

parthchandra commented Mar 21, 2018

vdiravka commented Mar 24, 2018

rajrahul commented Mar 26, 2018

vdiravka Mar 26, 2018

rajrahul Mar 27, 2018

rajrahul Mar 29, 2018

vdiravka Mar 29, 2018

rajrahul commented Mar 29, 2018

vdiravka left a comment

vdiravka Mar 29, 2018

vdiravka Mar 29, 2018

vdiravka Mar 29, 2018

rajrahul commented Mar 30, 2018

vdiravka left a comment

vdiravka Mar 30, 2018

rajrahul Mar 30, 2018

vdiravka Apr 1, 2018

vdiravka Mar 30, 2018

rajrahul Mar 30, 2018

vdiravka Apr 1, 2018

rajrahul commented Apr 2, 2018

vdiravka left a comment

parthchandra commented Apr 2, 2018

		.baselineColumns("int96_ts")
		.baselineValues(new DateTime(convertToLocalTimestamp("1970-01-01 00:00:01.000")))

		@@ -61,6 +60,7 @@
		import org.junit.runners.Parameterized;

		@RunWith(Parameterized.class)

DRILL-6016 - Fix for Error reading INT96 created by Apache Spark #1166

DRILL-6016 - Fix for Error reading INT96 created by Apache Spark #1166

Conversation

rajrahul commented Mar 14, 2018

priteshm commented Mar 14, 2018 • edited

rajrahul commented Mar 14, 2018

parthchandra commented Mar 14, 2018 • edited

rajrahul commented Mar 14, 2018

parthchandra commented Mar 14, 2018

rajrahul commented Mar 14, 2018

vdiravka commented Mar 14, 2018

rajrahul commented Mar 15, 2018

rajrahul commented Mar 16, 2018

parthchandra commented Mar 21, 2018

vdiravka commented Mar 24, 2018

rajrahul commented Mar 26, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rajrahul commented Mar 29, 2018

vdiravka left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rajrahul commented Mar 30, 2018

vdiravka left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rajrahul commented Apr 2, 2018

vdiravka left a comment

Choose a reason for hiding this comment

parthchandra commented Apr 2, 2018

priteshm commented Mar 14, 2018 •

edited

parthchandra commented Mar 14, 2018 •

edited