Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRILL-6016 - Fix for Error reading INT96 created by Apache Spark #1166

Closed
wants to merge 1 commit into from

Conversation

rajrahul
Copy link
Contributor

This fixes DRILL-6016 where drill was failing to read int96 generated by Apache Spark even after setting store.parquet.reader.int96_as_timestamp to true.

@priteshm
Copy link

priteshm commented Mar 14, 2018

@parthchandra @vdiravka would you please review this?

@rajrahul
Copy link
Contributor Author

@parthchandra @vdiravka I do not have a test case for this. I have manually verified the scenario with and without the patch. The sample input file is attached with https://issues.apache.org/jira/browse/DRILL-6016.

@parthchandra
Copy link
Contributor

parthchandra commented Mar 14, 2018

@rajrahul, thanks for submitting the patch. It looks good. I guess we missed dictionary encoded int96 timestamps (even though timestamps with nanosecond precision are the one thing that should never, ever, be dictionary encoded).

Just to make sure, I tried the use the sample file in DRILL-6016, but I could not even unzip it! Can you please check and see if the file is correct? WE can use that to create the unit test as well.

@rajrahul
Copy link
Contributor Author

@parthchandra please use the link https://github.com/rajrahul/files/raw/master/result.tar.gz
The files are present inside result/parquet/latest.

@parthchandra
Copy link
Contributor

@rajrahul this link is good. As expected, the int96 column is dictionary encoded.
Is it possible for you to extract just a couple of records from this file and then use that for a unit test?
see TestParquetWriter.testImpalaParquetBinaryAsTimeStamp_DictChange

@vdiravka TestParquetWriter.testImpalaParquetBinaryAsTimeStamp_DictChange also uses an int96 that is dictionary encoded. Any idea whether (and why) it might be going thru a different code path?

@rajrahul
Copy link
Contributor Author

@parthchandra I will create a unit test with few time stamp fields.

@vdiravka
Copy link
Member

@parthchandra I have compared meta of files from TestParquetWriter.testImpalaParquetBinaryAsTimeStamp_DictChange and the meta from Rahul's dataset and found that test case indeed makes a query from two parquet files: one is dictionary encoded and other isn't. But the dataMode of column is Optional, that's why Nullable column reader is used.
Rahul's dataset contains required mode for INT96 column. This is a difference. Therefore other non-nullable column reader is necessary.

But I believe we have some mess in names of that column readers. Maybe to make some refactoring would be a good point. What do you think? For example to remove Dictionary prefixes from nested classes, but to leave it for top class name.

@rajrahul
Copy link
Contributor Author

@parthchandra @vdiravka I have added the test case using the same parquet file(2.9k bytes). I tried creating a smaller file using Spark, but could not replicate the behavior. I have rebased the changes on the same commit and PR.

@rajrahul
Copy link
Contributor Author

The schema given below creates the issue, as @vdiravka pointed int96 is marked required here. This parquet was generated with an older version of spark and is included in the test case.

message spark_schema {
  optional binary article_no (UTF8);
  optional binary qty (UTF8);
  required int96 run_date;
}

Newer spark version created the schema below where int96 has become optional.

message spark_schema {
  optional binary country (UTF8);
  optional double sales;
  optional int96 targetDate;
}

@parthchandra
Copy link
Contributor

+1. LGTM

vdiravka pushed a commit to vdiravka/drill that referenced this pull request Mar 24, 2018
@vdiravka
Copy link
Member

@rajrahul Unit test from your PR relies on particular timezone similar to TestParquetWriter.testImpalaParquetBinaryAsTimeStamp_DictChange.

Could you please edit test case for working within any time zone?
Please see this PR #904 for more details.

@rajrahul
Copy link
Contributor Author

@vdiravka I have made the changes. Please have a look.

@Test
public void testSparkParquetBinaryAsTimeStamp_DictChange() throws Exception {
try {
mockUtcDateTimeZone();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't work without @RunWith(JMockit.class).
Also please enable above test case testImpalaParquetBinaryAsTimeStamp_DictChange with the same change. And be sure that tests pass in the other time zone.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could see two ways of doing this within the code itself.

  1. Mock and run with UTC, and compare the results in UTC as in TestCastFunctions#testToDateForTimeStamp. Since TestParquetWriter already has a RunWith annotation, we might have to create another class and move both the methods.
  2. Run with the JVM timezone(no mocking) and compare the results after a 'convertToLocalTimestamp' as in TestParquetWriter#testInt96TimeStampValueWidth

Approach 2 does not used fixed UTC timezone. Which approach do you suggest?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vdiravka your thoughts on comment above?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that in TestParquetWriter only one parameter is used - repeat. I think you can replace Parameterized running of this. test with simple variable.
Other approach - you can write programmatically using of JMockit.

But I prefer not to use mocks if possible. So try to use convertToLocalTimestamp. By using it you can enable also testHiveParquetTimestampAsInt96_basic test and testImpalaParquetBinaryAsTimeStamp_DictChange with removing redundant rows.

@rajrahul
Copy link
Contributor Author

@vdiravka I have made similar changes for testSparkParquetBinaryAsTimeStamp_DictChange, testHiveParquetTimestampAsInt96_basic and testImpalaParquetBinaryAsTimeStamp_DictChange. All tests are passing, please have a look.

Copy link
Member

@vdiravka vdiravka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address minor comments

.baselineColumns("int96_ts")
.baselineValues(new DateTime(convertToLocalTimestamp("1970-01-01 00:00:01.000")))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One baselineValue is enough. Please use where in the query.

@@ -35,6 +36,7 @@
import java.util.Map;

import com.google.common.base.Joiner;
import mockit.integration.junit4.JMockit;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same

@@ -27,6 +27,7 @@
import java.math.BigDecimal;
import java.nio.file.Paths;
import java.sql.Date;
import java.sql.Timestamp;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused import?

@rajrahul
Copy link
Contributor Author

@vdiravka Done. Please review.

Copy link
Member

@vdiravka vdiravka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thank you for editing additional tests.

@@ -61,6 +60,7 @@
import org.junit.runners.Parameterized;

@RunWith(Parameterized.class)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually not required, tried to add another RunWith for Mocking and removed later on leaving the newline.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, just remove it

public void testImpalaParquetBinaryAsTimeStamp_DictChange() throws Exception {
try {
testBuilder()
.sqlQuery("select int96_ts from dfs.`parquet/int96_dict_change` order by int96_ts")
.sqlQuery("select min(int96_ts) date_value from dfs.`parquet/int96_dict_change`")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you try WHERE statement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not try a WHERE statement, MIN was used to select a single record to compare. Was there any specific reason to use WHERE?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is just more obvious what result is expected. But using MIN is ok.

@rajrahul
Copy link
Contributor Author

rajrahul commented Apr 2, 2018

@vdiravka removed the extra line.

Copy link
Member

@vdiravka vdiravka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@parthchandra
Copy link
Contributor

@rajrahul thanks for making all the changes (and of course for the fix)!

@asfgit asfgit closed this in 127e415 Apr 7, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants