DRILL-8037: Add V2 JSON Format Plugin based on EVF #2364

vdiravka · 2021-11-05T02:34:31Z

DRILL-8037: Add V2 JSON Format Plugin based on EVF

Description

This adds new V2 beta JSON Format Plugin based on the "Extended Vector Framework".
This is follow up DRILL-6953 (was closed with the decision to merge it by small pieces).
So it is based on #1913 and rev2 work.

Documentation

"V1" - "legacy" reader
"V2" - new beta version JSON Format Plugin based on the result set loader.
The V2 version is a bit more robust, and supports the row set framework. However, V1 supports unions and reading corrupted JSON files.

The new "V2" JSON scan is controlled by a new option:
store.json.enable_v2_reader, which is true by default in this PR.

Adds a "projection type" to the column writer so that the
JSON parser can receive a "hint" as to the expected type.
The hint is from the form of the projected column: a[0],
a.b or just a.
Therefore it supports schema provision. Example:

ALTER SESSION SET `store.json.enable_v2_reader` = true;
apache drill (dfs.tmp)> select * from test;
+---------------+-------+---+
|       a       |   e   | f |
+---------------+-------+---+
| {"b":1,"c":1} | false | 1 |
| {"b":1,"c":1} | null  | 2 |
| {"b":1,"c":1} | true  | 3 |
+---------------+-------+---+
apache drill (dfs.tmp)> create or replace schema (`e` BOOLEAN default 'false', `new` VARCHAR not null default 'schema evolution') for table test;
apache drill (dfs.tmp)> select * from test;
+-------+------------------+---------------+---+
|   e   |       new        |       a       | f |
+-------+------------------+---------------+---+
| false | schema evolution | {"b":1,"c":1} | 1 |
| null  | schema evolution | {"b":1,"c":1} | 2 |
| true  | schema evolution | {"b":1,"c":1} | 3 |
+-------+------------------+---------------+---+

Testing

A lot of existing test cases are running for both readers. It is needed until V1 is still present in Drill code.

vdiravka · 2021-11-05T02:39:52Z

@paul-rogers I've cherry-picked commit from your rev2 branch. Didn't get all of them, because of big code conflicts and some of them are outdated.
Draft until all tests from rev2 are enabled.

paul-rogers · 2021-11-22T23:46:19Z

@vdiravka, thanks for doing this! It will take me a bit to remember where I left this. I'll do the review in small bits.

paul-rogers · 2021-11-25T23:46:33Z

@vdiravka, turns out there was one more commit that I had failed to push to DRILL-6953-rev2. Just did that. This commit contains tests for the extended types and for a provided schema. Please incorporate those additional changes so we have a complete set of work.

paul-rogers

Thanks again for merging this code across. I'm sure it was not an easy task.

I went though this PR, comparing it to a diff of my original branch with its base. Looks like you got most of the changes. I only found a few areas, noted below, where there were differences.

Please also grab that last commit I just pushed and I'll do a final review.

I saw that you had made a few good cleanups in doing the merge, so that means you probably took a good look at the code. I'll assume that counts as a review of my changes, and I'm just reviewing the merge (so that I'm not approving my own changes!)

...java-exec/src/main/java/org/apache/drill/exec/physical/resultSet/impl/SingleVectorState.java

paul-rogers · 2021-11-26T00:35:25Z

exec/java-exec/src/main/java/org/apache/drill/exec/store/easy/json/JSONFormatPlugin.java

+      .fsConf(fsConf)
+      .defaultName(PLUGIN_NAME)
+      .readerOperatorType(READER_OPERATOR_TYPE)
+      .writerOperatorType(WRITER_OPERATOR_TYPE)


Add the following:

// Temporary until V2 is the default.

It is enabled by default. So what is temporary here?

Sorry, I saw this in the PR description:

The new "V2" JSON scan is controlled by a new option:
store.json.enable_v2_reader, which is false by default in this PR.

Which I thought meant that the V2 reader is not enabled by default, hence the suggested comment.

paul-rogers · 2021-11-26T00:44:34Z

exec/java-exec/src/main/resources/drill-module.conf

@@ -692,6 +692,7 @@ drill.exec.options: {
    # Property name and value should be separated by =.
    # Properties should be separated by new line (\n).
    store.hive.conf.properties: "",
+    store.json.enable_v2_reader: true,


Do we want to use the V2 reader by default? I think this should be false for now.

yes, we want by default

If so, then please change the PR description quoted above.

paul-rogers · 2021-11-26T02:05:51Z

exec/java-exec/src/test/java/org/apache/drill/exec/store/json/TestJsonRecordReader.java

    }
    finally {
      String set = "alter session set `"
-          + ExecConstants.JSON_READER_SKIP_INVALID_RECORDS_FLAG + "` = false";
+        + ExecConstants.JSON_READER_SKIP_INVALID_RECORDS_FLAG + "` = false";


There are some changes here and above that need to be copied over.

What changes?

The code uses a method to do the alter session rather than building up a string. See this file.

paul-rogers · 2021-11-26T02:09:27Z

exec/java-exec/src/test/java/org/apache/drill/exec/vector/complex/writer/TestExtendedTypes.java

 import org.junit.BeforeClass;
 import org.junit.Test;

 public class TestExtendedTypes extends BaseTestQuery {
+


There is some cleanup in TestComplexTypeWriter.java to be copied across.

Spaces? Could you point me what cleanup do you mean?

Interesting. Seems OK now. Maybe I read the diff backwards...

paul-rogers · 2021-11-26T02:12:09Z

exec/java-exec/src/test/java/org/apache/drill/exec/vector/complex/writer/TestExtendedTypes.java

@@ -78,10 +78,10 @@ public void testMongoExtendedTypes() throws Exception {
      List<QueryDataBatch> resultList = testSqlWithResults(String.format("select * from dfs.`%s`", originalFile));
      String actual = getResultString(resultList, ",");
      String expected = "drill_timestamp_millies,bin,bin1\n2015-07-07 03:59:43.488,drill,drill\n";
-      Assert.assertEquals(expected, actual);
+      assertEquals(expected, actual);


This file has some extended type tests to be copied over.

Do you mean to add test cases for every extended type from the file?

Seems fine now.

paul-rogers · 2021-11-26T02:14:26Z

...va-exec/src/test/java/org/apache/drill/exec/vector/complex/writer/TestJsonEscapeAnyChar.java

+  private void resetV2Reader() {
+    client.resetSession(ExecConstants.ENABLE_V2_JSON_READER_KEY);
+  }
+


Changes in TestJsonNanInf.java need to be copied over.

What changes?

vdiravka

Hi @paul-rogers I am going to enable V2 by default. All tests passed and fixed, except cases with enabled options for UNION and JSON Corrupted files, but these features are experimental, so we are fine to go without them now.

So here are my additional changes rebased onto latest master and including all your commits from rev2

lgtm-com · 2021-12-03T02:05:32Z

This pull request fixes 2 alerts when merging 42863f4 into ecdb8db - view on LGTM.com

fixed alerts:

2 for Result of multiplication cast to wider type

paul-rogers · 2021-12-03T04:56:12Z

Sorry, I perhaps read the diffs backward before: the things I thought were missing seem actually fine.

Just to double-check, there were some additional complex type handling and tests in this commit that would be useful to add.

vdiravka · 2021-12-03T10:23:50Z

Sorry, I perhaps read the diffs backward before: the things I thought were missing seem actually fine.

Just to double-check, there were some additional complex type handling and tests in this commit that would be useful to add.

Hi Paul. I double checked and looks like the changes from commit is already incorporated. I made cheery-pick, resolved 3 merge conflicts and after that there are no new changes.

The other question what do you think is that useful to have dfs plugin config to switch to V1 JSON reader? I think, since the goal to switch to V2 fully, the system/session option is enough

paul-rogers · 2021-12-03T16:18:46Z

@vdiravka, the commit in question was something I pushed about a week ago: it was some last changes that were still on my local machine that I'd not yet pushed. Sorry about that! They were not on GitHub when you grabbed the original commits. For example, the new file UtcDateValueListener.java which I don't see in the list of files for this PR. Can you grab this commit also?

vdiravka · 2021-12-04T13:16:37Z

@vdiravka, the commit in question was something I pushed about a week ago: it was some last changes that were still on my local machine that I'd not yet pushed. Sorry about that! They were not on GitHub when you grabbed the original commits. For example, the new file UtcDateValueListener.java which I don't see in the list of files for this PR. Can you grab this commit also?

For passing session option in tests I have created the new methods enableSessionOption, disableSessionOption based on the building the string optionSettingQueriesForTestQuery in TestBuilder.

I see, you've already commited this changes and file in DRILL-7717: Support Mongo extended types in V2 JSON loader #2068

paul-rogers

@vdiravka, thanks for the explanations. Hard to remember stuff that happened over a year ago...

The changes look good. Thanks much for getting this merged!

LGTM. +1

vdiravka · 2021-12-07T12:43:38Z

I found one regression for LATTERAL and UNNEST operators with V2. SchemaChange starts to be happened, but it shouldn't.
TestE2EUnnestAndLateral#testMultipleBatchesLateral_WithGroupByInParent shows it after reverting changes in HashAggTemplate. Working on fix now

luocooong · 2021-12-07T12:59:22Z

@vdiravka Will the fix included in this PR or a new separate PR?

vdiravka · 2021-12-07T14:29:18Z

I'll try to prepare the fix today in this PR. If it will take longer, than it is fine to include in the separate PR

paul-rogers · 2022-02-02T05:24:39Z

@vdiravka reached out to me on this bug. His explanation:

The issue is the schema is changed for the second batch and it is reported by SchemaTracker#isSameSchema
I suppose V1 reader just compared the same field type, but the new one compares vectors by identity. So schemaChange is right thing for that test case. But Drill Hash aggregate doesn't support schema change.

Here is my analysis:

You are tight that the SchemaTracker is the thing in EVF which looks for schema changes. The rule is pretty simple: if the the kind of vector differs from the previous batch, then a schema change occurred. This is based, in part, on the observation that the sort operator can't handle any schema changes at all: not INT/BIGINT, not NOT NULL to NULL. Sort combines vectors, and if they are of a different type, the values won't fit together. The same is probably true of the hash agg.

OK, so we agree that a type change is bad. What you're describing is the same type, but a different vector. This is a bug in several ways, and not the way you might think.

First, it seems logical that if column x is a non-null INT in one batch, and is a non-null INT in the next batch, that no schema change has occurred. But, Drill will surprise you. Most operators assume that the vector itself has not changed: only the data stored within the vector. This is a subtle point, so let me expand it.

A vector is a permanent holder or a column. The vector holds one or more data buffers. Those buffers change from batch to batch. Think of it as a bucket brigade: the old-style way that people fought fires. A chain of people forms (the vectors). They hand buckets one to the next (the data buffers).

Now, I was pretty surprised when I first discovered this. So was Boaz when he wrote hash agg. But, most of Drill's code assumes that, in setup, the code binds to a specific vector. In the next batch, that same binding is valid; only the data changes. This shows up in code gen and other places. In fact, a goodly part of EVF is concerned with this need for "vector continuity" even across readers. (Let that sink in: two readers that have nothing to do with each other must somehow still share common vectors. That alone should tell us the vector design in Drill is a bit strange and needs rethinking. But, that's another topic.)

So, now let's review the bug. On the one hand, SchemaTracker has noticed the vectors changed, and is (correctly) telling the downstream operators that they must redo their bindings to the new vector.

But, on the other hand, SchemaTracker has been presented with two different vectors for the same column. That should never happen unless there is an actual type change (non-null to null, INT to BIGINT, etc.) There are elaborate mechanisms to reuse vectors from one reader to the next.

So, the first thing to check is if the types are at all different. (If the "major types" differ.) If so, then we have a legitimate schema change that Drill can't handle.

Now, if you find that the types are the same, then we have a bug in EVF, specifically in the ResultVectorCacheImpl class, since that's the thing which is supposed to do the magic.

Let me ask another question. When this occurs, is it in the first batch of the second (or later) reader? If so, then that reader should have pulled the vector from the cache. The question is, why didn't it?

If the error occurs within the first reader, then the bug is more subtle. How could the JSON reader accomplish that? The ResultSetLoader hides all that detail, and it defers to the ResultVectorCacheImpl for vector continuity.

Yet another question is why these tests didn't fail 3 months ago when the tests first ran. And, why did the tests pass way back when I wrote this code? Did anything change elsewhere that could trigger this? I don't see how, but we should probably ask the question.

Still another possibility is that there has always been a schema change, the prior code failed to point out the schema change (by returning a OK_NEW_SCHEMA status) yet somehow the hash agg handled that case. If so, then there is a whole chain of bug. That would not surprise me: I had to fix bug after bug after bug to get the EVF stuff working: often someone broke something in one place to work around something broken in another. When I fixed one, the other failed when it should not have. Loads of fun.

Fortunately, these are unit tests, so we should be able to sort out the issues without too much trouble.

I'll also note that there were a few fixes to EVF V2 that I made in my PR, but those were around a subtle but in implicit columns.

Suggestion: give the above a shot to see if you can see the problem. Otherwise, tomorrow I'll try to squeeze in time to grab this branch and do some debugging. After all, I wrote this stuff originally, so I should try to make it work.

paul-rogers · 2022-02-03T06:19:01Z

Oddly, TestE2EUnnestAndLateral works just fine in Eclipse on my machine with this branch. Is that the correct test which was failing? Also, I see in the panel here in GitHub that all tests passed. Can you explain the issue a bit more?

lgtm-com · 2022-02-04T23:12:44Z

This pull request fixes 2 alerts when merging 2fdec26 into 556b972 - view on LGTM.com

fixed alerts:

2 for Result of multiplication cast to wider type

vdiravka · 2022-02-04T23:32:56Z

Hi @paul-rogers I have rebased the branch to master branch. And in separate new commit removed the hack, which hid the schema change in the HashAggTemplate (and actually one row is missing in query result, just actually test case doesn't check it).
Thanks for explanation how vectors is working, it helped me. It is clear now, that schema is changing due to RepeatedMapVector can't be obtained from the cache:

      // Don't get the map vector from the vector cache. Map vectors may
      // have content that varies from batch to batch. Only the leaf
      // vectors can be cached.

Obtaining vector from cache here leads to errors in this and other test cases:
mapVector = (RepeatedMapVector) parent.vectorCache().vectorFor(mapColSchema.schema());

org.apache.drill.common.exceptions.UserRemoteException: EXECUTION_ERROR ERROR: null

Read failed for reader: JsonBatchReader
....
Caused by: java.lang.AssertionError: 
	at org.apache.drill.exec.physical.resultSet.impl.TupleState$MapState.addOutputColumn(TupleState.java:475)
	at org.apache.drill.exec.physical.resultSet.impl.ColumnState.buildOutput(ColumnState.java:321)
	at org.apache.drill.exec.physical.resultSet.impl.TupleState.updateOutput(TupleState.java:206)
	at org.apache.drill.exec.physical.resultSet.impl.TupleState.updateOutput(TupleState.java:217)
	at org.apache.drill.exec.physical.resultSet.impl.TupleState$RowState.updateOutput(TupleState.java:430)
	at org.apache.drill.exec.physical.resultSet.impl.ResultSetLoaderImpl.harvest(ResultSetLoaderImpl.java:716)

So as for me looks like we need to implement supporting schema change for hashAgg operator or obtaining RepeatedMapVector from the cache. I lean towards the latter. What do you think?

lgtm-com · 2022-02-05T00:45:36Z

This pull request fixes 2 alerts when merging 9eb445a into 556b972 - view on LGTM.com

fixed alerts:

2 for Result of multiplication cast to wider type

paul-rogers · 2022-02-05T02:11:16Z

@vdiravka, good sleuthing! You did indeed find the hole in the system. Map (and repeated map) vectors are special: they are just holders for the actual data vectors. If they are reused, we get all the previous map members, which may or may not be a problem. I guess it would be a problem if reader 1 has a.b be an INT, while reader 2 wants a.b to be a VARCHAR.

I guess a question is whether the HashAgg maintains a pointer to the map itself, or only the physical columns within it. It has no pointers to the map itself, we can special-case maps: they are considered the same schema if their contents are the same, whether or not the map vector itself is the same.

Another choice would be to store the map vector in the cache, but strip all the physical columns out of it when the caller asks for it again. The caller then reassembles the physical columns, also from the cache, and hopefully creates the same map structure as the previous reader.

Based on what you learned of the HashAgg, which of these might work?

paul-rogers · 2022-02-07T02:37:40Z

@vdiravka, which test do I run to see the failure? I tried that one test mentioned above, but it worked for me.

Actually, I should probably create a separate unit test for this scenario. I'll do that to see if I can reproduce the issue. If so, then I'll see if I can find a fix, perhaps one of the ideas mentioned above.

Then, we can validate the fix on that test case you have which is failing.

lgtm-com · 2022-04-12T01:50:59Z

This pull request fixes 2 alerts when merging 4e2c483 into 634ffa2 - view on LGTM.com

fixed alerts:

2 for Result of multiplication cast to wider type

* Snapshot: reworked JSON field parser creation * Updated JSON loader * Redo value listener with tokens * Extended long type works * More simple extended types * Added $date * Binary type * All extended type except arrays * Extended arrays partly working * More arrays work * Refactor element parser interfaces * Rename RowSetTests --> RowSetTest * More factory cleanup * Revised unknown field creation * In middle of factory/parser restructuring * Scalars, object, some variants work again * JSON loader tests pass * File cleanup * Old extended types test passes * Renamed JSON packages * Tested extended provided types

vdiravka · 2022-04-26T14:28:51Z

@paul-rogers
I have rebased this PR with a mater branch and fixed DRILL-8195. So here are no new changes after last review (except "rebase to master" commit and DRILL-8195. "Rebase to master" commit will be squashed to previous one after review)

Additionally DRILL-8196, DRILL-8197 and DRILL-8201 are opened and going to be resolved in separate PRs soon.

lgtm-com · 2022-04-26T21:29:44Z

This pull request fixes 3 alerts when merging 8fd294a into 5080424 - view on LGTM.com

fixed alerts:

3 for Result of multiplication cast to wider type

paul-rogers

@vdiravka, thanks again for merging the code.
LGTM +1

* Enable store.json.enable_v2_reader by default * Fix TestJsonReader doulbe quotes test cases. Update jackson 2.12.1 -> 2.13.0 * Disable V2 for experimental UNION datatype * Fix regressions * Fix json Schema Provision (it wasn't provided for JsonLoaderBuilder). The previous schema provision was a fake, the reader schema was infered from the json content. It fixes the scan and reader schema validation. And it starts to apply the provided schema to ANALYZE COMMANDS, fixed TestMetastoreWithEasyFormatPlugin#testAnalyzeOnJsonTable

vdiravka · 2022-04-27T08:18:29Z

Thanks for the review @paul-rogers
Fixup(squash) the "Rebase to the master branch" commit now and will merge the PR after CI passing

lgtm-com · 2022-04-27T09:52:35Z

This pull request fixes 3 alerts when merging 05b2933 into 7c37323 - view on LGTM.com

fixed alerts:

3 for Result of multiplication cast to wider type

vdiravka requested a review from paul-rogers November 5, 2021 02:34

vdiravka force-pushed the DRILL-8037 branch from 65a8c4f to 267a339 Compare November 5, 2021 02:55

vdiravka self-assigned this Nov 5, 2021

vdiravka added the json JSON Format label Nov 5, 2021

vdiravka force-pushed the DRILL-8037 branch 4 times, most recently from c8a31ec to 6ccb92f Compare November 7, 2021 09:38

paul-rogers requested changes Nov 26, 2021

View reviewed changes

vdiravka commented Dec 3, 2021

View reviewed changes

vdiravka force-pushed the DRILL-8037 branch from 6ccb92f to 42863f4 Compare December 3, 2021 00:44

paul-rogers approved these changes Dec 5, 2021

View reviewed changes

vdiravka marked this pull request as ready for review December 6, 2021 16:27

jnturton marked this pull request as draft January 3, 2022 07:49

vdiravka force-pushed the DRILL-8037 branch from 42863f4 to 2fdec26 Compare February 4, 2022 21:48

vdiravka force-pushed the DRILL-8037 branch 2 times, most recently from 950984e to 4e2c483 Compare April 12, 2022 00:14

vdiravka force-pushed the DRILL-8037 branch from 4e2c483 to 05dee3d Compare April 26, 2022 14:13

vdiravka marked this pull request as ready for review April 26, 2022 14:29

vdiravka force-pushed the DRILL-8037 branch from 05dee3d to 8fd294a Compare April 26, 2022 14:36

paul-rogers approved these changes Apr 27, 2022

View reviewed changes

vdiravka added 2 commits April 27, 2022 11:13

DRILL-8195: Add Timestamp Zone offset ISO-8601 format for JSON EVF

05b2933

vdiravka force-pushed the DRILL-8037 branch from 8fd294a to 05b2933 Compare April 27, 2022 08:16

vdiravka merged commit ead453c into apache:master Apr 27, 2022

DRILL-8037: Add V2 JSON Format Plugin based on EVF #2364

DRILL-8037: Add V2 JSON Format Plugin based on EVF #2364

Conversation

vdiravka commented Nov 5, 2021 • edited

DRILL-8037: Add V2 JSON Format Plugin based on EVF

Description

Documentation

Testing

vdiravka commented Nov 5, 2021

paul-rogers commented Nov 22, 2021

paul-rogers commented Nov 25, 2021

paul-rogers left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paul-rogers Dec 3, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paul-rogers Dec 3, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vdiravka left a comment • edited

Choose a reason for hiding this comment

lgtm-com bot commented Dec 3, 2021

paul-rogers commented Dec 3, 2021

vdiravka commented Dec 3, 2021

paul-rogers commented Dec 3, 2021

vdiravka commented Dec 4, 2021

paul-rogers left a comment

Choose a reason for hiding this comment

vdiravka commented Dec 7, 2021

luocooong commented Dec 7, 2021

vdiravka commented Dec 7, 2021

paul-rogers commented Feb 2, 2022 • edited

paul-rogers commented Feb 3, 2022

lgtm-com bot commented Feb 4, 2022

vdiravka commented Feb 4, 2022

lgtm-com bot commented Feb 5, 2022

paul-rogers commented Feb 5, 2022

paul-rogers commented Feb 7, 2022

lgtm-com bot commented Apr 12, 2022

vdiravka commented Apr 26, 2022

lgtm-com bot commented Apr 26, 2022

paul-rogers left a comment

Choose a reason for hiding this comment

vdiravka commented Apr 27, 2022

lgtm-com bot commented Apr 27, 2022

vdiravka commented Nov 5, 2021 •

edited

paul-rogers Dec 3, 2021 •

edited

paul-rogers Dec 3, 2021 •

edited

vdiravka left a comment •

edited

paul-rogers commented Feb 2, 2022 •

edited