-
Notifications
You must be signed in to change notification settings - Fork 980
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRILL-8188: Convert HDF5 format to EVF2 #2515
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM +1. I'll wait for @paul-rogers review for the final word on the other changes, but the HDF5 reader looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@luocooong, thanks for doing this conversion. I'm glad to see you are getting to be an expert in this part of Drill!
The review has a few minor comments and a few questions. Other than that, nice work!
// Case for datasets of greater than 2D | ||
// These are automatically flattened | ||
buildSchemaFor2DimensionalDataset(dataSet); | ||
{ // Opens an HDF5 file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the nested block necessary? It is handy when we want to reuse variable names (as sometimes occurs in tests), but is it needed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting thing, when we use EVF2, the batch reader only needs a constructor to start all initialization operations. But we will write a lot of code into that function, initialize the final variables, open files, define the schema, and so on. So I used the code blocks to group these actions to make them easier to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess some of these could become private methods but it's a minor point for me.
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java
Outdated
Show resolved
Hide resolved
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java
Outdated
Show resolved
Hide resolved
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java
Outdated
Show resolved
Hide resolved
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java
Outdated
Show resolved
Hide resolved
contrib/format-hdf5/src/main/java/org/apache/drill/exec/store/hdf5/HDF5BatchReader.java
Show resolved
Hide resolved
@@ -173,7 +173,7 @@ public ColumnHandle insert(int posn, ColumnMetadata col) { | |||
} | |||
|
|||
public ColumnHandle insert(ColumnMetadata col) { | |||
return insert(insertPoint++, col); | |||
return insert(insertPoint == -1 ? size() : insertPoint++, col); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What was the bug here? I'm wondering if there is something else broken, and the above is a work around for that other bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I fixed the following error, I received a new one : Dynamically create a repeated list field (and success) in next(), then this field may be assigned to the index value of -1.
@@ -189,7 +189,7 @@ private void insertColumn(ColumnMetadata col) { | |||
switch (mode) { | |||
case FIRST_READER_SCHEMA: | |||
case READER_SCHEMA: | |||
if (schema.projectionType() != ProjectionType.ALL) { | |||
if (schema.projectionType() != ProjectionType.ALL && !col.isArray()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is an array column special here? I think this is trying to say that, if the array is not projected, it should not have been created. There are dummy structures used instead. This fix suggests that there is a bug somewhere other than here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's look at this case :
- batch reader received the project column, from
SELECT path, data_type, file_name FROM
in the constructor(). - schema is [TupleSchema [ProjectedColumn [
path
(LATE:REQUIRED)]], [ProjectedColumn [data_type
(LATE:REQUIRED)]], [ProjectedColumn [file_name
(LATE:REQUIRED)]]]. - ProjectionType = 'SOME'.
- batch reader create new repeated list column in next().
- do the project for the schema.
ColumnHandle existing = schema.find(colSchema.name());
if (existing == null) {
insertColumn(colSchema);
}
- catch and throw the
IllegalStateException
. - failed to reader the format data.
In EVF1, creating repeated list fields dynamically is allowed and does not throw such exceptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error check is correct: the problem must be in the ResultSetLoader
. This error says that the ResultSetLoader
is trying to add a materialized column, but that column is not projected. The result set loader should have created a dummy column and not passed the column along to this mechanism.
.addRepeatedList("int_list") | ||
.addArray(MinorType.INT) | ||
.resumeSchema() | ||
.addRepeatedList("long_list") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for testing this. The two messiest vector types in Drill are the repeated Map and repeated list. Repeated list has many, many problems. It isn't even well defined in SQL since it's types can change from row to row.
I wonder, why do we need a repeated list for this reader? Because we want a 2D array? What would a Drill user do with a 2D array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests are to verify the above fix. Is it better to split into a new test class?
.addNullable(DATASET_DATA_TYPE_NAME, MinorType.VARCHAR) | ||
.addNullable(DIMENSIONS_FIELD_NAME, MinorType.VARCHAR); | ||
|
||
negotiator.tableSchema(builder.buildSchema(), false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this path we are telling EVF2 the schema to use. The false
argument says we are free to add columns later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to have the GitHub comment above this one as a comment above line 214 in the code.
Dataset dataSet = hdfFile.getDatasetByPath(readerConfig.defaultPath); | ||
dimensions = dataSet.getDimensions(); | ||
|
||
loader = negotiator.build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this path, we did not tell EVF2 about the schema. This means EVF2 expects us to discover columns as we go along. Is this the intent? The comment can be read as saying either "Drill can obtain the schema NOW" or "Drill can obtain the schema LATER".
But, it is odd, within the same reader, to provide a schema down one path, and not in the other path. It almost seems that there are two distinct readers in this case.
@@ -189,7 +189,7 @@ private void insertColumn(ColumnMetadata col) { | |||
switch (mode) { | |||
case FIRST_READER_SCHEMA: | |||
case READER_SCHEMA: | |||
if (schema.projectionType() != ProjectionType.ALL) { | |||
if (schema.projectionType() != ProjectionType.ALL && !col.isArray()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error check is correct: the problem must be in the ResultSetLoader
. This error says that the ResultSetLoader
is trying to add a materialized column, but that column is not projected. The result set loader should have created a dummy column and not passed the column along to this mechanism.
This PR is getting a bit complex with the bug or two that this PR uncovered. Iexplain a bit about how EVF2 works. There are two case: wildcard projection (SELECT *) and explicit projection (SELECT a, b, c). The way EVF2 works is different in these two cases. Then, for each reader, there are three other cases. The reader might know all its columns before the file is even opened. The PCAP reader is an example: all PCAP files have the same schema, so we don't need to look at the file to know the schema. The second case are files were we can learn the schema when opening the file. Parquet and CSV are examples: we can learn the Parquet schema from the file metadata, and CSV schema from the headers. The last case is where we don't know the schema until we read each row. JSON is the best example. So, now we have six cases to consider. This is why EVF2 is so complex! For the wildcard, EVF2 "discovers" columns as the reader creates them: either via the up-front schema, or as the reader reads data. In JSON, for example, we can discover a new column at any time. Once a column is added, EVF2 will automatically fill in null values if values are missing. In the extreme case, it can fill in nulls for an entire batch. Because of the wildcard, all discovered columns are materialized and added to the result set. If reading JSON, and a column does not appear until the third batch, then the first two won't contain that column, but the third batch will have a schema change and will include the column. This can cause a problem for operators such as joins, sort or aggregation that have to store a collection of rows, not all can handle a schema change. Now, for the explicit schema case, EVF2 knows what columns the user wants: those in the list. EVF2 waits as long as it can, hoping the reader will provide the columns. Again, the reader can provide them up front, before the first record, or as the read proceeds (as in JSON.) As the reader provides each column, EVF2 has to decide: do we need that column? If so, we create a vector and a column writer: we materialize the column. If the column is not needed, EVF2 creates a dummy column writer. Now we get to another interesting part. What if we guessed, say, Varchar, but the column later shows up as a JSON array? We're stuck: we can't go back and redo the old batches. We end up with a "hard" schema change. Bad things happen unless the query is really simple. This is the fun of Drill's schemaless system. With that background, we can try to answer your question. The answer is: it depends. If the reader says, "hey Mr. EVF2, here is the full schema I will read, I promise not to discover more columns", then EVF2 will throw an exception if later you say, "ha! just kidding. Actually, I discovered another one." I wonder if that's what is happening here. If, however, the reader left the schema open, and said, "here are the columns I know about now, but I might find more later", then EVF2 will expect more columns, and will handle them as above: materialize them if they are projected or if we have a wildcard, provide a dummy writer if we have explicit projection and the column is not projected. In this PR, we have two separate cases in the reader constructor.
This means the reader is doing two entirely different things. In the This seems horribly complicated! I wonder, are we missing logic in the |
Found the bug. It is in |
@paul-rogers Great! Thank you for the quick work. |
Hi @luocooong Thank you for this PR. Where are we in terms of getting it merged? |
Hi @cgivre Thank you for paying attention to this PR. |
Converted to draft to prevent merging. |
Hey @luocooong @paul-rogers I hope all is well. I wanted to check in on this PR to see where we are. At this point, nearly all the other format plugins have been converted to EVF V2. The other outstanding ones are the image format and LTSV. I'd really like to see this merged so that we can remove the EVF V1 code. Do you think we could get this ready to go soon? |
I think I hosed the version control somehow.... This PR should only modify a few files in the HDF5 reader. |
It seems you did this work on top of the master with my unsquashed commits. When you try to push, those commits come along for the ride. I think you should grab the latest master, then rebase your branch on it. Plan B is to a) grab the latest master, and b) create a new branch that cherry-picks the commit(s) you meant to add. If even this doesn't work, then I'll clean up this branch for you since I created the mess in the first place... |
@paul-rogers I attempted to fix. I kind of suck at git, so I think it's more or less correct now, but there was probably a better way to do this. |
I think you still want something like
|
I see Git's "patch contents already upstream" feature doesn't automatically clean up the unwanted commits. I've dropped them manually in a new branch in my fork and now suggest
|
@jnturton I did as you suggested. Would you mind please taking a look? |
Just workng through the review comments that @paul-rogers left (the ones unrelated to the needed functionality that was missing from EVF2). |
Did the recent EVF revisions allow the tests for this PR to pass? Is there anything that is still missing? Also, did the excitement over my botched merge settle down and are we good now? |
All the unit tests pass.... whether that means that everything is working.... this plugin has a decent amount of tests, so I'd feel pretty good. |
// Case for datasets of greater than 2D | ||
// These are automatically flattened | ||
buildSchemaFor2DimensionalDataset(dataSet); | ||
{ // Opens an HDF5 file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess some of these could become private methods but it's a minor point for me.
.addNullable(DATASET_DATA_TYPE_NAME, MinorType.VARCHAR) | ||
.addNullable(DIMENSIONS_FIELD_NAME, MinorType.VARCHAR); | ||
|
||
negotiator.tableSchema(builder.buildSchema(), false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to have the GitHub comment above this one as a comment above line 214 in the code.
} | ||
|
||
private HDF5DataWriter buildWriter(MinorType dataType) { | ||
switch (dataType) { | ||
/*case GENERIC_OBJECT: | ||
return new HDF5EnumDataWriter(hdfFile, writerSpec, readerConfig.defaultPath);*/ | ||
/* case GENERIC_OBJECT: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have an explanatory comment if we're going to keep commented out code.
DRILL-8188: Convert HDF5 format to EVF2
Description
Use EVF V2 instead of old V1.
Included two bug fixes with the V2 framework :1. Projected an unprojected column error in array object2. IndexOutOfBoundsException at add columnNote, these items were fixed in the context of DRILL-8375
Documentation
N/A
Testing