-
Notifications
You must be signed in to change notification settings - Fork 980
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRILL-7503: Refactor the project operator #1944
Conversation
d561f2e
to
9287c92
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @paul-rogers , I'm really happy that you started refactoring of ProjectRecordBatch
. I also looked at this class with tears in my eyes:'( What I see in the PR is the creation of some kind of VectorState
mediator between old ProjectRecordBatch
and new ProjectionMaterializer
. Unfortunately, such splitting does not bring much clarity to this code and it's still a necessity to redesign the ProjectRecordBatch
. Could you please invest more effort in order to finally design a mature solution for this operator? From my side, I'll also take a closer look at the code and share my ideas soon.
...ava-exec/src/main/java/org/apache/drill/exec/physical/impl/project/ProjectMemoryManager.java
Outdated
Show resolved
Hide resolved
...ava-exec/src/main/java/org/apache/drill/exec/physical/impl/project/ProjectMemoryManager.java
Outdated
Show resolved
Hide resolved
...ava-exec/src/main/java/org/apache/drill/exec/physical/impl/project/ProjectMemoryManager.java
Outdated
Show resolved
Hide resolved
...ava-exec/src/main/java/org/apache/drill/exec/physical/impl/project/ProjectMemoryManager.java
Outdated
Show resolved
Hide resolved
private enum WidthType { | ||
FIXED, | ||
VARIABLE | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After a close look at the code, I believe this enum is unnecessary. All usages of the ColumnWidthInfo
constructor accept WidthType.VARIABLE
and there is a block inside the update method :
if (columnWidthInfo.isFixedWidth()) {
// fixed width columns are accumulated in totalFixedWidthColumnWidth
ShouldNotReachHere();
} else {...}
Please remove the enum and related code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I tried to limit the "blast radius" by gritting my teeth and leaving this class as it was. This seems to be some offshoot of my "temporary" BatchSizer
. Just for background, some time ago we realized that not all Drill rows are of the same width. (Who'd have guessed?) So, we added "temporary" code to work out sizes in each operator. This was supposed to be temporary until batches carried their own size info. (Once we figure it out for a vector, we don't have to throw it away and figure it out again for each operator.)
In fact, I deliberately chose a goofy name "BatchSizer" to remind folks that this was a temporary hack to so I could focus on the "real work" of fixing the external sort. Sigh...
But, the temporary solution seems to have become semi-permanent and has grown odd variations such as this.
The "master plan" was to not try to predict batch sizes as we are doing here. Instead, the ResultSetLoader
is intended to just let the operator write to the batch until we hit the desired limit. All the calcs are already done for the reader case. The goal was to use the same mechanism in other places were we don't know widths ahead of time. Project is the classic case: for all we know, the user is doing some silly function like repeating a big VARCHAR
100 times. So, if we can ever get there, all this temporary stuff will be swept away.
...ava-exec/src/main/java/org/apache/drill/exec/physical/impl/project/ProjectMemoryManager.java
Outdated
Show resolved
Hide resolved
...ava-exec/src/main/java/org/apache/drill/exec/physical/impl/project/ProjectMemoryManager.java
Outdated
Show resolved
Hide resolved
...ava-exec/src/main/java/org/apache/drill/exec/physical/impl/project/ProjectMemoryManager.java
Outdated
Show resolved
Hide resolved
...ava-exec/src/main/java/org/apache/drill/exec/physical/impl/project/ProjectMemoryManager.java
Outdated
Show resolved
Hide resolved
...ava-exec/src/main/java/org/apache/drill/exec/physical/impl/project/ProjectMemoryManager.java
Outdated
Show resolved
Hide resolved
@ihuzenko, thanks for your thorough review, as usual. Here is my general philosophy on refactoring: do it in bite-size chunks. The first step here was simply to break code gen out of the operator itself. By leaving the bulk of the code unchanged, one can do a review by diffing the old and new files (not in GitHub, sadly), and see that blocks of code are identical, they have just been moved; in some cases from one huge function to smaller functions. Then, once we're satisfied that this simple refactoring works, we can think about the next step. If you agree with the step-by-step approach, then perhaps we can commit this first step to put us in position to make subsequent changes. |
Longer term, it seems we might want to restructure how we generate code. Today, if we run a query across, say, 20 minor fragments in the same Drillbit, and they all see the same schema (the normal case), then all 20 will generate exactly the same code. Then, down in the compiler layer, the first thread will compile the code and put it into the code cache. Threads 2 through 20 will get the compiled code from the cache. But, today, the process is very inefficient. Each thread:
The only real benefit of this approach is that it has worked all these years. The better approach is to:
The work I did back in the managed sort, and that I'm starting here, at least splits code gen from the operator. I suspect one (very long term) improvement would be to introduce another layer of abstraction like we had in Impala. The code gen code tries to do the kind of logical type analysis normally done in a planner. But, because the goal is Java code gen, it tends to mix SQL type analysis with Java implementation details, leading to overly complex code. (That's what I'm fighting with the typeof/UNION issue.). Such an approach would be doubly useful a we roll out the schema improvements your team has been doing. If we know the schema (types) at plan time, we can work out all the type conversion stuff at that time. In fact, we can even play the Spark trick: generate Java once in the planner and send it to the workers. I have only vague ideas here; have not spent much time on it. Sounds like you've looked at this some. What do you suggest we do? |
fbf8f6f
to
2cde64d
Compare
@paul-rogers thanks for the quick update. I agree that using the already generated and compiled code whenever possible would be very good in the long run. Also, I agree with the step-by-step refactoring approach. With respect to this PR, I propose to extract the |
@ihuzenko, good suggestion on We could accomplish the same thing by defining an interface which I agree that this can be improved, but I don't yet see an obvious move that leaves the code simpler. So, I thought to just leave it as is for now. Maybe it can evolve to own the allocation vectors, complex writers and so on. These are the kinds of questions we can ask now that we can start to see the pieces rather than just a big mess. Suggestions? |
Hello @paul-rogers , if you don't want to extract the |
@ihuzenko, the goal of this refactoring was simply to pull out the code gen part. The I was really hoping to not change the execution part of the project operator in this PR in order to limit the scope of changes. Maybe I'll go with the interface route (discussed in an earlier note), which will help with the longer term goal outlined below. Let me play with the code to see how that approach might work. |
One more thought about the CG. In working on another issue, it slowly dawned on me a couple of other worthwhile goals in addition to those mentioned earlier. We mentioned above that CD combines logical planning and code gen, creating a very complex bit of code. Again, some of us are not smart enough to be able to quickly understand the combined logic, so it would make sense to separate out the two steps so that average folks can more easily understand and modify them. Then, we need to think about debug and testing. As originally implemented, CG was direct path from operator to running byte codes. The original authors were smart enough to be able to debug the result by stepping through the compiled byte codes. Most of us are not that smart. So, few years back we added the "plain Java" and "save code for debugging" modes which made it possible for us newbies to view and step through the Java code. Still, in order to debug, we need to run Drill, find a useful unit test, run the query, and then debug CG in the context of running a query. We are limited in what we can test based on what queries we can find or create to set up the scenarios of interest. Better would be to apply unit testing principles: separate out CG so we can set up a set of inputs, run CG, and diff the resulting code against a golden copy. This is how we debugged the Impala planner and it worked very well. This way, we can easily set up every data type without needing to use a zillion different input file formats. That is:
Where the "exec plan" would be things like the knick-lacks that the This refactoring can be seen as a naive grasping toward that goal. We need to pull out CG, but it is not clear yet the final form. The same can be said of the earlier refactoring of the external sort to separate out CG. Sometimes, just by taking a step, we can get a bit better understanding of where we need to go. All that said, the goal here is just to take a step, not to implement a final solution. I realize, however, that makes this PR hard to review since it is incomplete: it takes a step, but does not clearly state "a step to where?" |
@ihuzenko, went ahead and adopted the interface solution. The interface is probably a better idea: if we try to isolate code gen for testing, we can create a mock implementation that skips the actual vector work. Please let me know if this satisfies your concerns. |
@ihuzenko, refactored some additional steps to adopt the solution you suggested. This version still uses the interface, with an implementation in a class other than the project record batch. This turns out to be handy because, oddly, the project operator generates code for two separate incoming batches: the "real" one and a "fake" empty one. The This version tries to eliminate all references to the incoming batch in the materializer, and instead work with the batch schema. Annoyingly, the The next refactoring move is to change this code to work with a schema (or interface to obtain the schema) rather than the actual vectors. Now, as it turns out, the batch schema has limitations for complex types, which is one of the reasons we created the Or, we can just pass an interface which will return the Since the required work will be rather large; I propose we do that as a separate PR. Have we done enough for this one PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @paul-rogers,
Thank you very much for finalizing the changes. I have executed tests on the test cluster and all of them successfully completed 🥇. Please fix the last few tiny comments related to javadocs and squash commits.
LGTM, +1.
Special thanks for your previous work on making possible debugging of the code generated by operators. It's an extremely useful feature.
Best regards,
Igor
.../java-exec/src/main/java/org/apache/drill/exec/physical/impl/project/OutputWidthVisitor.java
Outdated
Show resolved
Hide resolved
} | ||
|
||
/** | ||
* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* |
...java-exec/src/main/java/org/apache/drill/exec/physical/impl/project/ProjectBatchBuilder.java
Outdated
Show resolved
Hide resolved
} | ||
return write; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | |
} | |
...a-exec/src/main/java/org/apache/drill/exec/physical/impl/project/ProjectionMaterializer.java
Show resolved
Hide resolved
exec/java-exec/src/main/java/org/apache/drill/exec/record/TypedFieldId.java
Show resolved
Hide resolved
// to | ||
// before | ||
// the | ||
// if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe have all this on one line?
...a-exec/src/main/java/org/apache/drill/exec/physical/impl/project/ProjectionMaterializer.java
Outdated
Show resolved
Hide resolved
Breaks the big "setup" function into its own class, and separates out physical vector setup from logical projection planning. No functional change; just rearranging existing code.
a677829
to
72fe471
Compare
Thank you @ihuzenko and @arina-ielchiieva for your reviews. Addressed remaining minor issues. Squashed commits. @ihuzenko, your many suggestions made this a much better solution. It is almost, but not quite, clean enough that I could write code gen unit tests. Need to clean up that pesky |
Breaks the big "setup" function into its own class, and
separates out physical vector setup from logical projection
planning. No functional change; just rearranging existing
code.
Testing: reran all unit tests.
Jira - DRILL-7503.