Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRILL-7503: Refactor the project operator #1944

Closed
wants to merge 1 commit into from

Conversation

paul-rogers
Copy link
Contributor

@paul-rogers paul-rogers commented Dec 31, 2019

Breaks the big "setup" function into its own class, and
separates out physical vector setup from logical projection
planning. No functional change; just rearranging existing
code.

Testing: reran all unit tests.

Jira - DRILL-7503.

Copy link
Member

@ihuzenko ihuzenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @paul-rogers , I'm really happy that you started refactoring of ProjectRecordBatch. I also looked at this class with tears in my eyes:'( What I see in the PR is the creation of some kind of VectorState mediator between old ProjectRecordBatch and new ProjectionMaterializer. Unfortunately, such splitting does not bring much clarity to this code and it's still a necessity to redesign the ProjectRecordBatch. Could you please invest more effort in order to finally design a mature solution for this operator? From my side, I'll also take a closer look at the code and share my ideas soon.

Comment on lines 86 to 89
private enum WidthType {
FIXED,
VARIABLE
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a close look at the code, I believe this enum is unnecessary. All usages of the ColumnWidthInfo constructor accept WidthType.VARIABLE and there is a block inside the update method :

      if (columnWidthInfo.isFixedWidth()) {
        // fixed width columns are accumulated in totalFixedWidthColumnWidth
        ShouldNotReachHere();
      } else {...}

Please remove the enum and related code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I tried to limit the "blast radius" by gritting my teeth and leaving this class as it was. This seems to be some offshoot of my "temporary" BatchSizer. Just for background, some time ago we realized that not all Drill rows are of the same width. (Who'd have guessed?) So, we added "temporary" code to work out sizes in each operator. This was supposed to be temporary until batches carried their own size info. (Once we figure it out for a vector, we don't have to throw it away and figure it out again for each operator.)

In fact, I deliberately chose a goofy name "BatchSizer" to remind folks that this was a temporary hack to so I could focus on the "real work" of fixing the external sort. Sigh...

But, the temporary solution seems to have become semi-permanent and has grown odd variations such as this.

The "master plan" was to not try to predict batch sizes as we are doing here. Instead, the ResultSetLoader is intended to just let the operator write to the batch until we hit the desired limit. All the calcs are already done for the reader case. The goal was to use the same mechanism in other places were we don't know widths ahead of time. Project is the classic case: for all we know, the user is doing some silly function like repeating a big VARCHAR 100 times. So, if we can ever get there, all this temporary stuff will be swept away.

@paul-rogers
Copy link
Contributor Author

@ihuzenko, thanks for your thorough review, as usual. Here is my general philosophy on refactoring: do it in bite-size chunks. The first step here was simply to break code gen out of the operator itself. By leaving the bulk of the code unchanged, one can do a review by diffing the old and new files (not in GitHub, sadly), and see that blocks of code are identical, they have just been moved; in some cases from one huge function to smaller functions. Then, once we're satisfied that this simple refactoring works, we can think about the next step.

If you agree with the step-by-step approach, then perhaps we can commit this first step to put us in position to make subsequent changes.

@paul-rogers
Copy link
Contributor Author

Longer term, it seems we might want to restructure how we generate code. Today, if we run a query across, say, 20 minor fragments in the same Drillbit, and they all see the same schema (the normal case), then all 20 will generate exactly the same code. Then, down in the compiler layer, the first thread will compile the code and put it into the code cache. Threads 2 through 20 will get the compiled code from the cache.

But, today, the process is very inefficient. Each thread:

  • Does the semantic expression analysis (function lookup, type resolution, etc.)
  • Generate the entire Java code.
  • Rewrite the Java code to remove the varying bits (the generated class name).
  • Hash the entire code to get a hash code to look up in the code cache.
  • If a match is found, compare the entire code block byte-by-byte to verify a match.
  • If new, generate code, cache it, and use the source code (which can be huge) as the cache key.

The only real benefit of this approach is that it has worked all these years.

The better approach is to:

  • Create a parameterized "descriptor" object that holds all the factors needed to generate the code. (Input schema, expressions, relevant session options.)
  • Use that descriptor as a key into the code lookup table. If found, reuse the compiled code without any code gen.
  • If not found, only then tell the descriptor to generate the needed code, which will then be shared by all fragments.

The work I did back in the managed sort, and that I'm starting here, at least splits code gen from the operator.

I suspect one (very long term) improvement would be to introduce another layer of abstraction like we had in Impala. The code gen code tries to do the kind of logical type analysis normally done in a planner. But, because the goal is Java code gen, it tends to mix SQL type analysis with Java implementation details, leading to overly complex code. (That's what I'm fighting with the typeof/UNION issue.).

Such an approach would be doubly useful a we roll out the schema improvements your team has been doing. If we know the schema (types) at plan time, we can work out all the type conversion stuff at that time. In fact, we can even play the Spark trick: generate Java once in the planner and send it to the workers.

I have only vague ideas here; have not spent much time on it. Sounds like you've looked at this some. What do you suggest we do?

@ihuzenko
Copy link
Member

ihuzenko commented Jan 3, 2020

@paul-rogers thanks for the quick update. I agree that using the already generated and compiled code whenever possible would be very good in the long run. Also, I agree with the step-by-step refactoring approach. With respect to this PR, I propose to extract the VectorState into a separate class together with all the ProjectRecordBatch's fields that can be encapsulated. I believe that the extraction of VectorState won't allow using mixed access to fields in the future and opens more ways for later improvements. In order to better express my thought, I created the sample commit. Sorry for the late reply.

@paul-rogers
Copy link
Contributor Author

@ihuzenko, good suggestion on VectorState: my original thought was to make it a static class, then move it to its own file. Then I thought better of it. You'll notice that, at present, it is kind of a "shim" class to expose certain operations on the vector objects to the "materializer" without making the entire project operator visible. In particular, it is an inner class so it can tinker with project operator state.

We could accomplish the same thing by defining an interface which ProjectRecordBatch would implement. But, in that case, all the complex setup logic would be back in the project operator, undoing the separation we just accomplished. So, the interface approach feels not quite right.

I agree that this can be improved, but I don't yet see an obvious move that leaves the code simpler. So, I thought to just leave it as is for now.

Maybe it can evolve to own the allocation vectors, complex writers and so on. These are the kinds of questions we can ask now that we can start to see the pieces rather than just a big mess. Suggestions?

@ihuzenko
Copy link
Member

ihuzenko commented Jan 6, 2020

Hello @paul-rogers , if you don't want to extract the VectorState class then at least please put related fields into it, so it'll be visible which state is managed by the class. Although extraction is better way for me, since it won't allow anybody to easily mix access to related fields (transfers, allocationVectors, complexWriters, complexFieldReferences etc...). PS. Here there is no quick solution without serious rewriting to make everything look good:)

@paul-rogers
Copy link
Contributor Author

paul-rogers commented Jan 6, 2020

@ihuzenko, the goal of this refactoring was simply to pull out the code gen part. The VectorState is an admitted hack to create a clean interface on the CG side, though it is messy on the operator side.

I was really hoping to not change the execution part of the project operator in this PR in order to limit the scope of changes. Maybe I'll go with the interface route (discussed in an earlier note), which will help with the longer term goal outlined below. Let me play with the code to see how that approach might work.

@paul-rogers
Copy link
Contributor Author

One more thought about the CG. In working on another issue, it slowly dawned on me a couple of other worthwhile goals in addition to those mentioned earlier.

We mentioned above that CD combines logical planning and code gen, creating a very complex bit of code. Again, some of us are not smart enough to be able to quickly understand the combined logic, so it would make sense to separate out the two steps so that average folks can more easily understand and modify them.

Then, we need to think about debug and testing. As originally implemented, CG was direct path from operator to running byte codes. The original authors were smart enough to be able to debug the result by stepping through the compiled byte codes. Most of us are not that smart. So, few years back we added the "plain Java" and "save code for debugging" modes which made it possible for us newbies to view and step through the Java code.

Still, in order to debug, we need to run Drill, find a useful unit test, run the query, and then debug CG in the context of running a query. We are limited in what we can test based on what queries we can find or create to set up the scenarios of interest.

Better would be to apply unit testing principles: separate out CG so we can set up a set of inputs, run CG, and diff the resulting code against a golden copy. This is how we debugged the Impala planner and it worked very well. This way, we can easily set up every data type without needing to use a zillion different input file formats. That is:

(operator config, schema, options) --> CG --> (Java code, exec plan)

Where the "exec plan" would be things like the knick-lacks that the VectorState currently wrap. That is, rather than having CG set up vectors, maybe CG creates a plan that says, "here are the vectors you will need", then the operator creates the vectors. Still pretty hand-wavey at this point.

This refactoring can be seen as a naive grasping toward that goal. We need to pull out CG, but it is not clear yet the final form. The same can be said of the earlier refactoring of the external sort to separate out CG. Sometimes, just by taking a step, we can get a bit better understanding of where we need to go.

All that said, the goal here is just to take a step, not to implement a final solution. I realize, however, that makes this PR hard to review since it is incomplete: it takes a step, but does not clearly state "a step to where?"

@paul-rogers
Copy link
Contributor Author

@ihuzenko, went ahead and adopted the interface solution. The interface is probably a better idea: if we try to isolate code gen for testing, we can create a mock implementation that skips the actual vector work. Please let me know if this satisfies your concerns.

@paul-rogers
Copy link
Contributor Author

@ihuzenko, refactored some additional steps to adopt the solution you suggested. This version still uses the interface, with an implementation in a class other than the project record batch. This turns out to be handy because, oddly, the project operator generates code for two separate incoming batches: the "real" one and a "fake" empty one. The ProjectRecordBatchBuilder holds onto the input batch so we don't have to pass it into the materializer and back out.

This version tries to eliminate all references to the incoming batch in the materializer, and instead work with the batch schema. Annoyingly, the ExpressionTreeMaterializer needs the input batch so it can iterate over the vectors to get their schemas. If all we need is the schema, we don't need actual vectors. So, if we can pass in a schema, we can completely separate code gen from physical vectors.

The next refactoring move is to change this code to work with a schema (or interface to obtain the schema) rather than the actual vectors. Now, as it turns out, the batch schema has limitations for complex types, which is one of the reasons we created the TupleMetadata family of classes. So, perhaps we can convert the incoming batch to a TupleMetadata schema and use that. (The code to do that already exists in the RowSet classes.)

Or, we can just pass an interface which will return the TypedFieldId for each column. Or, do that conversion ahead of time, and pass in the results. Will have to play with it some to see which solution is the simplest.

Since the required work will be rather large; I propose we do that as a separate PR.

Have we done enough for this one PR?

Copy link
Member

@ihuzenko ihuzenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @paul-rogers,

Thank you very much for finalizing the changes. I have executed tests on the test cluster and all of them successfully completed 🥇. Please fix the last few tiny comments related to javadocs and squash commits.

LGTM, +1.

Special thanks for your previous work on making possible debugging of the code generated by operators. It's an extremely useful feature.

Best regards,
Igor

}

/**
*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
*

}
return write;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
}
}

Comment on lines +243 to +247
// to
// before
// the
// if
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe have all this on one line?

Breaks the big "setup" function into its own class, and
separates out physical vector setup from logical projection
planning. No functional change; just rearranging existing
code.
@paul-rogers
Copy link
Contributor Author

Thank you @ihuzenko and @arina-ielchiieva for your reviews. Addressed remaining minor issues. Squashed commits.

@ihuzenko, your many suggestions made this a much better solution. It is almost, but not quite, clean enough that I could write code gen unit tests. Need to clean up that pesky ExpressionTreeMaterializer issue, then we'll be able to write such tests.

@asfgit asfgit closed this in 0af04cd Jan 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants