-
Notifications
You must be signed in to change notification settings - Fork 3k
Spark 3.3: Choose readers based on task types #6345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark 3.3: Choose readers based on task types #6345
Conversation
| import org.slf4j.LoggerFactory; | ||
|
|
||
| class BatchDataReader extends BaseBatchReader<FileScanTask> { | ||
| class BatchDataReader extends BaseBatchReader<FileScanTask> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class was only used as PartitionReader in SparkScan, where we extended it, mixed PartitionReader and called the implementation as BatchReader. After adding a common reader factory, we may have multiple batch readers now. That's why BatchDataReader seemed like a more accurate name than BatchReader. As there were no other places that used this class, I decided to implement PartitionReader directly here.
Any feedback is welcome. See SparkScan below for old usage.
| } | ||
|
|
||
| BatchDataReader( | ||
| ScanTaskGroup<FileScanTask> task, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most other readers have another order of parameters: table, taskGroup, expectedSchema, caseSensitive.
| public Batch toBatch() { | ||
| return new SparkChangelogBatch( | ||
| spark, table, readConf, taskGroups(), expectedSchema, hashCode()); | ||
| return new SparkBatch(sparkContext, table, readConf, taskGroups(), expectedSchema, hashCode()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you take a look at SparkChangelogBatch, it was identical to SparkBatch, except the reader factory. Since we are adding a common factory for all tasks, it seemed appropriate to always leverage one class.
|
|
||
| SparkInputPartition partition = (SparkInputPartition) inputPartition; | ||
|
|
||
| if (partition.allTasksOfType(FileScanTask.class)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@szehon-ho, this is where you would check the type of tasks and select your reader.
| return (ScanTaskGroup<T>) taskGroup; | ||
| } | ||
|
|
||
| public <T extends ScanTask> boolean allTasksOfType(Class<T> javaClass) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious question here, do we assume a single task type for every taskGroup or not?
Just wondering if we can optimize it to just check the type of taskGroup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is to be able to check whether all tasks conform to a particular known parent type. For instance, for changelog tasks we only check if all are tasks implement ChangelogScanTask. Then the reader itself may downcast it to a particular child type (e.g. ChangelogRowReader).
We can't check the type of taskGroup due to Java type erasure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we allow a taskgroup to have tasks of different types? I didn't think so. But I don't think it costs us that much to just check all of them instead of just the first element.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Yea my question was Russell's question, if there's one to one relation always (defined by TaskGroup). But I guess there's nothing in TaskGroup preventing different tasks that are subclass of T, although that would be strange.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true, ScanTaskGroup may have arbitrary tasks.
|
|
||
| @Override | ||
| public boolean supportColumnarReads(InputPartition inputPartition) { | ||
| return batchSize > 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how this relates to vectorized reads?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this was in the old code, I guess we keep it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, I just copied what we had before. Our SparkBatch decides whether vectorized reads are supported and passes a batch size of > 1 if supported, 0 otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about adding methods .vectorized(boolean).batchSize(int)? It would be more code, but cleaner logically.
| class SparkPartitionReaderFactory implements PartitionReaderFactory { | ||
| private final int batchSize; | ||
|
|
||
| SparkPartitionReaderFactory(int batchSize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should just doc here that a batch size > 1 will create a ColumnarReader , just seems like a bit of a magic parameter here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about creating separate batch and non-batch reader factories? I just copied the existing code but having separate classes seems more natural than checking if batchSize is > 1.
Thoughts, @RussellSpitzer @szehon-ho?
RussellSpitzer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all makes sense to me. @szehon-ho is closer to the implementation here though so I'll defer to him on if this makes sense for his delete file reader implementation
szehon-ho
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I'll still have to see how my new table will fit (a bit behind), but refactor looks good to me, and we can revisit if needed
flyrain
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good. Thanks @aokolnychyi!
|
I've split the factory into two. I feel that eliminates the confusion pretty well. |
|
Thanks for reviewing, @szehon-ho @RussellSpitzer @flyrain! |
This PR adds
SparkPartitionReaderFactorythat creates readers based on tasks in input partitions.