Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic filter pushdown for selective broadcast joins #103

Closed

Conversation

mbasmanova
Copy link
Contributor

@mbasmanova mbasmanova commented Aug 25, 2021

This commit introduces a mechanism to push down dynamically generated filters from one operator to another operator upstream. The new mechanism is used to push down a filter on the join keys from a highly-selective broadcast (or simply colocated) hash join into probe-side table scan. This allows the ORC reader to skip whole files and/or row groups based on the extra filter.

The workflow is:

  • HashProbe operator waits for the HashBulid operator to load all the data and build a hash table.
  • HashProbe operator checks if join keys in the hash table are low-cardinality, e.g. table_->hashMode() != BaseHashTable::HashMode::kHash. If not, no push-down will occur.
  • HashProbe operator asks the Driver if it can push down filters on the join key columns. If not, no push-down will occur. It is possible that Driver can push down filter for some join keys, but not all, in which case HashProbe will produce filters only for these keys.
  • HashProbe starts processing the input and monitor the selectivity of individual join keys, e.g. how many rows build-side VectorHasher is dropping for each key.
  • Join keys whose corresponding built-side VectorHashers drop at least 1/3 of rows (after seeing at least 10K rows of input) are used to generated filters.
  • Driver loop periodically checks if an operator has generated dynamic filters and, if so, pushes them down.
  • In this change, only TableScan accepts dynamic filters. Since Exchange operator doesn't accept dynamic filter, filter pushdown doesn't happen for non-broadcast joins.

Individual changes that make up the new mechanism are:

  • Extend connector interface to add DataSource::addDynamicFilter() method to add dynamically generated filter.
  • Update Hive connector to add dynamic filters to the scanSpec_ before processing the next batch of rows. If a column already has a filter, dynamic filter is merged into an existing filter using Filter::mergeWith().
  • The new Driver::canPushdownFilters() method returns whether it is possible to pushdown filters generated by a given operator for the specified columns. To push down a filter, there should be an upstream operator that accepts dynamic filters, e.g. Operator::canAddDynamicFilter() returns true, and the filter column should not be identity-projected from that operator all the way to the filter producing operator.
  • Modified driver loop to check if an operator produced some dynamic filters and, if so, push them down.
  • Modified HashProbe operator to generate dynamic filters for join keys which have highly-selective VectorHashers on the build side. This logic applies only to inner joins with array-based lookup on the build side.
  • Modified TableScan operator to pass dynamic filters to the connector.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 25, 2021
@facebook-github-bot
Copy link
Contributor

@mbasmanova has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@mbasmanova mbasmanova force-pushed the dynamic-filtering branch 2 times, most recently from 683db7b to e3ef9a1 Compare August 25, 2021 14:05
@facebook-github-bot
Copy link
Contributor

@mbasmanova has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@mbasmanova mbasmanova force-pushed the dynamic-filtering branch 12 times, most recently from 534e614 to 13295e2 Compare August 31, 2021 17:33
@mbasmanova mbasmanova changed the title [WIP] Dynamic filtering [WIP] Dynamic filter pushdown for selective broadcast joins Aug 31, 2021
@mbasmanova mbasmanova changed the title [WIP] Dynamic filter pushdown for selective broadcast joins Dynamic filter pushdown for selective broadcast joins Aug 31, 2021
@mbasmanova mbasmanova force-pushed the dynamic-filtering branch 2 times, most recently from 17a0b1a to b89dcec Compare August 31, 2021 18:30
@mbasmanova mbasmanova marked this pull request as ready for review August 31, 2021 18:30
@facebook-github-bot
Copy link
Contributor

@mbasmanova has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@mbasmanova has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@oerling
Copy link
Contributor

oerling commented Aug 31, 2021 via email

@mbasmanova
Copy link
Contributor Author

You might prefer an API contract that has the filters take effect on the next batch, not the next split. This is for the case where the filter replaces the whole operator, you need to know when it takes effect.

I was thinking about that. What would it take to allow changing ScanSpec mid-split?

@oerling oerling self-assigned this Aug 31, 2021
@oerling
Copy link
Contributor

oerling commented Aug 31, 2021 via email

@mbasmanova
Copy link
Contributor Author

@oerling Orri, thank you for reviewing. I appreciate the feedback. I updated the PR to apply the pushed down filter to the next batch vs. next split. With this we can make a follow-up PR to turn the join into a no-op under certain conditions after pushing filter down. Would you take another look?

@facebook-github-bot
Copy link
Contributor

@mbasmanova has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@oerling
Copy link
Contributor

oerling commented Sep 1, 2021 via email

@oerling
Copy link
Contributor

oerling commented Sep 1, 2021 via email

@facebook-github-bot
Copy link
Contributor

@mbasmanova has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

1 similar comment
@facebook-github-bot
Copy link
Contributor

@mbasmanova has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@mbasmanova
Copy link
Contributor Author

@oerling Orri, thanks for bring up the issue with filter caches. I added SelectiveColumnReader::resetFilterCaches() and DwrfRowReader::columnReader() methods and updated HiveDataSource::addDynamicFilter to use these:

  auto columnReader =
      dynamic_cast<SelectiveColumnReader*>(rowReader_->columnReader());
  columnReader->resetFilterCaches();

@facebook-github-bot
Copy link
Contributor

@mbasmanova merged this pull request in a6b6dca.

facebook-github-bot pushed a commit that referenced this pull request Feb 8, 2022
Summary:
Pull Request resolved: facebook/sapling#103

Automate maintenance of the edenscm_* github actions yamls

Add job file and name options and support for the Rust install section

Reviewed By: fanzeyi

Differential Revision: D34044422

fbshipit-source-id: 7d5f07d37bab1eff5de30a88e710dbf7479ca192
PHILO-HE added a commit to PHILO-HE/velox that referenced this pull request Dec 26, 2022
rui-mo pushed a commit to rui-mo/velox that referenced this pull request Jan 6, 2023
rui-mo pushed a commit to rui-mo/velox that referenced this pull request Jan 12, 2023
PHILO-HE added a commit to PHILO-HE/velox that referenced this pull request Feb 3, 2023
rui-mo pushed a commit to rui-mo/velox that referenced this pull request Feb 24, 2023
liujiayi771 pushed a commit to liujiayi771/velox that referenced this pull request Mar 3, 2023
liujiayi771 pushed a commit to liujiayi771/velox that referenced this pull request Mar 9, 2023
rui-mo pushed a commit to rui-mo/velox that referenced this pull request Mar 17, 2023
* Update the groupid to io.gluten

* Update Gluten name and version in all pom.xml

* Update gluten picture
liujiayi771 pushed a commit to liujiayi771/velox that referenced this pull request Apr 1, 2023
rui-mo pushed a commit to rui-mo/velox that referenced this pull request Apr 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants