Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix distinctness/sort enforcing #1683

Merged
merged 8 commits into from
Apr 4, 2023
Merged

Conversation

max-hoffman
Copy link
Contributor

@max-hoffman max-hoffman commented Mar 31, 2023

The original bug: dolthub/dolt#5651 duplicates a RIGHT_SEMI_LOOKUP_JOIN row because we were distincting right full row rather than the subset of join attributes.

This PR adds some more tests around ordering and sort enforcing in the memo.

The overview is that DISTINCT is weird because it is something in-between a property of a relational expression and the property of a relational group. It is an enforcer that we can implement as an ORDERED_DISTINCT or ditch altogether when child nodes provide supportive sort orders. We could imagine bifurcating the memo into buckets, with expression groups sectioned into groups based on sort orders, and costing considering the cardinality of children plus conditional sort enforcers. More work needed to think through how PG and CRDB do this generally.

Copy link
Member

@zachmu zachmu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good, just some comments on naming

@@ -134,14 +134,14 @@ var joinCostTests = []struct {
// queries that test subquery hoisting
{
// case 1: condition uses columns from both sides
Query: "/*case1*/ select * from ab where exists (select * from xy where ab.a = xy.x + 3)",
Query: "/*+case1*/ select * from ab where exists (select * from xy where ab.a = xy.x + 3)",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the point of adding the + to these comments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/* breaks go's test regex matching, /*+ doesn't

}
relCost += dCost
} else {
n.setDistinct(noDistinctOp)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a no-op?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently no. i was flip-flopping on whether the 0-code should be "unknown" or "no distinct"

@@ -548,6 +654,8 @@ type relBase struct {
c float64
// cnt is this relations output row count
cnt float64
// distinct indicates a relExpr should be checked for sort enforcement
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean distinctness?

case *mergeJoin:
var ret sql.Schema
for _, e := range r.innerScan.idx.Expressions() {
parts := strings.Split(e, ".")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fraught, column names can in fact include a "."

Not much you can do right now, but note it

// sortedOnDistinct returns true if a relation's inputs are sorted on the
// full output schema. The OrderedDistinct operator can be used in this
// case.
func sortedOnDistinct(rel relExpr) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be sortedInputs, no? None of this logic has anything to do with distinct


func sortedColsForRel(rel relExpr) sql.Schema {
switch r := rel.(type) {
case *tableScan:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this check sufficient? We don't actually enforce that tables return their results in PK order, right? I think we do have an index sub interface that supports declaring that, maybe OrderedIndex?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could maybe add an attribute to PK schema, or a OrderedTable interface to indicate sorting. We have a lot of tests that would break if Dolt PK's suddenly stopped being ordered. In practice all kinds of crazy corruptions happen when entries in the primary get out of order.

Base automatically changed from max/anti-join-hints to main April 4, 2023 01:10
@max-hoffman max-hoffman merged commit 3b15f9f into main Apr 4, 2023
@max-hoffman max-hoffman deleted the max/right-semi-join-distinct branch April 4, 2023 15:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants