-
-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix distinctness/sort enforcing #1683
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks good, just some comments on naming
@@ -134,14 +134,14 @@ var joinCostTests = []struct { | |||
// queries that test subquery hoisting | |||
{ | |||
// case 1: condition uses columns from both sides | |||
Query: "/*case1*/ select * from ab where exists (select * from xy where ab.a = xy.x + 3)", | |||
Query: "/*+case1*/ select * from ab where exists (select * from xy where ab.a = xy.x + 3)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the point of adding the + to these comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/*
breaks go's test regex matching, /*+
doesn't
} | ||
relCost += dCost | ||
} else { | ||
n.setDistinct(noDistinctOp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a no-op?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currently no. i was flip-flopping on whether the 0-code should be "unknown" or "no distinct"
sql/analyzer/memo.go
Outdated
@@ -548,6 +654,8 @@ type relBase struct { | |||
c float64 | |||
// cnt is this relations output row count | |||
cnt float64 | |||
// distinct indicates a relExpr should be checked for sort enforcement |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean distinctness?
case *mergeJoin: | ||
var ret sql.Schema | ||
for _, e := range r.innerScan.idx.Expressions() { | ||
parts := strings.Split(e, ".") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fraught, column names can in fact include a "."
Not much you can do right now, but note it
sql/analyzer/memo.go
Outdated
// sortedOnDistinct returns true if a relation's inputs are sorted on the | ||
// full output schema. The OrderedDistinct operator can be used in this | ||
// case. | ||
func sortedOnDistinct(rel relExpr) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be sortedInputs, no? None of this logic has anything to do with distinct
|
||
func sortedColsForRel(rel relExpr) sql.Schema { | ||
switch r := rel.(type) { | ||
case *tableScan: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this check sufficient? We don't actually enforce that tables return their results in PK order, right? I think we do have an index sub interface that supports declaring that, maybe OrderedIndex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could maybe add an attribute to PK schema, or a OrderedTable interface to indicate sorting. We have a lot of tests that would break if Dolt PK's suddenly stopped being ordered. In practice all kinds of crazy corruptions happen when entries in the primary get out of order.
The original bug: dolthub/dolt#5651 duplicates a RIGHT_SEMI_LOOKUP_JOIN row because we were distincting right full row rather than the subset of join attributes.
This PR adds some more tests around ordering and sort enforcing in the memo.
The overview is that DISTINCT is weird because it is something in-between a property of a relational expression and the property of a relational group. It is an enforcer that we can implement as an ORDERED_DISTINCT or ditch altogether when child nodes provide supportive sort orders. We could imagine bifurcating the memo into buckets, with expression groups sectioned into groups based on sort orders, and costing considering the cardinality of children plus conditional sort enforcers. More work needed to think through how PG and CRDB do this generally.