Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect result from aggregating on table joined in both subquery and CTE #4353

Closed
snipesjr opened this issue Sep 15, 2022 · 15 comments
Closed
Assignees
Labels
bug Something isn't working integrations Issues with tools connecting to/querying Dolt sql Issue with SQL

Comments

@snipesjr
Copy link

Hi, earlier I tried to do a quick COUNT(*) to spot check some join logic in a CTE. The output appears to be a query planning bug? Seems to be skipping my join condition entirely.

My db is here:
https://www.dolthub.com/repositories/snipesjr/npi_registry

Subquery example (bad):

SELECT p.taxonomy_code AS taxonomy_classification
     , COUNT(*) AS provider_count # gives incorrect count
#      , COUNT(DISTINCT p.npi) AS provider_count # gives query plan error
    FROM (
        SELECT pro.npi
             , IFNULL(ext.taxonomy_code, 'UNKNOWN') AS taxonomy_code
            FROM npi_provider pro
                LEFT JOIN npi_provider_extra ext
                    ON (pro.npi = ext.npi
                    AND ext.primary_taxonomy_switch IN ('Y', 'X'))
    ) AS p
GROUP BY p.taxonomy_code
ORDER BY COUNT(*) DESC;

CTE example (bad):

WITH providers AS (
    SELECT pro.npi
         , IFNULL(ext.taxonomy_code, 'UNKNOWN') AS taxonomy_code
    FROM npi_provider pro
        LEFT JOIN npi_provider_extra ext
            ON (pro.npi = ext.npi
            AND ext.primary_taxonomy_switch IN ('Y', 'X'))
)

SELECT p.taxonomy_code
     , COUNT(*) AS provider_count
    FROM providers AS p
    GROUP BY p.taxonomy_code
    ORDER BY COUNT(*) DESC;

Non-subquery/CTE (good):

SELECT IFNULL(ext.taxonomy_code, 'UNKNOWN') AS taxonomy_code
     , COUNT(*) AS provider_count
    FROM npi_provider pro
        LEFT JOIN npi_provider_extra ext
            ON (pro.npi = ext.npi
            AND ext.primary_taxonomy_switch IN ('Y', 'X'))
GROUP BY IFNULL(ext.taxonomy_code, 'UNKNOWN')
ORDER BY COUNT(*) DESC

My dolt version:

$ dolt version
dolt version 0.41.3
database storage format: OLD ( __LD_1__ )

I tried to test the query in the web ui but my query timed out.

@snipesjr
Copy link
Author

Update. The subquery works in dolt sql-client, but returns wrong result in Pycharm. Could be driver related?

DBMS: MySQL (ver. 5.7.9-Vitess)
Case sensitivity: plain=exact, delimited=exact
Driver: MySQL Connector/J (ver. mysql-connector-java-8.0.25 (Revision: 08be9e9b4cba6aa115f9b27b215887af40b159e0), JDBC4.2)
Effective version: Vitess (ver. 0.0)

@timsehn timsehn added bug Something isn't working sql Issue with SQL labels Sep 15, 2022
@timsehn
Copy link
Sponsor Contributor

timsehn commented Sep 15, 2022

Nice bug. @max-hoffman has been grinding on CTE issues so he'll grab it once he frees up a bit.

@timsehn timsehn added the integrations Issues with tools connecting to/querying Dolt label Sep 21, 2022
@max-hoffman
Copy link
Contributor

@snipesjr thank you for the report! I would appreciate any additional debugging context. For example, if the queries are returning incorrect results, sharing how the results are different than expected, or how the EXPLAIN WITH cte ... output plans are different with your connector would make it easier for us to repro on our side.

@timsehn
Copy link
Sponsor Contributor

timsehn commented Sep 21, 2022

Note, Max ran the queries in withdolt sqland they all seem to work/return the same results:

image

So, we also suspect some sort of client issue. We need more information to keep debugging.

@snipesjr
Copy link
Author

Yes, for sure. I will send the output that I am getting from within pycharm tomorrow (new laptop). Dolt’s cli client returned the correct results, so it might be an issue elsewhere. Nevertheless, wanted to report it since others connecting using jetbrains products might bump into it.

@timsehn
Copy link
Sponsor Contributor

timsehn commented Sep 22, 2022

We love the report! Just want to get it fixed for you :-)

@max-hoffman
Copy link
Contributor

@snipesjr Just to clarify, query plans or the modified query strings might be easier to share than full result sets. If you set --trace debug, the server should log query strings, which might be different than what you thought you ran (ex: they added a filter or limit clause). You can get query plans by running EXPLAIN SELECT .... Either of these should let us reproduce the different result sets you see.

@snipesjr
Copy link
Author

Sharing first 10 records from CTE example (will share the rest):

taxonomy_code provider_count
UNKNOWN 18
207R00000X 17
207Q00000X 17
1223G0001X 15
225100000X 14
1041C0700X 13
367500000X 12
111N00000X 12
183500000X 11
101YP2500X 10

@snipesjr
Copy link
Author

Explain CTE (PyCharm):

plan
Limit(10)
└─ Project
├─ columns: [p.taxonomy_code, COUNT(*) as provider_count]
└─ Sort(COUNT(*) DESC)
└─ GroupBy
├─ SelectedExprs(p.taxonomy_code, COUNT(*))
├─ Grouping(p.taxonomy_code)
└─ SubqueryAlias(p)
└─ Project
├─ columns: [pro.npi, ifnull(ext.taxonomy_code, 'UNKNOWN') as taxonomy_code]
└─ LeftIndexedJoin((pro.npi = ext.npi) AND (ext.primary_taxonomy_switch IN ('Y', 'X')))
├─ Exchange
│ └─ TableAlias(pro)
│ └─ Table(npi_provider)
│ └─ columns: [npi]
└─ TableAlias(ext)
└─ IndexedTableAccess(npi_provider_extra)
├─ index: [npi_provider_extra.npi,npi_provider_extra.idx]
└─ columns: [npi taxonomy_code primary_taxonomy_switch]

@snipesjr
Copy link
Author

Also noticed that the CTE returns different result each time I run it. Here's the log I get with mysql-connector-java:

$ dolt sql-server -u jrs -p password --config dolt.cfg 
Starting server with Config HP="localhost:3306"|T="28800000"|R="false"|L="debug"
2022-09-21T20:26:56-04:00 INFO [conn 1] NewConnection {DisableClientMultiStatements=false}
2022-09-21T20:27:02-04:00 INFO [conn 2] NewConnection {DisableClientMultiStatements=false}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Starting query {connectTime=2022-09-21T20:27:02-04:00, connectionDb=, query=/* mysql-connector-java-8.0.25 (Revision: 08be9e9b4cba6aa115f9b27b215887af40b159e0) */SELECT @@session.auto_increment_increment AS auto_increment_increment, @@character_set_client AS character_set_client, @@character_set_connection AS character_set_connection, @@character_set_results AS character_set_results, @@character_set_server AS character_set_server, @@collation_server AS collation_server, @@collation_connection AS collation_connection, @@init_connect AS init_connect, @@interactive_timeout AS interactive_timeout, @@license AS license, @@lower_case_table_names AS lower_case_table_names, @@max_allowed_packet AS max_allowed_packet, @@net_write_timeout AS net_write_timeout, @@performance_schema AS performance_schema, @@query_cache_size AS query_cache_size, @@query_cache_type AS query_cache_type, @@sql_mode AS sql_mode, @@system_time_zone AS system_time_zone, @@time_zone AS time_zone, @@tx_isolation AS transaction_isolation, @@wait_timeout AS wait_timeout}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Query finished in 1 ms {connectTime=2022-09-21T20:27:02-04:00, connectionDb=, query=/* mysql-connector-java-8.0.25 (Revision: 08be9e9b4cba6aa115f9b27b215887af40b159e0) */SELECT @@session.auto_increment_increment AS auto_increment_increment, @@character_set_client AS character_set_client, @@character_set_connection AS character_set_connection, @@character_set_results AS character_set_results, @@character_set_server AS character_set_server, @@collation_server AS collation_server, @@collation_connection AS collation_connection, @@init_connect AS init_connect, @@interactive_timeout AS interactive_timeout, @@license AS license, @@lower_case_table_names AS lower_case_table_names, @@max_allowed_packet AS max_allowed_packet, @@net_write_timeout AS net_write_timeout, @@performance_schema AS performance_schema, @@query_cache_size AS query_cache_size, @@query_cache_type AS query_cache_type, @@sql_mode AS sql_mode, @@system_time_zone AS system_time_zone, @@time_zone AS time_zone, @@tx_isolation AS transaction_isolation, @@wait_timeout AS wait_timeout}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Starting query {connectTime=2022-09-21T20:27:02-04:00, connectionDb=, query=SET NAMES utf8mb4}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Query finished in 0 ms {connectTime=2022-09-21T20:27:02-04:00, connectionDb=, query=SET NAMES utf8mb4}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Starting query {connectTime=2022-09-21T20:27:02-04:00, connectionDb=, query=SET character_set_results = utf8}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Query finished in 0 ms {connectTime=2022-09-21T20:27:02-04:00, connectionDb=, query=SET character_set_results = utf8}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Starting query {connectTime=2022-09-21T20:27:02-04:00, connectionDb=, query=SET autocommit=1}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Query finished in 0 ms {connectTime=2022-09-21T20:27:02-04:00, connectionDb=, query=SET autocommit=1}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Starting query {connectTime=2022-09-21T20:27:02-04:00, connectionDb=, query=select version(), @@version_comment, database()}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Query finished in 1 ms {connectTime=2022-09-21T20:27:02-04:00, connectionDb=, query=select version(), @@version_comment, database()}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Starting query {connectTime=2022-09-21T20:27:02-04:00, connectionDb=, query=/* ApplicationName=PyCharm 2022.2.2 */ set session transaction read write}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Query finished in 0 ms {connectTime=2022-09-21T20:27:02-04:00, connectionDb=, query=/* ApplicationName=PyCharm 2022.2.2 */ set session transaction read write}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Starting query {connectTime=2022-09-21T20:27:02-04:00, connectionDb=, query=SHOW WARNINGS}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Query finished in 0 ms {connectTime=2022-09-21T20:27:02-04:00, connectionDb=, query=SHOW WARNINGS}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Starting query {connectTime=2022-09-21T20:27:02-04:00, connectionDb=, query=SELECT @@session.tx_read_only}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Query finished in 1 ms {connectTime=2022-09-21T20:27:02-04:00, connectionDb=, query=SELECT @@session.tx_read_only}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Starting query {connectTime=2022-09-21T20:27:02-04:00, connectionDb=, query=/* ApplicationName=PyCharm 2022.2.2 */ use npi_registry}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Query finished in 2 ms {connectTime=2022-09-21T20:27:02-04:00, connectionDb=npi_registry, query=/* ApplicationName=PyCharm 2022.2.2 */ use npi_registry}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Starting query {connectTime=2022-09-21T20:27:02-04:00, connectionDb=npi_registry, query=SHOW WARNINGS}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Query finished in 1 ms {connectTime=2022-09-21T20:27:02-04:00, connectionDb=npi_registry, query=SHOW WARNINGS}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Starting query {connectTime=2022-09-21T20:27:02-04:00, connectionDb=npi_registry, query=SHOW WARNINGS}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Query finished in 0 ms {connectTime=2022-09-21T20:27:02-04:00, connectionDb=npi_registry, query=SHOW WARNINGS}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Starting query {connectTime=2022-09-21T20:27:02-04:00, connectionDb=npi_registry, query=/* ApplicationName=PyCharm 2022.2.2 */ select database()}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Query finished in 0 ms {connectTime=2022-09-21T20:27:02-04:00, connectionDb=npi_registry, query=/* ApplicationName=PyCharm 2022.2.2 */ select database()}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Starting query {connectTime=2022-09-21T20:27:02-04:00, connectionDb=npi_registry, query=SHOW WARNINGS}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Query finished in 1 ms {connectTime=2022-09-21T20:27:02-04:00, connectionDb=npi_registry, query=SHOW WARNINGS}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Starting query {connectTime=2022-09-21T20:27:02-04:00, connectionDb=npi_registry, query=SHOW WARNINGS}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Query finished in 0 ms {connectTime=2022-09-21T20:27:02-04:00, connectionDb=npi_registry, query=SHOW WARNINGS}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Starting query {connectTime=2022-09-21T20:27:02-04:00, connectionDb=npi_registry, query=/* ApplicationName=PyCharm 2022.2.2 */ SET net_write_timeout=600}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Query finished in 0 ms {connectTime=2022-09-21T20:27:02-04:00, connectionDb=npi_registry, query=/* ApplicationName=PyCharm 2022.2.2 */ SET net_write_timeout=600}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Starting query {connectTime=2022-09-21T20:27:02-04:00, connectionDb=npi_registry, query=/* ApplicationName=PyCharm 2022.2.2 */ SET SQL_SELECT_LIMIT=501}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Query finished in 1 ms {connectTime=2022-09-21T20:27:02-04:00, connectionDb=npi_registry, query=/* ApplicationName=PyCharm 2022.2.2 */ SET SQL_SELECT_LIMIT=501}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Starting query {connectTime=2022-09-21T20:27:02-04:00, connectionDb=npi_registry, query=/* ApplicationName=PyCharm 2022.2.2 */ WITH providers AS ( SELECT pro.npi , IFNULL(ext.taxonomy_code, 'UNKNOWN') AS taxonomy_code FROM npi_provider pro LEFT JOIN npi_provider_extra ext ON (pro.npi = ext.npi AND ext.primary_taxonomy_switch IN ('Y', 'X')) ) SELECT p.taxonomy_code , COUNT(*) AS provider_count FROM providers AS p GROUP BY p.taxonomy_code ORDER BY COUNT(*) DESC LIMIT 10}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Query finished in 284 ms {connectTime=2022-09-21T20:27:02-04:00, connectionDb=npi_registry, query=/* ApplicationName=PyCharm 2022.2.2 */ WITH providers AS ( SELECT pro.npi , IFNULL(ext.taxonomy_code, 'UNKNOWN') AS taxonomy_code FROM npi_provider pro LEFT JOIN npi_provider_extra ext ON (pro.npi = ext.npi AND ext.primary_taxonomy_switch IN ('Y', 'X')) ) SELECT p.taxonomy_code , COUNT(*) AS provider_count FROM providers AS p GROUP BY p.taxonomy_code ORDER BY COUNT(*) DESC LIMIT 10}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Starting query {connectTime=2022-09-21T20:27:02-04:00, connectionDb=npi_registry, query=SHOW WARNINGS}
2022-09-21T20:27:02-04:00 DEBUG [conn 2] Query finished in 2 ms {connectTime=2022-09-21T20:27:02-04:00, connectionDb=npi_registry, query=SHOW WARNINGS}

@max-hoffman
Copy link
Contributor

Thank you for the additional details @snipesjr ! Made it much easier to figure out what's going on here.

I think the temporary workaround is to use ORDER BY provider_count:

> WITH providers AS (
    SELECT pro.npi
         , IFNULL(ext.taxonomy_code, 'UNKNOWN') AS taxonomy_code
    FROM npi_provider pro
        LEFT JOIN npi_provider_extra ext
            ON (pro.npi = ext.npi
            AND ext.primary_taxonomy_switch IN ('Y', 'X'))
)

SELECT p.taxonomy_code
     , COUNT(*) AS provider_count
    FROM providers AS p
    GROUP BY p.taxonomy_code
    ORDER BY provider_count DESC
    LIMIT 10;
+---------------+----------------+
| taxonomy_code | provider_count |
+---------------+----------------+
| UNKNOWN       | 1914495        |
| 183500000X    | 197214         |
| 1041C0700X    | 196532         |
| 225100000X    | 185787         |
| 101YM0800X    | 185663         |
| 390200000X    | 170594         |
| 106S00000X    | 169258         |
| 207Q00000X    | 146182         |
| 363LF0000X    | 139425         |
| 207R00000X    | 136415         |
+---------------+----------------+
10 rows in set (370.44 sec)

With the COUNT(*) query, we appear to push the sort too far. So instead of the results being sorted on provider_count, the results are being sorted randomly because it's counting the whole groupby result set:

-- old
 Limit(10)                                                                                                    
  └─ Project                                                                                                  
      ├─ columns: [p.taxonomy_code, COUNT(*) as provider_count]                                               
      └─ Sort(COUNT(*) DESC)                                                                                  
          └─ GroupBy

-- new
 Limit(10)                                                                                                    
  └─ TopN(Limit: [10]; provider_count DESC)                                                                   
      └─ Project                                                                                              
          ├─ columns: [p.taxonomy_code, COUNT(*) as provider_count]                                           
          └─ GroupBy

This does not explain why the two queries are equivalent without the Limit, so I need to do more testing.
I am working on a smaller repro to dig more and confirm that MySQL infers that the two COUNT(*) are the same expression.

@snipesjr
Copy link
Author

Thanks for the update!

Unfortunately, swapping to provider_count has the same result (random sort). The query plan does change though:

plan
Limit(10)
└─ TopN(Limit: [10]; provider_count DESC)
└─ Project
├─ columns: [p.taxonomy_code, COUNT(*) as provider_count]
└─ GroupBy
├─ SelectedExprs(p.taxonomy_code, COUNT(*))
├─ Grouping(p.taxonomy_code)
└─ SubqueryAlias(p)
└─ Project
├─ columns: [pro.npi, ifnull(ext.taxonomy_code, 'UNKNOWN') as taxonomy_code]
└─ LeftIndexedJoin((pro.npi = ext.npi) AND (ext.primary_taxonomy_switch IN ('Y', 'X')))
├─ Exchange
│ └─ TableAlias(pro)
│ └─ Table(npi_provider)
│ └─ columns: [npi]
└─ TableAlias(ext)
└─ IndexedTableAccess(npi_provider_extra)
├─ index: [npi_provider_extra.npi,npi_provider_extra.idx]
└─ columns: [npi taxonomy_code primary_taxonomy_switch]

Interestingly, I get the random sort without the limit as well; however I see that pycharm is doing a SET SQL SELECT_LIMIT=DEFAULT before executing my query.

@max-hoffman
Copy link
Contributor

Looks like we are selecting 501 random rows from the left join. We can manually disable parallelism (SINGLE_THREAD_FEATURE_FLAG dolt sql), which will give deterministic results, but I presume you want the full result set from the join before applying any limits.

Running a few comparison queries against MySQL, I think the correct behavior is to 1) only apply the default limit to the top-level query, and 2) use a query's LIMIT clause instead of the default when provided.

Alright two bugs:

  1. We need to infer the COUNT(*) sort alias and/or avoid pushing the sort below its enforcement scope

  2. sql_select_limit should only do one thing, apply a limit to the top-level scope if it does not already have one. We should not apply limits to arbitrary scopes.

In the meantime, disabling sql_select_limit should be a workaround for (2).

Thank you for the follow up information! Let us know if there are any other bugs we should clean up here.

@max-hoffman
Copy link
Contributor

@snipesjr I think this one should fix the sql_select_limit issue, available in the next release:
dolthub/go-mysql-server#1276

Let us know if you run into any more trouble.

@max-hoffman
Copy link
Contributor

The sql_select_limit fix should be included in the most recent release here https://github.com/dolthub/dolt/releases/tag/v0.41.6.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working integrations Issues with tools connecting to/querying Dolt sql Issue with SQL
Projects
None yet
Development

No branches or pull requests

3 participants