Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Translate Contains to IN with subquery instead of EXISTS where relevant #30955

Closed
roji opened this issue May 22, 2023 · 1 comment · Fixed by #30976
Closed

Translate Contains to IN with subquery instead of EXISTS where relevant #30955

roji opened this issue May 22, 2023 · 1 comment · Fixed by #30976
Assignees
Labels
area-perf area-query closed-fixed The issue has been fixed and is/will be included in the release indicated by the issue milestone. type-enhancement
Milestone

Comments

@roji
Copy link
Member

roji commented May 22, 2023

Although our InExpression supports having a subquery (and various visitors handle that), we never actually generate this - instead we always generate an EXISTS subquery with a predicate (based on SQL Server baselines). Translating to InExpression is likely better - it doesn't require a correlated subquery and expresses the intent more succintly, without needing a predicate. For example, instead of this:

SELECT ...
FROM [PrimitiveCollectionsEntity] AS [p]
WHERE EXISTS (
    SELECT 1
    FROM (
        SELECT TOP(2) CAST([i].[value] AS int) AS [c], CAST([i].[key] AS int) AS [c0]
        FROM OPENJSON([p].[Ints]) AS [i]
        ORDER BY CAST([i].[key] AS int)
    ) AS [t]
    WHERE [t].[c] = 11)

We can translate to this:

SELECT ...
FROM [PrimitiveCollectionsEntity] AS [p]
WHERE 11 IN (
    SELECT TOP(2) CAST([i].[value] AS int)
    FROM OPENJSON([p].[Ints]) AS [i]
    ORDER BY CAST([i].[key] AS int)
)

However, note that our null semantics around InExpression with subquery currently seems broken (fortunately it's dead code): IN returns null when the item is null (or when the values contains null and a match isn't found). We'd need to make that logic actually work.

@roji roji self-assigned this May 22, 2023
@roji roji changed the title Actually use InExpression with subquery Translate Contains to IN with subquery instead of EXISTS where relevant May 26, 2023
roji added a commit to roji/efcore that referenced this issue May 26, 2023
roji added a commit to roji/efcore that referenced this issue May 26, 2023
@ajcvickers ajcvickers added this to the Backlog milestone May 26, 2023
@roji roji modified the milestones: Backlog, 8.0.0 May 29, 2023
@roji
Copy link
Member Author

roji commented May 29, 2023

Using IN rather than EXISTS provides some very serious performance advantages, on all databases except for SQL Server; this is probably due to the removal of the correlated subquery which EXISTS requires.

The benchmark below uses the data and queries from @FHTSean, comparing the following two SQLs (see #30938):

SELECT *
FROM `CustomerMainTable` AS `t`
WHERE `t`.`CustomerMainTableId` IN (
    SELECT MAX(`t0`.`CustomerMainTableId`)
    FROM `CustomerMainTable` AS `t0`
    GROUP BY `t0`.`CustomerMainTableCustomerId`
)

... with:

SELECT *
FROM `CustomerMainTable` AS `t`
WHERE EXISTS (
    SELECT 1
    FROM `CustomerMainTable` AS `t0`
    GROUP BY `t0`.`CustomerMainTableCustomerId`
    HAVING MAX(`t0`.`CustomerMainTableId`) = `t`.`CustomerMainTableId`)

On MariaDB:

Method Mean Error StdDev
In 7.007 ms 0.1394 ms 0.1164 ms
Exists 9,803.773 ms 172.2222 ms 143.8134 ms

(That's 7ms compared to 9 seconds)

On PostgreSQL I'm also seeing a very big performance difference, even if less dramatic:

Method Mean Error StdDev
In 3.712 ms 0.1110 ms 0.3254 ms
Exists 792.243 ms 12.4048 ms 11.6034 ms

Also on SQLite:

Method Mean Error StdDev
In 1.628 ms 0.0292 ms 0.0273 ms
Exists 4.962 ms 0.0980 ms 0.1274 ms

On SQL Server both queries have the exact same running time, I'm guessing SQL Server sees through the EXISTS and avoids the correlated subquery.

Full benchmark sources
BenchmarkRunner.Run<Benchmark>();

public class Benchmark
{
    private MySqlConnection _connection;

    [GlobalSetup]
    public async Task Setup()
    {
        _connection = new MySqlConnection("Server=localhost;Database=test;User ID=root;Password=Abcd5678;");
        await _connection.OpenAsync();
    }

    [Benchmark]
    public async Task In()
    {
        await using var command = new MySqlCommand(
"""
SELECT *
FROM `CustomerMainTable` AS `t`
WHERE `t`.`CustomerMainTableId` IN (
    SELECT MAX(`t0`.`CustomerMainTableId`)
    FROM `CustomerMainTable` AS `t0`
    GROUP BY `t0`.`CustomerMainTableCustomerId`);
""", _connection);

        await command.ExecuteNonQueryAsync();
    }

    [Benchmark]
    public async Task Exists()
    {
        await using var command = new MySqlCommand(
"""
SELECT *
FROM `CustomerMainTable` AS `t`
WHERE EXISTS (
    SELECT 1
    FROM `CustomerMainTable` AS `t0`
    GROUP BY `t0`.`CustomerMainTableCustomerId`
    HAVING MAX(`t0`.`CustomerMainTableId`) = `t`.`CustomerMainTableId`)
""", _connection);

        await command.ExecuteNonQueryAsync();
    }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-perf area-query closed-fixed The issue has been fixed and is/will be included in the release indicated by the issue milestone. type-enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants