Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQL Server: Translate statistics aggregate functions #28104

Closed
roji opened this issue May 26, 2022 · 3 comments · Fixed by #28110
Closed

SQL Server: Translate statistics aggregate functions #28104

roji opened this issue May 26, 2022 · 3 comments · Fixed by #28110
Assignees
Labels
area-query area-sqlserver closed-fixed The issue has been fixed and is/will be included in the release indicated by the issue milestone. customer-reported ef6-parity type-enhancement
Milestone

Comments

@roji
Copy link
Member

roji commented May 26, 2022

With #22957 done, we can now add translations for additional aggregate operators.

Name SQL Server PostgreSQL MySQL Oracle
Standard deviation (sample) STDEV stddev/stddev_samp STDDEV_SAMP STDDEV/STDDEV_SAMP1
Standard deviation (population) STDEVP stdddev_pop STDDEV/STDDEV_POP STDDEV_POP
Variance (sample) VAR variance/var_samp VAR_SAMP VARIANCE/VAR_SAMP
Variance (population) VARP var_pop VARIANCE/VAR_POP VAR_POP
  • SQLite is the only database that doesn't have built-in standard deviation and variance.
  • Unfortunately, these functions return different types based on their argument type across databases:
    • SQL Server always returns float
    • PG returns double precision for real or double precision, otherwise numeric.
    • MariaDB always returns double, with different precision for float/double than for int/decimal (though it doesn't seem to matter)
  • As a result, we probably shouldn't have .NET functions in relational, but add them to each provider separately (signatures are different).
  • When there are no input values, these functions return NULL on all databases.
  • Standard deviation was already discussed in #11850, but that issue now tracks aggregate UDFs.
  • Old EF6 supported translating to the SQL Server aggregate statistical functions (docs), so this is ef6-parity.
Detailed investigation across databases

SQL Server

DROP TABLE IF EXISTS data;
CREATE TABLE data
(
    int INT,
    real REAL,
    float FLOAT,
    decimal DECIMAL
);
INSERT INTO data (int, real, float, decimal) VALUES (1, 1, 1, 1), (10, 10, 10, 10), (NULL, NULL, NULL, NULL);

-- Output types:
SELECT SQL_VARIANT_PROPERTY(STDEV(int), 'BaseType') FROM data; -- float
SELECT SQL_VARIANT_PROPERTY(STDEV(real), 'BaseType') FROM data; -- float
SELECT SQL_VARIANT_PROPERTY(STDEV(float), 'BaseType') FROM data; -- float
SELECT SQL_VARIANT_PROPERTY(STDEV(decimal), 'BaseType') FROM data; -- float

-- No input values:
SELECT STDEV(decimal) FROM data WHERE decimal < 0; -- NULL

PostgreSQL

DROP TABLE IF EXISTS data;
CREATE TABLE data
(
    int INT,
    real REAL,
    double DOUBLE PRECISION,
    decimal DECIMAL
);
INSERT INTO data (int, real, double, decimal) VALUES (1, 1, 1, 1), (10, 10, 10, 10), (NULL, NULL, NULL, NULL);

-- Output types:
SELECT pg_typeof(stddev(int)) FROM data; -- numeric
SELECT pg_typeof(stddev(real)) FROM data; -- double
SELECT pg_typeof(stddev(double)) FROM data; -- double
SELECT pg_typeof(stddev(decimal)) FROM data; -- numeric

-- No input values:
SELECT stddev(decimal) FROM data WHERE decimal < 0; -- NULL

MariaDB

DROP TABLE IF EXISTS data;
CREATE TABLE data
(
    intc INT,
    floatc FLOAT,
    doublec DOUBLE,
    decimalc DECIMAL
);
INSERT INTO data (intc, floatc, doublec, decimalc) VALUES (1, 1, 1, 1), (10, 10, 10, 10), (NULL, NULL, NULL, NULL);

-- Output types:
DROP TEMPORARY TABLE IF EXISTS foo;
CREATE TEMPORARY TABLE foo SELECT STDDEV_SAMP(intc) FROM data;
DESC foo; -- double(26,4)

DROP TEMPORARY TABLE IF EXISTS foo;
CREATE TEMPORARY TABLE foo SELECT STDDEV_SAMP(floatc) FROM data;
DESC foo; -- double

DROP TEMPORARY TABLE IF EXISTS foo;
CREATE TEMPORARY TABLE foo SELECT STDDEV_SAMP(doublec) FROM data;
DESC foo; -- double

DROP TEMPORARY TABLE IF EXISTS foo;
CREATE TEMPORARY TABLE foo SELECT STDDEV_SAMP(decimalc) FROM data;
DESC foo; -- double(26,4)

-- No input values:
SELECT STDDEV_SAMP(decimalc) FROM data WHERE decimalc < 0; -- NULL
@roji roji self-assigned this May 26, 2022
@roji roji removed the area-sqlite label May 26, 2022
@roji roji changed the title Translate statistics aggregate functions SQL ServerÖ Translate statistics aggregate functions May 26, 2022
@roji roji changed the title SQL ServerÖ Translate statistics aggregate functions SQL Server: Translate statistics aggregate functions May 26, 2022
roji added a commit to roji/efcore that referenced this issue May 26, 2022
* string.Join (SQL Server and SQLite)
* string.Concat (SQL Server and SQLite)
* Standard deviation and variance (SQL Server)

Closes dotnet#2981
Closes dotnet#28104
@roji
Copy link
Member Author

roji commented May 27, 2022

@lauxjpn you may be interested in this for MySQL, though I'd hold off for a little while before implementing, since the infrastructure is still being locked down.

PostgreSQL issue: npgsql/efcore.pg#532

@roji
Copy link
Member Author

roji commented May 31, 2022

The design problem outlined by @smitpatel is what to do with top-level versions of these attributes. For example, the PR as it currently is only allows expressing standard deviation as follows:

_ = context.Products
.GroupBy(od => od.ProductID)
.Select(g => new
{
    ProductID = g.Key,
    StandardDeviation = EF.Functions.StandardDeviationSample(g.Select(od => od.UnitPrice)),
});

... but does not allow top-level usage:

_ = context.Products.Select(p => p.UnitPrice).StandardDeviation();
// or even possibly:
_ = context.Products.StandardDeviation(p => p.UnitPrice);

We have 3 options here:

The top-level variant(s) above would be extension methods over IQueryable<T>. The main problem with that is that the top-level syntax (extension) and the GroupBy syntax (EF.Functions) is different, which is weird and non-discoverable. So I think we have 3 options:

  1. As above, have different syntax: IQueryable extension for top-level, EF.Functions for GroupBy; this is weird and non-discoverable.
  2. Make the GroupBy variant also an extension over IEnumerable<T>. We try to avoid doing this, since it would work on any IEnumerable but would throw (unless we provide a client-side statistics implementation :trollface:).
  3. Make the top-level variant also EF.Function, making the query:
_ = EF.Functions.StandardDeviationSample(context.Products.Select(p => p.UnitPrice));

Note that this question is general for custom aggregates, and not specific to these statistics ones. We should look at what EF6 did here to maybe get inspiration.

@roji
Copy link
Member Author

roji commented Jun 5, 2022

I thought of an argument in favor of option 1, i.e. have different syntaxes for Enumerable and IQueryable (top-level) versions of the aggregate function.

For cases where we translate a built-in .NET method, e.g. string.Join, the top-level method will in any case have to be different from the Enumerable version. We could have an IQueryable extension method:

ctx.Blogs.Where(...).Select(b => b.Name).StringJoin(", ");
// or better:
ctx.Blogs.Where(...).StringJoin(b => b.Name, ", ");

... or if we don't want an extension over IQueryable, an EF.Function:

EF.Functions.StringJoin(", ", ctx.Blogs.Where(...));

... but in any case it's not going to be the same as the enumerable string.Join. We'll have the same case with the spatial methods (#13278).

So if we have this problem in any case when it comes to built-in .NET methods we translate, it may be OK to just always have it, and have a nice IQueryable extension for the top-level, and an EF.Functions for the enumerable version.

roji added a commit to roji/efcore that referenced this issue Jun 18, 2022
* string.Join (SQL Server and SQLite)
* string.Concat (SQL Server and SQLite)
* Standard deviation and variance (SQL Server)

Closes dotnet#2981
Closes dotnet#28104
roji added a commit to roji/efcore that referenced this issue Jun 18, 2022
* string.Join (SQL Server and SQLite)
* string.Concat (SQL Server and SQLite)
* Standard deviation and variance (SQL Server)

Closes dotnet#2981
Closes dotnet#28104
@ajcvickers ajcvickers added this to the 7.0.0 milestone Jun 21, 2022
@ajcvickers ajcvickers added the closed-fixed The issue has been fixed and is/will be included in the release indicated by the issue milestone. label Jun 21, 2022
roji added a commit to roji/efcore that referenced this issue Jul 9, 2022
* string.Join (SQL Server and SQLite)
* string.Concat (SQL Server and SQLite)
* Standard deviation and variance (SQL Server)

Closes dotnet#2981
Closes dotnet#28104
@ghost ghost closed this as completed in #28110 Jul 9, 2022
ghost pushed a commit that referenced this issue Jul 9, 2022
* string.Join (SQL Server and SQLite)
* string.Concat (SQL Server and SQLite)
* Standard deviation and variance (SQL Server)

Closes #2981
Closes #28104
@ajcvickers ajcvickers modified the milestones: 7.0.0, 7.0.0-preview7 Jul 10, 2022
@ajcvickers ajcvickers modified the milestones: 7.0.0-preview7, 7.0.0 Nov 5, 2022
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-query area-sqlserver closed-fixed The issue has been fixed and is/will be included in the release indicated by the issue milestone. customer-reported ef6-parity type-enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants