Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect parameter nullability from query and suggest correct parameter functions #7

Closed
Zaid-Ajaj opened this issue Aug 19, 2020 · 24 comments

Comments

@Zaid-Ajaj
Copy link
Owner

Right now, when we derive the required parameters for a query, we only know what the type is of the parameter. However, we don't know whether a parameter is allowed to be null or not. Currently when the analyzer detects incorrect usage of SQL parameter, it suggests Sql.{type}, Sql.{type}orNone and Sql.dbnull. For non-nullable parameter types, it should only suggest Sql.{type} to use as parameter.

According to Npgsql#3115 Unable to detect parameter nullability when using DeriveParameters, this is a limitation of PostgreSQL itself when it describes the types of the available parameters.

Our best bet is build a SQL parser for the query and detect parameter nullability ourselves by using the definitions from the schema of the database. As well as detecting untyped parameters and inferring their types for example WHERE @p > 10 should infer that @p is int or bigint

@baronfel
Copy link
Contributor

baronfel commented Aug 19, 2020

The 'best' way to do this would be to make bindings against the native c postgresql API and just call the parser it provides: https://wiki.postgresql.org/wiki/Query_Parsing

This is what nodejs tools like https://github.com/zhm/pg-query-parser do, and it wouldn't be incredibly hard to do.

Then your task just changes to a tree-traversal.

To do this, you'd need:

  • A way to compile the c API for the platforms you want to target
  • A single c API binding to call the parse member

This could be put into a separate nuget package (nuget supports native dependencies), or bundled in this this library.

@Zaid-Ajaj
Copy link
Owner Author

@baronfel Last time we checked, the native parser was not cross platform. It wasn't working on windows and the fork that made it work isn't up to date anymore. I would love to use it is possible now

@YohDeadfall
Copy link

YohDeadfall commented Aug 19, 2020

To clarify, all types are nullable in PostgreSQL without any exclusion, so it correctly detects them. What you're trying to achieve is to get information about applied constraints. From the documentation:

It is very difficult to avoid such problems, because of SQL's general assumption that a null value is a valid value of every data type.

To support the feature at least partially you can determine nullability for domain types since you can take a value of the PostgresType property from NpsqlParameter and try to cast it to PostgresDomainType which has the NotNull property.

@Zaid-Ajaj
Copy link
Owner Author

@YohDeadfall There are many cases where we know exactly when an input parameter should not be null. The easiest example is when inserting values into tables for non-nullable columns. There are many other cases when comparing values of tables against values of parameters: where some_non_nullable_column = @input_parameter then we know that input_parameter should not be null.

Detecting a bunch of those case would make all the difference in this analyzer even though from the database perspective, null is allowed, in many cases the database clients don't want to provide nulls by mistake.

@Zaid-Ajaj
Copy link
Owner Author

libpg_query#33 Windows Support has been stale for quite some time and there doesn't seem to be interest making it work out of the box. My knowledge in C is lacking to know what the library needs for Windows support

@YohDeadfall
Copy link

Yeah, I just pointed that nullability isn't a type property (except for domains), but a constraint on a field. So what you need is to detect relationships between supplied parameters and their usages.

But there is one thing which makes your life easier. You can't compare any value with null using operators, so in that cases you can be sure that a value isn't nullable.

For example, the following query will return you NULL:

-- @p is new NpgsqlParameter<int?>("p", null)
SELECT @p = NULL::integer;

But IS NULL will produce true.

Therefore, if a parameter doesn't participate in a function, not a part of SELECT/VALUES/SET and not checked for being null, it's not nullable.

@Zaid-Ajaj
Copy link
Owner Author

Thanks @YohDeadfall I think that does make life easier. I think this project can start with parsing simple queries and try detect nullability when possible then falling back to "I don't know whether this is nullable or not" for more complicated queries. Let's see how it goes first because detecting the usages will be hard part 😄

@YohDeadfall
Copy link

In case if you would like to implement a full parser take a look at https://github.com/cockroachdb/cockroach written in Go (see contents of /pkg/sql/parser). It's a PostgreSQL like database engine, so should be mostly compatible.

It's okay to ping me and ask for help.

@Zaid-Ajaj
Copy link
Owner Author

I will check it out, thanks a lot @YohDeadfall ❤️

@costa100
Copy link

costa100 commented Aug 19, 2020

@YohDeadfall:

For example, the following query will return you NULL:

-- @p is new NpgsqlParameter<int?>("p", null)
SELECT @p = NULL::integer;

But IS NULL will produce true.

This is the SQL behavior. If you compare a value of NULL (stored in a field or a variable or simply NULL) with anything else (where that compare operation is not IS NULL or IS NOT NULL) that expression is going to return null. Also 100 + NULL returns NULL and so on, calling standard functions with NULL parameters might or might not return NULL values. If I call sin(NULL) it returns NULL but if we have a custom function, it might not return NULL.

What do you do if you have expressions that combine multiple parameters or case expressions?

Such a parser would have to actually compute types/nullability of an expression in the same way PostgreSql does it.

I would say parsing is the easy part. The harder part is establishing the rules that compute the nullability.

Imo, we should define the rules for the computation first and go from there.

re: Parsers, sorry, it might be an unpopular opinion in this context, but I am a fan of parser generators such as antlr. I haven't seen anything (i.e. articles) that compare parser generators against FP parser combinators. I know that the t-sql parser produced by Microsoft uses antlr. Quite a few years back, I was curious of how postgresql does the parsing - I think it was version 7, and they used a bison/yacc flavor grammar.

I would definitely not write SQL parsers by hand.

@Zaid-Ajaj
Copy link
Owner Author

I would definitely not write SQL parsers by hand.

@costa100 I don't want to write the parser by hand if there is something usable in .NET targeting netstandard2.0. ANTLR grammar I can use that generates C#/F# code? that is more than welcome.

I would say parsing is the easy part. The harder part is establishing the rules that compute the nullability. Such a parser would have to actually compute types/nullability of an expression in the same way PostgreSql does it.

Yeah that is the plan, but in the beginning I don't plan on supporting all possibilities of null detection. Starting with commonly used expressions and queries at first so I don't really need the full language parser but if it is available, I would love to use it 😄

@costa100
Copy link

costa100 commented Aug 19, 2020

My question is this. Look at the Microsoft transact-sql parser: https://docs.microsoft.com/en-us/dotnet/api/microsoft.sqlserver.transactsql.scriptdom.tsqlparser?view=sql-dacfx-150. Is this where you want to go ultimately for your postgresql parser, or perhaps a subset of the sql statements?

Anyway, I found this thread:
antlr/grammars-v4#1501

There are links to antlr grammars.

This seems to be the most up-to-date: https://github.com/pgcodekeeper/pgcodekeeper/blob/master/apgdiff/antlr-src/SQLParser.g4.

Antlr can generate C# code, you can configure the grammar to generate C# code.

And interestingly enough, this is the link to the postgresql yacc grammar: https://github.com/postgres/postgres/blob/master/src/backend/parser/gram.y.

Side note, I love the code in the postgresql yacc file - it is very readable.

@Zaid-Ajaj
Copy link
Owner Author

Thanks a lot for the links @costa100 💯 I will look into the code-gen story for C#, I haven't worked with it before but it looks doable 😄

@Zaid-Ajaj
Copy link
Owner Author

Alright, so to come up with a proof of concept, I started using FParsec and oh boy it is good! This short parser implementation can already parse SELECT queries with nested queries in them (see tests):

testSelect """
    SELECT username, email
    FROM users
    WHERE user_id IN (SELECT id FROM user_ids WHERE id IS NOT NULL)
""" {
    SelectExpr.Default with
        Columns = [Expr.Ident "username"; Expr.Ident "email"]
        From = Some (Expr.Ident "users")
        Where = Some (Expr.In(Expr.Ident "user_id", Expr.Query(TopLevelExpr.Select {
            SelectExpr.Default with
                Columns = [Expr.Ident "id"]
                From = Some (Expr.Ident "user_ids")
                Where = Some(Expr.Not(Expr.Equals(Expr.Null, Expr.Ident "id")))
        })))
}

I think building a simplified AST in F# makes the type inference and pattern matching easier later on. The parser doesn't need to actually check and validate the query, we let the database do that so here we only use for parameters and detecting which ones should be null or not.

I think things will start getting real complicated when I have to take table joins into account and how that affects nullability of parameters.

If you guys want to help out with the implementation, feel free to join in and hack away 😄

@YohDeadfall
Copy link

YohDeadfall commented Aug 20, 2020

What do you do if you have expressions that combine multiple parameters or case expressions?

PostgreSQL is a very flexible and extensible database engine which built around itself, so all required information about it's own functions and operators can be obtained from system tables. There are only three types of objects: types, attributes and functions.

For all types except some domain types NULL is a perfectly legal value, but some domain types may restrict the value to be non-nullable. Tables have their own types too (you can specify it explicitly, or the server will create it implicitly). That types represent records which can be stored in a table.

To understand does some type supports NULL as it's value or not, it's sufficient to check pg_type and see typnotnull. If there is composite type (applies to tables too) pg_attribute.attnotnull should be checked.

Expressions consist from operators and functions. An operator simply is a name and references to functions which actually gets executed. As @costa said the result of a function depends on its inputs. Some functions allow NULL, some not. This feature is controlled by CALLED ON NULL INPUT, RETURNS NULL ON NULL INPUT and STRICT of the CREATE FUNCTION statement, and information about it saved in pg_proc.proisstrict.

Having this information, the analyzer can request the server about interesting functions and types or all data at once to cache it. To understand what function, operator, type or table the user means in the query current_schemas should be invoked to receive a sorted list of schemas for that user. So if an object name isn't qualified, go through schemas and try to find that object.

I think things will start getting real complicated when I have to take table joins into account and how that affects nullability of parameters.

If there is no CROSS or INNER JOIN, then values from one part at least are nullable.

@YohDeadfall
Copy link

YohDeadfall commented Aug 20, 2020

If you guys want to help out with the implementation, feel free to join in and hack away

Don't know F#, but if there are simple issues where I can learn the language and help, ping me.

@roji
Copy link

roji commented Aug 20, 2020

There are many other cases when comparing values of tables against values of parameters: where some_non_nullable_column = @input_parameter then we know that input_parameter should not be null.

@Zaid-Ajaj that sounds odd... I may not be understanding what you're trying to do here, but it seems entirely reasonable to try to compare a nullable value to a non-nullable value. As has been noted above, when the nullable value is actually null, the result of the expression is null (which evaluates to false if it's the WHERE predicate) - but that's fine, isn't it? That's very different from trying to insert a null into a non-nullable column, in which case an error is triggered...

Stepping back a bit, I'll just say that in Entity Framework Core we deal with nullability calculation a lot, and it is an extremely difficult subject to get right (and to do it efficiently). I don't really know what's being attempted here - maybe a bit more context would be helpful.

@YohDeadfall
Copy link

It's just an analyzer which helps by hints and warnings like Rider loved by you (: It may not cover edge cases.

@Zaid-Ajaj
Copy link
Owner Author

@roji First we have to understand that this project is simply providing embedded static SQL analysis when using the library Npgsql.FSharp, so it validates the usage from the F# code against the structure of the query, the database schema definitions and the used parameters.

The analyzer is already able to properly validate F# code that incorrectly reads a values from a result set (see gif on README). However, when it comes to providing parameters, the analyzer doesn't know which a parameter should be nullable or not to properly correct the corresponding F# code usage.

Try it out on your visual studio by installing the extension from the releases page and trying out Npgsql.FSharp 😉

Scenario nr. 1: checking equality with a non-nullable column

CREATE TABLE users (user_id serial primary key, username text not null)

And the query

SELECT * FROM users WHERE user_id = @user_id

What should the type of @user_id be? The only possibility is a (signed) integer. However, this analyzer currently doesn't know that: it only knows that is can an integer or null. The analyzer in this case will suggest to use any of the functions

Sql.int 42
Sql.intOrNone <Some int | None>
Sql.dbnull

I want to extend the analyzer to eliminate the incorrect suggestions of Sql.intOrNone and Sql.dbnull and instead suggest the only possible value of a non-nullable integer with Sql.int

Scenario nr. 2: inserting values into non-nullable columns

Take the query

INSERT INTO users (username) VALUES (@input_username)` 

Again here, the parameter @input_username must be a non-null string to be provided using the function Sql.string, however, the analyzer currently suggests these functions

Sql.string
Sql.stringOrNone
Sql.dbnull

For this case, I want to make eliminate the suggestions Sql.stringOrNone and Sql.dbnull.

I don't plan on supporting every single scenario and every single edge case. I just want to obtain more information about what values are allowed for parameters and suggesting the users of Npgsql.FSharp with more correct code usage hints.

@roji
Copy link

roji commented Aug 20, 2020

@Zaid-Ajaj thanks for the detailed description!

So as I wrote above, the second scenario makes a lot more sense to me than the first - since a null will definitely cause an error. For the first case, if your user happens to have an intOrNone (which I assume is similar to a nullable int? am a F# noob...), isn't it perfectly reasonable to pass that as a parameter in that query? Passing a dbnull indeed doesn't make much sense, even if it still doesn't seem incorrect in the same sense as the second scenario (a warning does seem right in this case).

@Zaid-Ajaj
Copy link
Owner Author

isn't it perfectly reasonable to pass that as a parameter in that query?

@roji I admit that in the 1st scenario, I am making some assumptions, namely that someone wouldn't want to test a non-nullable column against a nullable value. In my opinion I find it more reasonable than suggesting to use intOrNone or dbnull even though it is syntactically correct by Postgres but again, I am just getting started here so I will think about it as I go. For the 2nd scenario, it definitely makes sense to obtain more nullability information on parameters

Yeah Sql.intOrNone is basically a function to provide a nullable int parameter

a warning does seem right in this case

All hints suggestions triggered by this analyzers are warnings

@roji
Copy link

roji commented Aug 20, 2020

OK, really interesting stuff - I'd be very interested in seeing where it goes. Note that Npgsql itself contains a basic lexical parser for the PG SQL dialect, since it needs to convert from named @p parameter placeholders to the PostgreSQL native positional ones: $1 (and also to split on statement boundaries). The parser only needs to understand when something is inside a string, so it's quite basic - but it may be a bit useful to you.

@Zaid-Ajaj
Copy link
Owner Author

@roji @YohDeadfall @costa100 in case you are interested, today I went live on stream and actually implemented an initial prototype of how this would look like. I managed to make the analyzer give correct suggestions to INSERT queries when using non-nullable columns. You can watch the stream here on Twitch

@Zaid-Ajaj
Copy link
Owner Author

I think this issue is resolved, I already published a new version of the analyzer which correctly detects non-nullable parameters in WHERE clauses, missing required INSERT columns and even better error messages 😄

parameter-nullability

better-insert

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants