Permalink
Newer
Older
100644 2122 lines (1585 sloc) 77 KB
1
- Feature Name: SQL typing
2
- Status: completed
3
- Authors: Andrei, knz, Nathan
4
- Start date: 2016-01-29
5
- RFC PR: [#4121](https://github.com/cockroachdb/cockroach/pull/4121),
6
[#6189](https://github.com/cockroachdb/cockroach/pull/6189)
7
- Cockroach Issue: [#4024](https://github.com/cockroachdb/cockroach/issues/4024),
8
[#4026](https://github.com/cockroachdb/cockroach/issues/4026),
9
[#3633](https://github.com/cockroachdb/cockroach/issues/3633),
10
[#4073](https://github.com/cockroachdb/cockroach/issues/4073),
11
[#4088](https://github.com/cockroachdb/cockroach/issues/4088),
12
[#327](https://github.com/cockroachdb/cockroach/pull/327),
13
[#1795](https://github.com/cockroachdb/cockroach/issues/1795)
14
15
# Summary
16
17
This RFC proposes to revamp the SQL semantic analysis (what happens
18
after the parser and before the query is compiled or executed) with a
19
few goals in mind:
20
21
- address some limitations of the current type-checking implementation
22
- improve support for fancy (and not-so-fancy) uses of SQL and typing
23
of placeholders for prepared queries
24
- improve the quality of the code internally
25
- pave the way for implementing sub-selects
26
27
To reach these goals the RFC proposes to:
28
29
- implement a new type system that is able to type more code than the
30
one currently implemented.
31
- separate semantic analysis in separate phases after parsing
32
- unify all typed SQL statements (including `SELECT`/`UPDATE`/`INSERT`) as
33
expressions (`SELECT` as an expression is already a prerequisite for
34
sub-selects)
35
- structure typing as a visitor that annotates types as attributes in AST nodes
Apr 20, 2016
36
- extend `EXPLAIN` to pretty-print the inferred types. This will be approached
37
by adding a new `EXPLAIN (TYPES)` command.
38
39
As with all in software engineering, more intelligence requires more
Feb 17, 2016
40
work, and has the potential to make software less predictable.
41
Among the spectrum of possible design points, this RFC settles
Apr 20, 2016
42
on a typing system we call *Summer*, which can be implemented
43
as a rule-based depth-first traversal of the query AST.
Apr 20, 2016
45
Alternate earlier proposals called *Rick* and *Morty* are also recorded for posterity.
Apr 20, 2016
47
# Motivation
48
49
## Overview
50
51
We need a better typing system.
52
53
Why: some things currently do not work that should really work. Some
54
other things behave incongruously and are difficult to understand, and
55
this runs counter to our design ideal to "make data easy".
56
57
How: let's look at a few examples, understand what goes Really Wrong,
58
propose some reasonable expected behavior(s), and see how to get there.
59
60
## Problems considered
61
62
63
This RFC considers specifically the following issues:
64
65
- overall architecture of semantic analysis in the SQL engine
66
- typing expressions involving only untyped literals
67
- typing expressions involving only untyped literals and placeholders
68
- overloaded function resolution in calls with untyped literals or
69
placeholders as arguments
70
71
The following issues are related to typing but fall outside of the
72
scope of this RFC:
73
74
- "prepare" reports type X to client, client does not *know* X (and
75
thus unable to send the proper format byte in subsequent "execute")
76
77
This issue can be addressed by extending/completing the client
78
Postgres driver.
79
80
- program/client sends a string literal in a position of another type,
81
expects a coercion like in pg.
82
83
For this issue one can argue the client is wrong; this issue may be
84
addressed at a later stage if real-world use shows that demand for
85
legacy compatibility here is real.
86
87
- prepare reports type "int" to client, client feeds "string" during
88
execute
89
90
Same as previous point.
91
92
## What typing is about
93
94
95
There are 4 different roles for typing in SQL:
96
97
1. **soundness analysis**, the most important is shared with other
98
languages: check that the code is semantically sound -- that the
99
operations given are actually computable. Typing soundness analysis
100
tells you e.g. that ``3 + "foo"`` does not make sense and should be
101
rejected.
102
103
2. **overload resolution** deciding what meaning to give
104
to type-overloaded constructs in the language. For example some
105
operators behave differently when given ``int`` or ``float``
106
arguments (+, - etc). Additionally, there are overloaded functions
107
(``length`` is different for ``string`` and ``bytes``) that behave
108
differently depending on provided arguments. These are both
109
features shared with other languages, when overloading exists.
110
111
3. **inferring implicit conversions**, ie. determine where to insert
112
implicit casts in contexts with disjoint types, when your flavor of
113
SQL supports this (this is like in a few other languages, like C).
114
115
4. **typing placeholders** inferring the type of
116
placeholders (``$1``..., sometimes also noted ``?``), because the
117
client needs to know this after a ``prepare`` and before an
118
``execute``.
119
120
What we see in CockroachDB at this time, as well as in some other SQL
121
products, is that SQL engines have issues in all 4 aspects.
122
123
There are often applicable reasons why this is so, for example
124
1) lack of specification of the SQL language itself 2) lack of
125
interest for this issue 3) organic growth of the machinery and 4)
126
general developer ignorance about typing.
127
128
129
## Examples that go wrong (arguably)
130
131
It's rather difficult to find examples where soundness goes wrong
132
because people tend to care about this most. That said, it is
133
reasonably easy to find example SQL code that seems to make logical
134
sense, but which engines reject as being unsound. For example:
135
136
```sql
137
prepare a as select 3 + case (4) when 4 then $1 end
138
```
139
140
this fails in Postgres because ``$1`` is typed as ``string`` always and
141
you can't add string to int (this is a soundness error). What we'd
142
rather want is to infer ``$1`` either as ``int`` (or decimal) and let
143
the operation succeed, or fail with a type inference error ("can't
144
decide the type"). In CockroachDB this does not even compile, there is
145
no inference available within ``CASE``.
146
147
Next to this, there are a number of situations where existing engines
148
have chosen a behavior that makes the implementation of the engine
149
easy, but may irk / surprise the SQL user. And Surprise is Bad.
150
151
For example:
152
153
154
1. pessimistic typing for numeric literals.
155
156
For example:
Feb 17, 2016
157
158
```sql
159
160
create table t (x float);
161
insert into t(x) values (1e10000 * 1e-9999);
162
```
Feb 17, 2016
163
164
This fails on both Postgres and CockroachDB with a complaint that
165
the numbers do not fit in either int or float, despite the fact the
166
result would.
167
168
2. incorrect typing for literals.
169
170
For example::
171
172
```sql
173
select length(E'\\000a'::bytea || 'b'::text)
174
```
175
176
Succeeds (wrongly!) in Postgres and reports 7 as result. This
177
should have failed with either "cannot concatenate bytes and string",
178
or created a byte array of 3 bytes (\x00ab), or a string with a
179
single character (b), or a 0-sized string.
180
181
3. engine throws hands up in the air and abandons something that could
182
otherwise look perfectly fine::
183
184
```sql
185
select floor($1 + $2)
186
```
Feb 17, 2016
187
188
This fails in Postgres with "can't infer the types" whereas the
189
context suggests that inferring ``decimal`` would be perfectly
190
fine.
191
192
4. failure to use context information to infer types where this
193
information is available.
194
195
To simplify the explanation let's construct a simple example by
196
hand. Consider a library containing the following functions::
197
198
f(int) -> int
199
f(float) -> float
200
g(int) -> int
201
202
Then consider the following statement::
203
204
```sql
205
prepare a as select g(f($1))
206
```
Feb 17, 2016
207
208
This fails with ambiguous/untypable $1, whereas one could argue (as
209
is implemented in other languages) that ``g`` asking for ``int`` is
210
sufficient to select the 1st overload for ``f`` and thus fully
211
determine the type of $1.
212
213
5. Lack of clarity about the expected behavior of the division sign.
214
215
Consider the following:
Feb 17, 2016
216
217
```sql
218
create table w (x int, y float);
219
insert into w values (3/2, 3/2);
220
```
Feb 17, 2016
221
222
In PostgreSQL this inserts (1, 1.0), with perhaps a surprise on the
223
2nd value. In CockroachDB this fails (arguably surprisingly) on
224
the 1st expression (can't insert float into int), although the
225
expression seems well-formed for the receiving column type.
Feb 17, 2016
226
227
6. Uncertainty on the typing of placeholders due to conflicting contexts:
228
229
```sql
230
prepare a as select (3 + $1) + ($1 + 3.5)
231
```
Feb 17, 2016
232
233
PostgreSQL resolves $1 as `decimal`. CockroachDB can't infer.
234
Arguably both "int" and "float" may come to mind as well.
Feb 17, 2016
235
236
237
238
## Things that look wrong but really aren't
239
240
241
1. loss of equivalence between prepared and direct statements::
242
243
```sql
244
prepare a as select ($1 + 2)
245
execute a(1.5)
246
247
-- reports 3 (in Postgres)
248
```
Feb 17, 2016
249
250
The issue here is that the + operator is overloaded, and the
251
engine performs typing on $1 only considering the 2nd operand to
252
the +, and not the fact that $1 may have a richer type.
253
254
One may argue that a typing algorithm that only performs "locally"
Aug 14, 2016
255
is sufficient, and that this statement can be reliably understood
256
to perform an integer operation in all cases, with a forced cast of
257
the value filled in the placeholder. The problem with this argument
258
is that this interpretation loses the equivalence between a direct
259
statement and a prepared statement, that is, the substitution of:
260
261
```sql
262
select 1.5 + 2
263
```
Feb 17, 2016
264
265
is not equivalent to:
266
267
```sql
268
prepare a as select $1 + 2; execute a(1.5)
269
```
Feb 17, 2016
270
271
The real issue however is that SQL's typing is essentially
272
monomorphic and that prepare statements are evaluated independently
273
of subsequent queries: there is simply no SQL type that can be
274
inferred for the placeholder in a way that provides sensible
275
behavior for all subsequent queries. And introducing polymorphic
276
types (or type families) just for this purpose doesn't seem
277
sufficiently justified, since an easy workaround is available::
278
279
```sql
280
prepare a as select $1::float + 2;
281
execute a(1.5)
282
```
Feb 17, 2016
283
284
2. Casts as type hints.
285
286
Postgres uses casts as a way to indicate type hints on
287
placeholders. One could argue that this is not intuitive, because a
288
user may legitimately want to use a value of a given type in a
289
context where another type is needed, without restricting the type
290
of the placeholder. For example:
Feb 17, 2016
291
292
```sql
293
create table t (x int, s string);
294
insert into t (x, s) values ($1, "hello " + $1::string)
295
```
Feb 17, 2016
296
297
Here intuition says we want this to infer "int" for $1, not get a
298
type error due to conflicting types.
299
300
However in any such case it is always possible to rewrite the
301
query to both take advantage of type hints and also demand
302
the required cast, for example:
303
304
```sql
305
create table t (x int, s string);
306
insert into t (x, s) values ($1::int, "hello " + ($1::int)::string
307
```
Feb 17, 2016
308
309
Therefore the use of casts as type hints should not be seem as a
310
hurdle, and simply requires the documentation to properly mention
311
to the user "if you intend to cast placeholders, explain the intended source
312
type of your placeholder inside your cast first".
313
314
# Detailed design
315
Apr 20, 2016
316
Summary: Nathan spent some time trying to implement the first version of
317
this RFC. While doing so, he discovered it was more comfortable, performant,
318
and desirable to implement something in-between the current proposals
319
for Rick and Morty.
Apr 20, 2016
320
321
Since this new code is already largely written and seems to behave in
322
as expected in almost all scenarios (all tests pass, examples from the
323
previous RFC are handled at least as well as Morty), we figured it warrants
324
a specification *a posteriori*. This will allow us to consider the new system
325
orthogonally from the code, and directly compare it to Rick and Morty.
326
327
The resulting type system is called **Summer**, after the name of
328
Morty's sister in the show. Summer is more mature and more predictable
329
than Morty, and gives the same or more desirable results in almost all
330
scenarios while being more easy to understand
331
externally.
Apr 20, 2016
332
333
## Overview of Summer
334
335
- Summer is also based on a set of rules that can be applied using
Apr 20, 2016
337
- Summer does slightly more work than Morty (more conditions checked
338
at every level) but is not iterative like Rick.
339
- Summer requires constant folding early in the type resolution.
Apr 20, 2016
340
- Summer does not require or allow implicit type conversions, as opposed
341
to Morty. In a similar approach to Go, it uses untyped literals
342
to cover 90% of the use cases for implicit type conversions, and deems
343
that it's preferable to require explicit programmer clarification for
Apr 20, 2016
345
- Summer only uses exact arithmetic during initial constant folding,
346
and performs all further operations using SQL types, whereas Morty
347
sometimes uses exact arithmetic during evaluation.
348
349
Criticism of Morty where Summer is better: EXPLAIN on Morty will
350
basically say to the user "I don't really know what the type of these
351
expressions is" (eval-time type assertions with an exact argument).
352
Where Summer will always pick a type and be able to explain it.
353
354
## Proposed typing strategy
355
356
### High-level overview
357
358
To explain Summer to a newcomer it would be mostly correct to say
359
"Summer first determines the types of the operands of a complex
360
expression, then based on the operand types decides the type of the
361
complex expression", ie. the intuitive description of a bottom-up type
362
inference.
363
364
The reason why Summer is more complex than this in reality (and the principle
365
underlying its design) is threefold:
366
367
- Expressions containing placeholders often contain insufficient
368
information to determine a proper type in a bottom-up fashion. For
369
example in the expression `floor($1 * $2)` we cannot type the
370
placeholders unless we take into account the accepted argument types
371
of `floor`.
372
373
- SQL number literals are usually valid values in multiple types
374
(`int`, `float`, `decimal`). Not only do users expect a minimum
375
amount of automatic type coercion, so that expressions like `1.5 +
376
123` are not rejected. Also there is a conflict of interest between
377
flexibility for the SQL user (which suggests picking the largest
378
type) and performance (which suggests picking the smallest type).
379
Summer does extra work to reach a balance in there. For example
380
`greatest(1, 1.2)` will pick `float` whereas `greatest(1,
381
1.2e10000)` will pick `decimal`.
382
383
- SQL has overloaded functions. If there are multiple candidates and
384
the operand types do not match the candidates' expected types
385
"exactly" Summer does extra work to find an acceptable candidate.
386
387
So another way to explain Summer that is somewhat less incorrect
388
than the naive explanation above would be:
389
390
1. the type of constant literals (numbers, strings, null) and
391
placeholders are mostly determined by their parent expression
392
depending on other rules (especially the expected type at that
393
position), not themselves. For example Summer does not "know"
394
(determines) the constant "123" to be an `int` until it looks at
395
its parent in the syntax tree. For complex expressions involving
396
number constants, this requires Summer to first perform constant
397
folding so that the immediate parent of a constant, often an
398
overloaded operator, has enough information from its other
399
operand(s) to decide a type for the constant. This constant folding
400
is performed using exact arithmetic.
401
402
2. for functions that require homogenous types (e.g. `GREATEST`, `CASE
403
.. THEN` etc), the type expected by the context, if any, is used to
404
restrict the operand types (rule 6.2) otherwise the first operand
405
with a "possibly useful" type is used to restrict the type of the
406
other operands (rules 6.3 and 6.4).
407
408
3. during overload resolution, the candidate list is first restricted
409
to the candidates that are *compatible* with the arguments (rules
410
7.1 to 7.3), then filtered down by compatibility between the
411
candidate return types and the context (7.4), then by minimizing
412
the amount of type conversions for literals (7.5), then by
413
preferring homogenous argument lists (7.6).
Apr 20, 2016
415
### Language extension
416
417
In order to clarify the typing rules below and to exercise
418
the proposed system, we found it was useful to "force" certain
419
expressions to be of a certain type.
420
421
Unfortunately the SQL cast expression (`CAST(... AS ...)` or
422
`...::...`) is not appropriate for this, because although it
423
guarantees a type to the surrounding expression it does not constrain
424
its argument. For example `sign(1.2)::int` does not disambiguate which
425
overload of `sign` to use.
426
427
Therefore we propose the following SQL extension, which is not
428
required to implement the typing system but offers opportunities to
429
better exercise it in tests. The explanatory examples below also use
430
this extension for explanatory purposes.
431
432
The extension is a new expression node "type annotation".
Apr 20, 2016
433
434
We also propose the following SQL syntax for this: "E ::: T".
436
For example: `1:::int` or `1 ::: int`.
Apr 20, 2016
437
438
The meaning of this at a first order approximation is "interpret the
439
expression on the left using the type on the right".
Apr 20, 2016
440
441
This is different from casts, as explain below.
Apr 20, 2016
442
443
The need for this type of extension is also implicitly
444
present/expressed in the alternate proposals Rick and Morty.
445
446
### First pass: placeholder annotations
447
448
In the first pass we check the following:
449
450
- if any given placeholder appears as immediate argument of an
451
explicit annotation, then assign that type to the placeholder (and
452
reject conflicting annotations after the 1st).
453
454
- otherwise (no direct annotations on a placeholder), if all
455
occurrences of a placeholder appear as immediate argument
456
to a cast expression then:
458
- if all the cast(s) are homogeneous,
459
then assign the placeholder the type indicated by the cast.
461
- otherwise, assign the type "string" to the placeholder.
Apr 20, 2016
462
Apr 20, 2016
464
466
select $1:::float, $1::string
467
-> $1 ::: float, execution will perform explicit cast float->string
468
select $1:::float, $1:::string
469
-> error: conflicting types
470
select $1::float, $1::float
473
-> $1 ::: string, execution will perform explicit cast $1 -> float
474
select $1:::float, $1
475
-> $1 ::: float
476
select $1::float, $1
477
-> nothing done during 1st pass, typing below will resolve
478
```
Apr 20, 2016
479
480
(Note that this rule does not interfere with the separate rule,
481
customary in SQL interpreters, that the client may choose to disregard
482
the stated type of a placeholder during execute and instead pass the
483
value as a string. The query executor must then convert the string to
484
the type for the placeholder that was determined during type checking.
485
For example if a client prepares `select $1:::int + 2` and passes "123" (a string),
486
the executor must convert "123" to 123 (an int) before running the query. The
487
annotation expression is a type assertion, not conversion, at run-time.)
Apr 20, 2016
488
489
### Second pass: constant folding
490
491
The second pass performs constant folding and annotates constant literals with
492
their possible types. Note that in practice, the first two passes could actually be implemented in a
493
single pass, but for the sake of understanding, it is easier to separate them
494
logically.
495
496
Constant expressions are folded using exact arithmetic. This is accomplished using a
497
depth-first, post-order traversal of the syntax tree. At the end of this phase,
498
the parents of constant values are either statements, or expression nodes where
499
one of the children is not a constant (either a column reference, a placeholder, or a
500
more complex non-constant expression).
501
502
Constant values are broken into two categories: **Numeric** and **String-like** constants,
503
which will be represented as the types `NumVal` and `StrVal` in the implemented typing system.
504
Numeric constants are stored internally as exact numeric values using the
505
[`go/constant`](https://golang.org/pkg/go/constant/) package. String-like constants are
Apr 20, 2016
506
stored internally using a `string` value.
507
508
After constant folding has occurred the remaining constants are represented as
509
literal constants in the syntax tree and annotated by an ordered list of SQL types that can
Apr 20, 2016
510
represent them with the least loss of information. We call this list the *resolvable type
511
ordered set*, or *resolvable type set* for short, and the head of this list the *natural type* of the
Apr 20, 2016
512
constant.
513
514
#### Numeric Constant Examples
515
516
| value | resolvable type set
517
|:----------------------:|:--------------------------
518
| 1 | [int, float, decimal]
519
| 1.0 | [float, int, decimal]
520
| 1.1 | [float, decimal]
521
| null | [null, int, float, decimal, string, bytes, timestamp, ...]
522
| 123..overflowInt..4567 | [decimal, float]
523
| 12..overflowFloat..567 | [decimal]
524
| 1/2 | [float, decimal]
525
| 1/3 | [float, decimal] // perhaps future feature: [fractional, float, decimal]
526
527
Notice: we use the lowercase SQL types in the RFC, these are reified in the code using either zero
Apr 20, 2016
528
values of `Datum` (original implementation) or optionally the enum values of a new `Type` type.
529
530
#### String-like Constant Examples
531
532
| value | resolvable type set
533
|:----------------------:|:--------------------------
534
| 'abc' | [string, bytes]
535
| b'abc' | [bytes, string]
536
| b'a\00bc' | [bytes]
537
538
These traits will be used later during the type resolution phase of constants.
539
540
### Third pass: expression typing, a recursive traversal of the syntax tree
541
542
The recursive typing function T takes two input parameters: the node to work
543
on, and a specification of **desired types**.
Apr 20, 2016
544
545
Desired types are simple for expressions: that's the desired type of the
546
result of evaluating the expression. For a sub-select or other syntax nodes
547
that return tables, the specification is a map from column name to requested
548
type for that column.
Apr 20, 2016
549
550
A desired type is merely a hint to the sub-expression
551
being type checked, it is up to the caller to assert a specific type
552
is returned from the expression typing and throw a type checking error
553
if necessary.
Apr 20, 2016
554
555
Desired type classes can be:
556
557
- fully unspecified (the top level specification is a wildcard)
558
- structurally specified (the top level specification is not a wildcard, but the desired
559
type may contain wildcards)
560
561
We also say that a desired type is "fully specified" if it doesn't contain any wildcard.
562
563
We say that two desired type patterns are "equivalent" when they are structurally
564
equivalent given that a wildcard is equivalent to any type.
565
566
While a specified (fully or partially) desired type indicates a preference to
567
the sub-expression being type checked, an unspecified desired type indicates
568
no preference. This means that the sub-expression should type itself naturally
569
(see *natural type* discussion above). Generally speaking, wildcard types can be
570
thought of as accepting any type in their place. In certain cases, such as with
571
placeholders, a desired type must be fully-specified or an ambiguity error will
572
be thrown.
573
Apr 20, 2016
574
The alternative would be to propagate the desired type down as a constraint,
575
and fail as soon as this constraint is violated. However by doing so we would
576
dilute the clarity of the origin of the error. Consider for example `insert into (text_column) values (1+floor(1.5))`;
577
if we had desired types as constraints the error would be `1 is not a string` whereas by making the caller
578
that demands a type the checker, the error becomes `1+floor(1.5) is not a string`, which is arguably more desirable.
579
Meanwhile, the type checking of a node retains the option to accept
580
the type found for a sub-tree even if it's different from the desired type.
581
582
As an important optimization, we annotate the results of typing in the syntax
583
node. This way during normalization when the syntax structure is changed,
584
the new nodes created by normalization can reuse the types of their sub-trees
585
without having to recompute them (since normalization does not change
586
the type of the sub-trees). A new method on syntax node provides read access to
587
the type annotation.
588
589
The output of T is a new "typed expression" which is capable of returning
590
(without checking again) the type it will return when evaluated. T also stores
591
the inferred type in the input node itself (at every level) before returning.
592
In effect, this means that type checking will transform an untyped expression
593
tree where each node is unable to be properly introspect about its own return
594
type into a typed tree which can provide its inferred result type, and as such
595
can be evaluated later.
Apr 20, 2016
596
Apr 20, 2016
598
599
_In an effort to make this distinction clearer in code, a `TypedExpr` interface
600
will be created, which is a superset of the `Expr` interface, but also has the
601
ability to return its annotation and evaluate itself. This means that the `Eval`
Apr 20, 2016
602
method in `Expr` will be moved to `TypedExpr`, and that the `TypeCheck` method on
603
`Expr`s will return a `TypedExpr`._
604
605
The function then works as follows:
606
607
1. if the node is a constant literal: if the desired type is within the constant's
Apr 20, 2016
608
**resolvable type set**, convert the literal to the desired type. Otherwise, resolve
609
the literal as its **natural type**. Note that only fully-specified types will ever
610
be in a constant's **resolvable type set**.
Apr 20, 2016
612
2. if the node is a column reference or a datum: use the type determined by the node regardless
613
of the desired type.
614
615
3. if the node is a placeholder: if the desired type is not fully-specified, report an error.
616
if there is a fully-specified desired type, and the placeholder was not yet assigned a type,
617
assign the desired type to the placeholder. If the placeholder was already assigned a
618
different type, report an error.
Apr 20, 2016
619
620
4. if the node is NULL: if there is a fully-specified desired type, annotate the node with
621
the equivalent type and return that, otherwise return the NULL type as the expression's resolved
622
type.
Apr 20, 2016
623
624
5. if the node is a simple statement (or sub-select, not CASE!). Propagate the desired
625
types down, then look at what comes up when the recursion returns, then check the
626
inferred type are compatible with the statement semantics.
Apr 20, 2016
627
628
6. for statements or variadic function calls with a homogeneity requirement, we use the rules in the section
Apr 20, 2016
629
[below](#required-homogeneity) for typing.
630
631
7. if the node is a function call not otherwise handled in step #6 [incl a binary or unary operation, or a comparison
632
operation], perform overload resolution on the set of possible overloaded
633
functions that could be used. See [below](#overload-resolution) for how
Apr 20, 2016
634
this is performed.
Apr 20, 2016
636
8. if the node is a type annotation, then the desired type provided from the parent is ignored and the annotated
637
type required instead (sent down as desired type, checked upon type resolution, if they don't match an error is reported).
638
The annotated type is resolved.
Apr 20, 2016
640
### Overload resolution
641
642
In the case of a function call (and all call-like expressions) there
643
are a set of overloads that must be chosen from to resolve the correct operation implementation
644
for to dispatch to during evaluation. This resolution can be broken down into a series of filtering
645
steps, whereby candidates which do not pass a filter are eliminated from the resolution process.
Apr 20, 2016
646
647
The resolution is based on an initial classification of all the argument expressions to the call into
648
3 categories (implemented as 3 vectors of (position, expression) in the implementation):
649
650
- constant numeric literals
651
- unresolved arguments: placeholder nodes that do not have an assigned type yet
652
- "pre-typable nodes" (which happens to be either unambiguously resolvable expressions or previously resolved placeholders or constant
653
string literals)
654
655
The **first three** steps below are run unconditionally. After the 3rd step and after each
Apr 20, 2016
656
subsequent step, we check the remaining overload set:
657
658
- If there are no candidates left, type checking fails ("no matching overload").
Apr 20, 2016
659
- if there is only one candidate left, this is used as the implementation function to use for the call, any
660
yet untyped placeholder or constant literal is typed recursively using the type defined by its argument position as desired type,
661
(it is possible to prove, and we could assert here, that the inferred type here is always
662
equivalent to the desired type) then subsequent steps are skipped.
663
- if there is more than one candidate left, the next filter is applied and the resolution
Apr 20, 2016
664
continues.
Apr 20, 2016
666
667
1. (7.1) candidates are filtered based on the number of arguments
668
669
2. (7.2) the pre-typable sub-nodes (and only those) are typed, starting with an unspecified desired type.
670
At every sub-node, the candidate list is filtered using the types found so far. If at
671
any point there is only one candidate remaining, further pre-typable sub-nodes are typed using
Apr 20, 2016
672
the remaining candidate's argument type at that position as desired type.
673
674
Possible extension: if at any point the remaining candidates all accept the same type at the
Apr 20, 2016
675
current argument position, that type is also used as desired type.
676
677
Then the overload candidates are filtered based on the resulting types. If any argument of the call
Apr 20, 2016
678
receives type null, then it is not used for filtering.
680
For example: `select mod(extract(seconds from now()), $1*20)`. There
681
are 3 candidates for `mod`, on `int`, `float` and `decimal`. The
682
first argument `extract` is typed with an unspecified desired type and
683
resolves to `int`. This selects the candidate `mod(int, int)`. From then on only one candidate
684
remains so `$1*20` gets typed using desired type `int` and `$1` gets typed as `int`.
685
686
3. (7.3) candidates are filtered based on the resolvable type set types of constant number literals.
Apr 20, 2016
687
Remember at this point all constant literals already have a resolvable type set since constant folding.
Apr 20, 2016
689
The filtering is done left to right, eliminating at each argument all candidates that do not accept
690
one of the types in the resolvable set at that position.
691
692
Example: `select sign(1.2)`. `sign` has 3 candidates for `int`, `float` and `decimal`. Step 7.3 eliminates
693
the candidate for `int`.
694
Apr 20, 2016
695
After this point,
696
the number of candidates left will be checked now and after each following step.
Apr 20, 2016
698
4. (7.4) candidates are filtered based on the desired return type, if one is provided
699
700
Example: `insert into (str_col) values (left($1, 1))`
701
With only rules 7.2 and 7.3 above we still have 2 candidates: `left(string, int)` and `left(bytes, int)`.
702
With rule 7.4 `left(string, int)` is selected.
Apr 20, 2016
703
704
5. (7.5) If there are constant number literals in the argument list, then try to filter the candidate list
705
down to 1 candidate using the natural type of the constants. If that fails (either 0 candidates left or >1), try again
706
this time trying to find a candidate that accepts the "best" mutual type in the resolvable type set of all constants.
707
(in the order defined in the resolvable type set)
709
Example: `select sign(1.2)`
710
With only rules 7.2 to 7.4 above we still have 2 candidates:
711
With candidates `sign(float)` and `sign(decimal)`.
712
Rule 7.5 with the natural type selects `div(float)`.
714
Example: `select div(1e10000,2.5)`
715
With only rules 7.2 to 7.4 above we still have 2 candidates:
716
`div(float,float)` and `div(decimal,decimal)` however the natural types are `decimal` and `float`, respectively.
717
The 2nd part of rule 7.5 picks `div(decimal,decimal)`.
718
719
6. (7.6) *for the final step, we look to prefer homogeneous argument types across candidates.
Apr 20, 2016
721
for `int + int` and `int + date` exist, so without preferring homogeneous overloads, `1 + $1` would
722
be resolved as ambiguous. Therefore, we check if all previously resolved types are the same, and if
723
so, follow the filtering step.*
724
725
if there is at least one argument with a resolved type, and all resolved types for arguments so far are homogeneous in type
Apr 20, 2016
726
and all remaining constants have this type in their resolvable type set, and there is at least one candidate.
727
that accepts this type in the yet untyped positions,
728
choose that candidate.
730
Example: `select div(1, $1)` still has candidates for `int`, `float` and `decimal`.
Apr 20, 2016
731
732
Another approach would be to go through each overload and attempt to type check each
733
argument expression with the parameter's type. If any of these expressions type checked to a
Apr 20, 2016
734
different type then we could discard the overload. This would avoid some of the issues noted in
735
step 2, but would create a few other issues
736
- it would ignore constant's **natural** types. This could be special cased, but only one level deep
737
- it could be expensive if the function was high up in an expression tree
738
- it would ignore preferred homogeneity. Again though, this could be special cased
739
Because of these issues, this approach is not being considered
Apr 20, 2016
741
### Required homogeneity
742
743
There are a number of cases where it is required that the type of all expressions are the
744
same. For example: COALESCE, CASE (conditions and values), IF, NULLIF, RANGE, CONCAT, LEAST/GREATEST (MIN/MAX variadic)....
745
746
These situations may or may not also desire a given type for all subexpressions. Two
747
examples of this type of situation are in CASE statements (both for the condition set and the
748
value set) and in COALESCE statements. Because this is a common need for a number of statement
749
types, the typing resolution of this situation should be specified. Here we present a list of
750
rules to be applied to a given list of untyped expressions and a desired type.
Apr 20, 2016
751
752
1. (6.1) as we did with overload resolution, split the provided expressions into three groups:
753
- pre-typable nodes unambiguously resolvable expressions, previously resolved arguments, and constant string literals
754
- constant numeric literals
755
- unresolved placeholders
756
757
2. (6.2) if there is a specified (partially or fully) desired type, type all the sub-nodes using
758
this type as desired type. If any of the sub-nodes resolves to a different type, report an
759
error (expecting X, got Y).
761
3. (6.3) otherwise (wildcard desired type), if there is any pre-typable node, then
762
type this node with an unspecified desired type.
Apr 20, 2016
763
Call the resulting type T.
764
Then for all remaining sub-nodes, type it desiring T. If the resulting type is different from T, report an error.
765
The result of typing is T.
766
767
4. (6.4) (wildcard desired type, no pre-typable node, all remaining nodes are either constant number literals or untyped placeholders)
Apr 20, 2016
768
769
If there is at least one constant literal, then pick the best mutual type of all constant literals, if any, call that T,
770
type all sub-nodes using T as desired type, and return T as resolved type.
771
Apr 20, 2016
772
5. (6.5) Fail with ambiguous typing.
Apr 20, 2016
774
## Examples with Summer
775
Apr 20, 2016
778
```sql
779
prepare a as select 3 + case (4) when 4 then $1 end
780
Tree:
781
select
Apr 20, 2016
783
+
784
/ \
785
3 case
786
/ | \
787
4 4 $1
788
```
789
790
Constant folding happens, nothing changes.
791
792
Typing of the select begins. Since this is not a sub-select there is wildcard desired type.
793
Typing of "+" begins. Again wildcard desired type.
Apr 20, 2016
794
795
Rule 7.1 then 7.2 applies.
796
797
Typing of "case" begins without a specified desired type.
Apr 20, 2016
798
799
Then "case" recursively types its condition variable with a wildcard desired type.
800
Typing of "4" begins. unspecified desired type here, resolves to int as natural type, [int, float, dec] as resolvable type set.
Apr 20, 2016
801
Typing of "case" continues. Now it knows the condition is an "int" it will demand "int" for the WHEN branches.
802
Typing of "4" (the 2nd one) begins. Type "int" is desired so the 2nd "4" is typed to that.
Apr 20, 2016
803
Typing of "case" continues. Here rule 6.4 applies, and a failure occurs.
804
805
```sql
806
prepare a as select 3 + case (4) when 4 then $1 else 42 end
807
```
808
809
Typing of the select begins. Since this is not a sub-select there is wildcard desired type.
810
Typing of "+" begins. Again wildcard desired type.
Apr 20, 2016
811
812
Rule 7.1 then 7.2 applies.
813
814
Typing of "case" begins without a specified desired type.
Apr 20, 2016
815
816
Then "case" recursively types its condition variable with a wildcard desired type.
817
Typing of "4" begins. wildcard desired type, resolves to int as natural type, [int, float, dec] as resolvable type set.
Apr 20, 2016
818
Typing of "case" continues. Now it knows the condition is an "int" it will demand "int" for the WHEN branches.
819
Typing of "4" (the 2nd one) begins. Type "int" is desired so the 2nd "4" is typed to that.
Apr 20, 2016
820
821
Here rule 6.4 applies. "42" decides int, so $1 gets assigned "int" and
822
case resolves as "int".
823
824
Then typing of "+" resumes.
825
826
Based on the resolved type for case, "+" reduces the overload set to
827
(int, int), (date, int), (timestamp, int).
828
829
Rule 7.3 applies. This eliminates the candidates that take non-number as 1st argument. Only (int, int) remains.
830
This decides the overload, "$1" gets typed as "int", and the
Apr 20, 2016
831
typing completes for "+" with "int".
832
Typing completes.
833
834
Apr 20, 2016
836
837
```sql
838
create table t (x float);
839
insert into t(x) values (1e10000 * 1e-9999);
840
```
841
842
First pass: constant folding using exact arithmetic. The expression 1e10000 * 1e-9999 gets simplified to 10.
843
844
Typing insert. The target of the insert is looked at first. This determines desired type "float" for the 1st column in the values clause.
845
Typing of the values clause begins, with desired type "float".
846
Typing of "10" begins with desired type "float". "10" gets converted to float.
847
Typing of insert ends. All is well. Result:
848
849
```
850
insert
851
|
Apr 20, 2016
853
```
854
Apr 20, 2016
857
858
```
859
select floor($1 + $2)
860
```
861
862
Assuming `floor` is only defined for floats.
863
Typing of "floor" begins with an unspecified desired type.
Apr 20, 2016
864
865
Rule 7.2 applies.
866
There is only one candidate, so there is a desired type for the remaining arguments (here the only one of them) based on the arguments taken by floor.
867
868
Typing of "+" begins with desired type "float".
869
870
Rule 7.2 applies: nothing to do.
871
Rule 7.3 applies: nothing to do.
872
Rule 7.4 applies: +(float, float) is selected.
873
Then $1 and $2 are assigned the types desired by the remaining candidate.
Apr 20, 2016
874
875
Typing of "floor" resumes, finds an "float" argument.
876
rules 7.2 completes with 1 candidate, and
877
typing of "floor" completes with type "float".
878
Apr 20, 2016
880
881
```
882
select
883
|
884
floor:::float
885
/ \
886
$1:::float $2:::float
Apr 20, 2016
887
```
888
Apr 20, 2016
891
892
```sql
893
select ($1+$1)+current_date()
894
select
Apr 20, 2016
896
+(b) current_date()
897
$1 $1
898
```
899
900
Typing of "+(a)" begins with a wildcard desired type.
901
Rule 7.2 applies.
Apr 20, 2016
902
All candidates for "+" take different types,
903
so we don't find any desired type
904
905
Typing of "+(b)" begins without a specified desired type.
Apr 20, 2016
906
Rules 7.1 to 7.6 fail to reduce the overload set, so typing fails with ambiguous types.
907
908
Possible fix:
Apr 20, 2016
909
- annotate during constant folding all the nodes known to not contain placeholders or constants underneath
910
- in rule 7.2 order the typing of pre-typable sub-nodes by starting
911
with those nodes.
912
913
Apr 20, 2016
915
916
Consider a library containing the following functions::
Apr 20, 2016
918
f(int) -> int
919
f(float) -> float
920
g(int) -> int
921
922
Then consider the following statement::
Apr 20, 2016
924
```sql
925
prepare a as select g(f($1))
926
```
927
928
Typing starts for "select".
929
Typing starts for the call to "g" without a specified desired type.
Apr 20, 2016
930
Rule 7.2 applies. Only 1 candidate so the sub-nodes are typed
931
with its argument type as desired type.
932
933
Typing starts for the call to "f" with desired type "int".
Apr 20, 2016
934
Rule 7.4 applies, select only 1 candidate.
935
then typing of "f" completes, "$1" gets assigned "int",
936
"f" resolves to "int".
937
938
"g" sees only 1 candidate, resolves to that candidate's
939
return type "int"
940
Typing completes.
941
942
```sql
943
INSERT INTO t(int_col) VALUES (4.5)
944
```
945
946
Insert demands "int", "4.5" has natural type float and doesn't have
947
"int" in its resolvable type set.
948
Typing fails (like in Rick and Morty).
949
950
```sql
951
INSERT INTO t(int_col) VALUES ($1 + 1)
952
```
953
954
Insert demands "int",
Apr 20, 2016
955
Typing of "+" begins with desired type "int"
956
Rule 7.4 applies, choses +(int, int).
957
Only 1 candidate, $1 and 1 gets assigned 'int"
958
Typing completes.
959
960
```sql
961
insert into (int_col) values ($1 - $2)
962
```
963
964
do not forget:
965
-(int, int) -> int
966
-(date, date) -> int
967
968
Ambiguous on overload resolution of "-"
969
Apr 20, 2016
972
973
```sql
974
insert into (str_col) values (coalesce(1, "foo"))
975
-- must say "1" is not string
976
select coalesce(1, "foo")
977
-- must say "foo" is not int
978
```
979
980
(to check in testing: Rules 6.1-6.5 do this)
981
982
```sql
983
SELECT ($1 + 2) + ($1 + 2.5)
984
```
985
986
($1 + 2) types as int, $1 gets assigned int
987
then ($2 + 2.5) doesn't type.
988
989
(Morty would have done $1 = exact)
990
Apr 20, 2016
993
994
```sql
995
create table t (x float);
996
insert into t(x) values (3 / 2)
997
```
998
Constant folding reduces 3/2 into 1.5.
999
1000
Typing "1.5" stars with desired type "float", succeeds, 1.5 gets inserted.
Apr 20, 2016
1001
1002
```sql
1003
create table u (x int);
1004
insert into u(x) values (((9 / 3) * (1 / 3))::int)
1005
```
1006
Constant folding folds this down to ... values("1") with "1"
Apr 20, 2016
1007
annotated with natural type "int" and resolvable type set [int].
1008
1009
Then typing succeeds.
1010
1011
1012
### Example 8
Apr 20, 2016
1013
1014
```sql
1015
create table t (x int, s text);
1016
insert into t (x, s) values ($1, "hello " + $1::text)
1017
```
1018
1019
First "$1" gets typed with desired type "int", gets assigned "int".
1020
Then "+" is typed.
1021
Rule 7.2 applies.
1022
The cast "cast ($1 as text)" is typed with a wildcard desired type.
Apr 20, 2016
1023
This succeeds, leaves the $1 unchanged (it is agnostic of its argument)
1024
and resolves to type "text".
1025
"+" resolves to 1 candidate, is typed as "string"
1026
Typing ends. $1 is int.
1027
(better than Morty!)
1028
1029
1030
### Example 9
Apr 20, 2016
1031
1032
```sql
1033
select $1::int
1034
```
1035
1036
First pass annotates $1 as int (all occurrences are argument of
1037
cast). Typing completes with int.
1038
1039
1040
### Example 10
1041
1042
```sql
1043
f:int,int->int
1044
f:float,float->int
1045
PREPARE a AS SELECT f($1, $2), $2::float
1046
```
1047
1048
Typing of "f" starts,
1049
Multiple candidate remain after overload resolution.
1050
Typing fails with ambiguous types.
Apr 20, 2016
1051
1052
1053
### Example 11
1054
1055
#### Part a
Apr 20, 2016
1056
1057
```sql
1058
f:int,int->int
1059
f:float,float->int
1060
PREPARE a AS SELECT f($1, $2), $2:float
1061
```
1062
1063
$2 is assigned to "float" during the first phase.
1064
then typing of "f" starts,
1065
the argument have reduced the candidate set to just one.
1066
Typing completes
1067
$1 is assigned "float"
1068
1071
```sql
1072
PREPARE a AS SELECT ($1 + 4) + $1::int
1073
```
1074
1075
Typin of top "+" starts.
1076
Typing of inner "+" starts.
1077
Candidates filtered to +(date,int) and +(int, int).
1078
Rule 7.6 applies, $1 gets assigned "int".
1079
"+" resolves 1 candidate.
1080
Top level plus is +(int,int)->int
1081
Typing end with int.
1082
Apr 20, 2016
1085
```sql
1086
PREPARE a AS SELECT ($1 + 4) + $1:::int
Apr 20, 2016
1087
```
1088
1089
"$1" gets assigned "int" during the first phase.
Apr 20, 2016
1090
"+" resolves 1 candidate
Apr 20, 2016
1092
Typing ends.
1093
Apr 20, 2016
1096
```sql
1097
PREPARE a AS SELECT ($2 - $2) * $1:::int, $2:::int
Apr 20, 2016
1098
```
1099
1100
$2 is assigned int during the first pass.
Apr 20, 2016
1101
Typing of "*" begins.
1102
It sees that its 2nd argument already has type.
1103
So the candidate list is reduced to *(int,int)
1104
so Typing of "-" starts with desired type "int".
Apr 20, 2016
1106
-(int, int) -> int
1107
-(date, date) -> int
1108
$2 already has type int, so one candidate remains.
1109
[...]
1110
Typing ends successfully.
Apr 20, 2016
1111
1112
1113
### Example 12
1114
Apr 20, 2016
1115
```sql
Apr 20, 2016
1117
INSERT INTO t (int_a, int_b) VALUES (f($1), $1 - $2)
1118
-- succeeds (f types $1::int first, then $2 gets typed int),
1119
-- however:
1120
INSERT INTO t (int_b, int_a) VALUES ($1 - $2, f($1))
1121
-- fails with ambiguous typing for $1-$2, f not visited yet.
1122
```
1123
Apr 20, 2016
1125
1126
```sql
1127
SELECT CASE a_int
1128
WHEN 1 THEN 'one'
1129
WHEN 2 THEN
1130
CASE language_str
1131
WHEN 'en' THEN $1
1132
END
1133
END
1134
```
1135
1136
Rule 6.3 applies for the outer case, "one" gets typed as "string"
1137
Then "string" is desired for the inner case.
1138
Then typing of "$1" assigns "string" (desired).
1139
Then typing completes.
1140
1141
Apr 20, 2016
1143
1144
```sql
1145
select max($1, $1):::int
Apr 20, 2016
1146
```
1147
1148
Annotation demands "int" so rule 6 demands "int" from max, resolves "int" for $1 and max.
1149
1150
1151
### Example 14
1152
1153
```sql
1154
select array_length(ARRAY[1, 2, 3])
1155
```
1156
1157
Typing starts for "select".
1158
Typing starts for the call to "array_length" without a specified desired type.
1159
Rule 7.2 applies. Only 1 candidate is available so the sub-nodes are typed with
1160
its argument type as a desired type, which is "array<*>".
1161
1162
Typing starts for the ARRAY constructor with desired type "array<*>".
1163
The ARRAY expression checks that the desired type is present and has a
1164
base type of "array". Because it does, it unwraps the desired type, pulls
1165
out the parameterized type "*", and passes this as the desired type when
1166
requiring homogeneous types for all elements.
1167
1168
Typing starts for the array's expressions. These elements, in the presence
1169
of an unspecified desired type, naturally type themselves as "int"s using
1170
rule 6.4.
1171
1172
The ARRAY expression types itself as "array\<int\>".
1173
1174
The overload resolution for "array_length" finds that this resolved type is
1175
equivalent to its single candidate's parameter (`array<*> ≡ array<int>`), so
1176
it picks that candidate and resolves to that candidate's return type of "int".
1177
1178
Typing completes.
1179
1180
Apr 20, 2016
1181
# Alternatives
1182
1183
## Overview of Morty
1184
1185
- Morty is a simple set of rules; they're applied locally (single
1186
depth-first, post-order traversal) to AST nodes for making typing
1187
decisions.
1188
1189
One thing that conveniently makes a bunch of simple examples just
1190
work is that we keep numerical constants untyped as much as possible
1191
and introduce the one and only implicit cast from an untyped number
1192
constant to any other numeric type;
1193
1194
- Morty has only two implicit conversions, one for arithmetic on
1195
untyped constants and placeholders, and one for string literals.
1197
- Morty does not require but can benefit from constant folding.
1199
We use the following notations below:
Feb 17, 2016
1202
E :: T => the regular SQL cast, equivalent to `CAST(E as T)`
1203
E [T] => an AST node representing `E`
1204
with an annotation that indicates it has type T
1205
1206
## AST changes and new types
1207
1208
These are common to both Rick and Morty.
1209
1210
`SELECT`, `INSERT` and `UPDATE` should really be **EXPR** s.
1211
1212
The type of a `SELECT` expression should be an **aggregate**.
1213
1214
Table names should type as the **aggregate type** derived from their
1215
schema.
1216
1217
An insert/update should really be seen as an expression like
1218
a **function call** where the type of the arguments
1219
is determined by the column names targeted by the insert.
1220
1221
1222
## Proposed typing strategy for Morty
1223
1224
First pass: populating initial types for literals and placeholders.
1225
1226
- for each numeric literal, annotate with an internal type
Aug 14, 2016
1227
`exact`. Just like for Rick, we can do arithmetic in this type for
1228
constant folding.
Feb 17, 2016
1229
1230
- for each placeholder, process immediate casts if any by annotating
1231
the placeholder by the type indicated by the cast *when there is no
1232
other type discovered earlier for this placeholder* during this
Aug 14, 2016
1233
phase. If the same placeholder is encountered a 2nd time with a
1234
conflicting cast, report a typing error ("conflicting types for $n
1235
...")
1236
1237
Second pass (optional, not part of type checking): constant folding.
1238
1239
Third pass, type inference and soundness analysis:
1240
1241
1. Overload resolution is done using only already typed arguments. This
1242
includes non-placeholder arguments, and placeholders with a type discovered
1243
earlier (either from the first pass, or earlier in this pass in traversal order).
1244
2. If, during overload resolution, an expression E of type `exact` is
1245
found at some argument position and no candidate accepts `exact` at
1246
that position, and *also* there is only one candidate that accepts
1247
a numeric type T at that position, then the expression E is
1248
automatically substituted by `TYPEASSERT_NUMERIC(E,T)[T]` and
Oct 13, 2016
1249
typing continues assuming `E[T]` (see rule 11 below for a definition of `TYPEASSERT_NUMERIC`).
1250
3. If, during overload resolution, a *literal* `string` E is
1251
found at some argument position and no candidate accepts `string`
1252
at that position, and *also* there is only one candidate left based
1253
on other arguments that accept type T at that position *which does
1254
not have a native literal syntax*, then the expression E is
1255
automatically substituted by `TYPEASSERT_STRING(E,T)[T]` and typing
1256
continues assuming E[T]. See rule 12 below.
1257
4. If no candidate overload can be found after steps #2 and #3, typing
1258
fails with "no known function with these argument types".
1259
5. If an overload has only one candidate based on rules #2 and #3,
1260
then any placeholder it has as immediate arguments that are not yet
1261
typed receive the type indicated by their argument position.
1262
6. If overload resolution finds more than 1 candidate, typing fails
1263
with "ambiguous overload".
1264
7. `INSERT`s and `UPDATE`s come with the same inference rules
Feb 17, 2016
1265
as function calls.
1266
8. If no type can be inferred for a placeholder (e.g. it's used only
1267
in overloaded function calls with multiple remaining candidates or
1268
only comes in contact with other untyped placeholders), then again
1269
fail with "ambiguous typing for the placeholder".
1270
9. literal NULL is typed "unknown" unless there's an immediate cast just
1271
afterwards, and the _type_ "unknown" propagates up expressions until
1272
either the top level (that's an error) or a function that explicitly
1273
takes unknown as input type to do something with it (e.g. is_null,
1274
comparison, or INSERT with nullable columns);
1275
10. "then" clauses (And the entire surrounding case expression) get
1276
typed by first attempting to type all the expressions after
1277
"then"; then once this done, take the 1st expression that has a
1278
type (if any) and type check the other expressions against that
1279
type (possibly assigning types to untyped placeholders/exact
1280
expressions in that process, as per rule 2/3). If there are "then"
1281
clauses with no types after this, a typing error is reported.
1282
11. `TYPEASSERT_NUMERIC(<expression>, <type>)` accepts an expression of type
1283
`exact` as first argument and a numeric type name as 2nd
1284
argument. If at run-time the value of the expression fits into the
1285
specified type (at least preserving the amplitude for float, and
1286
without any information loss for integer and decimal), the value
1287
of the expression is returned, casted to the type. Otherwise, a
1288
SQL error is generated.
1289
12. `TYPEASSERT_STRING(<expression>, <type>)` accepts an expression of
1290
type `string` as first argument and a type with a possible
1291
conversion from string as 2nd argument. If at run-time the
1292
converted value of the expression fits into the specified type
1293
(the format is correct, and the conversion is at least preserving
1294
the amplitude for float, and without any information loss for
1295
integer and decimal), the value of the expression is returned,
1296
converted to the type. Otherwise, a SQL error is generated.
1297
Feb 17, 2016
1298
You can see that Morty is simpler than Rick: there's no sets of type candidates for any expressions.
1299
Other differences is that Morty relies on the introduction of an
1300
guarded implicit cast. This is because of the following cases:
1301
1302
```sql
1303
(1) INSERT INTO t(int_col) VALUES (4.5)
Feb 17, 2016
1304
```
1305
1306
This is a type error in Rick. Without Morty's rule 2 and a "blind"
1307
implicit cast, this would insert `4` which would be undesirable. With
1308
rule 2, the semantics become:
1309
1310
```sql
1311
(1) INSERT INTO t(int_col) VALUES (TYPEASSERT_NUMERIC(4.5, int)[int])
Feb 17, 2016
1312
```
1313
1314
And this would fail, as desired.
1315
1316
`Exact` is obviously not supported by the pgwire protocol, or by
1317
clients, so we'd report `numeric` when `exact` has been inferred for a
Feb 17, 2016
1318
placeholder.
1319
1320
Similarly, and in a fashion compatible with many SQL engines, string
1321
values are autocasted when there is no ambiguity (rule 3); for
1322
example:
Feb 17, 2016
1323
1324
```sql
1325
(1b) INSERT INTO t(timestamp_col) VALUES ('2012-02-01 01:02:03')
Feb 17, 2016
1326
1327
Gets replaced by:
Feb 17, 2016
1328
1329
(1b) INSERT INTO t(timestamp_col) VALUES (TYPEASSERT_STRING('2012-02-01 01:02:03', timestamp)[timestamp])
1330
1331
which succeeds, and
1332
1333
(1c) INSERT INTO t(timestamp_col) VALUES ('4.5')
Feb 17, 2016
1334
1335
gets replaced by:
Feb 17, 2016
1336
1337
(1c) INSERT INTO t(timestamp_col) VALUES (TYPEASSERT_STRING('4.5', timestamp)[timestamp])
Feb 17, 2016
1338
1339
which fails at run-time.
Feb 17, 2016
1340
```
1341
1342
Morty's rule 3 is proposed for convenience, observing that
1343
once the SQL implementation starts to provide custom / extended types,
1344
clients may not support a native wire representation for them. It can
1345
be observed in many SQL implementations that clients will pass values
1346
of "exotic" types (interval, timestamps, ranges, etc) as strings,
1347
expecting the Right Thing to happen automatically. Rule 3 is
1348
our proposal to go in this direction.
1349
1350
Rule 3 is restricted to literals however, because we probably don't
1351
want to support things like `insert into x(timestamp_column) values
1352
(substring(...) || 'foo')` without an explicit cast to make the
1353
intention clear.
1354
1355
1356
Regarding typing of placeholders:
1357
1358
```sql
1359
(2) INSERT INTO t(int_col) VALUES ($1)
1360
(3) INSERT INTO t(int_col) VALUES ($1 + 1)
1361
```
1362
1363
In `(2)`, `$1` is inferred to be `int`. Passing the value `"4.5"` for
1364
`$1` in `(2)` would be a type error during execute.
1365
1366
In `(3)`, `$1` is inferred to be `exact` and reported as `numeric`; the
1367
client can then send numbers as either int, floats or decimal down the
1368
wire during execute. (We propose to change the parser to accept any
1369
client-provided numeric type for a placeholder when the AST expects
1370
exact.)
1371
1372
However meanwhile because the expression `$1 + 1` is also
1373
`exact`, the semantics are automatically changed to become:
1374
1375
```sql
1376
(3) INSERT INTO t(int_col) VALUES (TYPEASSERT($1 + 1, int)[int])
1377
```
1378
1379
This way the statement only effectively succeeds when the client
Feb 17, 2016
1380
passes integers for the placeholder.
1381
1382
Although another type system could have chosen to infer `int` for `$1`
1383
based on the appearance of the constant 1 in the expression, the true
1384
strength of Morty comes with statements of the following form:
1385
1386
```sql
1387
(4) INSERT INTO t(int_col) VALUES ($1 + 1.5)
1388
```
1389
1390
Here `$1` is typed `exact`, clients see `numeric`, and thanks to the
1391
type assertion, using `$1 = 3.5` for example will actually succeed
1392
because the result fits into an int.
1393
1394
Typing of constants as `exact` seems to come in handy in some
1395
situations that Rick didn't handle very well:
1396
1397
```sql
1398
SELECT ($1 + 2) + ($1 + 2.5)
1399
```
1400
1401
Here Rick would throw a type error for `$1`, whereas Morty infers `exact`.
1402
1403
## Examples of Morty's behavior
1404
1405
```sql
1406
create table t (x float);
1407
insert into t(x) values (3 / 2)
1408
```
1409
1410
`3/2` gets typed as `3::exact / 2::exact`, division gets exact 1.5,
1411
then exact gets autocasted to float for insert (because float
1412
preserves the amplitude of 1.5).
1413
1414
```sql
1415
create table u (x int);
1416
insert into u(x) values (((9 / 3) * (1 / 3))::int)
1417
```
1418
1419
`(9/3)*(1/3)` gets typed and computes down to exact 1, then exact
1420
gets casted to int as requested.
1421
1422
Note that in this specific case the cast is not required any more
1423
because the implicit conversion from exact to int would take place
1424
anyways.
1425
1426
```sql
1427
create table t (x float);
1428
insert into t(x) values (1e10000 * 1e-9999);
1429
```
1430
1431
Numbers gets typed and casted as exact, multiplication
1432
evaluates to exact 10, this gets autocasted back to float for insert.
1433
1434
```sql
1435
select length(E'\\000a'::bytea || 'b'::text)
1436
```
1437
1438
Type error, concat only works for homogeneous types.
1439
1440
```sql
1441
select floor($1 + $2)
1442
```
1443
1444
Type error, ambiguous resolve for `+`.
1445
This can be fixed by `floor($1::float + $2)`, then there's only
1446
one type remaining for `$2` and all is well.
1447
1448
```sql
1449
f(int) -> int
1450
f(float) -> float
1451
g(int) -> int
1452
prepare a as select g(f($1))
1453
```
1454
1455
Ambiguous, tough luck. Try with `g(f($1::int))` then all is well.
1456
1457
```sql
1458
prepare a as select ($1 + 2)
1459
execute a(1.5)
1460
```
1461
1462
`2` typed as exact, so `$1` too. `numeric` reported to client, then
1463
`a(1.5)` sends `1.5` down the wire, all is well.
1464
1465
```sql
1466
create table t (x int, s text);
1467
insert into t (x, s) values ($1, "hello " + $1::text)
1468
```
1469
1470
`$1` typed during first phase by collecting the hint `::text`:
1471
1472
```sql
1473
insert into t (x, s) values ($1[text], "hello "[text] + $1::text)
1474
```
1475
1476
Then during type checking, text is found where int is expected in the
1477
1st position of `values`, and typing fails. The user can force the
1478
typing for `int` by using explicit hints:
1479
1480
```sql
1481
create table t (x int, s text);
1482
insert into t (x, s) values ($1::int, "hello " + $1::int::text)
1483
```
1484
1485
Regarding case statements:
1486
1487
```sql
1488
prepare a as select 3 + case (4) when 4 then $1 end
1489
```
1490
1491
Because there is only one `then` clause without a type, typing fails.
1492
The user can fix by suggesting a type hint. However, with:
1493
1494
1495
```sql
1496
prepare a as select 3 + case (4) when 4 then $1 else 42 end
1497
```
1498
1499
`42` gets typed as `exact`, so `exact` is assumed for the other `then` branches
1500
including `$1` which gets typed as `exact` too.
1501
1502
Indirect overload resolution:
1503
1504
```sql
1505
f:int,int->int
1506
f:float,float->int
1507
PREPARE a AS SELECT f($1, $2), $2::float
1508
```
1509
1510
Morty sees `$2::float` first, thus types `$2` as float then `$1` as
1511
float too by rule 5. Likewise:
1512
1513
1514
```sql
1515
PREPARE a AS SELECT $1 + 4 + $1::int
1516
```
1517
1518
Morty sees `$1::int` first, then autocasts 4 to `int` and the
1519
operation is performed on int arguments.
1520
1521
## Alternatives around Morty
1522
1523
Morty is an asymmetric algorithm: how much an how well
1524
the type of a placeholder is typed depends on the order of syntax
1525
elements. HFor example:
1526
1527
```sql
1528
f : int -> int
1529
INSERT INTO t (a, b) VALUES (f($1), $1 + $2)
Feb 17, 2016
1530
-- succeeds (f types $1::int first, then $2 gets typed int),
1531
-- however:
1532
INSERT INTO t (b, a) VALUES ($1 + $2, f($1))
1533
-- fails with ambiguous typing for $1+$2, f not visited yet.
1534
```
1535
1536
Of course we could explain this in documentation and suggest the use
1537
of explicit casts in ambiguous contexts. However, if this property is
1538
deemed too uncomfortable to expose to users, we could make the
1539
algorithm iterative and repeat applying Morty's rule 5 to all
1540
expressions as long as it manages to type new placeholders. This way:
1541
1542
```sql
1543
INSERT INTO t (b, a) VALUES ($1 + $2, f($1))
1544
-- ^ fail, but continue
1545
-- $1 + $2, f($1) continue
Feb 17, 2016
1546
-- ^
1547
-- .... , f($1:::int) now retry
Feb 17, 2016
1548
--
1549
-- $1::int + $2, ...
1550
-- ^ aha! new information
1551
-- $1::int + $2::int, f($1::int)
Feb 17, 2016
1552
1553
-- all is well!
1554
```
1555
Apr 20, 2016
1556
## Implementation notes
1557
1558
(these may evolve as the RFC gets implemented. This section
1559
is likely to become outdated a few months after the RFC gets accepted.)
1560
1561
1. All AST nodes (produced by the parser) implement `Expr`.
1562
1563
`INSERT`, `SELECT`, `UPDATE` nodes become visitable by
1564
visitors. This will unify the way we do processing on the AST.
1565
2. The ``TypeCheck`` method from ``Expr`` becomes a separate
1566
visitor. Expr gets a ``type`` field populated by this visitor. This
1567
will make it clear when type inference and type checking have run
1568
(and that they run only once). This is in contrast with
1569
``TypeCheck`` being called at random times by random code.
1570
3. During typing there will be a need for a data structure to collect
1571
the type candidate sets per AST node (``Expr``) and
1572
placeholder. This should be done using a separate map, where either
1573
AST nodes or placeholder names are keys.
1574
4. Semantic analysis will be done as a new step doing constant
1575
folding, type inference, type checking.
Feb 17, 2016
1576
1577
The semantic analysis will thus look like::
1578
1579
```go
1580
type placeholderTypes = map[ValArg]Type
1581
1582
// Mutates the tree and populates .type
1583
func semanticAnalysis(root Expr) (assignments placeholderTypes, error) {
1584
var untypedFolder UntypedConstantFoldingVisitor = UntypedConstantFoldingVisitor{}
1585
untypedFolder.Visit(root)
1586
1587
// Type checking and type inference combined.
1588
var typeChecker TypeCheckVisitor = TypeCheckVisitor{}
1589
if err := typeChecker.Visit(root); err != nil {
1590
report ambiguity or typing error
1591
}
1592
assignments = typeChecker.GetPlaceholderTypes()
1593
1594
// Optional in Morty
1595
var constantFolder ConstantFoldingVisitor = ConstantFoldingVisitor{}
1596
constantFolder.Visit(root)
1597
}
1598
```
1599
1600
When sending values over pgwire during bind, the client sends the
1601
arguments positionally. For each argument, it specifies a "format"
1602
(different that a type). The format can be binary or text, and
1603
specifies the encoding of that argument. Every type has a text
1604
encoding, only some also have binary encodings. The client does not
1605
send an oID back, or anything to identify the type. So the server just
1606
needs to parse whatever it got assuming the type it previously
1607
inferred.
1608
1609
The issue of parsing these arguments is not really a typing
1610
issue. Formally Morty (and Rick, its alternative) just assumes that it gets whatever
1611
type it asked for. Whomever implements the parsing of these arguments
1612
(our pgwire implementation) uses the same code/principles as a
1613
`TYPEASSERT_STRING` (but this has nothing to do with the AST of our
1614
query (which ideally should have been already saved from the prepare
1615
phase)).
1616
1617
Apr 20, 2016
1618
## Overview of Rick
1619
1620
The precursor of, and an alternative to, Morty was called *Rick*. We
1621
present it here to keep historical records and possibly serve as other
1622
point of reference if the topic is revisited in the future.
1623
1624
- Rick is an iterative (multiple traversals) algorithm that tries
1625
harder to find a type for placeholders that accommodates all their
1626
occurrences;
1627
1628
- Rick allows from flexible implicit conversions;
1629
1630
- Rick really can't work without constant folding to simplify complex
1631
expressions involving only constants;
1632
1633
- Rick tries to "optimize" the type given to a literal constant
Feb 17, 2016
1634
depending on context;
1637
## Proposed typing strategy for Rick
1638
1639
We use the following notations below::
1640
Feb 17, 2016
1641
E :: T => the regular SQL cast, equivalent to `CAST(E as T)`
1642
E [T] => an AST node representing `E`
1643
with an annotation that indicates it has type T
1644
1645
For conciseness, we also introduce the notation E[\*N] to mean that
1646
`E` has an unknown number type (`int`, `float` or `decimal`).
1647
1648
We assume that an initial/earlier phase has performed the reduction of
1649
casted placeholders (but only placeholders!), that is, folding:
1650
1651
$1::T => $1[T]
1652
x::T => x :: T (for any x that is not a placeholder)
1653
$1::T :: U => $1[T] :: U
1654
1655
Then we type using the following phases, detailed below:
1656
1657
- 1. Constant folding for untyped constants, mandatory
1658
- 2-6. Type assignment and checking
1659
- 7. Constant folding for remaining typed constants, optional
1660
1661
The details:
1662
1663
1. Constant folding.
1664
1665
This reduces complex expressions without losing information (like
1666
in [Go](https://blog.golang.org/constants)!) Literal constants are
1667
evaluated using either their type, if intrinsically known (for
1668
unambiguous literals like true/false, strings, byte arrays), or an
1669
internal exact implementation type for ambiguous literals
1670
(numbers). This is performed for all expressions involving only
1671
untyped literals and functions applications applied only to such
1672
expressions. For number literals, the imlementation type from the
1673
[go/constant](https://golang.org/pkg/go/constant/) arithmetic library
1674
can be used.
Feb 17, 2016
1675
1676
While the constant expressions are folded, the results must be typed
1677
using either the known type if any operands had one; or the unknown
1678
numeric type when the none of the operands had a known type.
1679
1680
For example:
1681
1682
true and false => false[bool]
1683
'a' + 'b' => "ab"[string]
1684
12 + 3.5 => 15.5[*N]
Feb 17, 2016
1685
case 1 when 1 then x => x[?]
1686
case 1 when 1 then 2 => 2[*N]
1687
3 + case 1 when 1 then 2 => 5[*N]
1688
abs(-2) => 2[*N]
1689
abs(-2e10000) => 2e10000[*N]
1690
1691
Note that folding does not take place for functions/operators that
1692
are overloaded and when the operands have different types (we
1693
might resolve type coercions at a later phase):
1694
1695
23 + 'abc' => 23[*N] + 'abc'[string]
1696
23 + sin(23) => 23[*N] + -0.8462204041751706[float]
1697
1698
Folding does "as much work as possible", for example:
1699
1700
case x when 1 + 2 then 3 - 4 => (case x[?] when 3[*N] then -1[*N])
1701
1702
Note that casts select a specific type, but may stop the fold
1703
because the surrounding operation becomes applied to different
1704
types:
1705
1706
true::bool and false => false[bool] (both operands of "and" are bool)
1707
1::int + 23 => 1[int] + 23[*N]
1708
(2 + 3)::int + 23 => 5[int] + 23[*N]
1709
1710
Constant function evaluation only takes place for a limited
1711
subset of supported functions, they need to be pure and have an
1712
implementation for the exact type.
1713
Feb 17, 2016
1714
2. Culling and candidate type collection.
1715
1716
This phase collects candidate types for AST nodes, does a
1717
pre-selection of candidates for overloaded calls and computes
1718
intersections.
1719
1720
This is a depth-first, post-order traversal. At every node:
Feb 17, 2016
1721
1722
1. the candidate types of the children are computed first
Feb 17, 2016
1723
1724
2. the current node is looked at, some candidate overloads may be
1725
filtered out
Feb 17, 2016
1726
1727
3. in case of call to an overloaded op/fun, the argument types
1728
are used to restrict the candidate set of the direct child
1729
nodes (set intersection)
Feb 17, 2016
1730
1731
4. if the steps above determine there are no
Feb 17, 2016
1732
possible types for a node, fail as a typing error.
1733
1734
(Note: this is probably a point where we can look at implicit
1735
coercions)
1736
1737
Simple example:
1738
1739
5[int] + 23[*N]
1740
1741
This filters the candidates for + to only the one taking `int` and
1742
`int` (rule 2). Then by rule 2.3 the annotation on 23 is changed,
1743
and we obtain:
Feb 17, 2016
1744
1745
( 5[int] + 23[int] )[int]
1746
1747
Another example::
1748
1749
f:int->int
1750
f:float->float
1751
f:string->string
1752
(12 + $1) + f($1)
1753
1754
We type as follows::
1755
1756
(12[*N] + $1) + f($1)
1757
^
Feb 17, 2016
1758
1759
(12[*N] + $1[*N]) + f($1[*N])
1760
^
1761
-- Note that the placeholders in the AST share
1762
their type annotation between all their occurrences
1763
(this is unique to them, e.g. literals have
1764
separate type annotations)
Feb 17, 2016
1765
1766
(12[*N] + $1[*N])[*N] + f($1[*N])
1767
^
Feb 17, 2016
1768
1769
(12[*N] + $1[*N])[*N] + f($1[*N])
1770
^
1771
(nothing to do anymore)
Feb 17, 2016
1772
1773
(12[*N] + $1[*N])[*N] + f($1[*N])
1774
^
1775
1776
At this point, we are looking at `f($1[int,float,decimal,...])`.
1777
Yet f is only overloaded for int and float, therefore, we restrict
1778
the set of candidates to those allowed by the type of $1 at that
1779
point, and that reduces us to:
1780
1781
f:int->int
1782
f:float->float
1783
1784
And the typing continues, restricting the type of $1:
1785
1786
(12[*N] + $1[int,float])[*N] + f($1[int,float])
1787
^^ ^ ^^
1788
1789
(12[*N] + $1[int,float])[*N] + f($1[int,float])[int,float]
1790
^ ^^
1791
1792
(12[*N] + $1[int,float])[*N] + f($1[int,float])[int,float]
1793
^
1794
1795
Aha! Now the plus sees an operand on the right more restricted
1796
than the one on the left, so it filters out all the unapplicable
1797
candidates, and only the following are left over::
1798
1799
+: int,int->int
1800
+: float,float->float
1801
1802
And thus this phase completes with::
1803
1804
((12[*N] + $1[int,float])[int,float] + f($1[int,float])[int,float])[int,float]
1805
^^ ^
1806
Notice how the restrictions only apply to the direct children
1807
nodes when there is a call and not pushed further down (e.g. to
1808
`12[*N]` in this example).
1809
1810
3. Repeat step 2 as long as there is at least one candidate set with more
1811
than one type, and until the candidate sets do not evolve any more.
1812
1813
This simplifies the example above to:
1814
1815
((12[int,float] + $1[int,float])[int,float] + f($1[int,float])[int,float])[int,float]
1816
1817
4. Refine the type of numeric constants.
1818
1819
This is a depth-first, post-order traversal.
1820
1821
For every constant with more than one type in its candidate type
Feb 17, 2016
1822
set, pick the best type that can represent the constant: we use
1823
the preference order `int`, `float`, `decimal`
1824
and pick the first that can represent the value we've computed.
1825
1826
For example:
1827
1828
12[int,float] + $1[int,float] => 12[int] + $1[int, float]
Feb 17, 2016
1829
1830
The reason why we consider constants here (and not placeholders) is
1831
that the programmers express an intent about typing in the form of
1832
their literals. That is, there is a special meaning expressed by
1833
writing "2.0" instead of "2". (Weak argument?)
1834
1835
Also see section
1836
[Implementing Rick](#implementing-rick-untyped-numeric-literals).
1837
1838
5. Run steps 2 and 3 again. This will refine the type of placeholders
1839
automatically.
1840
1841
6. If there is any remaining candidate type set with more than one
1842
candidate, fail with ambiguous.
1843
1844
7. Perform further constant folding on the remaining constants that now have a specific type.
1845
1846
## Revisiting the examples from earlier with Rick
1847
1848
From section [Examples that go wrong (arguably)](#examples-that-go-wrong-arguably):
1849
1850
```sql
1851
prepare a as select 3 + case (4) when 4 then $1 end
1852
-- 3[*N] + $1[?] (rule 1)
1853
-- 3[*N] + $1[*N] (rule 2)
1854
-- 3[int] + $1[*N] (rule 4)
1855
-- 3[int] + $1[int] (rule 2)
1856
--OK
1857
1858
create table t (x decimal);
1859
insert into t(x) values (3/2)
1860
-- (3/2)[*N] (rule 1)
1861
-- (3/2)[decimal] (rule 2)
1862
--OK
1863
1864
create table u (x int);
1865
insert into u(x) values (((9 / 3) * (1 / 3))::int)
1866
-- 3 * (1/3)::int (rule 1)
1867
-- 1::int (rule 1)
1868
-- 1[int] (rule 1)
1869
--OK
1870
1871
create table t (x float);
1872
insert into t(x) values (1e10000 * 1e-9999)
1873
-- 10[*N] (rule 1)
1874
-- 10[float] (rule 2)
1875
--OK
1876
1877
select length(E'\\000' + 'a'::bytes)
1878
-- E'\\000'[string] + 'a'[bytes] (input, pretype)
1879
-- then failure, no overload for + found
1880
--OK
1881
1882
select length(E'\\000a'::bytes || 'b'::string)
1883
-- E'\\000a'[bytes] || 'b'[string]
1884
-- then failure, no overload for || found
1885
--OK
1886
```
1887
1888
Fancier example that shows the power of the proposed
1889
type system, with an example where Postgres would
1890
give up:
1891
1892
```sql
1893
f:int,float->int
1894
f:string,string->int
1895
g:float,decimal->int
1896
g:string,string->int
1897
h:decimal,float->int
1898
h:string,string->int
1899
prepare a as select f($1,$2) + g($2,$3) + h($3,$1)
1900
-- ^
1901
-- f($1[int,string],$2[float,string]) + ....
1902
-- ^
1903
-- f(...)+g($2[float,string],$3[decimal,string]) + ...
1904
-- ^
1905
-- f(...)+g(...)+h($3[decimal,string],$1[string])
1906
-- ^
1907
-- (2 re-iterates)
1908
-- f($1[string],$2[string]) + ...
1909
-- ^
1910
-- f(...)+g($2[string],$3[string]) + ...
1911
-- ^
1912
-- f(...)+g(...)+h($3[string],$1[string])
1913
-- ^
1914
1915
-- (B stops, all types have been resolved)
1916
1917
-- => $1, $2, $3 must be strings
1918
```
1919
1920
## Drawbacks of Rick
1922
The following example types differently from PostgreSQL::
1923
1924
```sql
1925
select (3 + $1) + ($1 + 3.5)
1926
-- (3[*N] + $1[*N]) + ($1[*N] + 3.5[*N]) rule 2
1927
-- (3[int] + $1[*N]) + ($1[*N] + 3.5[float]) rule 4
Feb 17, 2016
1928
-- (3[int] + $1[int]) + ...
Feb 17, 2016
1930
-- (3[int] + $1[int] + ($1[int] + 3.5[float])
1931
-- ^ failure, unknown overload
1932
```
1933
1934
Here Postgres would infer "decimal" for `$1` whereas our proposed
1935
algorithm fails.
1936
1937
The following situations are not handled, although they were mentioned
1938
in section
1939
[Examples that go wrong (arguably)](#examples-that-go-wrong-arguably)
1940
as possible candidates for an improvement:
1941
1942
```sql
1943
select floor($1 + $2)
1944
-- $1[*N] + $2[*N] (rule 2)
1945
-- => failure, ambiguous types for $1 and $2
1946
1947
f(int) -> int
1948
f(float) -> float
1949
g(int) -> int
1950
prepare a as select g(f($1))
1951
-- $1[int,float] (rule 2)
1952
-- => failure, ambiguous types for $1 and $2
1953
```
1954
1955
## Alternatives around Rick (other than Morty)
Feb 17, 2016
1957
There's cases where the type inference doesn't quite work, like
1958
1959
floor($1 + $2)
1960
g(f($1))
Feb 17, 2016
1961
CASE a
1962
WHEN 1 THEN 'one'
1963
WHEN 2 THEN
1964
CASE language
1965
WHEN 'en' THEN $1
1966
END
Feb 17, 2016
1968
1969
Another category of failures involves dependencies between choices of
1970
types. E.g.:
1971
1972
f: int,int->int
1973
f: float,float->int
1974
f: char, string->int
1975
g: int->int
1976
g: float->int
1977
h: int->int
1978
h: string->int
1979
1980
f($1, $2) + g($1) + h($2)
Feb 17, 2016
1981
1982
Here the only possibility is `$1[int], $2[int]` but the algorithm is not
1983
smart enough to figure that out.
1984
1985
To support these, one might
Feb 17, 2016
1986
suggest to make Rick super-smart via
1987
the application of a "bidirectional" typing algorithm, where
1988
the allowable types in a given context guide the typing of
1989
sub-expressions. These are akin to constraint-driven typing and a number
1990
of established algorithms exist, such as Hindley-Milner.
1991
1992
The introduction of a more powerful typing system would certainly
1993
attract attention to CockroachDB and probably attract a crowd of
1994
language enthousiasts, with possible benefits in terms of external
1995
contributions.
1996
1997
However, from a practical perspective, more complex type systems are
1998
also more complex to implement and troubleshoot (they are usually
1999
implemented functionally and need to be first translated to
2000
non-functional Go code) and may have non-trivial run-time costs
2001
(e.g. extensions to Hindley-Milner to support overloading resolve in
2002
quadratic time).
2003
2004
## Implementing Rick: untyped numeric literals
2005
2006
To implement untyped numeric literals which will enable exact
2007
arithmetic, we will use
2008
https://golang.org/pkg/go/constant/. This will require a
2009
change to our Yacc parser and lexical scanner, which will parser all
2010
numeric looking values (`ICONST` and `FCONST`) as `NumVal`.
2011
2012
We will then introduce a constant folding pass before type checking is
2013
initially performed (ideally using a folding visitor instead of the
2014
current interface approach). While constant folding these untyped
2015
literals, we can use
2016
[BinaryOp](https://golang.org/pkg/go/constant/#BinaryOp) and
2017
[UnaryOp](https://golang.org/pkg/go/constant/#UnaryOp) to
2018
retain exact precision.
2019
2020
Next, during type checking, ``NumVals`` will be evalutated as their
2021
logical `Datum` types. Here, they will be converted `int`, `float` or
2022
`decimal`, based on their `Value.SemanticType()` (e.g. using
2023
[Int64Val](https://golang.org/pkg/go/constant/#Int64Val) or
2024
`decimal.SetString(Value.String())`. Some Semantic Types will result in a
2025
panic because they should not be possible based on our
2026
parser. However, we could eventually introduce Complex literals using
2027
this approach.
2028
2029
Finally, once type checking has occurred, we can proceed with folding
2030
for all typed values and expressions.
2031
2032
Untyped numeric literals become typed when they interact with other
2033
types. E.g.: `(2 + 10) / strpos(“hello”, “o”)`: 2 and 10 would be
2034
added using exact arithmatic in the first folding phase to
2035
get 12. However, because the constant function `strpos` returns a
2036
typed value, we would not fold its result further in the first phase.
2037
Instead, we would type the 12 to a `DInt` in the type check phase, and
2038
then perform the rest of the constant folding on the `DInt` and the
2039
return value of `strpos` in the second constant folding phase. **Once
2040
an untyped constant literal needs to be typed, it can never become
2041
untyped again.**
2042
2043
## Comments on Rick, leading to Morty
2044
2045
Rick seems both imperfect (it can fail to find the unique type
2046
assignment that makes the expression sound) and complicated. Moreover
2047
one can easily argue that it can infer too much and appear magic.
2048
E.g. the `f($1,$2) + g($2,$3) + h($3,$1)` example where it might be
2049
better to just ask the user to give type hints.
2050
2051
It also makes some pretty arbitrary decisions about programmer intent,
2052
e.g. for `f` overloaded on `int` and `float`, `f((1.5 - 0.5) + $1)`,
2053
the constant expression `1.5 - 0.5` evaluates to an `int` an forces
2054
`$1` to be an `int` too.
2055
2056
The complexity and perhaps excessive intelligence of Rick stimulated a
2057
discussion about the simplest alternative that's still useful for
2058
enough common cases. Morty was born from this discussion: a simple set
2059
of rules operating in two simple passes on the AST; there's no recursion
2060
and no iteration.
2061
2062
## Examples where Morty differs from Rick
2063
2064
```sql
2065
f: int -> int
2066
f: float -> float
2067
SELECT f(1)
2068
```
2069
2070
*M* says it can't choose an overload. *R* would type `1` as `int`.
2071
2072
```sql
2073
f:int->int
2074
f:float->float
2075
f:string->string
2076
PREPARE a AS (12 + $1) + f($1)
2077
```
2078
2079
*M* infers `exact` and says that `f` is ambiguous for an `exact` argument, *R* infers `int`.
2080
2081
```sql
2082
f:int->int
2083
f:float->float
2084
g:float->int
2085
g:numeric->int
2086
PREPARE a AS SELECT f($1) + g($1)
2087
```
2088
2089
*M* can't infer anything, *R* intersects candidate sets and figures
2090
out `float` for `$1`.
2091
2092
## Implementation notes for Rick
2094
Constant folding for Rick will actually be split in two parts: one
2095
running before type checking and doing folding of untyped
2096
numerical computations, the other running after type checking and
2097
doing folding of any constant expression (typed literals, function
2098
calls, etc.). This is because we want to do untyped computations
2099
before having to figure out types, so we can possibly use the
2100
resulting value when deciding the type (e.g. 3.5 - 0.5 could b
2101
inferred as ``int``).
2103
## Unresolved questions for Rick
2104
2105
Note that some of the reasons why implicit casts would be otherwise
2106
needed go away with the untyped constant arithmetic that we're suggesting,
2107
and also because we'd now have type inference for values used in `INSERT`
2108
and `UPDATE` statements (`INSERT INTO tab (float_col) VALUES 42` works as
2109
expected). If we choose to have some implicit casts in the language, then the
2110
type inference algorithm probably needs to be extended to rank overload options based on
2111
the number of casts required.
2112
2113
What's the story for `NULL` constants (literals or the result of a
Feb 17, 2016
2114
pure function) in Rick? Do they need to be typed?
2115
2116
Generally do we need to have null-able and non-nullable types?
2117
2118
# Unresolved questions
2119
2120
How much Postgres compatibility is really required?
2121