New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed #33397 -- Corrected resolving output_field for DateField/DateTimeField/TimeField/DurationFields. #15271
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the very detailed breakdown and patch Luke!
These happen because QuerySet.filter() just builds a SQL expression that it passes on to the database, without ever using output_field(), in contrast to annotate().
Right, that's an unfortunate side effect of lookups not being resolved like other expressions are.
I thought about changing the wrong logic in BaseExpression._resolve_output_field, but it would be a nightmare, because it’s a base class and too much code assumes the current behaviour. For example, we’ve got many functions in django.db.models.functions that inherit this behaviour via Func and would have to be fixed up, and worse are all the 3rd party functions that would be affected.
I think that this is something we should ultimately do though if we decide to venture down this road and properly address this issue. It doesn't necessarily have to be done in this PR/ticket but I think that the general consensus is that this pattern is flawed.
Adding special cases for combining None with another numeric type
I'm not convinced there is value in special casing None
in this PR tbh as that's a whole can of worms if you consider nullable fields. I think this could be a standalone ticket.
It was then easy to add the following behaviours, which seem to match what all the databases do (I mostly checked on Postgres, the tests cover at least some of these).
DateField + DurationField -> DateTimeField
Not sure if this is already covered by tests and I realized we document this caveat but I do wonder how things will fare for non-PostgreSQL backends in this case.
django/docs/ref/models/fields.txt
Lines 757 to 761 in 4400d85
.. note:: | |
Arithmetic with ``DurationField`` works in most cases. However on all | |
databases other than PostgreSQL, comparing the value of a ``DurationField`` | |
to arithmetic on ``DateTimeField`` instances will not work as expected. |
Therefore, I think the best route is that we don’t attempt type inference for any combination involving DateField/ DateTimeField/ DurationField combined with Value(None), but raise FieldError for all these cases.
We reached the same conclusion I believe :)
However, to avoid people having to specify output_field() for such custom fields, perhaps we could document and expose the register_combinable_fields function as a public API.
Not against documenting register_combinable_fields
but I'd suggest we start by deprecating the current BaseExpression._resolve_output_field
behaviour and document that output_field
must be provided. During the deprecation cycle we should get a bit of feedback about the necessity of documenting register_combinable_fields
Adding such an API is tracked in ticket-26355 which had #6395 as a tentative PR.
# This guess is mostly a bad idea, but there is quite a lot of code | ||
# (especially 3rd party Func subclasses) that depend on it, we'd need a | ||
# deprecation path to fix it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Completely agree that this was a bad idea, it shouldn't be too hard to deprecate and I assume it wouldn't be too disruptive if we defined explicit output_field
(or proper logic) for all built-in expressions.
This is something ticket-30446/#11359 forced us to do in a few instances already anyway (e.g. ticket-33114)
I added that because the original logic supported that case, and a test fails without it: django/tests/annotations/tests.py Line 383 in 0ab58c1
I'd rather not have it at all, but it feels like if we've got a test then we've got some kind of commitment to preserving that behaviour. |
Makes sense, thanks for the clarifications. @django/triage-review any other thoughts on the current state of the PR? To me this seems relatively fit for inclusion as is assuming we don't want to document |
I rebased and apply |
I moved tests that work without this patch to separate PR, see #15556. |
Rebased after merging 04ad0f2. |
5381e9d
to
dbbf44f
Compare
@spookylukey Thanks 👍 I moved adding |
…meField/TimeField/DurationFields. This includes refactoring of CombinedExpression._resolve_output_field() so it no longer uses the behavior inherited from Expression of guessing same output type if argument types match, and instead we explicitly define the output type of all supported operations. This also makes nonsensical operations involving dates (e.g. date + date) raise a FieldError, and adds support for automatically inferring output_field for cases such as: * date - date * date + duration * date - duration * time + duration * time - time
Summary
This PR attempts to address ticket 33398:
Refactors type inference (i.e. automatic calculation of
output_field
) forCombinedExpression
so that it doesn’t use the guessing logic from its base class, but explicitly lists all the combinations it supports.Corrects inference of
output_field
involving nonsensical date/duration operations, such asdate + date
, so that they raiseFieldError
.Corrects inference of
output_field
for expressions using well-defined date/duration operations, such asdate + duration
, so that you don’t need to specifyoutput_field
manually.Preamble
Sorry for the length of this PR, this turned out to be very complex!
All of this work affects only cases where we are trying to resolve the type of an expression within Django. That means it doesn’t affect various cases involving
F
expressions:Without this patch, you can do
QuerySet.filter(some_date_field=F('other_date_field') + Value(timedelta(days=1)))
and it works as expected.With this patch, if you do
QuerySet.filter(nonsense=F('some_date_field') + F('other_date_field')
, you do not get aFieldError
, but usually get an error later on, depending on your database.These happen because
QuerySet.filter()
just builds a SQL expression that it passes on to the database, without ever usingoutput_field()
, in contrast toannotate()
. For testing, make sure you are using.annotate()
or.alias()
, not.filter()
, and be careful when reading existing unit tests, since the names of the tests typically don’t distinguish between these cases.In the text below, in most cases I will use “date” informally to cover both Python
date
/DjangoDateField
and Pythondatetime
/DjangoDateTimeField
, and sometimesTimeField
as well which usually works similarly.Existing behaviour and root causes
The existing behaviour on
BaseExpression._resolve_output_field
is as follows:An expression where all the types of the sub-expressions are the same can be assumed to have the same type.
This is at the root of ticket #33397, and causes a lot of problems in general. As an assumption, it is fairly obviously wrong. For example, there is no reason to think that a function call whose arguments are all strings will return a string, and there are obvious counter examples like
LENGTH
. It just happens that it is often true, so it’s basically a guess.When inherited by
CombinedExpression
, this logic is particularly bad. It can’t handle things likedate + duration
, which ought to produce a datetime, and it incorrectly handles things likedate + date
(which does not produce a date, but is an error).There is also some other existing behaviour regarding
None
/NULL
supported by this logic, which we have tests against. So, for example, if you haveannotate(value=F('some_integer_field') + None)
, the type resolves toIntegerField
, on the basis that theNone
gets ignored and the remaining fields are allIntegerField
. This is especially problematic for dates, as discussed below.Approaches
I thought about changing the wrong logic in
BaseExpression._resolve_output_field
, but it would be a nightmare, because it’s a base class and too much code assumes the current behaviour. For example, we’ve got many functions indjango.db.models.functions
that inherit this behaviour viaFunc
and would have to be fixed up, and worse are all the 3rd party functions that would be affected.The next best approach is if
CombinedExpression._resolve_output_field
simply doesn’t usesuper()
at all, and that way we can avoid the bugs, and also avoid any contortions required to work around its incorrect logic.In order to do this, we have to fold into the
CombinedExpression._resolve_output_field
dict any correct behaviour that waspreviously supplied by
BaseExpression._resolve_output_field
. That is trickier than it sounds, because we have to decide which behaviour is “good”, plus we mustn’t forget any combinations.My principles have been:
FieldError
.Numeric operations
There are a lot of edge cases, subtleties and potential bugs here, which I investigated in the course of writing this patch, but for the purposes of this ticket, I think the best approach here is to just try to preserve existing behaviour. That means:
Adding “Same type arguments -> same output type” logic for a bunch of types/operations into the
_connector_combinations
dictionary:ADD/SUB/MUL/DIV/POW/MOD for IntegerField, FloatField, DecimalField
BITAND/BITOR/BITLEFTSHIFT/BITRIGHTSHIFT/BITXOR for IntegerField. In theory, these should be added for
FloatField
andDecimalField
too if we want to exactly preserve previous behaviour, but these operations do not make sense for anything butIntegerField
, and some databases will raise errors in many/most cases. (For example, with Postgres, the only case it doesn’t produce an error for is where you useDecimal()
with an argument that happens to be an integer).Adding special cases for combining
None
with another numeric typeIn this patch, I have preserved the “type promotion” behaviour of ADD/SUB/MUL/DIV, but not added it for the POW/MOD cases (which stood out to me as a probably an accidental omission). For example, you will get “You must set output_field” if you attempt to do
Value(1.5) ** Value(2)
. This could be changed separately once the right behaviour has been decided.I have also left integer-integer division alone, just adding a note that the current behaviour does not match MySQL/Oracle’s behaviour.
NULLs
The patch explicitly reproduces the previous behaviour for things like
something + Value(None)
wheresomething
resolves toIntegerField
,DecimalField
orFloatField
- i.e. we resolve to whateversomething
is.Date and time
With the overall change in strategy done, we get the following desired behaviours for nonsensical operations:
and others.
It was then easy to add the following behaviours, which seem to match what all the databases do (I mostly checked on Postgres, the tests cover at least some of these).
DateField + DurationField -> DateTimeField
DateField - DurationField -> DateTimeField
DateTimeField + DurationField -> DateTimeField
DateTimeField - DurationField -> DateTimeField
TimeField + DurationField -> TimeField
TimeField - DurationField -> TimeField
DurationField + DurationField -> DurationField
DurationField - DurationField -> DurationField
(and some more combos, see code).
NULLs
NULLs with date/duration values is one of the trickiest bits.
For various cases, we have an issue defining what the
output_field
should be if one of the operands is inferred as NULL (e.g.Value(None)
)Example 1
duration + duration -> duration
date + duration -> date
NULL + duration -> ?
Example 2
date - date -> duration
date - duration -> date
date - NULL -> ?
Example 3
duration - duration -> duration
date - duration -> date
NULL - duration -> ?
The fundamental issue here is that
+
and-
operators have been overloaded in ways that make them very unlike numeric operators, but there isn’t much we can do about that.There are some cases where, with the combinations we currently support, we could infer an output type based on the idea that there is only one output type that wouldn’t be an error. For example:
date + NULL
-> we could inferdate
hereduration - NULL
-> we could inferduration
hereHowever, that is based on the assumption that in the future we won’t add more supported combinations. For reference, Postgres operators support some interesting uses of
+
and-
that we don’t support, such asduration - time -> time
, which is weird but we might support it in the future.Therefore, I think the best route is that we don’t attempt type inference for any combination involving
DateField
/DateTimeField
/DurationField
combined withValue(None)
, but raiseFieldError
for all these cases.Another supporting argument for this is that Postgres also raises errors for some combinations, for example
date + NULL
, with essentially the same issue - it cannot decide on which overload to use:So even if we tried to infer
date
fordate + NULL
, we’d have to also add casts for the NULL to get Postgres to not return an error like this.There are some backwards compatibility issues with the approach I’m suggesting, but I think minimal. Affected code looks like this:
..where
x
is sometimesNone
, and sometimes a sensible value (a date/datetime object for the first case, a timedelta for the last two).In these cases, previously Django would work as expected, but with the patch, it will require setting
output_type
explicitly to resolve the ambiguity. I think this is a reasonable trade-off - I think we can legitimately classify these as ambiguous cases where the old behaviour just happens to work. Only with the third case (duration - NULL) could you argue that there is no ambiguity.Other fields
First, are there other fields in Django itself that need to have support for
+
etc., where we were relying on the logic of “same output type if input types match”?The test suite discovered
SearchVectorField
as something I had missed, but there may be others. I created theregister_combinable_fields
interface for this, so that core code doesn’t need to importcontrib.postgres
.Second, this PR could cause breakage for 3rd party fields that allow
+
etc. at the database and for which the logic of “same output type if input types match” was working well. I think this is reasonable given the constraints and the fact that, in general, that logic is incorrect. Also, in many cases, I suspect 3rd party fields will be covered due to inheriting fromIntegerField
orFloatField
, for example.However, to avoid people having to specify
output_field()
for such custom fields, perhaps we could document and expose theregister_combinable_fields
function as a public API.Another option possibility is that we detect 3rd party fields and fall back to the old logic in that case.
For Postgres, the following query may help find operator overloads that we may need to explicitly re-add support for:
I have not checked all these, but wanted to get some feedback first.
End
Thanks for your consideration!