New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed #28805 -- Added regular expression database functions. #12438
base: main
Are you sure you want to change the base?
Conversation
10cc472
to
2677efc
Compare
37509f3
to
b806947
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. So I think this is ready for a first round of review.
django/db/models/functions/text.py
Outdated
# FIXME: This emulated version doesn't handle NULL pattern correctly. | ||
expression, pattern, flags = self.source_expressions.copy() | ||
expr = RegexpSubstr(expression, pattern, flags, no_wrap=True) | ||
expr = StrIndex(expression, Coalesce(expr, Value('<<fail>>'))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Emulating REGEXP_INSTR
in PostgreSQL is tough. Unfortunately empty string has an index of 1 so we need to coalesce to some other sentinel value that will not match to get the expected index of 0 for no match. This breaks for NULL
being passed to pattern
. We might be able to solve this by wrapping with Case
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be possible to do CASE WHEN <pattern> IS NULL THEN NULL ELSE <expr> END
in PostgreSQL, but I can't work out if this is possible with Django's Case
and When
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am now thinking that I'll probably just skip the test for NULL
being passed to pattern
and chalk it up as a wart. I'm having to do that for Oracle anyway for REGEXP_REPLACE
as it treats NULL
as ''
for pattern
there, returning the original string instead of NULL
.
f6f61f6
to
1597a68
Compare
acb76b3
to
fff84cc
Compare
438755a
to
93c706b
Compare
class RegexpReplace(RegexpFlagMixin, Func): | ||
function = 'REGEXP_REPLACE' | ||
inline_flags = {'mariadb'} | ||
output_field = CharField() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@felixxm I've got some test failures on Oracle and was wondering if you as an oracle on Oracle would understand.
It seems that the value returned is an instance of cx_Oracle.LOB
and I have to cast that to str
to get a sensible value.
If I change this line to TextField
the tests pass, but that differs to all of the other functions in this module which prefer CharField
. I didn't have these failures originally and think that they may have appeared after 1e38f11 and I probably had to add this output_field
line here after that due to mixed CharField
and TextField
. Maybe @charettes will understand what is going on here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a quick answer. I would expect that all text-functions are affected by this issue (but I couldn't reproduce it with LPad
or Left
🤔). Oracle should return LOB
`s output for CLOB
`s inputs, when we set output_field
to CharField()
it's not converted to str
in contrast to TextField
, see
django/django/db/backends/oracle/operations.py
Lines 205 to 208 in 35b0378
def convert_textfield_value(self, value, expression, connection): | |
if isinstance(value, Database.LOB): | |
value = value.read() | |
return value |
Adding conversion will fix tests, but I'm not sure if that's a good approach:
diff --git a/django/db/backends/oracle/operations.py b/django/db/backends/oracle/operations.py
index 1e5dc70613..bd15c0a140 100644
--- a/django/db/backends/oracle/operations.py
+++ b/django/db/backends/oracle/operations.py
@@ -176,7 +176,7 @@ END;
def get_db_converters(self, expression):
converters = super().get_db_converters(expression)
internal_type = expression.output_field.get_internal_type()
- if internal_type in ['JSONField', 'TextField']:
+ if internal_type in ['JSONField', 'TextField', 'CharField']:
converters.append(self.convert_textfield_value)
elif internal_type == 'BinaryField':
converters.append(self.convert_binaryfield_value)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pope1ni Thanks for this patch 👍 and investigation 🕵️ The main issue for me is that's it's not adjustable for 3rd-party database backends. Ideally, we could add a several hooks and call them in as_{vendor}
methods instead of checking multiple vendors in as_sql()
.
93c706b
to
46c5cea
Compare
So I've rejigged this in a way that should address this concern. Feel free to review those changes. I'm now just waiting for the oracle tests to complete to see if I still get the failures I was experiencing. I suspect that I'll have to follow the advice recently added in 9369f0c. |
So the failures with Oracle remain:
I've noticed that the failing tests all have one thing in common: They pass @felixxm I'm wondering whether this is a more general problem with Oracle? I see that the majority of tests for text database functions don't test with |
A string of ``flags`` can be provided to adjust the matching and replacement | ||
behavior for all of the above functions: | ||
|
||
* ``c``: Perform case-sensitive matching of ``pattern``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got to this bit and now I'm wondering what the defaults are?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The defaults are in theory that no flags are passed. Hence the flags=Value('')
in the function signatures above.
In reality, it's very messy because nearly none of the backends agree on how to do regular expressions. So we smooth out some of the discrepancies between them to get it as consistent as possible. For PostgreSQL we pass p
by default, for MySQL we pass c
, and for MariaDB we use (?-i)
inline. This is all carefully explained in the admonition.
I wish it were easier, but I believe that this is the best we can do. 🙁
@felixxm This has been nearly ready for a long time so I thought I'd try to finish it. I've tested it out with the change that you suggested in #12438 (comment) and all the tests pass. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ngnpope Thanks for all you efforts 👍 and really sorry for the late reply. I'm still a bit skeptical about juggling so many database-specific flags, it's can be really hard to maintain. Personally, I'm -0 about this feature.
I left some comments, but we need to make few more things to make it reviewable again:
- rebase from the
main
branch, - apply
black
, and - re-target to Django 4.1.
It happens!
After two years of keeping this going with no discouragement, I hope we don't jump to a no too quickly... We'll have to see if we can break things up further to deal with the flags better.
Have rebased, blackened, re-targeted to 4.1, and addressed the majority of comments. A couple of the comments will require a bit more thought, however. As an aside, PostgreSQL 15 looks like it is going to learn |
dfb1016
to
66099dd
Compare
98d3049
to
9acb5bc
Compare
ddd1bca
to
0e708aa
Compare
buildbot, test on oracle. |
671957c
to
9ce112f
Compare
9ce112f
to
a42fa9c
Compare
This is a work in progress attempt to implement
RegexpReplace()
, etc. for ticket-28805.Backend Support:
PostgreSQL:
regexp_count(expr, pattern, position?, flags?)
-- v15+ onlyregexp_instr(expr, pattern, position?, occurrence?, return_opt?, flags?, subexpr?)
-- v15+ onlyregexp_replace(expr, pattern, replacement, position?, occurrence?, flags?)
position
andoccurrence
for v15+ onlyregexp_substr(expr, pattern, position?, occurrence?, flags?, subexpr?)
-- v15+ onlysubstr(expr from pattern)
-- required for <v15MySQL:
REGEXP_REPLACE(expr, pattern, replacement, position?, occurrence?, flags?)
REGEXP_SUBSTR(expr, pattern, position?, occurrence?, flags?)
REGEXP_INSTR(expr, pattern, position?, occurrence?, return_opt?, flags?)
MariaDB:
REGEXP_REPLACE(expr, pattern, replacement)
REGEXP_SUBSTR(expr, pattern)
REGEXP_INSTR(expr, pattern)
Oracle:
REGEXP_REPLACE(expr, pattern, replacement, position?, occurrence?, flags?)
REGEXP_SUBSTR(expr, pattern, position?, occurrence?, flags?, subexpr?)
REGEXP_INSTR(expr, pattern, position?, occurrence?, return_opt?, flags?, subexpr?)
REGEXP_COUNT(expr, pattern, position?, flags?)
SQLite:
re.sub(pattern, replacement, expr, count?, flags?)
Implementation:
RegexpReplace(expression, pattern, replacement, flags)
RegexpSubstr(expression, pattern, flags)
RegexpStrIndex(expression, pattern, flags)
Notes:
position
is only supported on MySQL, Oracle, and PostgreSQL 15+ and is used to indicate where matches should begin. Support for this will not be implemented, save to pass in the default value of1
where subsequent function arguments are required.'g'
is passed inflags
to replace all occurrences. MySQL & Oracle replace all occurrences by default (occurrence=0
) or the specified occurrence. SQLite, using Python'sre.sub()
will replace all occurrences by default (count=0
) or up tocount
occurrences in the string. MariaDB only supports replacing all occurrences.We will pass
1
to the underlyingcount
/occurrence
parameter by default and accept'g'
inflags
to pass to PostgreSQL or0
to the underlyingcount
/occurrence
. MariaDB will ignore the'g'
flag as it will always replace everything.return_opt
is only supported on MySQL and Oracle forREGEXP_INSTR
and is used to control which position value is returtned. Support for this will not be implemented, save to pass in the default value of0
where subsequent function arguments are required.'c'
) and case-insensitive ('i'
) flags are supported by most backends. It seems that later flags specified take precedence over earlier ones. SQLite, using Python, is always case-sensitive by default and only supports'i'
, but if'c'
is present after'i'
, we can cancel the case-insensitivity.MariaDB
doesn't support being passedonly supports inline flags. We can support'c'
and'i'
, by prefixingpattern
with(?-i)
and(?i)
respectively. It also seems that MariaDB is case-insensitive by default (unless we have some weird collation configuration on the Django CI).MySQL also seems to be case-insensitive by default.
'm'
) seems to work similarly across all backends. PostgreSQL supports the value'm'
, but the canonical value for the flag is'n'
.'s'
) seems to work similarly across all backends, but is the default on PostgreSQL. Wemay want topass the'p'
flag by default for PostgreSQL to get it to behave like other backends -- it is also documented. Oracle and MySQL use the value'n'
for this flag instead.'x'
) is not supported by MySQL. Oracle does not support comments in thepattern
when using'x'
, but all other backends do.Issues:
MySQL tests are currently skipped as Django CI doesn't have 8.0.4+.(No longer a problem - this was started so long ago!)