Fixed #28805 -- Added regular expression database functions. #12438

ngnpope · 2020-02-09T14:20:47Z

This is a work in progress attempt to implement RegexpReplace(), etc. for ticket-28805.

Backend Support:

PostgreSQL:

regexp_count(expr, pattern, position?, flags?) -- v15+ only
regexp_instr(expr, pattern, position?, occurrence?, return_opt?, flags?, subexpr?) -- v15+ only
regexp_replace(expr, pattern, replacement, position?, occurrence?, flags?)
- position and occurrence for v15+ only
regexp_substr(expr, pattern, position?, occurrence?, flags?, subexpr?) -- v15+ only
substr(expr from pattern) -- required for <v15

MySQL:

MariaDB:

Oracle:

SQLite:

re.sub(pattern, replacement, expr, count?, flags?)

Implementation:

RegexpReplace(expression, pattern, replacement, flags)
RegexpSubstr(expression, pattern, flags)
RegexpStrIndex(expression, pattern, flags)

Notes:

position is only supported on MySQL, Oracle, and PostgreSQL 15+ and is used to indicate where matches should begin. Support for this will not be implemented, save to pass in the default value of 1 where subsequent function arguments are required.
By default PostgreSQL replaces the first occurrence only unless 'g' is passed in flags to replace all occurrences. MySQL & Oracle replace all occurrences by default (occurrence=0) or the specified occurrence. SQLite, using Python's re.sub() will replace all occurrences by default (count=0) or up to count occurrences in the string. MariaDB only supports replacing all occurrences.

We will pass 1 to the underlying count/occurrence parameter by default and accept 'g' in flags to pass to PostgreSQL or 0 to the underlying count/occurrence. MariaDB will ignore the 'g' flag as it will always replace everything.
return_opt is only supported on MySQL and Oracle for REGEXP_INSTR and is used to control which position value is returtned. Support for this will not be implemented, save to pass in the default value of 0 where subsequent function arguments are required.
The case-sensitive ('c') and case-insensitive ('i') flags are supported by most backends. It seems that later flags specified take precedence over earlier ones. SQLite, using Python, is always case-sensitive by default and only supports 'i', but if 'c' is present after 'i', we can cancel the case-insensitivity.

MariaDB ~~doesn't support being passed~~ only supports inline flags. We can support 'c' and 'i', by prefixing pattern with (?-i) and (?i) respectively. It also seems that MariaDB is case-insensitive by default (unless we have some weird collation configuration on the Django CI).

MySQL also seems to be case-insensitive by default.
The multi-line flag ('m') seems to work similarly across all backends. PostgreSQL supports the value 'm', but the canonical value for the flag is 'n'.
The dotall flag ('s') seems to work similarly across all backends, but is the default on PostgreSQL. We ~~may want to~~ pass the 'p' flag by default for PostgreSQL to get it to behave like other backends -- it is also documented. Oracle and MySQL use the value 'n' for this flag instead.
The extended flag ('x') is not supported by MySQL. Oracle does not support comments in the pattern when using 'x', but all other backends do.

Issues:

~~MySQL tests are currently skipped as Django CI doesn't have 8.0.4+.~~ (No longer a problem - this was started so long ago!)

django/db/models/functions/text.py

ngnpope

Ok. So I think this is ready for a first round of review.

django/db/models/functions/text.py

ngnpope · 2020-07-01T21:04:16Z

django/db/models/functions/text.py

+        # FIXME: This emulated version doesn't handle NULL pattern correctly.
+        expression, pattern, flags = self.source_expressions.copy()
+        expr = RegexpSubstr(expression, pattern, flags, no_wrap=True)
+        expr = StrIndex(expression, Coalesce(expr, Value('<<fail>>')))


Emulating REGEXP_INSTR in PostgreSQL is tough. Unfortunately empty string has an index of 1 so we need to coalesce to some other sentinel value that will not match to get the expected index of 0 for no match. This breaks for NULL being passed to pattern. We might be able to solve this by wrapping with Case.

It should be possible to do CASE WHEN <pattern> IS NULL THEN NULL ELSE <expr> END in PostgreSQL, but I can't work out if this is possible with Django's Case and When...

Am now thinking that I'll probably just skip the test for NULL being passed to pattern and chalk it up as a wart. I'm having to do that for Oracle anyway for REGEXP_REPLACE as it treats NULL as '' for pattern there, returning the original string instead of NULL.

django/db/models/functions/text.py

ngnpope · 2020-08-16T11:23:59Z

django/db/models/functions/text.py

+class RegexpReplace(RegexpFlagMixin, Func):
+    function = 'REGEXP_REPLACE'
+    inline_flags = {'mariadb'}
+    output_field = CharField()


@felixxm I've got some test failures on Oracle and was wondering if you as an oracle on Oracle would understand.

It seems that the value returned is an instance of cx_Oracle.LOB and I have to cast that to str to get a sensible value.

If I change this line to TextField the tests pass, but that differs to all of the other functions in this module which prefer CharField. I didn't have these failures originally and think that they may have appeared after 1e38f11 and I probably had to add this output_field line here after that due to mixed CharField and TextField. Maybe @charettes will understand what is going on here.

I don't have a quick answer. I would expect that all text-functions are affected by this issue (but I couldn't reproduce it with LPad or Left 🤔). Oracle should return LOB`s output for CLOB`s inputs, when we set output_field to CharField() it's not converted to str in contrast to TextField, see

django/django/db/backends/oracle/operations.py

Lines 205 to 208 in 35b0378

def convert_textfield_value(self, value, expression, connection):

if isinstance(value, Database.LOB):

value = value.read()

return value

Adding conversion will fix tests, but I'm not sure if that's a good approach:

diff --git a/django/db/backends/oracle/operations.py b/django/db/backends/oracle/operations.py index 1e5dc70613..bd15c0a140 100644 --- a/django/db/backends/oracle/operations.py +++ b/django/db/backends/oracle/operations.py @@ -176,7 +176,7 @@ END; def get_db_converters(self, expression): converters = super().get_db_converters(expression) internal_type = expression.output_field.get_internal_type() - if internal_type in ['JSONField', 'TextField']: + if internal_type in ['JSONField', 'TextField', 'CharField']: converters.append(self.convert_textfield_value) elif internal_type == 'BinaryField': converters.append(self.convert_binaryfield_value)

django/db/models/functions/text.py

felixxm

@pope1ni Thanks for this patch 👍 and investigation 🕵️ The main issue for me is that's it's not adjustable for 3rd-party database backends. Ideally, we could add a several hooks and call them in as_{vendor} methods instead of checking multiple vendors in as_sql().

docs/ref/models/database-functions.txt

ngnpope · 2020-09-17T14:41:34Z

The main issue for me is that's it's not adjustable for 3rd-party database backends. Ideally, we could add a several hooks and call them in as_{vendor} methods instead of checking multiple vendors in as_sql().

So I've rejigged this in a way that should address this concern. Feel free to review those changes.

I'm now just waiting for the oracle tests to complete to see if I still get the failures I was experiencing. I suspect that I'll have to follow the advice recently added in 9369f0c.

ngnpope · 2020-09-21T11:01:18Z

I'm now just waiting for the oracle tests to complete to see if I still get the failures I was experiencing. I suspect that I'll have to follow the advice recently added in 9369f0c.

So the failures with Oracle remain:

db_functions.text.test_regexpreplace.RegexpReplaceFlagTests.test_dotall_flag
db_functions.text.test_regexpreplace.RegexpReplaceFlagTests.test_extended_flag
db_functions.text.test_regexpreplace.RegexpReplaceFlagTests.test_global_flag
db_functions.text.test_regexpreplace.RegexpReplaceFlagTests.test_multiline_flag
db_functions.text.test_regexpsubstr.RegexpSubstrFlagTests.test_dotall_flag
db_functions.text.test_regexpsubstr.RegexpSubstrFlagTests.test_extended_flag
db_functions.text.test_regexpsubstr.RegexpSubstrFlagTests.test_multiline_flag

I've noticed that the failing tests all have one thing in common: They pass Article.text as their first argument - a TextField - and not Article.title as the other tests use - a CharField.

@felixxm I'm wondering whether this is a more general problem with Oracle? I see that the majority of tests for text database functions don't test with TextField inputs. You mentioned being unable to reproduce the issue with LPad or Left, but were you using a CharField? Do you get the same issue if you pass in a TextField, e.g. Article.text, as the first argument? (I don't currently have an Oracle instance to test...) 🕵️

docs/ref/models/database-functions.txt

smithdc1 · 2021-07-16T05:07:54Z

docs/ref/models/database-functions.txt

+A string of ``flags`` can be provided to adjust the matching and replacement
+behavior for all of the above functions:
+
+* ``c``: Perform case-sensitive matching of ``pattern``.


I got to this bit and now I'm wondering what the defaults are?

The defaults are in theory that no flags are passed. Hence the flags=Value('') in the function signatures above.

In reality, it's very messy because nearly none of the backends agree on how to do regular expressions. So we smooth out some of the discrepancies between them to get it as consistent as possible. For PostgreSQL we pass p by default, for MySQL we pass c, and for MariaDB we use (?-i) inline. This is all carefully explained in the admonition.

I wish it were easier, but I believe that this is the best we can do. 🙁

ngnpope · 2021-07-17T11:25:19Z

@felixxm This has been nearly ready for a long time so I thought I'd try to finish it.

I've tested it out with the change that you suggested in #12438 (comment) and all the tests pass.
It doesn't appear to affect anything else. Is there any reason why that wouldn't be acceptable?

felixxm

@ngnpope Thanks for all you efforts 👍 and really sorry for the late reply. I'm still a bit skeptical about juggling so many database-specific flags, it's can be really hard to maintain. Personally, I'm -0 about this feature.

I left some comments, but we need to make few more things to make it reviewable again:

rebase from the main branch,
apply black, and
re-target to Django 4.1.

django/db/backends/mysql/features.py

django/db/backends/sqlite3/base.py

django/db/models/functions/text.py

tests/db_functions/text/test_regexpreplace.py

tests/db_functions/text/test_regexpstrindex.py

tests/db_functions/text/test_regexpsubstr.py

ngnpope · 2022-03-20T22:51:59Z

Thanks for all you efforts +1 and really sorry for the late reply.

It happens!

I'm still a bit skeptical about juggling so many database-specific flags, it's can be really hard to maintain. Personally, I'm -0 about this feature.

After two years of keeping this going with no discouragement, I hope we don't jump to a no too quickly...

We'll have to see if we can break things up further to deal with the flags better.

I left some comments, but we need to make few more things to make it reviewable again:

Have rebased, blackened, re-targeted to 4.1, and addressed the majority of comments. A couple of the comments will require a bit more thought, however.

As an aside, PostgreSQL 15 looks like it is going to learn REGEXP_INSTR, REGEXP_REPLACE and REGEXP_SUBSTR - among others - which will eventually allow some of the hacks for PostgreSQL to be phased out:

ngnpope · 2023-07-31T09:31:23Z

buildbot, test on oracle.

ngnpope force-pushed the ticket-28805 branch 7 times, most recently from 10cc472 to 2677efc Compare February 12, 2020 00:05

charettes reviewed Feb 12, 2020

View reviewed changes

django/db/models/functions/text.py Outdated Show resolved Hide resolved

ngnpope force-pushed the ticket-28805 branch 2 times, most recently from 37509f3 to b806947 Compare February 15, 2020 20:36

ngnpope changed the title ~~Fixed #28805 -- Added the RegexpReplace database function.~~ Fixed #28805 -- Added regular expression database functions. Feb 16, 2020

ngnpope force-pushed the ticket-28805 branch from b806947 to e54691d Compare March 7, 2020 23:10

ngnpope force-pushed the ticket-28805 branch from e54691d to dfcbce3 Compare July 1, 2020 20:56

ngnpope commented Jul 1, 2020

View reviewed changes

ngnpope marked this pull request as ready for review July 1, 2020 21:08

ngnpope force-pushed the ticket-28805 branch from dfcbce3 to f02edd1 Compare July 3, 2020 15:45

charettes reviewed Jul 3, 2020

View reviewed changes

ngnpope force-pushed the ticket-28805 branch from 7384654 to 2c08f7c Compare July 9, 2020 19:23

ngnpope force-pushed the ticket-28805 branch 3 times, most recently from f6f61f6 to 1597a68 Compare July 23, 2020 20:19

ngnpope force-pushed the ticket-28805 branch from 1597a68 to cdd0115 Compare July 28, 2020 19:08

ngnpope force-pushed the ticket-28805 branch 2 times, most recently from acb76b3 to fff84cc Compare August 9, 2020 20:16

ngnpope force-pushed the ticket-28805 branch 2 times, most recently from 438755a to 93c706b Compare August 16, 2020 11:09

ngnpope commented Aug 16, 2020

View reviewed changes

felixxm reviewed Sep 17, 2020

View reviewed changes

docs/ref/models/database-functions.txt Outdated Show resolved Hide resolved

felixxm reviewed Sep 17, 2020

View reviewed changes

docs/ref/models/database-functions.txt Outdated Show resolved Hide resolved

ngnpope force-pushed the ticket-28805 branch from 93c706b to 46c5cea Compare September 17, 2020 14:18

Base automatically changed from master to main March 9, 2021 06:21

ngnpope force-pushed the ticket-28805 branch from 46c5cea to c18ea00 Compare March 29, 2021 10:24

ngnpope mentioned this pull request Jul 1, 2021

MySQL 8 regex functions adamchainz/django-mysql#454

Closed

ngnpope force-pushed the ticket-28805 branch from c18ea00 to e11642d Compare July 15, 2021 22:29

smithdc1 reviewed Jul 16, 2021

View reviewed changes

ngnpope force-pushed the ticket-28805 branch from e11642d to 12b6c2a Compare July 17, 2021 11:20

felixxm reviewed Mar 18, 2022

View reviewed changes

ngnpope force-pushed the ticket-28805 branch from 12b6c2a to a3a0e90 Compare March 20, 2022 22:37

ngnpope force-pushed the ticket-28805 branch 2 times, most recently from dfb1016 to 66099dd Compare October 17, 2022 16:17

ngnpope force-pushed the ticket-28805 branch 2 times, most recently from 98d3049 to 9acb5bc Compare February 12, 2023 21:11

ngnpope force-pushed the ticket-28805 branch 3 times, most recently from ddd1bca to 0e708aa Compare June 11, 2023 19:59

SilviaAmAm mentioned this pull request Jun 15, 2023

[#1530] Confirmation email with CC on cosign open-formulieren/open-forms#3157

Merged

ngnpope force-pushed the ticket-28805 branch from 0e708aa to 384f271 Compare July 31, 2023 09:30

ngnpope force-pushed the ticket-28805 branch 3 times, most recently from 671957c to 9ce112f Compare September 13, 2023 20:18

ngnpope and others added 2 commits September 18, 2023 21:43

HACK: Attempt to fix for Oracle.

e91409f

Fixed #28805 -- Added regular expression database function.

a42fa9c

ngnpope force-pushed the ticket-28805 branch from 9ce112f to a42fa9c Compare September 18, 2023 20:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed #28805 -- Added regular expression database functions. #12438

Fixed #28805 -- Added regular expression database functions. #12438

ngnpope commented Feb 9, 2020 •

edited

ngnpope left a comment

ngnpope Jul 1, 2020

ngnpope Jul 1, 2020

ngnpope Jul 3, 2020

ngnpope Aug 16, 2020

felixxm Aug 18, 2020

felixxm left a comment

ngnpope commented Sep 17, 2020

ngnpope commented Sep 21, 2020

smithdc1 Jul 16, 2021

ngnpope Jul 17, 2021

ngnpope commented Jul 17, 2021

felixxm left a comment

ngnpope commented Mar 20, 2022

ngnpope commented Jul 31, 2023

	def convert_textfield_value(self, value, expression, connection):
	if isinstance(value, Database.LOB):
	value = value.read()
	return value

Fixed #28805 -- Added regular expression database functions. #12438

Are you sure you want to change the base?

Fixed #28805 -- Added regular expression database functions. #12438

Conversation

ngnpope commented Feb 9, 2020 • edited

ngnpope left a comment

Choose a reason for hiding this comment

ngnpope Jul 1, 2020

Choose a reason for hiding this comment

ngnpope Jul 1, 2020

Choose a reason for hiding this comment

ngnpope Jul 3, 2020

Choose a reason for hiding this comment

ngnpope Aug 16, 2020

Choose a reason for hiding this comment

felixxm Aug 18, 2020

Choose a reason for hiding this comment

felixxm left a comment

Choose a reason for hiding this comment

ngnpope commented Sep 17, 2020

ngnpope commented Sep 21, 2020

smithdc1 Jul 16, 2021

Choose a reason for hiding this comment

ngnpope Jul 17, 2021

Choose a reason for hiding this comment

ngnpope commented Jul 17, 2021

felixxm left a comment

Choose a reason for hiding this comment

ngnpope commented Mar 20, 2022

ngnpope commented Jul 31, 2023

ngnpope commented Feb 9, 2020 •

edited