Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed #28805 -- Added regular expression database functions. #12438

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

ngnpope
Copy link
Member

@ngnpope ngnpope commented Feb 9, 2020

This is a work in progress attempt to implement RegexpReplace(), etc. for ticket-28805.

Backend Support:

PostgreSQL:

MySQL:

MariaDB:

Oracle:

SQLite:

Implementation:

  • RegexpReplace(expression, pattern, replacement, flags)
  • RegexpSubstr(expression, pattern, flags)
  • RegexpStrIndex(expression, pattern, flags)

Notes:

  • position is only supported on MySQL, Oracle, and PostgreSQL 15+ and is used to indicate where matches should begin. Support for this will not be implemented, save to pass in the default value of 1 where subsequent function arguments are required.

  • By default PostgreSQL replaces the first occurrence only unless 'g' is passed in flags to replace all occurrences. MySQL & Oracle replace all occurrences by default (occurrence=0) or the specified occurrence. SQLite, using Python's re.sub() will replace all occurrences by default (count=0) or up to count occurrences in the string. MariaDB only supports replacing all occurrences.

    We will pass 1 to the underlying count/occurrence parameter by default and accept 'g' in flags to pass to PostgreSQL or 0 to the underlying count/occurrence. MariaDB will ignore the 'g' flag as it will always replace everything.

  • return_opt is only supported on MySQL and Oracle for REGEXP_INSTR and is used to control which position value is returtned. Support for this will not be implemented, save to pass in the default value of 0 where subsequent function arguments are required.

  • The case-sensitive ('c') and case-insensitive ('i') flags are supported by most backends. It seems that later flags specified take precedence over earlier ones. SQLite, using Python, is always case-sensitive by default and only supports 'i', but if 'c' is present after 'i', we can cancel the case-insensitivity.

    MariaDB doesn't support being passed only supports inline flags. We can support 'c' and 'i', by prefixing pattern with (?-i) and (?i) respectively. It also seems that MariaDB is case-insensitive by default (unless we have some weird collation configuration on the Django CI).

    MySQL also seems to be case-insensitive by default.

  • The multi-line flag ('m') seems to work similarly across all backends. PostgreSQL supports the value 'm', but the canonical value for the flag is 'n'.

  • The dotall flag ('s') seems to work similarly across all backends, but is the default on PostgreSQL. We may want to pass the 'p' flag by default for PostgreSQL to get it to behave like other backends -- it is also documented. Oracle and MySQL use the value 'n' for this flag instead.

  • The extended flag ('x') is not supported by MySQL. Oracle does not support comments in the pattern when using 'x', but all other backends do.

Issues:

  • MySQL tests are currently skipped as Django CI doesn't have 8.0.4+. (No longer a problem - this was started so long ago!)

@ngnpope ngnpope force-pushed the ticket-28805 branch 7 times, most recently from 10cc472 to 2677efc Compare February 12, 2020 00:05
@ngnpope ngnpope force-pushed the ticket-28805 branch 2 times, most recently from 37509f3 to b806947 Compare February 15, 2020 20:36
@ngnpope ngnpope changed the title Fixed #28805 -- Added the RegexpReplace database function. Fixed #28805 -- Added regular expression database functions. Feb 16, 2020
Copy link
Member Author

@ngnpope ngnpope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. So I think this is ready for a first round of review.

django/db/models/functions/text.py Outdated Show resolved Hide resolved
# FIXME: This emulated version doesn't handle NULL pattern correctly.
expression, pattern, flags = self.source_expressions.copy()
expr = RegexpSubstr(expression, pattern, flags, no_wrap=True)
expr = StrIndex(expression, Coalesce(expr, Value('<<fail>>')))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Emulating REGEXP_INSTR in PostgreSQL is tough. Unfortunately empty string has an index of 1 so we need to coalesce to some other sentinel value that will not match to get the expected index of 0 for no match. This breaks for NULL being passed to pattern. We might be able to solve this by wrapping with Case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be possible to do CASE WHEN <pattern> IS NULL THEN NULL ELSE <expr> END in PostgreSQL, but I can't work out if this is possible with Django's Case and When...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am now thinking that I'll probably just skip the test for NULL being passed to pattern and chalk it up as a wart. I'm having to do that for Oracle anyway for REGEXP_REPLACE as it treats NULL as '' for pattern there, returning the original string instead of NULL.

@ngnpope ngnpope marked this pull request as ready for review July 1, 2020 21:08
django/db/models/functions/text.py Outdated Show resolved Hide resolved
django/db/models/functions/text.py Outdated Show resolved Hide resolved
django/db/models/functions/text.py Outdated Show resolved Hide resolved
django/db/models/functions/text.py Outdated Show resolved Hide resolved
django/db/models/functions/text.py Outdated Show resolved Hide resolved
django/db/models/functions/text.py Outdated Show resolved Hide resolved
django/db/models/functions/text.py Outdated Show resolved Hide resolved
django/db/models/functions/text.py Outdated Show resolved Hide resolved
django/db/models/functions/text.py Outdated Show resolved Hide resolved
class RegexpReplace(RegexpFlagMixin, Func):
function = 'REGEXP_REPLACE'
inline_flags = {'mariadb'}
output_field = CharField()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felixxm I've got some test failures on Oracle and was wondering if you as an oracle on Oracle would understand.

It seems that the value returned is an instance of cx_Oracle.LOB and I have to cast that to str to get a sensible value.

If I change this line to TextField the tests pass, but that differs to all of the other functions in this module which prefer CharField. I didn't have these failures originally and think that they may have appeared after 1e38f11 and I probably had to add this output_field line here after that due to mixed CharField and TextField. Maybe @charettes will understand what is going on here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a quick answer. I would expect that all text-functions are affected by this issue (but I couldn't reproduce it with LPad or Left 🤔). Oracle should return LOB`s output for CLOB`s inputs, when we set output_field to CharField() it's not converted to str in contrast to TextField, see

def convert_textfield_value(self, value, expression, connection):
if isinstance(value, Database.LOB):
value = value.read()
return value

Adding conversion will fix tests, but I'm not sure if that's a good approach:

diff --git a/django/db/backends/oracle/operations.py b/django/db/backends/oracle/operations.py
index 1e5dc70613..bd15c0a140 100644
--- a/django/db/backends/oracle/operations.py
+++ b/django/db/backends/oracle/operations.py
@@ -176,7 +176,7 @@ END;
     def get_db_converters(self, expression):
         converters = super().get_db_converters(expression)
         internal_type = expression.output_field.get_internal_type()
-        if internal_type in ['JSONField', 'TextField']:
+        if internal_type in ['JSONField', 'TextField', 'CharField']:
             converters.append(self.convert_textfield_value)
         elif internal_type == 'BinaryField':
             converters.append(self.convert_binaryfield_value)

django/db/models/functions/text.py Show resolved Hide resolved
Copy link
Member

@felixxm felixxm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pope1ni Thanks for this patch 👍 and investigation 🕵️ The main issue for me is that's it's not adjustable for 3rd-party database backends. Ideally, we could add a several hooks and call them in as_{vendor} methods instead of checking multiple vendors in as_sql().

@ngnpope
Copy link
Member Author

ngnpope commented Sep 17, 2020

The main issue for me is that's it's not adjustable for 3rd-party database backends. Ideally, we could add a several hooks and call them in as_{vendor} methods instead of checking multiple vendors in as_sql().

So I've rejigged this in a way that should address this concern. Feel free to review those changes.

I'm now just waiting for the oracle tests to complete to see if I still get the failures I was experiencing. I suspect that I'll have to follow the advice recently added in 9369f0c.

@ngnpope
Copy link
Member Author

ngnpope commented Sep 21, 2020

I'm now just waiting for the oracle tests to complete to see if I still get the failures I was experiencing. I suspect that I'll have to follow the advice recently added in 9369f0c.

So the failures with Oracle remain:

db_functions.text.test_regexpreplace.RegexpReplaceFlagTests.test_dotall_flag
db_functions.text.test_regexpreplace.RegexpReplaceFlagTests.test_extended_flag
db_functions.text.test_regexpreplace.RegexpReplaceFlagTests.test_global_flag
db_functions.text.test_regexpreplace.RegexpReplaceFlagTests.test_multiline_flag
db_functions.text.test_regexpsubstr.RegexpSubstrFlagTests.test_dotall_flag
db_functions.text.test_regexpsubstr.RegexpSubstrFlagTests.test_extended_flag
db_functions.text.test_regexpsubstr.RegexpSubstrFlagTests.test_multiline_flag

I've noticed that the failing tests all have one thing in common: They pass Article.text as their first argument - a TextField - and not Article.title as the other tests use - a CharField.

@felixxm I'm wondering whether this is a more general problem with Oracle? I see that the majority of tests for text database functions don't test with TextField inputs. You mentioned being unable to reproduce the issue with LPad or Left, but were you using a CharField? Do you get the same issue if you pass in a TextField, e.g. Article.text, as the first argument? (I don't currently have an Oracle instance to test...) 🕵️

docs/ref/models/database-functions.txt Outdated Show resolved Hide resolved
A string of ``flags`` can be provided to adjust the matching and replacement
behavior for all of the above functions:

* ``c``: Perform case-sensitive matching of ``pattern``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got to this bit and now I'm wondering what the defaults are?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The defaults are in theory that no flags are passed. Hence the flags=Value('') in the function signatures above.

In reality, it's very messy because nearly none of the backends agree on how to do regular expressions. So we smooth out some of the discrepancies between them to get it as consistent as possible. For PostgreSQL we pass p by default, for MySQL we pass c, and for MariaDB we use (?-i) inline. This is all carefully explained in the admonition.

I wish it were easier, but I believe that this is the best we can do. 🙁

@ngnpope
Copy link
Member Author

ngnpope commented Jul 17, 2021

@felixxm This has been nearly ready for a long time so I thought I'd try to finish it.

I've tested it out with the change that you suggested in #12438 (comment) and all the tests pass.
It doesn't appear to affect anything else. Is there any reason why that wouldn't be acceptable?

Copy link
Member

@felixxm felixxm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngnpope Thanks for all you efforts 👍 and really sorry for the late reply. I'm still a bit skeptical about juggling so many database-specific flags, it's can be really hard to maintain. Personally, I'm -0 about this feature.

I left some comments, but we need to make few more things to make it reviewable again:

  • rebase from the main branch,
  • apply black, and
  • re-target to Django 4.1.

django/db/backends/mysql/features.py Outdated Show resolved Hide resolved
django/db/backends/sqlite3/base.py Outdated Show resolved Hide resolved
django/db/models/functions/text.py Outdated Show resolved Hide resolved
django/db/models/functions/text.py Outdated Show resolved Hide resolved
tests/db_functions/text/test_regexpreplace.py Outdated Show resolved Hide resolved
tests/db_functions/text/test_regexpreplace.py Outdated Show resolved Hide resolved
tests/db_functions/text/test_regexpstrindex.py Outdated Show resolved Hide resolved
tests/db_functions/text/test_regexpsubstr.py Outdated Show resolved Hide resolved
@ngnpope
Copy link
Member Author

ngnpope commented Mar 20, 2022

Thanks for all you efforts +1 and really sorry for the late reply.

It happens!

I'm still a bit skeptical about juggling so many database-specific flags, it's can be really hard to maintain. Personally, I'm -0 about this feature.

After two years of keeping this going with no discouragement, I hope we don't jump to a no too quickly...

We'll have to see if we can break things up further to deal with the flags better.

I left some comments, but we need to make few more things to make it reviewable again:

Have rebased, blackened, re-targeted to 4.1, and addressed the majority of comments. A couple of the comments will require a bit more thought, however.


As an aside, PostgreSQL 15 looks like it is going to learn REGEXP_INSTR, REGEXP_REPLACE and REGEXP_SUBSTR - among others - which will eventually allow some of the hacks for PostgreSQL to be phased out:

@ngnpope
Copy link
Member Author

ngnpope commented Jul 31, 2023

buildbot, test on oracle.

@ngnpope ngnpope force-pushed the ticket-28805 branch 3 times, most recently from 671957c to 9ce112f Compare September 13, 2023 20:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants