Fixed #28041 -- Added prefix matching for PostgreSQL full text search #12727

kaedroho · 2020-04-15T16:53:12Z

This pull request is based on an earlier one: #8313

I've rebased it on the latest master and made the following tweaks:

Added some tests and docs
Added escaping for quotes and backslashes
Added support for the ~ operator on Lexemes
Replaced the RawSearchQuery with the new SearchQuery(search_type='raw')

jacobtylerwalls

Hi Karl, thanks for the patch. Although I'm not able to offer a substantive review, I left some comments on your docs to ensure CI will pass. I also updated the flags on the ticket so that this will be more visible for reviewers who can give you a substantive review. Cheers

django/contrib/postgres/search.py

docs/ref/contrib/postgres/search.txt

jacobtylerwalls · 2020-10-31T14:38:34Z

docs/ref/contrib/postgres/search.txt

+==============
+
+Lexeme objects allow search operators to be used along with strings from an untrusted source. The contents
+of each lexeme is excaped so cannot contain any operators that PostgreSQL will recognise, but they support


also plural "the contents of each lexeme are escaped"
and "so cannot contain" is not clear, sounds like it's missing a word.

"recognize" for American English, don't blame me, it's CI that will complain :-)

Thanks! I think changing "contents" to singular sounds a bit better, but I'm not sure!

"the content of each lexeme is escaped"

I've had a go at rewriting this paragraph, let me know what you think! I'm not sure about the wording of first sentence in particular

django/contrib/postgres/search.py

alextatarinov · 2020-11-12T18:42:15Z

django/contrib/postgres/search.py

+        super().__init__(value, output_field=output_field)
+
+    def as_sql(self, compiler, connection):
+        param = "'%s'" % self.value.replace("'", "''").replace("\\", "\\\\")


I am not 100% certain, but it may require to escape more than a single quote and backslash, specifically, all to_tsquery operators like & and |. The following should fail Lexeme('red!'). Here is the code I used to implement the escaping some time ago.

_SEARCH_SPEC_CHARS = r"['\0\[\]()|&:*!@<>\\]" _spec_chars_re = re.compile(_SEARCH_SPEC_CHARS) multiple_spaces_re = re.compile(r'\s{2,}') def normalize_spaces(val): """Converts multiple spaces to single and strips from both sides""" if not val: return None return multiple_spaces_re.sub(' ', val.strip()) def psql_escape(query: str): # replace unsafe chars with space query = _spec_chars_re.sub(' ', query) query = normalize_spaces(query) # convert multiple spaces to single return query

It looks like it should be handled by 'raw' search type, but it's not, and since quote and backslash are escaped, I assumed it should also handle other special characters.

On the other hand, it was probably omitted intentionally in 'raw' query implementation, so probably we shouldn't try to do it here also.

Thanks! I've updated it to use your code snippet for escaping

This did change to how some words are indexed though. For example: L'amour was previously indexed as L''amour, and now it is L amour. I'm not that familiar with French, so I'm not sure what impact that has.

I've applied the suggestion from #12727 (comment) which bought back the quotes, but it also doesn't escape !.

I think this is fine because the whole lexeme is surrounded by quotes, so your previous example would be escaped as 'red!' and those quotes should prevent it from interpreting the ! (I guess!).

felixxm

@kaedroho Thanks for this patch 👍 I left initial comments.

felixxm · 2020-11-21T07:20:40Z

django/contrib/postgres/search.py

+
+    def __init__(self, value, output_field=None, *, invert=False, prefix=False, weight=None):
+        self.prefix = prefix
+        self.invert = invert


I would use negated.

Suggested change

self.invert = invert

self.negated = negated

The existing SearchQuery class also uses invert: https://github.com/django/django/blob/master/django/contrib/postgres/search.py#L169

django/contrib/postgres/search.py

felixxm · 2020-11-21T07:42:43Z

django/contrib/postgres/search.py

@@ -300,3 +303,90 @@ class TrigramSimilarity(TrigramBase):
 class TrigramDistance(TrigramBase):
    function = ''
    arg_joiner = ' <-> '
+
+
+class LexemeCombinable(Expression):


This creates a diamond-shape inheritance

Expression / \ LexemeCombinable Value \ / Lexeme

which is not necessary. I think we should use:

class LexemeCombinable: ... class Lexeme(LexemeCombinable, Value): ... class CombinedLexeme(LexemeCombinable, CombinedExpression): ...

Done, thanks!

felixxm · 2020-11-21T07:47:49Z

tests/postgres_tests/test_search.py

+    def test_as_sql(self):
+        query = Line.objects.all().query
+        compiler = query.get_compiler(connection.alias)
+
+        tests = (
+            (Lexeme('a'), '%s', ["'a'"]),
+            (Lexeme('a', invert=True), '%s', ["!'a'"]),
+            (~Lexeme('a'), '%s', ["!'a'"]),
+            (Lexeme('a', prefix=True), '%s', ["'a':*"]),
+            (Lexeme('a', weight='D'), '%s', ["'a':D"]),
+            (Lexeme('a', invert=True, prefix=True, weight='D'), '%s', ["!'a':*D"]),
+            (Lexeme('a') | Lexeme('b') & ~Lexeme('c'), '%s', ["('a' | ('b' & !'c'))"]),
+            (~(Lexeme('a') | Lexeme('b') & ~Lexeme('c')), '%s', ["(!'a' & (!'b' | 'c'))"]),
+
+            # Some escaping tests
+            (Lexeme("L'amour piqué par une abeille"), '%s', ["'L''amour piqué par une abeille'"]),
+            (Lexeme("'starting quote"), '%s', ["'''starting quote'"]),
+            (Lexeme("ending quote'"), '%s', ["'ending quote'''"]),
+            (Lexeme("double quo''te"), '%s', ["'double quo''''te'"]),
+            (Lexeme("triple quo'''te"), '%s', ["'triple quo''''''te'"]),
+            (Lexeme("backslash\\"), '%s', ["'backslash\\\\'"]),
+        )
+        for expression, expected_sql, expected_params in tests:
+            with self.subTest(expression=expression):
+                resolved = expression.resolve_expression(query)
+                sql, params = resolved.as_sql(compiler, connection)
+                self.assertEqual(sql, expected_sql)
+                self.assertEqual(params, expected_params)


Low-level tests are not necessary, it should be feasible to test using Lexeme() with special characters in a query.

Ahh, so I should change this to make real queries? that makes sense!

felixxm · 2020-11-21T08:08:51Z

django/contrib/postgres/search.py

+        super().__init__(value, output_field=output_field)
+
+    def as_sql(self, compiler, connection):
+        param = "'%s'" % self.value.replace("'", "''").replace("\\", "\\\\")


Using psycopg2's adapt() should be sufficient in all cases:

Suggested change

param = "'%s'" % self.value.replace("'", "''").replace("\\", "\\\\")

param = psycopg2.extensions.adapt(self.value).getquoted().decode()

Done, thanks!

Getting a crash when using this with unicode characters:

>>> psycopg2.extensions.adapt("L'amour piqué par une abeille").getquoted().decode() UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 14: invalid continuation byte

felixxm · 2020-11-21T08:22:36Z

django/contrib/postgres/search.py

+    def __init__(self, value, output_field=None, *, invert=False, prefix=False, weight=None):
+        self.prefix = prefix
+        self.invert = invert
+        self.weight = weight


As far as I'm aware weight and prefix are mutually exclusive, we should raise a ValueError in such case:

raise ValueError('prefix and weight are mutually exclusive')

https://www.postgresql.org/docs/10/textsearch-controls.html under section "12.3.2. Parsing Queries" there are some examples that appear to combine weights and prefix queries (and there's currently a test for this in postgres_tests.test_search.TestLexemes.test_as_sql)

@jacobtylerwalls

Thanks @jacobtylerwalls for suggestions!

As per django#12727 (review)

jacobtylerwalls · 2020-11-25T13:27:17Z

docs/ref/contrib/postgres/search.txt

+==============
+
+Lexeme objects allow search operators to be used with strings from an
+untrusted source. The content of each lexeme is escaped so any operators that


Rewrite sounds good to my ear! If you want fine-tuning suggestions:

"so that any operators that may exist in the string itself"

First sentence sounds fine to me too, but if you're looking for an alternative:

"Lexeme objects allow for safely applying search operators to strings from untrusted sources."

kaedroho · 2020-11-25T14:06:25Z

Thanks for the reviews! I'll get back to this on the weekend to update tests, fix the CI errors and address any other comments

felixxm · 2021-03-31T19:11:52Z

Closing due to inactivity.

gasman mentioned this pull request Apr 27, 2020

Some cleanups in PostgreSQL search module wagtail/wagtail#5953

Merged

auvipy approved these changes May 5, 2020

View reviewed changes

jacobtylerwalls reviewed Oct 31, 2020

View reviewed changes

alextatarinov reviewed Nov 12, 2020

View reviewed changes

felixxm reviewed Nov 21, 2020

View reviewed changes

joetsoi and others added 13 commits November 25, 2020 12:15

Add prefix matching for Postgres full text search

a7a6afd

Add tests for postgres prefix matching

5742f6a

Add 'or' search for postgres fts prefix matching

e780b68

Add weighting to lexemes

876de7c

Remove second parameter to compiler.compile()

8674b6f

Escaping fixes

3d6faca

Add some more tests

caf510a

Allow ~ syntax on Lexemes

10740ab

Replace RawSearchQuery with SearchQuery(search_type='raw')

493c5f0

Add some basic docs

284ce6a

Docs fixes

c483495

Thanks @jacobtylerwalls for suggestions!

Change method of escaping lexemes

57748c2

As per django#12727 (review)

Fix broken tests

5665ade

kaedroho force-pushed the postgres-fts-prefix branch from 22bfc7e to 5665ade Compare November 25, 2020 12:22

jacobtylerwalls reviewed Nov 25, 2020

View reviewed changes

Base automatically changed from master to main March 9, 2021 06:21

felixxm closed this Mar 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed #28041 -- Added prefix matching for PostgreSQL full text search #12727

Fixed #28041 -- Added prefix matching for PostgreSQL full text search #12727

kaedroho commented Apr 15, 2020

jacobtylerwalls left a comment

jacobtylerwalls Oct 31, 2020

jacobtylerwalls Oct 31, 2020

jacobtylerwalls Oct 31, 2020

kaedroho Nov 25, 2020

kaedroho Nov 25, 2020

alextatarinov Nov 12, 2020

alextatarinov Nov 12, 2020 •

edited

kaedroho Nov 25, 2020

kaedroho Nov 25, 2020 •

edited

felixxm left a comment

felixxm Nov 21, 2020

kaedroho Nov 25, 2020

felixxm Nov 21, 2020

kaedroho Nov 25, 2020

felixxm Nov 21, 2020

kaedroho Nov 25, 2020

felixxm Nov 21, 2020

kaedroho Nov 25, 2020

kaedroho Nov 25, 2020

felixxm Nov 21, 2020

kaedroho Nov 25, 2020 •

edited

jacobtylerwalls Nov 25, 2020

kaedroho commented Nov 25, 2020

felixxm commented Mar 31, 2021

	param = "'%s'" % self.value.replace("'", "''").replace("\\", "\\\\")
	param = psycopg2.extensions.adapt(self.value).getquoted().decode()

Fixed #28041 -- Added prefix matching for PostgreSQL full text search #12727

Fixed #28041 -- Added prefix matching for PostgreSQL full text search #12727

Conversation

kaedroho commented Apr 15, 2020

jacobtylerwalls left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alextatarinov Nov 12, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kaedroho Nov 25, 2020 • edited

Choose a reason for hiding this comment

felixxm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kaedroho Nov 25, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kaedroho commented Nov 25, 2020

felixxm commented Mar 31, 2021

alextatarinov Nov 12, 2020 •

edited

kaedroho Nov 25, 2020 •

edited

kaedroho Nov 25, 2020 •

edited