Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed #28041 -- Added prefix matching for PostgreSQL full text search #12727

Closed
wants to merge 13 commits into from

Conversation

kaedroho
Copy link
Contributor

This pull request is based on an earlier one: #8313

I've rebased it on the latest master and made the following tweaks:

  • Added some tests and docs
  • Added escaping for quotes and backslashes
  • Added support for the ~ operator on Lexemes
  • Replaced the RawSearchQuery with the new SearchQuery(search_type='raw')

Copy link
Member

@jacobtylerwalls jacobtylerwalls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Karl, thanks for the patch. Although I'm not able to offer a substantive review, I left some comments on your docs to ensure CI will pass. I also updated the flags on the ticket so that this will be more visible for reviewers who can give you a substantive review. Cheers

django/contrib/postgres/search.py Outdated Show resolved Hide resolved
docs/ref/contrib/postgres/search.txt Outdated Show resolved Hide resolved
docs/ref/contrib/postgres/search.txt Show resolved Hide resolved
==============

Lexeme objects allow search operators to be used along with strings from an untrusted source. The contents
of each lexeme is excaped so cannot contain any operators that PostgreSQL will recognise, but they support
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"escaped"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also plural "the contents of each lexeme are escaped"
and "so cannot contain" is not clear, sounds like it's missing a word.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"recognize" for American English, don't blame me, it's CI that will complain :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I think changing "contents" to singular sounds a bit better, but I'm not sure!

"the content of each lexeme is escaped"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've had a go at rewriting this paragraph, let me know what you think! I'm not sure about the wording of first sentence in particular

django/contrib/postgres/search.py Outdated Show resolved Hide resolved
super().__init__(value, output_field=output_field)

def as_sql(self, compiler, connection):
param = "'%s'" % self.value.replace("'", "''").replace("\\", "\\\\")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not 100% certain, but it may require to escape more than a single quote and backslash, specifically, all to_tsquery operators like & and |. The following should fail Lexeme('red!'). Here is the code I used to implement the escaping some time ago.


_SEARCH_SPEC_CHARS = r"['\0\[\]()|&:*!@<>\\]"
_spec_chars_re = re.compile(_SEARCH_SPEC_CHARS)
multiple_spaces_re = re.compile(r'\s{2,}')


def normalize_spaces(val):
    """Converts multiple spaces to single and strips from both sides"""
    if not val:
        return None
    return multiple_spaces_re.sub(' ', val.strip())

def psql_escape(query: str):
    # replace unsafe chars with space
    query = _spec_chars_re.sub(' ', query)
    query = normalize_spaces(query)  # convert multiple spaces to single 
    return query

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like it should be handled by 'raw' search type, but it's not, and since quote and backslash are escaped, I assumed it should also handle other special characters.

On the other hand, it was probably omitted intentionally in 'raw' query implementation, so probably we shouldn't try to do it here also.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I've updated it to use your code snippet for escaping

This did change to how some words are indexed though. For example: L'amour was previously indexed as L''amour, and now it is L amour. I'm not that familiar with French, so I'm not sure what impact that has.

Copy link
Contributor Author

@kaedroho kaedroho Nov 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've applied the suggestion from #12727 (comment) which bought back the quotes, but it also doesn't escape !.

I think this is fine because the whole lexeme is surrounded by quotes, so your previous example would be escaped as 'red!' and those quotes should prevent it from interpreting the ! (I guess!).

Copy link
Member

@felixxm felixxm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kaedroho Thanks for this patch 👍 I left initial comments.


def __init__(self, value, output_field=None, *, invert=False, prefix=False, weight=None):
self.prefix = prefix
self.invert = invert
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use negated.

Suggested change
self.invert = invert
self.negated = negated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

django/contrib/postgres/search.py Show resolved Hide resolved
@@ -300,3 +303,90 @@ class TrigramSimilarity(TrigramBase):
class TrigramDistance(TrigramBase):
function = ''
arg_joiner = ' <-> '


class LexemeCombinable(Expression):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This creates a diamond-shape inheritance

             Expression
              /      \
LexemeCombinable     Value
              \      /
               Lexeme

which is not necessary. I think we should use:

class LexemeCombinable:
     ...

class Lexeme(LexemeCombinable, Value):
    ...

class CombinedLexeme(LexemeCombinable, CombinedExpression):
    ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

Comment on lines 728 to 770
def test_as_sql(self):
query = Line.objects.all().query
compiler = query.get_compiler(connection.alias)

tests = (
(Lexeme('a'), '%s', ["'a'"]),
(Lexeme('a', invert=True), '%s', ["!'a'"]),
(~Lexeme('a'), '%s', ["!'a'"]),
(Lexeme('a', prefix=True), '%s', ["'a':*"]),
(Lexeme('a', weight='D'), '%s', ["'a':D"]),
(Lexeme('a', invert=True, prefix=True, weight='D'), '%s', ["!'a':*D"]),
(Lexeme('a') | Lexeme('b') & ~Lexeme('c'), '%s', ["('a' | ('b' & !'c'))"]),
(~(Lexeme('a') | Lexeme('b') & ~Lexeme('c')), '%s', ["(!'a' & (!'b' | 'c'))"]),

# Some escaping tests
(Lexeme("L'amour piqué par une abeille"), '%s', ["'L''amour piqué par une abeille'"]),
(Lexeme("'starting quote"), '%s', ["'''starting quote'"]),
(Lexeme("ending quote'"), '%s', ["'ending quote'''"]),
(Lexeme("double quo''te"), '%s', ["'double quo''''te'"]),
(Lexeme("triple quo'''te"), '%s', ["'triple quo''''''te'"]),
(Lexeme("backslash\\"), '%s', ["'backslash\\\\'"]),
)
for expression, expected_sql, expected_params in tests:
with self.subTest(expression=expression):
resolved = expression.resolve_expression(query)
sql, params = resolved.as_sql(compiler, connection)
self.assertEqual(sql, expected_sql)
self.assertEqual(params, expected_params)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low-level tests are not necessary, it should be feasible to test using Lexeme() with special characters in a query.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, so I should change this to make real queries? that makes sense!

super().__init__(value, output_field=output_field)

def as_sql(self, compiler, connection):
param = "'%s'" % self.value.replace("'", "''").replace("\\", "\\\\")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using psycopg2's adapt() should be sufficient in all cases:

Suggested change
param = "'%s'" % self.value.replace("'", "''").replace("\\", "\\\\")
param = psycopg2.extensions.adapt(self.value).getquoted().decode()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting a crash when using this with unicode characters:

>>> psycopg2.extensions.adapt("L'amour piqué par une abeille").getquoted().decode()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 14: invalid continuation byte

def __init__(self, value, output_field=None, *, invert=False, prefix=False, weight=None):
self.prefix = prefix
self.invert = invert
self.weight = weight
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I'm aware weight and prefix are mutually exclusive, we should raise a ValueError in such case:

raise ValueError('prefix and weight are mutually exclusive')

Copy link
Contributor Author

@kaedroho kaedroho Nov 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://www.postgresql.org/docs/10/textsearch-controls.html under section "12.3.2. Parsing Queries" there are some examples that appear to combine weights and prefix queries (and there's currently a test for this in postgres_tests.test_search.TestLexemes.test_as_sql)

==============

Lexeme objects allow search operators to be used with strings from an
untrusted source. The content of each lexeme is escaped so any operators that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewrite sounds good to my ear! If you want fine-tuning suggestions:

  • "so that any operators that may exist in the string itself"

First sentence sounds fine to me too, but if you're looking for an alternative:

  • "Lexeme objects allow for safely applying search operators to strings from untrusted sources."

@kaedroho
Copy link
Contributor Author

Thanks for the reviews! I'll get back to this on the weekend to update tests, fix the CI errors and address any other comments

Base automatically changed from master to main March 9, 2021 06:21
@felixxm
Copy link
Member

felixxm commented Mar 31, 2021

Closing due to inactivity.

@felixxm felixxm closed this Mar 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants