New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed #28041 -- Added prefix matching for PostgreSQL full text search #12727
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Karl, thanks for the patch. Although I'm not able to offer a substantive review, I left some comments on your docs to ensure CI will pass. I also updated the flags on the ticket so that this will be more visible for reviewers who can give you a substantive review. Cheers
docs/ref/contrib/postgres/search.txt
Outdated
============== | ||
|
||
Lexeme objects allow search operators to be used along with strings from an untrusted source. The contents | ||
of each lexeme is excaped so cannot contain any operators that PostgreSQL will recognise, but they support |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"escaped"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also plural "the contents of each lexeme are escaped"
and "so cannot contain" is not clear, sounds like it's missing a word.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"recognize" for American English, don't blame me, it's CI that will complain :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I think changing "contents" to singular sounds a bit better, but I'm not sure!
"the content of each lexeme is escaped"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've had a go at rewriting this paragraph, let me know what you think! I'm not sure about the wording of first sentence in particular
django/contrib/postgres/search.py
Outdated
super().__init__(value, output_field=output_field) | ||
|
||
def as_sql(self, compiler, connection): | ||
param = "'%s'" % self.value.replace("'", "''").replace("\\", "\\\\") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not 100% certain, but it may require to escape more than a single quote and backslash, specifically, all to_tsquery operators like & and |. The following should fail Lexeme('red!')
. Here is the code I used to implement the escaping some time ago.
_SEARCH_SPEC_CHARS = r"['\0\[\]()|&:*!@<>\\]"
_spec_chars_re = re.compile(_SEARCH_SPEC_CHARS)
multiple_spaces_re = re.compile(r'\s{2,}')
def normalize_spaces(val):
"""Converts multiple spaces to single and strips from both sides"""
if not val:
return None
return multiple_spaces_re.sub(' ', val.strip())
def psql_escape(query: str):
# replace unsafe chars with space
query = _spec_chars_re.sub(' ', query)
query = normalize_spaces(query) # convert multiple spaces to single
return query
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like it should be handled by 'raw' search type, but it's not, and since quote and backslash are escaped, I assumed it should also handle other special characters.
On the other hand, it was probably omitted intentionally in 'raw' query implementation, so probably we shouldn't try to do it here also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I've updated it to use your code snippet for escaping
This did change to how some words are indexed though. For example: L'amour
was previously indexed as L''amour
, and now it is L amour
. I'm not that familiar with French, so I'm not sure what impact that has.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've applied the suggestion from #12727 (comment) which bought back the quotes, but it also doesn't escape !
.
I think this is fine because the whole lexeme is surrounded by quotes, so your previous example would be escaped as 'red!'
and those quotes should prevent it from interpreting the !
(I guess!).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kaedroho Thanks for this patch 👍 I left initial comments.
|
||
def __init__(self, value, output_field=None, *, invert=False, prefix=False, weight=None): | ||
self.prefix = prefix | ||
self.invert = invert |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would use negated
.
self.invert = invert | |
self.negated = negated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existing SearchQuery
class also uses invert
: https://github.com/django/django/blob/master/django/contrib/postgres/search.py#L169
@@ -300,3 +303,90 @@ class TrigramSimilarity(TrigramBase): | |||
class TrigramDistance(TrigramBase): | |||
function = '' | |||
arg_joiner = ' <-> ' | |||
|
|||
|
|||
class LexemeCombinable(Expression): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This creates a diamond-shape inheritance
Expression
/ \
LexemeCombinable Value
\ /
Lexeme
which is not necessary. I think we should use:
class LexemeCombinable:
...
class Lexeme(LexemeCombinable, Value):
...
class CombinedLexeme(LexemeCombinable, CombinedExpression):
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks!
tests/postgres_tests/test_search.py
Outdated
def test_as_sql(self): | ||
query = Line.objects.all().query | ||
compiler = query.get_compiler(connection.alias) | ||
|
||
tests = ( | ||
(Lexeme('a'), '%s', ["'a'"]), | ||
(Lexeme('a', invert=True), '%s', ["!'a'"]), | ||
(~Lexeme('a'), '%s', ["!'a'"]), | ||
(Lexeme('a', prefix=True), '%s', ["'a':*"]), | ||
(Lexeme('a', weight='D'), '%s', ["'a':D"]), | ||
(Lexeme('a', invert=True, prefix=True, weight='D'), '%s', ["!'a':*D"]), | ||
(Lexeme('a') | Lexeme('b') & ~Lexeme('c'), '%s', ["('a' | ('b' & !'c'))"]), | ||
(~(Lexeme('a') | Lexeme('b') & ~Lexeme('c')), '%s', ["(!'a' & (!'b' | 'c'))"]), | ||
|
||
# Some escaping tests | ||
(Lexeme("L'amour piqué par une abeille"), '%s', ["'L''amour piqué par une abeille'"]), | ||
(Lexeme("'starting quote"), '%s', ["'''starting quote'"]), | ||
(Lexeme("ending quote'"), '%s', ["'ending quote'''"]), | ||
(Lexeme("double quo''te"), '%s', ["'double quo''''te'"]), | ||
(Lexeme("triple quo'''te"), '%s', ["'triple quo''''''te'"]), | ||
(Lexeme("backslash\\"), '%s', ["'backslash\\\\'"]), | ||
) | ||
for expression, expected_sql, expected_params in tests: | ||
with self.subTest(expression=expression): | ||
resolved = expression.resolve_expression(query) | ||
sql, params = resolved.as_sql(compiler, connection) | ||
self.assertEqual(sql, expected_sql) | ||
self.assertEqual(params, expected_params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Low-level tests are not necessary, it should be feasible to test using Lexeme()
with special characters in a query.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, so I should change this to make real queries? that makes sense!
django/contrib/postgres/search.py
Outdated
super().__init__(value, output_field=output_field) | ||
|
||
def as_sql(self, compiler, connection): | ||
param = "'%s'" % self.value.replace("'", "''").replace("\\", "\\\\") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using psycopg2's adapt()
should be sufficient in all cases:
param = "'%s'" % self.value.replace("'", "''").replace("\\", "\\\\") | |
param = psycopg2.extensions.adapt(self.value).getquoted().decode() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting a crash when using this with unicode characters:
>>> psycopg2.extensions.adapt("L'amour piqué par une abeille").getquoted().decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 14: invalid continuation byte
def __init__(self, value, output_field=None, *, invert=False, prefix=False, weight=None): | ||
self.prefix = prefix | ||
self.invert = invert | ||
self.weight = weight |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I'm aware weight
and prefix
are mutually exclusive, we should raise a ValueError
in such case:
raise ValueError('prefix and weight are mutually exclusive')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://www.postgresql.org/docs/10/textsearch-controls.html under section "12.3.2. Parsing Queries" there are some examples that appear to combine weights and prefix queries (and there's currently a test for this in postgres_tests.test_search.TestLexemes.test_as_sql
)
Thanks @jacobtylerwalls for suggestions!
22bfc7e
to
5665ade
Compare
============== | ||
|
||
Lexeme objects allow search operators to be used with strings from an | ||
untrusted source. The content of each lexeme is escaped so any operators that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rewrite sounds good to my ear! If you want fine-tuning suggestions:
- "so that any operators that may exist in the string itself"
First sentence sounds fine to me too, but if you're looking for an alternative:
- "Lexeme objects allow for safely applying search operators to strings from untrusted sources."
Thanks for the reviews! I'll get back to this on the weekend to update tests, fix the CI errors and address any other comments |
Closing due to inactivity. |
This pull request is based on an earlier one: #8313
I've rebased it on the latest master and made the following tweaks:
~
operator on LexemesRawSearchQuery
with the newSearchQuery(search_type='raw')