New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX: Inject extra lexemes for host lexeme. #10198
FIX: Inject extra lexemes for host lexeme. #10198
Conversation
e9ee6b6
to
a2f2d81
Compare
a2f2d81
to
c8241de
Compare
|
||
tsvector.scan(/'(([a-zA-Z0-9]+\.)+[a-zA-Z0-9]+)'\:([\w+,]+)/).reduce(additional_lexemes) do |array, (lexeme, _, positions)| | ||
array.concat(lexeme.split('.').map { |l| "'#{l}':#{positions}" }) | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My big question though is, is this injection still too much?
If my post contains sam.i.am.hello
should the word hello
really find this post? Should this injection be specific to domains eg: https://www.discourse.org
? Even if this is for domains ... should org
be included as well? what about query params... surely we don't want to inject in query params eg: https://domain.com?a.b.c.g.e.f.h=1
do we really need h here as a distinct piece?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
discourse_development=# SELECT * FROM TS_DEBUG('https://domain.com?a.b.c.g.e.f.h=1');
alias | description | token | dictionaries | dictionary | lexemes
----------+-------------------+---------------+--------------+------------+-----------------
protocol | Protocol head | https:// | {} | |
host | Host | domain.com | {simple} | simple | {domain.com}
blank | Space symbols | ? | {} | |
file | File or path name | a.b.c.g.e.f.h | {simple} | simple | {a.b.c.g.e.f.h}
blank | Space symbols | = | {} | |
uint | Unsigned integer | 1 | {simple} | simple | {1}
Ahh I thought PG was smart enough to only retrieve the host... looks like it treats query params as file here. I wonder if we can identify what type a lexeme is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If my post contains
sam.i.am.hello
should the wordhello
really find this post?
I think so since this was one of the bug reports we got previously https://meta.discourse.org/t/discourses-internal-search-does-not-find-the-phrase-pagedowncustom-but-google-does/35406/8.
what about query params... surely we don't want to inject in query params eg:
https://domain.com?a.b.c.g.e.f.h=1
do we really need h here as a distinct piece?
Query params are tricky and it seems like the PG default parser isn't very smart.
discourse_development=# SELECT TO_TSVECTOR('https://meta.discourse.org?test.dot=1');
to_tsvector
-------------------------------------------
'1':3 'meta.discourse.org':1 'test.dot':2
(1 row)
discourse_development=# SELECT TO_TSVECTOR('https://meta.discourse.org/?test.dot=1');
to_tsvector
----------------------------------------------------------------------------
'/?test.dot=1':3 'meta.discourse.org':2 'meta.discourse.org/?test.dot=1':1
(1 row)
I guess one thing we can do is to drop query params from the search index?
e4123d3
to
89d5a64
Compare
expect(Search.execute('http://wb').posts.map(&:id)).to eq([post.id]) | ||
expect(Search.execute('wb.camra').posts.map(&:id)).to eq([post.id]) | ||
expect(Search.execute('wb.camra.org').posts.map(&:id)).to eq([post.id]) | ||
expect(Search.execute('org.uk').posts.map(&:id)).to eq([post.id]) | ||
expect(Search.execute('camra.org.uk').posts.map(&:id)).to eq([post.id]) | ||
expect(Search.execute('wb.camra.org.uk').posts.map(&:id)).to eq([post.id]) | ||
expect(Search.execute('wb.camra.org.uk/latest').posts.map(&:id)).to eq([post.id]) | ||
expect(Search.execute('/latest#test').posts.map(&:id)).to eq([post.id]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SamSaffron IMO the searches which I've added above are all valid and makes search better.
|
||
loop do | ||
count += 1 | ||
break if count >= 10 # Safeguard here to prevent infinite loop when a term has many dots |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Realized we need a safe guard here to prevent extreme cases. 'some.super.long.file.name.that.will.never.end.some.super.long.file.name.that.will.never.end.some.super.long.file.name.that.will.never.end.some.super.long.file.name.that.will.never.end`
|
||
data.gsub!(EmailCook.url_regexp) do |url| | ||
uri = URI.parse(url) | ||
uri.query = nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SamSaffron I think we should just strip the query string from URLs. The following seems to work as expected
discourse_development=# SELECT TO_TSVECTOR('https://www.discourse.org?test=2&test2=3');
to_tsvector
------------------------------------------------------
'2':3 '3':5 'test':2 'test2':4 'www.discourse.org':1
However, once a path is present
discourse_development=# SELECT TO_TSVECTOR('https://www.discourse.org/latest?test=2&test2=3');
to_tsvector
----------------------------------------------------------------------------------------------
'/latest?test=2&test2=3':3 'www.discourse.org':2 'www.discourse.org/latest?test=2&test2=3':1
The parsing here is really inconsistent and complex query strings will end up generating noise in our search data.
I still feel we should refine this further, but I don't want to block the further optimisation of the excerpt parsing, can you go ahead and merge? |
``` discourse_development=# SELECT alias, lexemes FROM TS_DEBUG('www.discourse.org'); alias | lexemes -------+--------------------- host | {www.discourse.org} discourse_development=# SELECT TO_TSVECTOR('www.discourse.org'); to_tsvector ----------------------- 'www.discourse.org':1 ``` Given the above lexeme, we will inject additional lexeme by splitting the host on `.`. The actual tsvector stored will look something like ``` tsvector --------------------------------------- 'discourse':1 'discourse.org':1 'org':1 'www':1 'www.discourse.org':1 ```
Indexing query strings in URLS produces inconsistent results in PG and pollutes the search data for really little gain. The following seems to work as expected... ``` discourse_development=# SELECT TO_TSVECTOR('https://www.discourse.org?test=2&test2=3'); to_tsvector ------------------------------------------------------ '2':3 '3':5 'test':2 'test2':4 'www.discourse.org':1 ``` However, once a path is present ``` discourse_development=# SELECT TO_TSVECTOR('https://www.discourse.org/latest?test=2&test2=3'); to_tsvector ---------------------------------------------------------------------------------------------- '/latest?test=2&test2=3':3 'www.discourse.org':2 'www.discourse.org/latest?test=2&test2=3':1 ``` The lexeme contains both the path and the query string.
1c4f394
to
2196d0b
Compare
Given the above lexeme, we will inject additional lexeme by splitting
the host on
.
. The actual tsvector stored will look something like