Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX: Inject extra lexemes for host lexeme. #10198

Merged
merged 2 commits into from Jul 14, 2020

Conversation

tgxworld
Copy link
Contributor

@tgxworld tgxworld commented Jul 9, 2020

discourse_development=# SELECT alias, lexemes FROM TS_DEBUG('www.discourse.org');
 alias |       lexemes
-------+---------------------
 host  | {www.discourse.org}

discourse_development=# SELECT TO_TSVECTOR('www.discourse.org');
      to_tsvector
-----------------------
 'www.discourse.org':1

Given the above lexeme, we will inject additional lexeme by splitting
the host on .. The actual tsvector stored will look something like

               tsvector
---------------------------------------
 'discourse':1 'discourse.org':1 'org':1 'www':1 'www.discourse.org':1

@tgxworld tgxworld changed the title FIX: Search for whole URLs wasn't working. FIX: Inject extra lexemes for host lexeme. Jul 9, 2020

tsvector.scan(/'(([a-zA-Z0-9]+\.)+[a-zA-Z0-9]+)'\:([\w+,]+)/).reduce(additional_lexemes) do |array, (lexeme, _, positions)|
array.concat(lexeme.split('.').map { |l| "'#{l}':#{positions}" })
end
Copy link
Member

@SamSaffron SamSaffron Jul 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My big question though is, is this injection still too much?

If my post contains sam.i.am.hello should the word hello really find this post? Should this injection be specific to domains eg: https://www.discourse.org? Even if this is for domains ... should org be included as well? what about query params... surely we don't want to inject in query params eg: https://domain.com?a.b.c.g.e.f.h=1 do we really need h here as a distinct piece?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discourse_development=# SELECT * FROM TS_DEBUG('https://domain.com?a.b.c.g.e.f.h=1');
  alias   |    description    |     token     | dictionaries | dictionary |     lexemes     
----------+-------------------+---------------+--------------+------------+-----------------
 protocol | Protocol head     | https://      | {}           |            | 
 host     | Host              | domain.com    | {simple}     | simple     | {domain.com}
 blank    | Space symbols     | ?             | {}           |            | 
 file     | File or path name | a.b.c.g.e.f.h | {simple}     | simple     | {a.b.c.g.e.f.h}
 blank    | Space symbols     | =             | {}           |            | 
 uint     | Unsigned integer  | 1             | {simple}     | simple     | {1}

Ahh I thought PG was smart enough to only retrieve the host... looks like it treats query params as file here. I wonder if we can identify what type a lexeme is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If my post contains sam.i.am.hello should the word hello really find this post?

I think so since this was one of the bug reports we got previously https://meta.discourse.org/t/discourses-internal-search-does-not-find-the-phrase-pagedowncustom-but-google-does/35406/8.

what about query params... surely we don't want to inject in query params eg: https://domain.com?a.b.c.g.e.f.h=1 do we really need h here as a distinct piece?

Query params are tricky and it seems like the PG default parser isn't very smart.

discourse_development=# SELECT TO_TSVECTOR('https://meta.discourse.org?test.dot=1');
                to_tsvector                
-------------------------------------------
 '1':3 'meta.discourse.org':1 'test.dot':2
(1 row)

discourse_development=# SELECT TO_TSVECTOR('https://meta.discourse.org/?test.dot=1');
                                to_tsvector                                 
----------------------------------------------------------------------------
 '/?test.dot=1':3 'meta.discourse.org':2 'meta.discourse.org/?test.dot=1':1
(1 row)

I guess one thing we can do is to drop query params from the search index?

@tgxworld tgxworld force-pushed the remove_inject_extra_terms branch 2 times, most recently from e4123d3 to 89d5a64 Compare July 9, 2020 07:53
Comment on lines +1209 to +1277
expect(Search.execute('http://wb').posts.map(&:id)).to eq([post.id])
expect(Search.execute('wb.camra').posts.map(&:id)).to eq([post.id])
expect(Search.execute('wb.camra.org').posts.map(&:id)).to eq([post.id])
expect(Search.execute('org.uk').posts.map(&:id)).to eq([post.id])
expect(Search.execute('camra.org.uk').posts.map(&:id)).to eq([post.id])
expect(Search.execute('wb.camra.org.uk').posts.map(&:id)).to eq([post.id])
expect(Search.execute('wb.camra.org.uk/latest').posts.map(&:id)).to eq([post.id])
expect(Search.execute('/latest#test').posts.map(&:id)).to eq([post.id])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SamSaffron IMO the searches which I've added above are all valid and makes search better.


loop do
count += 1
break if count >= 10 # Safeguard here to prevent infinite loop when a term has many dots
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Realized we need a safe guard here to prevent extreme cases. 'some.super.long.file.name.that.will.never.end.some.super.long.file.name.that.will.never.end.some.super.long.file.name.that.will.never.end.some.super.long.file.name.that.will.never.end`


data.gsub!(EmailCook.url_regexp) do |url|
uri = URI.parse(url)
uri.query = nil
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SamSaffron I think we should just strip the query string from URLs. The following seems to work as expected

discourse_development=# SELECT TO_TSVECTOR('https://www.discourse.org?test=2&test2=3');
                     to_tsvector
------------------------------------------------------
 '2':3 '3':5 'test':2 'test2':4 'www.discourse.org':1

However, once a path is present

discourse_development=# SELECT TO_TSVECTOR('https://www.discourse.org/latest?test=2&test2=3');
                                         to_tsvector
----------------------------------------------------------------------------------------------
 '/latest?test=2&test2=3':3 'www.discourse.org':2 'www.discourse.org/latest?test=2&test2=3':1

The parsing here is really inconsistent and complex query strings will end up generating noise in our search data.

@SamSaffron
Copy link
Member

I still feel we should refine this further, but I don't want to block the further optimisation of the excerpt parsing, can you go ahead and merge?

```
discourse_development=# SELECT alias, lexemes FROM TS_DEBUG('www.discourse.org');
 alias |       lexemes
-------+---------------------
 host  | {www.discourse.org}

discourse_development=# SELECT TO_TSVECTOR('www.discourse.org');
      to_tsvector
-----------------------
 'www.discourse.org':1
```

Given the above lexeme, we will inject additional lexeme by splitting
the host on `.`. The actual tsvector stored will look something like

```
               tsvector
---------------------------------------
 'discourse':1 'discourse.org':1 'org':1 'www':1 'www.discourse.org':1
```
Indexing query strings in URLS produces inconsistent results in PG and
pollutes the search data for really little gain.

The following seems to work as expected...

```
discourse_development=# SELECT TO_TSVECTOR('https://www.discourse.org?test=2&test2=3');
                     to_tsvector
------------------------------------------------------
 '2':3 '3':5 'test':2 'test2':4 'www.discourse.org':1
```

However, once a path is present

```
discourse_development=# SELECT TO_TSVECTOR('https://www.discourse.org/latest?test=2&test2=3');
                                         to_tsvector
----------------------------------------------------------------------------------------------
 '/latest?test=2&test2=3':3 'www.discourse.org':2 'www.discourse.org/latest?test=2&test2=3':1
```

The lexeme contains both the path and the query string.
@tgxworld tgxworld merged commit 2196d0b into discourse:master Jul 14, 2020
@tgxworld tgxworld deleted the remove_inject_extra_terms branch July 14, 2020 07:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants