FIX: Inject extra lexemes for host lexeme. #10198

tgxworld · 2020-07-09T07:00:59Z

discourse_development=# SELECT alias, lexemes FROM TS_DEBUG('www.discourse.org');
 alias |       lexemes
-------+---------------------
 host  | {www.discourse.org}

discourse_development=# SELECT TO_TSVECTOR('www.discourse.org');
      to_tsvector
-----------------------
 'www.discourse.org':1

Given the above lexeme, we will inject additional lexeme by splitting
the host on .. The actual tsvector stored will look something like

               tsvector
---------------------------------------
 'discourse':1 'discourse.org':1 'org':1 'www':1 'www.discourse.org':1

app/services/search_indexer.rb

SamSaffron · 2020-07-09T07:46:10Z

app/services/search_indexer.rb

+
+    tsvector.scan(/'(([a-zA-Z0-9]+\.)+[a-zA-Z0-9]+)'\:([\w+,]+)/).reduce(additional_lexemes) do |array, (lexeme, _, positions)|
+      array.concat(lexeme.split('.').map { |l| "'#{l}':#{positions}" })
+    end


My big question though is, is this injection still too much?

If my post contains sam.i.am.hello should the word hello really find this post? Should this injection be specific to domains eg: https://www.discourse.org? Even if this is for domains ... should org be included as well? what about query params... surely we don't want to inject in query params eg: https://domain.com?a.b.c.g.e.f.h=1 do we really need h here as a distinct piece?

discourse_development=# SELECT * FROM TS_DEBUG('https://domain.com?a.b.c.g.e.f.h=1'); alias | description | token | dictionaries | dictionary | lexemes ----------+-------------------+---------------+--------------+------------+----------------- protocol | Protocol head | https:// | {} | | host | Host | domain.com | {simple} | simple | {domain.com} blank | Space symbols | ? | {} | | file | File or path name | a.b.c.g.e.f.h | {simple} | simple | {a.b.c.g.e.f.h} blank | Space symbols | = | {} | | uint | Unsigned integer | 1 | {simple} | simple | {1}

Ahh I thought PG was smart enough to only retrieve the host... looks like it treats query params as file here. I wonder if we can identify what type a lexeme is.

If my post contains sam.i.am.hello should the word hello really find this post?

I think so since this was one of the bug reports we got previously https://meta.discourse.org/t/discourses-internal-search-does-not-find-the-phrase-pagedowncustom-but-google-does/35406/8.

what about query params... surely we don't want to inject in query params eg: https://domain.com?a.b.c.g.e.f.h=1 do we really need h here as a distinct piece?

Query params are tricky and it seems like the PG default parser isn't very smart.

discourse_development=# SELECT TO_TSVECTOR('https://meta.discourse.org?test.dot=1'); to_tsvector ------------------------------------------- '1':3 'meta.discourse.org':1 'test.dot':2 (1 row) discourse_development=# SELECT TO_TSVECTOR('https://meta.discourse.org/?test.dot=1'); to_tsvector ---------------------------------------------------------------------------- '/?test.dot=1':3 'meta.discourse.org':2 'meta.discourse.org/?test.dot=1':1 (1 row)

I guess one thing we can do is to drop query params from the search index?

tgxworld · 2020-07-09T07:57:22Z

spec/components/search_spec.rb

+      expect(Search.execute('http://wb').posts.map(&:id)).to eq([post.id])
+      expect(Search.execute('wb.camra').posts.map(&:id)).to eq([post.id])
+      expect(Search.execute('wb.camra.org').posts.map(&:id)).to eq([post.id])
+      expect(Search.execute('org.uk').posts.map(&:id)).to eq([post.id])
+      expect(Search.execute('camra.org.uk').posts.map(&:id)).to eq([post.id])
+      expect(Search.execute('wb.camra.org.uk').posts.map(&:id)).to eq([post.id])
+      expect(Search.execute('wb.camra.org.uk/latest').posts.map(&:id)).to eq([post.id])
+      expect(Search.execute('/latest#test').posts.map(&:id)).to eq([post.id])


@SamSaffron IMO the searches which I've added above are all valid and makes search better.

tgxworld · 2020-07-09T07:58:36Z

app/services/search_indexer.rb

+
+      loop do
+        count += 1
+        break if count >= 10 # Safeguard here to prevent infinite loop when a term has many dots


Realized we need a safe guard here to prevent extreme cases. 'some.super.long.file.name.that.will.never.end.some.super.long.file.name.that.will.never.end.some.super.long.file.name.that.will.never.end.some.super.long.file.name.that.will.never.end`

tgxworld · 2020-07-09T09:10:31Z

lib/search.rb

+
+    data.gsub!(EmailCook.url_regexp) do |url|
+      uri = URI.parse(url)
+      uri.query = nil


@SamSaffron I think we should just strip the query string from URLs. The following seems to work as expected

discourse_development=# SELECT TO_TSVECTOR('https://www.discourse.org?test=2&test2=3'); to_tsvector ------------------------------------------------------ '2':3 '3':5 'test':2 'test2':4 'www.discourse.org':1

However, once a path is present

discourse_development=# SELECT TO_TSVECTOR('https://www.discourse.org/latest?test=2&test2=3'); to_tsvector ---------------------------------------------------------------------------------------------- '/latest?test=2&test2=3':3 'www.discourse.org':2 'www.discourse.org/latest?test=2&test2=3':1

The parsing here is really inconsistent and complex query strings will end up generating noise in our search data.

SamSaffron · 2020-07-14T06:06:09Z

I still feel we should refine this further, but I don't want to block the further optimisation of the excerpt parsing, can you go ahead and merge?

``` discourse_development=# SELECT alias, lexemes FROM TS_DEBUG('www.discourse.org'); alias | lexemes -------+--------------------- host | {www.discourse.org} discourse_development=# SELECT TO_TSVECTOR('www.discourse.org'); to_tsvector ----------------------- 'www.discourse.org':1 ``` Given the above lexeme, we will inject additional lexeme by splitting the host on `.`. The actual tsvector stored will look something like ``` tsvector --------------------------------------- 'discourse':1 'discourse.org':1 'org':1 'www':1 'www.discourse.org':1 ```

Indexing query strings in URLS produces inconsistent results in PG and pollutes the search data for really little gain. The following seems to work as expected... ``` discourse_development=# SELECT TO_TSVECTOR('https://www.discourse.org?test=2&test2=3'); to_tsvector ------------------------------------------------------ '2':3 '3':5 'test':2 'test2':4 'www.discourse.org':1 ``` However, once a path is present ``` discourse_development=# SELECT TO_TSVECTOR('https://www.discourse.org/latest?test=2&test2=3'); to_tsvector ---------------------------------------------------------------------------------------------- '/latest?test=2&test2=3':3 'www.discourse.org':2 'www.discourse.org/latest?test=2&test2=3':1 ``` The lexeme contains both the path and the query string.

tgxworld force-pushed the remove_inject_extra_terms branch from e9ee6b6 to a2f2d81 Compare July 9, 2020 07:04

tgxworld changed the title ~~FIX: Search for whole URLs wasn't working.~~ FIX: Inject extra lexemes for host lexeme. Jul 9, 2020

SamSaffron reviewed Jul 9, 2020

View reviewed changes

app/services/search_indexer.rb Outdated Show resolved Hide resolved

tgxworld force-pushed the remove_inject_extra_terms branch from a2f2d81 to c8241de Compare July 9, 2020 07:31

SamSaffron reviewed Jul 9, 2020

View reviewed changes

tgxworld force-pushed the remove_inject_extra_terms branch 2 times, most recently from e4123d3 to 89d5a64 Compare July 9, 2020 07:53

tgxworld commented Jul 9, 2020

View reviewed changes

tgxworld added 2 commits July 14, 2020 15:32

tgxworld force-pushed the remove_inject_extra_terms branch from 1c4f394 to 2196d0b Compare July 14, 2020 07:33

tgxworld merged commit 2196d0b into discourse:master Jul 14, 2020

tgxworld deleted the remove_inject_extra_terms branch July 14, 2020 07:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX: Inject extra lexemes for host lexeme. #10198

FIX: Inject extra lexemes for host lexeme. #10198

tgxworld commented Jul 9, 2020 •

edited

SamSaffron Jul 9, 2020 •

edited

tgxworld Jul 9, 2020

tgxworld Jul 9, 2020

tgxworld Jul 9, 2020

tgxworld Jul 9, 2020

tgxworld Jul 9, 2020

SamSaffron commented Jul 14, 2020

FIX: Inject extra lexemes for host lexeme. #10198

FIX: Inject extra lexemes for host lexeme. #10198

Conversation

tgxworld commented Jul 9, 2020 • edited

SamSaffron Jul 9, 2020 • edited

Choose a reason for hiding this comment

tgxworld Jul 9, 2020

Choose a reason for hiding this comment

tgxworld Jul 9, 2020

Choose a reason for hiding this comment

tgxworld Jul 9, 2020

Choose a reason for hiding this comment

tgxworld Jul 9, 2020

Choose a reason for hiding this comment

tgxworld Jul 9, 2020

Choose a reason for hiding this comment

SamSaffron commented Jul 14, 2020

tgxworld commented Jul 9, 2020 •

edited

SamSaffron Jul 9, 2020 •

edited