Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX: Inject extra lexemes for host lexeme. #10198

Merged
merged 2 commits into from
Jul 14, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
51 changes: 29 additions & 22 deletions app/services/search_indexer.rb
Original file line number Diff line number Diff line change
Expand Up @@ -16,26 +16,9 @@ def self.scrub_html_for_search(html, strip_diacritics: SiteSetting.search_ignore
HtmlScrubber.scrub(html, strip_diacritics: strip_diacritics)
end

def self.inject_extra_terms(raw)
return raw if !SiteSetting.search_inject_extra_terms

# insert some extra words for I.am.a.word so "word" is tokenized
# I.am.a.word becomes I.am.a.word am a word
raw.gsub(/[^[:space:]]*[\.]+[^[:space:]]*/) do |with_dot|

split = with_dot.split(/https?:\/\/|[?:;,.\/]/)

if split.length > 1
with_dot + ((+" ") << split[1..-1].reject { |x| x.blank? }.join(" "))
else
with_dot
end
end
end

def self.update_index(table: , id: , raw_data:)
search_data = raw_data.map do |data|
inject_extra_terms(Search.prepare_data(data || "", :index))
Search.prepare_data(data || "", :index)
end

table_name = "#{table}_search_data"
Expand All @@ -53,15 +36,39 @@ def self.update_index(table: , id: , raw_data:)

indexed_data = search_data.select { |d| d.length > 0 }.join(' ')

params = {
ranked_params = {
a: search_data[0],
b: search_data[1],
c: search_data[2],
d: search_data[3],
}

tsvector = DB.query_single("SELECT #{ranked_index}", ranked_params)[0]
additional_lexemes = []

tsvector.scan(/'(([a-zA-Z0-9]+\.)+[a-zA-Z0-9]+)'\:([\w+,]+)/).reduce(additional_lexemes) do |array, (lexeme, _, positions)|
count = 0

loop do
count += 1
break if count >= 10 # Safeguard here to prevent infinite loop when a term has many dots
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Realized we need a safe guard here to prevent extreme cases. 'some.super.long.file.name.that.will.never.end.some.super.long.file.name.that.will.never.end.some.super.long.file.name.that.will.never.end.some.super.long.file.name.that.will.never.end`

term, _, remaining = lexeme.partition(".")
break if remaining.blank?
array << "'#{term}':#{positions} '#{remaining}':#{positions}"
lexeme = remaining
end

array
end
Copy link
Member

@SamSaffron SamSaffron Jul 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My big question though is, is this injection still too much?

If my post contains sam.i.am.hello should the word hello really find this post? Should this injection be specific to domains eg: https://www.discourse.org? Even if this is for domains ... should org be included as well? what about query params... surely we don't want to inject in query params eg: https://domain.com?a.b.c.g.e.f.h=1 do we really need h here as a distinct piece?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discourse_development=# SELECT * FROM TS_DEBUG('https://domain.com?a.b.c.g.e.f.h=1');
  alias   |    description    |     token     | dictionaries | dictionary |     lexemes     
----------+-------------------+---------------+--------------+------------+-----------------
 protocol | Protocol head     | https://      | {}           |            | 
 host     | Host              | domain.com    | {simple}     | simple     | {domain.com}
 blank    | Space symbols     | ?             | {}           |            | 
 file     | File or path name | a.b.c.g.e.f.h | {simple}     | simple     | {a.b.c.g.e.f.h}
 blank    | Space symbols     | =             | {}           |            | 
 uint     | Unsigned integer  | 1             | {simple}     | simple     | {1}

Ahh I thought PG was smart enough to only retrieve the host... looks like it treats query params as file here. I wonder if we can identify what type a lexeme is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If my post contains sam.i.am.hello should the word hello really find this post?

I think so since this was one of the bug reports we got previously https://meta.discourse.org/t/discourses-internal-search-does-not-find-the-phrase-pagedowncustom-but-google-does/35406/8.

what about query params... surely we don't want to inject in query params eg: https://domain.com?a.b.c.g.e.f.h=1 do we really need h here as a distinct piece?

Query params are tricky and it seems like the PG default parser isn't very smart.

discourse_development=# SELECT TO_TSVECTOR('https://meta.discourse.org?test.dot=1');
                to_tsvector                
-------------------------------------------
 '1':3 'meta.discourse.org':1 'test.dot':2
(1 row)

discourse_development=# SELECT TO_TSVECTOR('https://meta.discourse.org/?test.dot=1');
                                to_tsvector                                 
----------------------------------------------------------------------------
 '/?test.dot=1':3 'meta.discourse.org':2 'meta.discourse.org/?test.dot=1':1
(1 row)

I guess one thing we can do is to drop query params from the search index?


tsvector = "#{tsvector} #{additional_lexemes.join(' ')}"

params = {
raw_data: indexed_data,
id: id,
locale: SiteSetting.default_locale,
version: INDEX_VERSION
version: INDEX_VERSION,
tsvector: tsvector,
}

# Would be nice to use AR here but not sure how to execut Postgres functions
Expand All @@ -71,7 +78,7 @@ def self.update_index(table: , id: , raw_data:)
SET
raw_data = :raw_data,
locale = :locale,
search_data = #{ranked_index},
search_data = (:tsvector)::tsvector,
version = :version
WHERE #{foreign_key} = :id
SQL
Expand All @@ -80,7 +87,7 @@ def self.update_index(table: , id: , raw_data:)
DB.exec(<<~SQL, params)
INSERT INTO #{table_name}
(#{foreign_key}, search_data, locale, raw_data, version)
VALUES (:id, #{ranked_index}, :locale, :raw_data, :version)
VALUES (:id, (:tsvector)::tsvector, :locale, :raw_data, :version)
SQL
end
rescue
Expand Down
5 changes: 1 addition & 4 deletions config/site_settings.yml
Original file line number Diff line number Diff line change
Expand Up @@ -379,7 +379,7 @@ login:
discord_trusted_guilds:
default: ""
type: list
external_auth_skip_create_confirm:
external_auth_skip_create_confirm:
default: false
client: true
enable_sso:
Expand Down Expand Up @@ -1750,9 +1750,6 @@ search:
search_ranking_normalization:
default: '1'
hidden: true
search_inject_extra_terms:
default: true
hidden: true
min_search_term_length:
client: true
default: 3
Expand Down
9 changes: 9 additions & 0 deletions lib/search.rb
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,15 @@ def self.prepare_data(search_data, purpose = :query)
data = strip_diacritics(data)
end
end

data.gsub!(EmailCook.url_regexp) do |url|
uri = URI.parse(url)
uri.query = nil
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SamSaffron I think we should just strip the query string from URLs. The following seems to work as expected

discourse_development=# SELECT TO_TSVECTOR('https://www.discourse.org?test=2&test2=3');
                     to_tsvector
------------------------------------------------------
 '2':3 '3':5 'test':2 'test2':4 'www.discourse.org':1

However, once a path is present

discourse_development=# SELECT TO_TSVECTOR('https://www.discourse.org/latest?test=2&test2=3');
                                         to_tsvector
----------------------------------------------------------------------------------------------
 '/latest?test=2&test2=3':3 'www.discourse.org':2 'www.discourse.org/latest?test=2&test2=3':1

The parsing here is really inconsistent and complex query strings will end up generating noise in our search data.

uri.to_s
rescue URI::Error
# Don't fail even if URL turns out to be invalid
end

data
end

Expand Down
18 changes: 12 additions & 6 deletions spec/components/search_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -1255,20 +1255,26 @@ def search
])
end

it 'can tokenize dots' do
it 'can search for terms with dots' do
post = Fabricate(:post, raw: 'Will.2000 Will.Bob.Bill...')
expect(Search.execute('bill').posts.map(&:id)).to eq([post.id])
expect(Search.execute('bob').posts.map(&:id)).to eq([post.id])
expect(Search.execute('2000').posts.map(&:id)).to eq([post.id])
end

it 'can search URLS correctly' do
post = Fabricate(:post, raw: 'i like http://wb.camra.org.uk/latest#test so yay')

expect(Search.execute('http://wb.camra.org.uk/latest#test').posts.map(&:id)).to eq([post.id])
expect(Search.execute('camra').posts.map(&:id)).to eq([post.id])

complex_url = "https://test.some.site.com/path?some.range_input=74235a"
post2 = Fabricate(:post, raw: "this is a complex url #{complex_url} so complex")

expect(Search.execute(complex_url).posts.map(&:id)).to eq([post2.id])
expect(Search.execute('http://wb').posts.map(&:id)).to eq([post.id])
expect(Search.execute('wb.camra').posts.map(&:id)).to eq([post.id])
expect(Search.execute('wb.camra.org').posts.map(&:id)).to eq([post.id])
expect(Search.execute('org.uk').posts.map(&:id)).to eq([post.id])
expect(Search.execute('camra.org.uk').posts.map(&:id)).to eq([post.id])
expect(Search.execute('wb.camra.org.uk').posts.map(&:id)).to eq([post.id])
expect(Search.execute('wb.camra.org.uk/latest').posts.map(&:id)).to eq([post.id])
expect(Search.execute('/latest#test').posts.map(&:id)).to eq([post.id])
Comment on lines +1270 to +1277
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SamSaffron IMO the searches which I've added above are all valid and makes search better.

end

it 'supports category slug and tags' do
Expand Down
33 changes: 22 additions & 11 deletions spec/services/search_indexer_spec.rb
Original file line number Diff line number Diff line change
Expand Up @@ -17,16 +17,6 @@ def scrub(html, strip_diacritics: false)
SearchIndexer.scrub_html_for_search(html, strip_diacritics: strip_diacritics)
end

it 'can correctly inject if http or https links exist' do

val = "a https://cnn.com?bob=1, http://stuff.com.au?bill=1 b abc.net/xyz=1"
result = SearchIndexer.inject_extra_terms(val)

expected = "a https://cnn.com?bob=1, cnn com bob=1 http://stuff.com.au?bill=1 stuff com au bill=1 b abc.net/xyz=1 net xyz=1"

expect(result).to eq(expected)
end

it 'correctly indexes chinese' do
SiteSetting.default_locale = 'zh_CN'
data = "你好世界"
Expand Down Expand Up @@ -151,7 +141,28 @@ def scrub(html, strip_diacritics: false)
topic = post.topic

expect(post.post_search_data.raw_data).to eq(
"#{topic.title} #{topic.category.name} https://meta.discourse.org/some.png meta discourse org some png"
"#{topic.title} #{topic.category.name} https://meta.discourse.org/some.png"
)
end

it 'should tokenize host of a URL and removes query string' do
category = Fabricate(:category, name: 'awesome category')
topic = Fabricate(:topic, category: category, title: 'this is a test topic')

post = Fabricate(:post, topic: topic, raw: <<~RAW)
a https://cnn.com?bob=1, http://stuff.com.au?bill=1 b abc.net/xyz=1
RAW

post.rebake!
post.reload
topic = post.topic

expect(post.post_search_data.raw_data).to eq(
"#{topic.title} #{category.name} a https://cnn.com , http://stuff.com.au b http://abc.net/xyz=1 abc.net/xyz=1"
)

expect(post.post_search_data.search_data).to eq(
"'/xyz=1':14,17 'abc':13,16 'abc.net':13,16 'abc.net/xyz=1':12,15 'au':10 'awesom':6B 'b':11 'categori':7B 'cnn':9 'cnn.com':9 'com':9,10 'com.au':10 'net':13,16 'stuff':10 'stuff.com.au':10 'test':4A 'topic':5A"
)
end

Expand Down