Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEATURE: Adds option to search only in titles #5538

Closed

Conversation

jorgemanrubia
Copy link
Contributor

@jorgemanrubia jorgemanrubia commented Jan 27, 2018

It adds a new search option in:title that will make it search only in titles.

It also adds a new checkbox at the top of the advanced search UI:

screen recording 2018-01-27 at 02 01 pm

See:

Overview of changes

It changes the way posts and topics are indexed to use PostgreSQL full-text search weights. Then, it uses the weight associated with title to filter by title when searching posts.

Before, it was concatenating title + body + category and storing the indexed result. Now, it will store the concatenation of weighted indexed parts.

This is my first experience with PostgreSQL full-text search. I think it is the proper way to do it after checking the docs. The weights system is a little bit bizarre with 4 values (A, B, C, D) but seems to work pretty well.

As a side effect, results will take weight into consideration when sorting. So title > body > category. From 12.3.3. Ranking Search Results:

If no weights are provided, then these defaults are used:
{0.1, 0.2, 0.4, 1.0}

So for posts, weights will be 1.0 for title title, 0.4 for body and 0.2 for category. I think this is ok, but it's definitely something good to check before merging the PR. I tested this locally with a database with 19k posts from a forum I used to run, and the sorting made sense to me.

I will add some comments in the code to explain the changes next.

Reindexing required

in:title will only work after re-indexing all the posts. Not sure how to do it automatically when shipping this:

  • Should it trigger a reindexing job as part of the deploy? Is there support in place for such tasks? I guess we could launch a job from a migration if not
  • Should it just increase the INDEX_VERSION and rely on the indexing job that runs every day?

In development I have used rake search:reindex for testing this.

@discoursebot
Copy link

You've signed the CLA, jorgemanrubia. Thank you! This pull request is ready for review.

# first one will have a priority A; he second one a priority B, etc.
# When only 1 entry is provided, no weights will be used.
def self.update_index(table, id, *raw_entries)
raw_data = Search.prepare_data(raw_entries.join(' '), :index)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method now supports providing multiple entries to index and, when more than 1 entry is provided, it will assign them a weight.

I didn't add support for providing the specific weights and, instead, used the convention of deducing it from the order. So the first entry gets an A, the second a B, and so. I think this works well for what we need and keeps the code simpler.

end.join(" || ")
end
end

def self.update_topics_index(topic_id, title, cooked)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While weights were only needed for posts, I added them for topics too for consistency

@discoursebot
Copy link

This pull request has been mentioned on Discourse Meta. There might be relevant details there:

https://meta.discourse.org/t/search-only-within-topic-titles/27600/10

# don't allow concurrency to mess up saving a post
end

# for user login and name use "simple" lowercase stemmer
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I extracted a few methods from SearchIndexer#update_index

updateSearchTermForSpecialInLikes() {
const match = this.filterBlocks(REGEXP_SPECIAL_IN_LIKES_MATCH);
const inFilter = this.get('searchedTerms.special.in.likes');
updateSearchTermForSpecialIn(key, regexp){
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracted method to remove some duplicated code for the "searching in" logic

"indexable_fragment_#{index}".to_sym
end

def self.build_ts_vector(stemmer, indexable_entries)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the place where we add weights to entries when more than 1 entry is provided. When only 1 entry is provided (users, categories) it works like it used to.

@jorgemanrubia jorgemanrubia changed the title [FEATURE] Adds option to search only in titles FEATURE: Adds option to search only in titles Jan 27, 2018
@jorgemanrubia
Copy link
Contributor Author

When searching exact terms with "..." it won't work with titles. This is already happening in current version: titles are ignored when performing exact searches.

The reason is that these queries check posts.raw, which does not include the title. I can tweak that and use post_search_data.raw_data which is the concatenation of title+cooked+category name but wasn't really sure if we want to tweak that.

@SamSaffron
Copy link
Member

That 1.freeze I may have added there sure is odd, I would just bump version to 2 and rely on it the indexing job.

Can you review the existing code that attempts to give a bump to title. Perhaps there is a bunch of code that can be removed now? eg: https://github.com/discourse/discourse/blob/master/lib/search.rb#L746

I think that checkbox in the UI has a bit too much prominence I would bump it to the bottom of the advanced area.

Overall the PR looks great, nice feature.

@jorgemanrubia
Copy link
Contributor Author

@SamSaffron I added some commits addressing your comments, thanks!

Regarding the ordering, this commit will use the new weights to bump titles. With the default weights of {0.1, 0.2, 0.4, 1.0} it wasn't bumping titles as in the current implementation, where title matches always appear first. I experimented with a few values and making the relation 1 to 0.1 keeps the same behaviour.

A problem with this change is that, until the database is reindexed, title bumping in results won't work. Not sure if that's ok. If it is not, we can rollback that change and deploy it a few days after getting this shipped to be safe.

@jorgemanrubia
Copy link
Contributor Author

jorgemanrubia commented Jan 28, 2018

I was having a look at Job::ReindexSearch and I am seeing that it has a limit of 10000 for posts and that it only fetches a list of ids. I wonder if that limit is too short for very active communities (such as Meta). I understand it's for performance reasons, but what about getting rid of the limit, adding a select for only loading ids into the records, and then using find_each to process them in batches and prevent memory issues?

Something like:

posts = Post.joins(:topic)
            .select(:id)
            .where(...)

posts.find_each {|post| indexing_stuff}

@SamSaffron
Copy link
Member

OK, I took some time today to re-do this work based on your awesome changes. per: 86d12bd

I opted to hack this in myself cause I was very concerned about maintenance and needed to know exactly how everything works together.

I opted out of UI changes for now for a couple more betas so the index has some time to rebuild.

Overall I was very happy with this change and I think it provides discourse with a significantly better search.

ranking wise I went with title category tags body this seems to be doing the trick.

@SamSaffron SamSaffron closed this Feb 20, 2018
@coding-horror
Copy link
Contributor

yes this was an excellent nudge to improve search on titles which we did get a lot of complaints about .. glad we were actually doing something wrong, that makes it easier to fix.. thank you @jorgemanrubia

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
4 participants