Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search discussion titles too #257

Closed
tobyzerner opened this issue Aug 27, 2015 · 14 comments
Closed

Search discussion titles too #257

tobyzerner opened this issue Aug 27, 2015 · 14 comments

Comments

@tobyzerner
Copy link
Contributor

Currently fulltext search is only performed on post text, not discussion titles. This needs to be fixed.

What needs to be done:

  • Work out a strategy to do this. Do we run two queries, one on the posts table and one on the discussions? (How would we rank results then?) Or do we duplicate the discussion title in every post record and perform a joint full text search on the title/content fields?
@amdad
Copy link

amdad commented Sep 16, 2015

IMO Search should me more sensitive. Why It doesn't handle 3 signs long phrases? Like "php" or "403" for example? Even worse "php 5", and still not searching...

@tobyzerner
Copy link
Contributor Author

Limitations of MySQL's FULLTEXT search engine. We're planning to make it possible to use more powerful drivers like ElasticSearch/Sphinx instead. See #509.

@amdad
Copy link

amdad commented Sep 16, 2015

ElasticSearch sounds good. Current solution - without comment. :)

@tobyzerner
Copy link
Contributor Author

@franzliedke Any thoughts on this? How do you include titles in the search in FluxBB?

@kirkbushell
Copy link
Contributor

If we want to use elasticsearch, the search implementation will need to be abstracted, which we have done at work. I can paste in some examples.

@franzliedke
Copy link
Contributor

We built an index upon post creation / editing that stored all words along with their appearances in all posts. Easy to adapt for topic titles...

Search then simply looked for all search terms in the index, found all posts in which they appeared and assembled the resulting list.

@tobyzerner
Copy link
Contributor Author

@kirkbushell We already have an abstraction, I'm not sure if it'll be sufficient...

@franzliedke That's very interesting, I've never thought about doing that with MySQL. How does it scale though? What about posts with tens of thousands of words? And forums with millions of posts. Isn't it essentially implementing a less efficient version of MySQL's FULLTEXT engine?

@franzliedke
Copy link
Contributor

It does work on the fairly large forums we have (1M+ posts), I think. I believe the approach was originally copied from PhpBB.

The bigger problem is the very limited functionality. Word-based stuff works well for Western languages, but even then matches need to be complete (i.e. "shoot" will not match "shot"). I'll experiment with that stuff at some point, I took a whole class on the topic, that should be worth something ;)

@tobyzerner
Copy link
Contributor Author

Relevant: http://www.slideshare.net/billkarwin/practical-full-text-search-with-my-sql

Thinking it might be best to go with MySQL ISAM FULLTEXT, mirroring the discussion title on the posts table. Then recommend Sphinx/ElasticSearch for improved results.

@sijad
Copy link
Contributor

sijad commented Jan 9, 2016

i think index table is a better idea, I'm not sure if you want to search tags or not, but this way you can do that too, big community softwares use this method, with a good indexing it should be ok I guess

@sijad
Copy link
Contributor

sijad commented Feb 7, 2016

please take a look at this:

            'titlewords' => array(
                'postid' => 'INT UNSIGNED NOT NULL',
                'wordid' => 'INT UNSIGNED NOT NULL',
                'KEY postid (postid)',
                'KEY wordid (wordid)',
                'CONSTRAINT ^titlewords_ibfk_1 FOREIGN KEY (postid) REFERENCES ^posts(postid) ON DELETE CASCADE',
                'CONSTRAINT ^titlewords_ibfk_2 FOREIGN KEY (wordid) REFERENCES ^words(wordid)',
            ),
            'contentwords' => array(
                'postid' => 'INT UNSIGNED NOT NULL',
                'wordid' => 'INT UNSIGNED NOT NULL',
                'count' => 'TINYINT UNSIGNED NOT NULL', // how many times word appears in the post - anything over 255 can be ignored
                'type' => "ENUM('Q', 'A', 'C', 'NOTE') NOT NULL", // the post's type (copied here for quick searching)
                'questionid' => 'INT UNSIGNED NOT NULL', // the id of the post's antecedent parent (here for quick searching)
                'KEY postid (postid)',
                'KEY wordid (wordid)',
                'CONSTRAINT ^contentwords_ibfk_1 FOREIGN KEY (postid) REFERENCES ^posts(postid) ON DELETE CASCADE',
                'CONSTRAINT ^contentwords_ibfk_2 FOREIGN KEY (wordid) REFERENCES ^words(wordid)',
            ),
            'tagwords' => array(
                'postid' => 'INT UNSIGNED NOT NULL',
                'wordid' => 'INT UNSIGNED NOT NULL',
                'KEY postid (postid)',
                'KEY wordid (wordid)',
                'CONSTRAINT ^tagwords_ibfk_1 FOREIGN KEY (postid) REFERENCES ^posts(postid) ON DELETE CASCADE',
                'CONSTRAINT ^tagwords_ibfk_2 FOREIGN KEY (wordid) REFERENCES ^words(wordid)',
            ),

3 different table for searching in titles, contents and tags, you can see the usage here

@franzliedke franzliedke modified the milestone: 0.1.0 Apr 7, 2016
@tobyzerner tobyzerner removed this from the 0.1.0 milestone Jul 22, 2017
@BartVB
Copy link
Sponsor

BartVB commented Aug 7, 2017

Using a DB table with word_id/post_id won't scale to millions of posts (we tried) and it's very limited in functionality/flexibility. Besides; there are much better solutions out there nowadays which include weighing search terms, filtering stopwords, stemming, fuzzy search, distributed search, etc. The SQL route is fine if you don't want to use external dependencies and if you can't use MySQL FULLTEXT, but for other use cases I would strongly recommend against it.

I would propose using MySQL FULLTEXT for small to medium sites and using something like Sphinx or Elasticsearch on big/demanding sites.

But to do this properly the abstraction in:
https://github.com/flarum/core/blob/master/src/Core/Search/Discussion/Fulltext/DriverInterface.php

Is too limited.

https://github.com/flarum/core/blob/master/src/Core/Search/AbstractSearch.php

would be more suitable. But #1042 needs to be fixed first.

It would be nice if the search interface would at least offer:

  • String to match
  • Limit by tags
  • Limit by users
  • Sort order (relevance/time)
  • Limit nr of results/pagination

Nice to have:

  • Only search first post
  • Search by IP
  • Only search title/content

@amdad
Copy link

amdad commented Aug 7, 2017

What I want to say is this: https://github.com/teamtnt/tntsearch

@tobyzerner
Copy link
Contributor Author

This was implemented in #1339

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants