New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search discussion titles too #257

Closed
tobscure opened this Issue Aug 27, 2015 · 14 comments

Comments

Projects
None yet
6 participants
@tobscure
Member

tobscure commented Aug 27, 2015

Currently fulltext search is only performed on post text, not discussion titles. This needs to be fixed.

What needs to be done:

  • Work out a strategy to do this. Do we run two queries, one on the posts table and one on the discussions? (How would we rank results then?) Or do we duplicate the discussion title in every post record and perform a joint full text search on the title/content fields?

@tobscure tobscure referenced this issue Aug 28, 2015

Closed

v0.1.0 roadmap (old) #74

19 of 53 tasks complete

@tobscure tobscure added the Bug label Aug 28, 2015

@justjavac justjavac referenced this issue Sep 7, 2015

Open

Flarum v0.1.0 开发路线图 #3

18 of 53 tasks complete

@tobscure tobscure added the Backend label Sep 16, 2015

@amdad

This comment has been minimized.

Show comment
Hide comment
@amdad

amdad Sep 16, 2015

IMO Search should me more sensitive. Why It doesn't handle 3 signs long phrases? Like "php" or "403" for example? Even worse "php 5", and still not searching...

amdad commented Sep 16, 2015

IMO Search should me more sensitive. Why It doesn't handle 3 signs long phrases? Like "php" or "403" for example? Even worse "php 5", and still not searching...

@tobscure

This comment has been minimized.

Show comment
Hide comment
@tobscure

tobscure Sep 16, 2015

Member

Limitations of MySQL's FULLTEXT search engine. We're planning to make it possible to use more powerful drivers like ElasticSearch/Sphinx instead. See #509.

Member

tobscure commented Sep 16, 2015

Limitations of MySQL's FULLTEXT search engine. We're planning to make it possible to use more powerful drivers like ElasticSearch/Sphinx instead. See #509.

@amdad

This comment has been minimized.

Show comment
Hide comment
@amdad

amdad Sep 16, 2015

ElasticSearch sounds good. Current solution - without comment. :)

amdad commented Sep 16, 2015

ElasticSearch sounds good. Current solution - without comment. :)

@tobscure

This comment has been minimized.

Show comment
Hide comment
@tobscure

tobscure Oct 15, 2015

Member

@franzliedke Any thoughts on this? How do you include titles in the search in FluxBB?

Member

tobscure commented Oct 15, 2015

@franzliedke Any thoughts on this? How do you include titles in the search in FluxBB?

@kirkbushell

This comment has been minimized.

Show comment
Hide comment
@kirkbushell

kirkbushell Oct 15, 2015

Contributor

If we want to use elasticsearch, the search implementation will need to be abstracted, which we have done at work. I can paste in some examples.

Contributor

kirkbushell commented Oct 15, 2015

If we want to use elasticsearch, the search implementation will need to be abstracted, which we have done at work. I can paste in some examples.

@franzliedke

This comment has been minimized.

Show comment
Hide comment
@franzliedke

franzliedke Oct 15, 2015

Member

We built an index upon post creation / editing that stored all words along with their appearances in all posts. Easy to adapt for topic titles...

Search then simply looked for all search terms in the index, found all posts in which they appeared and assembled the resulting list.

Member

franzliedke commented Oct 15, 2015

We built an index upon post creation / editing that stored all words along with their appearances in all posts. Easy to adapt for topic titles...

Search then simply looked for all search terms in the index, found all posts in which they appeared and assembled the resulting list.

@tobscure

This comment has been minimized.

Show comment
Hide comment
@tobscure

tobscure Oct 16, 2015

Member

@kirkbushell We already have an abstraction, I'm not sure if it'll be sufficient...

@franzliedke That's very interesting, I've never thought about doing that with MySQL. How does it scale though? What about posts with tens of thousands of words? And forums with millions of posts. Isn't it essentially implementing a less efficient version of MySQL's FULLTEXT engine?

Member

tobscure commented Oct 16, 2015

@kirkbushell We already have an abstraction, I'm not sure if it'll be sufficient...

@franzliedke That's very interesting, I've never thought about doing that with MySQL. How does it scale though? What about posts with tens of thousands of words? And forums with millions of posts. Isn't it essentially implementing a less efficient version of MySQL's FULLTEXT engine?

@franzliedke

This comment has been minimized.

Show comment
Hide comment
@franzliedke

franzliedke Oct 16, 2015

Member

It does work on the fairly large forums we have (1M+ posts), I think. I believe the approach was originally copied from PhpBB.

The bigger problem is the very limited functionality. Word-based stuff works well for Western languages, but even then matches need to be complete (i.e. "shoot" will not match "shot"). I'll experiment with that stuff at some point, I took a whole class on the topic, that should be worth something ;)

Member

franzliedke commented Oct 16, 2015

It does work on the fairly large forums we have (1M+ posts), I think. I believe the approach was originally copied from PhpBB.

The bigger problem is the very limited functionality. Word-based stuff works well for Western languages, but even then matches need to be complete (i.e. "shoot" will not match "shot"). I'll experiment with that stuff at some point, I took a whole class on the topic, that should be worth something ;)

@tobscure

This comment has been minimized.

Show comment
Hide comment
@tobscure

tobscure Oct 19, 2015

Member

Relevant: http://www.slideshare.net/billkarwin/practical-full-text-search-with-my-sql

Thinking it might be best to go with MySQL ISAM FULLTEXT, mirroring the discussion title on the posts table. Then recommend Sphinx/ElasticSearch for improved results.

Member

tobscure commented Oct 19, 2015

Relevant: http://www.slideshare.net/billkarwin/practical-full-text-search-with-my-sql

Thinking it might be best to go with MySQL ISAM FULLTEXT, mirroring the discussion title on the posts table. Then recommend Sphinx/ElasticSearch for improved results.

@sijad

This comment has been minimized.

Show comment
Hide comment
@sijad

sijad Jan 9, 2016

Contributor

i think index table is a better idea, I'm not sure if you want to search tags or not, but this way you can do that too, big community softwares use this method, with a good indexing it should be ok I guess

Contributor

sijad commented Jan 9, 2016

i think index table is a better idea, I'm not sure if you want to search tags or not, but this way you can do that too, big community softwares use this method, with a good indexing it should be ok I guess

@sijad

This comment has been minimized.

Show comment
Hide comment
@sijad

sijad Feb 7, 2016

Contributor

please take a look at this:

            'titlewords' => array(
                'postid' => 'INT UNSIGNED NOT NULL',
                'wordid' => 'INT UNSIGNED NOT NULL',
                'KEY postid (postid)',
                'KEY wordid (wordid)',
                'CONSTRAINT ^titlewords_ibfk_1 FOREIGN KEY (postid) REFERENCES ^posts(postid) ON DELETE CASCADE',
                'CONSTRAINT ^titlewords_ibfk_2 FOREIGN KEY (wordid) REFERENCES ^words(wordid)',
            ),
            'contentwords' => array(
                'postid' => 'INT UNSIGNED NOT NULL',
                'wordid' => 'INT UNSIGNED NOT NULL',
                'count' => 'TINYINT UNSIGNED NOT NULL', // how many times word appears in the post - anything over 255 can be ignored
                'type' => "ENUM('Q', 'A', 'C', 'NOTE') NOT NULL", // the post's type (copied here for quick searching)
                'questionid' => 'INT UNSIGNED NOT NULL', // the id of the post's antecedent parent (here for quick searching)
                'KEY postid (postid)',
                'KEY wordid (wordid)',
                'CONSTRAINT ^contentwords_ibfk_1 FOREIGN KEY (postid) REFERENCES ^posts(postid) ON DELETE CASCADE',
                'CONSTRAINT ^contentwords_ibfk_2 FOREIGN KEY (wordid) REFERENCES ^words(wordid)',
            ),
            'tagwords' => array(
                'postid' => 'INT UNSIGNED NOT NULL',
                'wordid' => 'INT UNSIGNED NOT NULL',
                'KEY postid (postid)',
                'KEY wordid (wordid)',
                'CONSTRAINT ^tagwords_ibfk_1 FOREIGN KEY (postid) REFERENCES ^posts(postid) ON DELETE CASCADE',
                'CONSTRAINT ^tagwords_ibfk_2 FOREIGN KEY (wordid) REFERENCES ^words(wordid)',
            ),

3 different table for searching in titles, contents and tags, you can see the usage here

Contributor

sijad commented Feb 7, 2016

please take a look at this:

            'titlewords' => array(
                'postid' => 'INT UNSIGNED NOT NULL',
                'wordid' => 'INT UNSIGNED NOT NULL',
                'KEY postid (postid)',
                'KEY wordid (wordid)',
                'CONSTRAINT ^titlewords_ibfk_1 FOREIGN KEY (postid) REFERENCES ^posts(postid) ON DELETE CASCADE',
                'CONSTRAINT ^titlewords_ibfk_2 FOREIGN KEY (wordid) REFERENCES ^words(wordid)',
            ),
            'contentwords' => array(
                'postid' => 'INT UNSIGNED NOT NULL',
                'wordid' => 'INT UNSIGNED NOT NULL',
                'count' => 'TINYINT UNSIGNED NOT NULL', // how many times word appears in the post - anything over 255 can be ignored
                'type' => "ENUM('Q', 'A', 'C', 'NOTE') NOT NULL", // the post's type (copied here for quick searching)
                'questionid' => 'INT UNSIGNED NOT NULL', // the id of the post's antecedent parent (here for quick searching)
                'KEY postid (postid)',
                'KEY wordid (wordid)',
                'CONSTRAINT ^contentwords_ibfk_1 FOREIGN KEY (postid) REFERENCES ^posts(postid) ON DELETE CASCADE',
                'CONSTRAINT ^contentwords_ibfk_2 FOREIGN KEY (wordid) REFERENCES ^words(wordid)',
            ),
            'tagwords' => array(
                'postid' => 'INT UNSIGNED NOT NULL',
                'wordid' => 'INT UNSIGNED NOT NULL',
                'KEY postid (postid)',
                'KEY wordid (wordid)',
                'CONSTRAINT ^tagwords_ibfk_1 FOREIGN KEY (postid) REFERENCES ^posts(postid) ON DELETE CASCADE',
                'CONSTRAINT ^tagwords_ibfk_2 FOREIGN KEY (wordid) REFERENCES ^words(wordid)',
            ),

3 different table for searching in titles, contents and tags, you can see the usage here

@franzliedke franzliedke modified the milestone: 0.1.0 Apr 7, 2016

@tobscure tobscure removed this from the 0.1.0 milestone Jul 22, 2017

@BartVB

This comment has been minimized.

Show comment
Hide comment
@BartVB

BartVB Aug 7, 2017

Using a DB table with word_id/post_id won't scale to millions of posts (we tried) and it's very limited in functionality/flexibility. Besides; there are much better solutions out there nowadays which include weighing search terms, filtering stopwords, stemming, fuzzy search, distributed search, etc. The SQL route is fine if you don't want to use external dependencies and if you can't use MySQL FULLTEXT, but for other use cases I would strongly recommend against it.

I would propose using MySQL FULLTEXT for small to medium sites and using something like Sphinx or Elasticsearch on big/demanding sites.

But to do this properly the abstraction in:
https://github.com/flarum/core/blob/master/src/Core/Search/Discussion/Fulltext/DriverInterface.php

Is too limited.

https://github.com/flarum/core/blob/master/src/Core/Search/AbstractSearch.php

would be more suitable. But #1042 needs to be fixed first.

It would be nice if the search interface would at least offer:

  • String to match
  • Limit by tags
  • Limit by users
  • Sort order (relevance/time)
  • Limit nr of results/pagination

Nice to have:

  • Only search first post
  • Search by IP
  • Only search title/content

BartVB commented Aug 7, 2017

Using a DB table with word_id/post_id won't scale to millions of posts (we tried) and it's very limited in functionality/flexibility. Besides; there are much better solutions out there nowadays which include weighing search terms, filtering stopwords, stemming, fuzzy search, distributed search, etc. The SQL route is fine if you don't want to use external dependencies and if you can't use MySQL FULLTEXT, but for other use cases I would strongly recommend against it.

I would propose using MySQL FULLTEXT for small to medium sites and using something like Sphinx or Elasticsearch on big/demanding sites.

But to do this properly the abstraction in:
https://github.com/flarum/core/blob/master/src/Core/Search/Discussion/Fulltext/DriverInterface.php

Is too limited.

https://github.com/flarum/core/blob/master/src/Core/Search/AbstractSearch.php

would be more suitable. But #1042 needs to be fixed first.

It would be nice if the search interface would at least offer:

  • String to match
  • Limit by tags
  • Limit by users
  • Sort order (relevance/time)
  • Limit nr of results/pagination

Nice to have:

  • Only search first post
  • Search by IP
  • Only search title/content
@amdad

This comment has been minimized.

Show comment
Hide comment

amdad commented Aug 7, 2017

What I want to say is this: https://github.com/teamtnt/tntsearch

@tobscure

This comment has been minimized.

Show comment
Hide comment
@tobscure

tobscure Jun 16, 2018

Member

This was implemented in #1339

Member

tobscure commented Jun 16, 2018

This was implemented in #1339

@tobscure tobscure closed this Jun 16, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment