Ignore BBCode tags in search #17

Closed
crazedpsyc opened this Issue Sep 28, 2012 · 2 comments

Projects

None yet

2 participants

@crazedpsyc

Currently, searching the forums returns unparsed BBCode in the results, and the BBCode tags themselves can be matched in a search (i.e. "code" will match everything with [code] blocks).

@jbarrett
DuckDuckGo member

This appears to sill be an issue. I was thinking of some approaches to this including filtering of results in Perl to remove bbcode from returned text, then check we still have a match - I think this would break paging (and possibly other stuff) badly though.

Another option might be to create a postgres function to filter bbcode from content. So our search currently returns 5 comments containing 'code':

ddgc=# select count(*) from comment where content ilike '%code%';
 count 
-------
     5
(1 row)

If we add a 'strip_bbcode' function to our schema:

ddgc=# create function strip_bbcode(TEXT)
       returns TEXT as $$
       select regexp_replace($1,'\[[^\]]*\]','','g')
       $$ language sql;

To demonstrate what this does:

ddgc=# select strip_bbcode('[code]printf()[/code]');
 strip_bbcode 
--------------
 printf()
(1 row)

We can then:

ddgc=# select count(*) from comment where strip_bbcode(content) ilike '%code%';
 count
-------
     2
(1 row)

So we only get results back where the text itself contains 'code'. Note, the strip_bbcode function is pretty crude as it stands, it currently strips all text within square braces.

There might be a case to be made for creating a search index table which aggregates data in this fashion at regular intervals, so potentially expensive regexes aren't being performed with every search.

@crazedpsyc

Fixed in dezi-search.

@crazedpsyc crazedpsyc closed this Feb 25, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment