Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Browse files

Hack on the token_stream_to_hash to account for searching for URL

strings.  Also notes in the default.base on enabling URL searching from
the Spinx side.

Signed-off-by: Snax Fauna <evan+fauna@cloudbur.st>
  • Loading branch information...
commit 053ab3a8e8f4b5b4a1a3dc2cb6ef2d5599268582 1 parent 518c247
@monde monde authored Snax Fauna committed
Showing with 14 additions and 2 deletions.
  1. +5 −0 examples/default.base
  2. +9 −2 lib/ultrasphinx/search/parser.rb
View
5 examples/default.base
@@ -79,6 +79,11 @@ index
# Enable these if you need wildcard searching. They will slow down indexing significantly.
# min_infix_len = 1
# enable_star = 1
+
+ # # URL search options
+ # # add " @, /, :," before " a-z," in the charset_table and uncomment prefix_fields
+ # to seach URL and email addresses
+ # prefix_fields = url, domain
charset_type = utf-8 # or sbcs (Single Byte Character Set)
charset_table = 0..9, A..Z->a..z, -, _, ., &, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F,U+C5->U+E5, U+E5, U+C4->U+E4, U+E4, U+D6->U+F6, U+F6, U+16B, U+0c1->a, U+0c4->a, U+0c9->e, U+0cd->i, U+0d3->o, U+0d4->o, U+0da->u, U+0dd->y, U+0e1->a, U+0e4->a, U+0e9->e, U+0ed->i, U+0f3->o, U+0f4->o, U+0fa->u, U+0fd->y, U+104->U+105, U+105, U+106->U+107, U+10c->c, U+10d->c, U+10e->d, U+10f->d, U+116->U+117, U+117, U+118->U+119, U+11a->e, U+11b->e, U+12E->U+12F, U+12F, U+139->l, U+13a->l, U+13d->l, U+13e->l, U+141->U+142, U+142, U+143->U+144, U+144,U+147->n, U+148->n, U+154->r, U+155->r, U+158->r, U+159->r, U+15A->U+15B, U+15B, U+160->s, U+160->U+161, U+161->s, U+164->t, U+165->t, U+16A->U+16B, U+16B, U+16e->u, U+16f->u, U+172->U+173, U+173, U+179->U+17A, U+17A, U+17B->U+17C, U+17C, U+17d->z, U+17e->z,
View
11 lib/ultrasphinx/search/parser.rb
@@ -131,7 +131,14 @@ def token_stream_to_hash(token_stream)
# Remove some spaces
content.gsub!(/^"\s+|\s+"$/, '"')
# Convert fields into sphinx style, reformat the stream object
- if content =~ /(.*?):(.*)/
+ if content =~ /(^(http|https):\/\/[a-z0-9]+([-.]{1}[a-z0-9]*)+. [a-z]{2,5}(([0-9]{1,5})?\/.*)?$)/ix
+ # XXX hack, its somewhat common to search for URLs. be sure to add
+ # " @, /," in the charset_type of the US config to search on all
+ # URLs and email addresses, and add:
+ # prefix_fields = url, domain
+ # to your US config
+ token_hash[nil] += [[operator, content]]
+ elsif content =~ /(.*?):(.*)/
token_hash[$1] += [[operator, $2]]
else
token_hash[nil] += [[operator, content]]
@@ -143,4 +150,4 @@ def token_stream_to_hash(token_stream)
end
end
-end
+end
Please sign in to comment.
Something went wrong with that request. Please try again.