-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fuzzy Word Matching #56
Comments
Yes, that's something that we had considered in the past, but never implemented as we did not want to kill the underlying database with too many queries. Can you please list a few of the Ruby Gems you had found? |
From my limited understanding, I can only imagine a DB hit when the front end gets an updated string to compare (I use the "as you type" method in my setup, which could actually skip key presses like One that keeps popping up in SO questions is Another interesting looking one is From the lib weight point of view, it looks like |
What about full text search with elasticsearch-rails |
Consider also thinkingsphinx which can fit better that elasticsearch to current features list (filter by project_id, issue_id, status) |
👍 |
I just pulled the changes referenced above into my fork of the repo, but I still don't get any fuzzy searching.. When I migrate plugins, I do get a warning to say "Sphinx cannot be found on your system". Tried installing the gem manually ( @korin, was that thumbs-up meant to imply that @swiatkiewicz's changes work? Or have you not had a chance to test them yet? |
It works, you need to install Sphinx. See ThinkingSphinx quickstart guide it's already available in most distros. To be honest @swiatkiewicz changes allow replace sql like search with sphinx indexer which is faster option. It's only one step from fuzzy matching feature. |
Thanks for the link. Unfortunately I need to run my system in a Windows environment, so it looks like I've got some reading to do before I get that part working. The SQL |
I can't get this feature to work :( |
show us plugin settings /settings/plugin/redmine_didyoumean |
it's in polish but you know all settings. |
Should work, I have similar settings. Any errors in redmine log file? or in browser development console? |
You can also try rebuild thinking sphinx index with rake ts:rebuild. |
No errors so far noticed :( I'll try out this with new tickets, |
only small difference with configuration see http://pat.github.io/thinking-sphinx/rake_tasks.html |
Ok, so basically ts:rebuild is same as stop+index+start, great :) |
Have you tried testing it in rails console? 2015-02-05 15:57 GMT+01:00 dominch notifications@github.com:
|
Trying now: 2.0.0-p594 :001 > Issue.search 'somethig' CustomField Load (0.6ms) SELECT `custom_fields`.* FROM `custom_fields` WHERE `custom_fields`.`type` = 'IssueCustomField' AND `custom_fields`.`searchable` = 1 Role Load (0.3ms) SELECT `roles`.* FROM `roles` WHERE `roles`.`builtin` = 2 LIMIT 1 GroupAnonymous Load (0.6ms) SELECT `users`.* FROM `users` WHERE `users`.`type` IN ('GroupAnonymous') ORDER BY id LIMIT 1 Member Load (0.4ms) SELECT `members`.* FROM `members` INNER JOIN `projects` ON `projects`.`id` = `members`.`project_id` WHERE (projects.status <> 9) AND (members.user_id = 2 OR (projects.is_public = 1 AND members.user_id = 49)) (15.0ms) SELECT COUNT(DISTINCT `issues`.`id`) FROM `issues` LEFT OUTER JOIN `projects` ON `projects`.`id` = `issues`.`project_id` LEFT OUTER JOIN `journals` ON `journals`.`journalized_id` = `issues`.`id` AND (journals.private_notes = 0 OR (1=0)) AND `journals`.`journalized_type` = 'Issue' WHERE (((projects.status <> 9 AND projects.id IN (SELECT em.project_id FROM enabled_modules em WHERE em.name='issue_tracking')) AND ((projects.is_public = 1 AND ((issues.is_private = 0)))))) AND (((LOWER(subject) LIKE '%somethig%') OR (LOWER(issues.description) LIKE '%somethig%') OR (LOWER(journals.notes) LIKE '%somethig%') OR issues.id IN (SELECT cfs.customized_id FROM custom_values cfs WHERE cfs.customized_type='Issue' AND cfs.customized_id=issues.id AND LOWER(cfs.value) LIKE '%somethig%' AND cfs.custom_field_id IN (2,4) AND ((1=1) AND (issues.tracker_id IN (SELECT tracker_id FROM custom_fields_trackers WHERE custom_field_id = cfs.custom_field_id)) AND (EXISTS (SELECT 1 FROM custom_fields ifa WHERE ifa.is_for_all = 1 AND ifa.id = cfs.custom_field_id) OR issues.project_id IN (SELECT project_id FROM custom_fields_projects WHERE custom_field_id = cfs.custom_field_id)))))) SQL (23.8ms) SELECT `issues`.`id` AS t0_r0, `issues`.`tracker_id` AS t0_r1, `issues`.`project_id` AS t0_r2, `issues`.`subject` AS t0_r3, `issues`.`description` AS t0_r4, `issues`.`due_date` AS t0_r5, `issues`.`category_id` AS t0_r6, `issues`.`status_id` AS t0_r7, `issues`.`assigned_to_id` AS t0_r8, `issues`.`priority_id` AS t0_r9, `issues`.`fixed_version_id` AS t0_r10, `issues`.`author_id` AS t0_r11, `issues`.`lock_version` AS t0_r12, `issues`.`created_on` AS t0_r13, `issues`.`updated_on` AS t0_r14, `issues`.`start_date` AS t0_r15, `issues`.`done_ratio` AS t0_r16, `issues`.`estimated_hours` AS t0_r17, `issues`.`parent_id` AS t0_r18, `issues`.`root_id` AS t0_r19, `issues`.`lft` AS t0_r20, `issues`.`rgt` AS t0_r21, `issues`.`is_private` AS t0_r22, `issues`.`ir_position` AS t0_r23, `issues`.`closed_on` AS t0_r24, `issues`.`sprint_id` AS t0_r25, `issues`.`position` AS t0_r26, `projects`.`id` AS t1_r0, `projects`.`name` AS t1_r1, `projects`.`description` AS t1_r2, `projects`.`homepage` AS t1_r3, `projects`.`is_public` AS t1_r4, `projects`.`parent_id` AS t1_r5, `projects`.`created_on` AS t1_r6, `projects`.`updated_on` AS t1_r7, `projects`.`identifier` AS t1_r8, `projects`.`status` AS t1_r9, `projects`.`lft` AS t1_r10, `projects`.`rgt` AS t1_r11, `projects`.`inherit_members` AS t1_r12, `projects`.`default_assignee_id` AS t1_r13, `projects`.`product_backlog_id` AS t1_r14, `journals`.`id` AS t2_r0, `journals`.`journalized_id` AS t2_r1, `journals`.`journalized_type` AS t2_r2, `journals`.`user_id` AS t2_r3, `journals`.`notes` AS t2_r4, `journals`.`created_on` AS t2_r5, `journals`.`private_notes` AS t2_r6 FROM `issues` LEFT OUTER JOIN `projects` ON `projects`.`id` = `issues`.`project_id` LEFT OUTER JOIN `journals` ON `journals`.`journalized_id` = `issues`.`id` AND (journals.private_notes = 0 OR (1=0)) AND `journals`.`journalized_type` = 'Issue' WHERE (((projects.status <> 9 AND projects.id IN (SELECT em.project_id FROM enabled_modules em WHERE em.name='issue_tracking')) AND ((projects.is_public = 1 AND ((issues.is_private = 0)))))) AND (((LOWER(subject) LIKE '%somethig%') OR (LOWER(issues.description) LIKE '%somethig%') OR (LOWER(journals.notes) LIKE '%somethig%') OR issues.id IN (SELECT cfs.customized_id FROM custom_values cfs WHERE cfs.customized_type='Issue' AND cfs.customized_id=issues.id AND LOWER(cfs.value) LIKE '%somethig%' AND cfs.custom_field_id IN (2,4) AND ((1=1) AND (issues.tracker_id IN (SELECT tracker_id FROM custom_fields_trackers WHERE custom_field_id = cfs.custom_field_id)) AND (EXISTS (SELECT 1 FROM custom_fields ifa WHERE ifa.is_for_all = 1 AND ifa.id = cfs.custom_field_id) OR issues.project_id IN (SELECT project_id FROM custom_fields_projects WHERE custom_field_id = cfs.custom_field_id)))))) ORDER BY issues.id ASC => [[], 0] It seems to be SQL, isn't it? |
Setting.plugin_redmine_didyoumean['search_method'] in rails console, 0 - SQL 1- TS |
I'm trying in console: 2.0.0-p594 :006 > Issue.sphinx_search 'test' Issue Load (0.8ms) SELECT `issues`.* FROM `issues` WHERE `issues`.`id` IN (231, 1, 51, 52, 53, 114, 150, 153, 167, 173, 232, 235, 244, 284, 381, 523, 687, 717, 747, 912) (20 results) nad same for word 'testy' gives me only one result. |
That seems to be ok: 2.0.0-p594 :009 > Setting.plugin_redmine_didyoumean['search_method'] => "1" And that's: def search_class case Setting.plugin_redmine_didyoumean['search_method'] when "0" SqlSearch when "1" ThinkingSphinxSearch else raise 'There is no search method selected!' end end so its sphinx. I tried to modify searching_by_thinking_sphinx.rb and that caused an effect so it's using it for sure. The question is what is wrong with sphinx that results are wrong. |
@dominch My case was: 'test', 'tester', 'testowy'. And before this steps I got only 1 results, but should be 3, now after these steps, I got a good result (3). Check your application log for something like : |
How can I turn on debug mode? script/rails server webrick -e production -d -p 3000 plus: http://localhost:3000/searchissues?project_id=1&issue_id=&query=testowy gives me: Processing by SearchIssuesController#index as HTML Parameters: {"project_id"=>"1", "issue_id"=>"", "query"=>"testowy"} Current user: dominik.chmaj (id=3) Completed 200 OK in 485.4ms (Views: 1.0ms | ActiveRecord: 15.7ms) previous setting for :enable_star was 1, changed that to true but still no effect :( |
@dominch In thinkingSphinx is another problem, because if you add new issue or edit exisitng one, then you should run ts:index, to update indexes. I'm trying to implement RealTime indexing but it's doesn't work as I expected and it's can take a while. |
Ok, debug logs are working and I have: Sphinx Query (0.8ms) SELECT * FROM `issue_core` WHERE MATCH('*testowej*') AND `project_id` IN (1, 2, 3, 4, 5, 6, 7, 9, 10, 11, 12, 13, 14, 17, 18, 19, 20, 23, 24, 25, 26) AND `sphinx_deleted` = 0 AND `sphinx_internal_id` NOT IN (0) LIMIT 0, 5 Sphinx Found 3 results Issue Load (0.6ms) SELECT `issues`.* FROM `issues` WHERE `issues`.`id` IN (336, 717, 962) So that's proof - sphinx are working, somehow it does not return much results. Only exact words. my development.sphinx.conf looks like: indexer { } searchd { listen = 127.0.0.1:9306:mysql41 log = /var/data/redmine/log/development.searchd.log query_log = /var/data/redmine/log/development.searchd.query.log pid_file = /var/data/redmine/log/development.sphinx.pid workers = threads binlog_path = /var/data/redmine/tmp/binlog/development } source issue_core_0 { type = mysql sql_host = localhost sql_user = redmine sql_pass = *** sql_db = redmine sql_query_pre = SET TIME_ZONE = '+0:00' sql_query_pre = SET NAMES utf8 sql_query = SELECT SQL_NO_CACHE `issues`.`id` * 2 + 0 AS `id`, `issues`.`subject` AS `subject`, `issues`.`id` AS `sphinx_internal_id`, 'Issue' AS `sphinx_internal_class`, 0 AS `sphinx_deleted`, `issues`.`id` AS `id`, `issues`.`status_id` AS `status_id`, `issues`.`project_id` AS `project_id` FROM `issues` WHERE (`issues`.`id` BETWEEN $start AND $end) GROUP BY `issues`.`id`, `issues`.`subject`, `issues`.`id`, `issues`.`id`, `issues`.`status_id`, `issues`.`project_id` ORDER BY NULL sql_query_range = SELECT IFNULL(MIN(`issues`.`id`), 1), IFNULL(MAX(`issues`.`id`), 1) FROM `issues` sql_attr_uint = sphinx_internal_id sql_attr_uint = sphinx_deleted sql_attr_uint = id sql_attr_uint = status_id sql_attr_uint = project_id sql_attr_string = sphinx_internal_class sql_field_string = subject sql_query_info = SELECT `issues`.* FROM `issues` WHERE (`issues`.`id` = ($id - 0) / 2) } index issue_core { type = plain path = /var/data/redmine/db/sphinx/development/issue_core docinfo = extern charset_type = utf-8 min_infix_len = 2 enable_star = 1 source = issue_core_0 } index issue { type = distributed local = issue_core } everything seems to work except right results :) This have to be something with tokenization etc. I assume that language is not that important because it should look for any word in any language with described rules. Is that correct? |
@dominch This seems to be ok, right? |
In description there is: "faulty" also returned a match for "fault", "saving" matched "saved" or "save" from plugin settings: - Thinking Sphinx - firsly search words 1:1 after then substract last character and search again ('Running' will be looking for 'Runner' 'Running' etc.). Substract to min word length which is definded below So in Your case for word "tester" it should find everything with word "test" (substacted 2 letters) Right now and for Your example it's loking for test and I have same thing - above You can find: "FROM Try "tester" - it should find # 101497 and of course # 101512 I expect that this word should search for "tester" + "teste" + "test" + "tes", should assign weights etc. That should give much more results. |
It seems that sphinxsearch does not tokenize words by default. To make it work install it with libstemmer library. links: |
I just changed my config and now it's working. production: morphology: stem_en mem_limit: 128M wordforms: "/var/data/redmine/config/sphinx/wordforms.txt" stopwords: "/var/data/redmine/config/sphinx/stopwords.txt" This added morphology and steammer for my generated files, Now it's working great! :) Thank You for help, I wasn't sure if I need anything more to my configuration. |
@dominch Hello, I am highly interested in what you found. Could you please describe a little more how you achieved that ?
I would like to use a different language than english/russian and after reading documentation, it appears I have to make more steps to achieve the morphology search. |
I found a beginning of answer :
I ran The morphology still does not work :( |
Yes, it's inside redmine config directory, edit this file (thinking_sphinx.yml) and after ts:rebuild You should notice change in production.sphinx.yml file (in same dir) generated after command is executed. I found both files in internet and placed them inside redmine_dir/config/sphinx (and reflected that path in config above). That is enogh for my needs - wordforms are changing my complex words to basic like "thinking > think" in example above; Best luck! :) |
Thanks for your answer. My
So I suppose, sphinx is aware that I would like morphology but what I want to achieve first is if I search for |
What bother me is after using the After reading the documentation I would assume that I would find Am I wrong ? |
No, it should be "something > somethingElse" |
Ok. So my doubts were based :) I just found that my dict file was not corrupted, certainly leading to a bad wordforms.txt |
My dict file is now clean, I regenarated my wordforms.txt. It is a clean Thanks for your help. |
I found a solution. My problem was : You .dic file must have a list of My solution : Now I have a valid |
When ran My I ran this But when I run I don't know how to clean former indexes as I tried |
duplicates are not only word1 > word1 but word1 > word2 word1 > word3 are duplicate to. You need to have only one word to convert to. In other words some unique index is build and engine needs to know what should be replaced exactly. Try to grep Your wordforms for any example from warning ("word1 > ") :) |
Mmm I think I found my real problem : the accents ! I don't have Probably a UTF-8 problem. |
Then check out database and it's table encoding, i.e. mysql by default uses latin1_swedish, I needed to change both db and all tables to utf8 but this happeneded some time ago. New tables were created not in utf8 and now they are. I'm not sure if those information are stored in db on index, but some documents are. |
They are all UTF8. I think it is more a Sphinx problem. I found several posts about UTF8 problems. Still searching :) |
Ok ... This is strange. I was totally focused on my warnings but in fact, the morphology seems to work. When I search for Now, my problem is in the example of the did_you_mean plugin, there is the I have an issue Maybe I missundertood the example. |
That example is based on en morphology which cuts all words to min length and then tes+t = tes+ts |
As I added in
Because I have a Edit : ah ok. So I suppose my morphology is half operating :) |
Could DYM be extended to use the idea of "approximate string matching" (Wikipedia reference)? Not everybody phrases their sentences the same way.. I honestly believe this plugin would be a million times more useful if words like "faulty" also returned a match for "fault", "saving" matched "saved" or "save", etc etc.
Unfortunately I'm not familiar enough with Ruby to do this myself, but a quick Google search shows many Ruby Gems that provide this type of functionality. So I'm guessing (read: hoping) this isn't an unreasonable request. :)
The text was updated successfully, but these errors were encountered: