Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRM-20565 - Improve ajax dedupe lookups on contact add form #10341

Merged
merged 3 commits into from Jun 5, 2018

Conversation

Projects
None yet
5 participants
@colemanw
Copy link
Member

commented May 12, 2017

@colemanw colemanw force-pushed the colemanw:CRM-20565 branch from 528a74b to e50e50f May 12, 2017

@eileenmcnaughton

This comment has been minimized.

Copy link
Contributor

commented May 12, 2017

I'm going to try to take a look at this next week

@colemanw

This comment has been minimized.

Copy link
Member Author

commented May 12, 2017

Great, thanks @eileenmcnaughton .

$this->addRadio('contact_ajax_check_similar', ts('Check for Similar Contacts'), array(
'1' => ts('While Typing'),
'0' => ts('When Saving'),
'2' => ts('Never'),

This comment has been minimized.

Copy link
@colemanw

colemanw May 13, 2017

Author Member

Note: In all existing installs this preference is already set to either '1' or '0'. These new options keep the original meaning of 1 and 0 so no upgrade script is needed.

@colemanw

This comment has been minimized.

Copy link
Member Author

commented Jun 3, 2017

Thoughts on this one @eileenmcnaughton ?

@eileenmcnaughton

This comment has been minimized.

Copy link
Contributor

commented Jun 12, 2017

Sorry - I still have this on my radar - I'm expecting to do significant work on dedupes going into next quarter - with extracting to an extension & making all interaction with the dedupeFinder & merger via api being high priorities

@colemanw colemanw force-pushed the colemanw:CRM-20565 branch from e50e50f to d506e9c Oct 15, 2017

@colemanw

This comment has been minimized.

Copy link
Member Author

commented Oct 15, 2017

@eileenmcnaughton I've rebased this. Can you take a look?

@eileenmcnaughton

This comment has been minimized.

Copy link
Contributor

commented Oct 19, 2017

I've started looking at this & in general I agree with the change. I'm still working through it so these are not final comments.

  1. I note with the api call we need to pass 'check_permissions' to the contact.get call - perhaps it will filter out some info, or not, but it's the right thing to do.
  2. I'm not a fan of how that function does different things depending on return. I'm kinda inclined to deprecate it rather than hack it. Then we would wind up with 2 functions eventually

Contact.getmatches
Contact.getduplicates

The former is this function - matches for the data we have. The second would be pairs of duplicates to potentially merge (like the merge screen uses but less nasty). (The second obviously does not exist as yet.).

  1. I want to look a bit more closely at a tidy up of the admin form for this.
  2. From a user point of view we have the situation where if I have the default settings I no longer get presented with duplicates until I have entered first name, last name AND email. This is a sort of regression from getting them as soon as I've entered last name.

In general the defaults for Supervised & Unsupervised seem to be the reverse of what they should be. ie. Unsupervised is happy to given anyone the bash if there is a matching email whereas Supervised is conservative about putting up options. I'm not quite sure how to deal with this but as a user I feel like I would want matches as soon as I've entered first name in & then the top matches should get increasingly specific as I provide more data.

@lcdservices I think you understand the thinking behind Supervised vs Unsupervised - which I'm grappling with.

@colemanw

This comment has been minimized.

Copy link
Member Author

commented Oct 19, 2017

  1. You're right. I've added it.

  2. Actually it returns the same format. If you leave off return then it sends back:

    [
    12 => ['id' => 12],
    34 => ['id' => 34],
    ]

and if you add e.g. return => ['display_name', 'email']` then it sends back:

[
  12 => ['id' => 12, 'display_name' => 'Joe Tester', 'email' => 'joe@test.er'],
  34 => ['id' => 34, 'display_name' => 'Jim Tester', 'email' => 'jim@test.er'],
]
  1. Ok. Note my inline comment in the PR about that.
  2. Well the code in this PR doesn't do that, it starts looking for matches as soon as you enter something in any field. But the supervised dedupe rule might get in your way depending on how you've set it up. A more lax rule will indeed give more results. What would be really nice would be to sort those matches by weight (e.g. closer matches first) but our BAO doesn't currently do that I don't think :(
@eileenmcnaughton

This comment has been minimized.

Copy link
Contributor

commented Oct 19, 2017

Regarding 4 - I'm just working off what people have as the default out-of-the-box settings. I realise people can customise it but most don't. The experience would be that they are used to having it resolve when they enter last name & now it won't - so it would feel like a bug/regression to a lot of people.

@eileenmcnaughton

This comment has been minimized.

Copy link
Contributor

commented Oct 19, 2017

(your reply to 2 makes sense & I agree that is fine)

@colemanw

This comment has been minimized.

Copy link
Member Author

commented Oct 20, 2017

I just tested and you're right about number 4. It's not very useful that it only works when all 3 fields are an exact match. I honestly don't know how that qualifies as a "supervised" rule as anyone with exactly the same first&last name and email is unquestionably the same person.

That said, what we had in place prior to this PR wasn't exactly gold-standard either. It matched on last name and nothing else, so Jan Smith would match Bob Smith, etc. - also not very useful results, and not at all configurable as it is now.

@eileenmcnaughton

This comment has been minimized.

Copy link
Contributor

commented Oct 20, 2017

My instinct is that the Supervised & Unsupervised rules are messed up & were implemented in reverse of how they were conceived - the reason I was hoping to get input from @lcdservices

I feel like this WOULD feel like a regression to some users as is & I feel like we should consider adding a third rule type that exists by default called 'InputAssistance' or something & making it match on last name or email being sufficient.

Overall I feel like this might not be universally better & we should probably solicit wider input. Possibly from the dev list although it's more a UI question than a dev question but our only other list seems to be 'partners' which is definitely not right and please not a double post....

From a code point of view I'm happy with the code apart from wanting to dig a bit further into the admin form side of things and agree it is an improvement. I just realised I haven't tested that we still do awful ghastly things to people when they choose ' on save ' for the setting

@lcdservices

This comment has been minimized.

Copy link
Contributor

commented Oct 20, 2017

@eileenmcnaughton sorry -- missed the earlier ping on this. I'm happy to give my 2 cents on the rules... ;-)

  1. I've never liked that the unsupervised rule was so loose by default (email only). I argued for stricter default rules many versions ago but Dave G felt that since email is the only field required on contrib forms and event reg forms, and the inclusion of profiles is optional, we shouldn't create a default dedupe rule that may be invalidated by the default configuration of contrib pages/events. While I appreciate that argument, I would argue we should opt for a more strict rule regardless, as the unsupervised rule should always err on the side of being more restrictive (to prevent false positives). Further -- since that discussion several years ago we've added a validation rule so that if the user selects a dedupe rule where the fields are not present in the form via the profile, we throw a warning that the configuration is not capable of creating a match.
  2. In principle, I agree with @eileenmcnaughton that the two rules are better switched. Unsupervised should match on more values, Supervised on less.
  3. I also agree that the supervised rule would benefit from being more nuanced -- which I gather is the impetus for this PR. personally I think a match on first name + last name is a sufficient default rule.
  4. the two default "optimized" rules would benefit from some review. those were created in conjunction with the creation of the dedupe hook, (v4.2?). we had that hook added for NYSS and proposed the optimized rules at that time (that's when the discussion about defaults took place). but I feel like they might benefit from a code review. (I realize that's not the purview of this issue)
@lcdservices

This comment has been minimized.

Copy link
Contributor

commented Oct 20, 2017

One more comment after reading the JIra issue --
I agree that the lookup should be done via ajax or postprocess validate, but not both -- and I'd vote for the AJAX method. in my thinking, the ajax method should derive from the supervised rule, and should only be triggered when all fields in that rule actually have data in them. so if the rule is defined as first + last name, it should not be triggered after the first name is added, but only after both first and last have been added.

@eileenmcnaughton

This comment has been minimized.

Copy link
Contributor

commented Oct 20, 2017

So to attempt to pull things together - I think both @lcdservices & I would be happy to see only TWO options

  1. ajax checks
  2. no check

I think you'd struggle to find anyone who likes the on-save validation - but we could ask.

  1. We don't think the default rules are right (which would affect usablity of this) but the question is - how do you deal with that. The best thing I can think of is to add a new type that is 'fit-for-purpose' by default. I feel like last name OR email match would be a good default for a data-entry rule (although there might be performance issues with that default on larger sites).
@colemanw

This comment has been minimized.

Copy link
Member Author

commented Oct 23, 2017

I agree about a made-for-deduping rule. I actually custom coded one of those for Woolman years ago. It did all kinds of fuzzy OR logic, taking phone numbers, nick names, email, etc into consideration. It wasn't very performant but it was awesome at nailing hard-to-find dupes
https://github.com/woolman/drupal-modules/blob/master/woolman_website/woolman_dupe_query.inc#L54.

@eileenmcnaughton

This comment has been minimized.

Copy link
Contributor

commented Nov 16, 2017

I had another go at this and the usability seemed ok if I created a new for-purpose rule & used that instead of the 'Supervised' rule (in practice I did this by editing the Supervised rule to have a threshold of 5 & deleting CRM/Dedupe/BAO/QueryBuilder/IndividualSupervised.php). Using the existing built in rule didn't give me any feedback until all 3 fields matched which was a worse user experience than without the patch.

With my 'new custom rule' the performance was 'ok' even on the wmf database. By that I mean that on many names it resolved in real time but even on 'John Smith' it didn't cause queries to spin off and bring the database down. I did get a very strange UI experience on the slow ones as nothing showed up & then suddenly I had 3 matching boxes of matches.

However, the problem with the code as is is that the names are sorted by Sort name. Since there are more than 20 Eileen's in our database the list of Eileen's displayed once I entered first_name did not include me. Once I entered my email it fired again - but presented the list of names in the same order so I was still only to see the first 20 possible matches which include me.

I tried to make it sort by weight so the best matches would float to the top - the code is my efforts, but while I managed to get the api to sort by weight the UI still did not.

Doing data entry the whole thing would work better if the email address field were the first one you hit because then it could try to match on that first & also I think it's realistically more important for dataentry than fields like Job Title & Contact Type - tangental...

diff --git a/CRM/Dedupe/BAO/RuleGroup.php b/CRM/Dedupe/BAO/RuleGroup.php
index 7557af3..f374863 100644
--- a/CRM/Dedupe/BAO/RuleGroup.php
+++ b/CRM/Dedupe/BAO/RuleGroup.php
@@ -211,8 +211,7 @@ class CRM_Dedupe_BAO_RuleGroup extends CRM_Dedupe_DAO_RuleGroup {
     $exclWeightSum = array();
 
     // create temp table
-    $dao = new CRM_Core_DAO();
-    $dao->query($tempTableQuery);
+    $dao = CRM_Core_DAO::executeQuery($tempTableQuery);
 
     CRM_Utils_Hook::dupeQuery($this, 'table', $tableQueries);
 
Discard this hunk from worktree [y,n,q,a,d,/,j,J,g,e,?]? q

wmf1411:civicrm emcnaughton$ git diff
diff --git a/CRM/Dedupe/BAO/RuleGroup.php b/CRM/Dedupe/BAO/RuleGroup.php
index 7557af3..f374863 100644
--- a/CRM/Dedupe/BAO/RuleGroup.php
+++ b/CRM/Dedupe/BAO/RuleGroup.php
@@ -211,8 +211,7 @@ class CRM_Dedupe_BAO_RuleGroup extends CRM_Dedupe_DAO_RuleGroup {
     $exclWeightSum = array();
 
     // create temp table
-    $dao = new CRM_Core_DAO();
-    $dao->query($tempTableQuery);
+    $dao = CRM_Core_DAO::executeQuery($tempTableQuery);
 
     CRM_Utils_Hook::dupeQuery($this, 'table', $tableQueries);
 
@@ -384,7 +383,8 @@ class CRM_Dedupe_BAO_RuleGroup extends CRM_Dedupe_DAO_RuleGroup {
       $query = "SELECT dedupe.id1 as id
                 FROM dedupe JOIN civicrm_contact ON dedupe.id1 = civicrm_contact.id {$this->_aclFrom}
                 WHERE contact_type = '{$this->contact_type}' {$this->_aclWhere}
-                AND weight >= {$this->threshold}";
+                AND weight >= {$this->threshold}
+                 ORDER BY weight DESC ";
     }
     else {
       $this->_aclWhere = ' AND c1.is_deleted = 0 AND c2.is_deleted = 0';
@@ -399,7 +399,8 @@ class CRM_Dedupe_BAO_RuleGroup extends CRM_Dedupe_DAO_RuleGroup {
                        LEFT JOIN civicrm_dedupe_exception exc ON dedupe.id1 = exc.contact_id1 AND dedupe.id2 = exc.contact_id2
                 WHERE c1.contact_type = '{$this->contact_type}' AND
                       c2.contact_type = '{$this->contact_type}' {$this->_aclWhere}
-                      AND weight >= {$this->threshold} AND exc.contact_id1 IS NULL";
+                      AND weight >= {$this->threshold} AND exc.contact_id1 IS NULL
+                      ORDER BY weight DESC ";
     }
 
     CRM_Utils_Hook::dupeQuery($this, 'threshold', $query);
diff --git a/CRM/Dedupe/Finder.php b/CRM/Dedupe/Finder.php
index fd5d1c9..108fec4 100644
--- a/CRM/Dedupe/Finder.php
+++ b/CRM/Dedupe/Finder.php
@@ -148,8 +148,8 @@ class CRM_Dedupe_Finder {
 
     $rgBao->params = $params;
     $rgBao->fillTable();
-    $dao = new CRM_Core_DAO();
-    $dao->query($rgBao->thresholdQuery($params['check_permission']));
+    $dao = CRM_Core_DAO::executeQuery($rgBao->thresholdQuery($params['check_permission']));
+
     $dupes = array();
     while ($dao->fetch()) {
       if (isset($dao->id) && $dao->id) {
diff --git a/api/v3/Contact.php b/api/v3/Contact.php
index 38a1172..86846e0 100644
--- a/api/v3/Contact.php
+++ b/api/v3/Contact.php
@@ -1371,13 +1371,15 @@ function civicrm_api3_contact_duplicatecheck($params) {
   );
   $values = array();
   if ($dupes && !empty($params['return'])) {
-    return civicrm_api3('Contact', 'get', array(
+    $result = civicrm_api3('Contact', 'get', array(
       'return' => $params['return'],
       'id' => array('IN' => $dupes),
       'options' => CRM_Utils_Array::value('options', $params),
-      'sequential' => CRM_Utils_Array::value('sequential', $params),
       'check_permissions' => CRM_Utils_Array::value('check_permissions', $params),
     ));
+    
+    $result['values'] = array_replace(array_flip($dupes), $result['values']);
+    return $result;
   }
   foreach ($dupes as $dupe) {
     $values[$dupe] = array('id' => $dupe);
diff --git a/templates/CRM/Contact/Form/Contact.tpl b/templates/CRM/Contact/Form/Contact.tpl
index 384fe48..cdc4b9a 100644
--- a/templates/CRM/Contact/Form/Contact.tpl
+++ b/templates/CRM/Contact/Form/Contact.tpl
@@ -300,7 +300,6 @@
       CRM.api3('contact', 'duplicatecheck', {
         match: match,
         rule_type: 'Supervised',
-        options: {sort: 'sort_name'},
         return: ['display_name', 'email']
       }).done(function(data) {
         var title = data.count == 1 ? {/literal}"{ts escape='js'}Similar Contact Found{/ts}" : "{ts escape='js'}Similar Contacts Found{/ts}"{literal},
@eileenmcnaughton

This comment has been minimized.

Copy link
Contributor

commented May 10, 2018

@colemanw what should we do with this? I think it seems like to get it mergeable we would need to

  1. create a new inbuilt rule ('date entry'?) that mimicked existing behaviour - people could edit from there.
  2. figure out how to order by weight per the last comment

I think it's a nice improvement - but it also seems not quite mergeable at the moment & I don't think either of us are quite committed enough to get it over the line at the moment? If that is the case we should close the PR & track it from gitlab (we won't lose access to the changes - although if you ALSO delete the branch & we haven't linked into the commit specifically we will struggle to find it again

@colemanw

This comment has been minimized.

Copy link
Member Author

commented May 15, 2018

The trouble with adding a new rule is that we bump against the limitations of the rule categories. Currently we have "Supervised" "Unsupervised" and "General". We could create a new category but that's yet another can-o-worms to open.

IMO we ought to create a new hard-coded rule called "Name or Email or Phone or Address" and make it the default Supervised rule. The current default Supervised rule isn't very good. For existing sites, if they have the current default "Name and Email" default Supervised rule, we switch it. If they've configured a custom Supervised rule, we leave it alone.

That could be done in a separate PR. Just to think out loud, I think the ideal fuzzy rule would do something like

IF (
  (first_name && last_name match) OR
  (nick_name && last_name match) OR
  (email matches) OR
  (phone matches) OR
  (street_address matches)
)

@eileenmcnaughton you know more about optimizing queries on huge databases than I do. Is the above too crazy for large databases to handle?

@eileenmcnaughton

This comment has been minimized.

Copy link
Contributor

commented May 15, 2018

OK - so to answer that I think I need to feel more comfortable with how the existing Supervised rule is used - ie. audit the places. I realise my discomfort with the idea is about the places it is used. TODO audit this.

My inclination is that adding a DataEntry category is safer because it involves changing behaviour in less places. Where it differs logically from the 'normal' concept of 'Supervised' is that we would want it to be more aggressive about suggestions - ie. as you type 'Coleman' suggestions for 'Coleman' would appear (for a large DB they might tweak that but on a small DB that's a good rule).

Regarding whether it should be a hardcoded rule - I can do some tests - I'm not sure that the hard-coded rules are more efficient from my previous tests in this instance.

@eileenmcnaughton

This comment has been minimized.

Copy link
Contributor

commented May 29, 2018

@colemanw I had an idea here.

On one side of this patch we have a significant UI improvement on the other we have a change in behaviour to the selection of which contacts are duplicates that makes me uncomfortable. Could we just get the first part commited - ie. write a (deprecated) api or ajax call that does the same look up as right now & use that & then review (if we choose) switching the mechanism?

@colemanw

This comment has been minimized.

Copy link
Member Author

commented May 30, 2018

@eileenmcnaughton well the current implementation of the ajax lookup leaves a lot to be desired. It matches on last_name and nothing else.
I suppose now that the api3 supports the OR operator I could fire off a Contact.get api call to match on first name or last name or email, without going through the dedupe engine; at least that would be an improvement.

@eileenmcnaughton

This comment has been minimized.

Copy link
Contributor

commented May 30, 2018

@colemanw Yep & we could get that merged & then consider improving the rules.

colemanw added some commits May 12, 2017

CRM-20565 - Use contact api for now
Per discussion on #10341
Since the default dedupe rules are inadequate for this, we'll just
use the contact api for now.

@colemanw colemanw force-pushed the colemanw:CRM-20565 branch from 6128f69 to 01ee39a May 31, 2018

@colemanw

This comment has been minimized.

Copy link
Member Author

commented May 31, 2018

@eileenmcnaughton done. I also added some throttling to prevent multiple messages popping up if the user fills out the form quickly.

@eileenmcnaughton

This comment has been minimized.

Copy link
Contributor

commented Jun 5, 2018

test this please

@eileenmcnaughton

This comment has been minimized.

Copy link
Contributor

commented Jun 5, 2018

I just tested this & it feels better from a UI POV & I didn't spot any issues - I think it's good to merge

@monishdeb

This comment has been minimized.

Copy link
Member

commented Jun 5, 2018

Merging as per tag.

@monishdeb monishdeb merged commit c47a8ed into civicrm:master Jun 5, 2018

1 check passed

default Build finished.
Details
@monishdeb

This comment has been minimized.

Copy link
Member

commented Jun 5, 2018

@colemanw @eileenmcnaughton please close the JIRA ticket. Thanks!

@eileenmcnaughton eileenmcnaughton deleted the colemanw:CRM-20565 branch Jun 5, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.