Skip to content

fix: overhaul duplicate detection scoring, add address matching, trigger after imports#22

Closed
bashar-qassis wants to merge 45 commits into
mainfrom
fix/duplicate-detection
Closed

fix: overhaul duplicate detection scoring, add address matching, trigger after imports#22
bashar-qassis wants to merge 45 commits into
mainfrom
fix/duplicate-detection

Conversation

@bashar-qassis
Copy link
Copy Markdown
Owner

Summary

  • Fixed broken scoring formula that silently missed duplicates sharing the same email (scored 0.35, below 0.4 threshold) or phone (scored 0.25). Replaced additive scoring with max-signal + bonus approach where each signal independently qualifies: email=0.85, phone=0.75, address=0.60, name=pg_trgm similarity.
  • Added address matching on normalized line1 + postal_code (case-insensitive, trimmed).
  • Fixed email matching to be case-insensitive and filter both sides of the field pair by contact_field_type.
  • Fixed protocol matching to use LIKE 'mailto%' pattern, handling the colon inconsistency between seeded and custom-created field types.
  • Triggered duplicate detection after imports — all three import workers (MonicaApiCrawlWorker, ImportSourceWorker, ImportWorker) now enqueue DuplicateDetectionWorker on successful completion.
  • Added 20 comprehensive tests for the detection worker covering all match types, scoring, edge cases, and account isolation.

Test plan

  • mix compile --warnings-as-errors — clean
  • mix test — 1035 tests, 0 failures
  • mix quality — format, credo, sobelow, dialyzer all pass
  • Manual: trigger Monica import → verify duplicate candidates appear afterward
  • Manual: click "Scan now" → verify contacts sharing email/phone/address are detected

…ger after imports

The duplicate detection worker had several bugs preventing it from catching
obvious duplicates:

- Scoring formula (name*0.4 + email*0.35 + phone*0.25 with threshold 0.4)
  meant contacts sharing the same email but with different names scored 0.35,
  below the threshold — silently missed.
- Email comparison was case-sensitive.
- Only one side of email/phone field pairs had its type verified.
- Address data was completely ignored.
- No import worker triggered duplicate detection after completion.

Fixes:
- Replace additive scoring with max-signal + bonus approach where each signal
  independently qualifies (email=0.85, phone=0.75, address=0.60, name=similarity)
- Add case-insensitive email matching via LOWER() fragments
- Filter both cf1 and cf2 contact_field_types in email/phone queries
- Use LIKE 'mailto%' pattern to handle protocol colon inconsistency
- Add address matching on normalized line1 + postal_code
- Enqueue DuplicateDetectionWorker after successful completion in all three
  import workers (MonicaApiCrawlWorker, ImportSourceWorker, ImportWorker)
- Add comprehensive test suite (20 tests) for the detection worker
@bashar-qassis bashar-qassis force-pushed the fix/duplicate-detection branch from 5850c3f to 38cadb8 Compare April 4, 2026 16:42
list_candidates now takes limit/offset opts (default 20 per page).
The LiveView loads one page at a time with a "Load more" button.
Dismiss removes the candidate from the current list without reloading.
The /contacts/duplicates route uses ContactLive.Index, not the standalone
Duplicates LiveView. Added limit/offset pagination with Load more button
and optimistic dismiss (no full re-query) to match the standalone page.
Photos with the same content_hash on both contacts caused a unique
constraint violation during merge. Now deletes duplicate photos from
the non-survivor before transferring the rest, matching the pattern
used for contact_tags and activity_contacts.

Also collapsed the merge flow from 4 steps to 3 by combining the
preview and confirm steps into a single "Review & merge" step.
From the duplicates page (contact preselected), merge is now 2 clicks
instead of 3.
Replace the custom hand-rolled Oban dashboard (412 LOC LiveView) with
the official `oban_web` package, now open-source and free as of Oban
2.20. Mount at /admin/oban behind a new :require_admin on_mount hook
gated by Kith.Policy.can?(user, :manage, :oban). Hide the "Jobs" nav
link from non-admin users.

Drops the dead photo_sync queue from Oban config — its worker
(PhotoBatchSyncWorker) was removed in commit e474853 when Monica
imports moved to API crawling.
Make every published host port configurable through env-var
substitution so contributors can resolve local port conflicts without
editing compose files. Defaults preserve current behavior with no
.env present.

Dev (docker-compose.dev.yml): MAILPIT_SMTP_PORT, MAILPIT_WEB_PORT,
APP_PORT join the existing DB_PORT and MINIO_*_PORT vars.

Prod (docker-compose.prod.yml): HTTP_PORT, HTTPS_PORT for Caddy.

Internal container ports remain hardcoded so services can address
each other on standard ports over the Docker network.
Photo import previously ran inline as Phase 4 of MonicaApi.crawl/5, which
buried it inside the contact-crawl job and left the import_history "Photo
Sync" UI panel stuck on "in progress" forever — sync_summary was never
written because the refactor that deleted PhotoBatchSyncWorker (commit
e474853) removed the only writers.

Move the photo crawl into its own MonicaPhotoSyncWorker on the photo_sync
queue, enqueued by MonicaApiCrawlWorker when api_options["photos"] is true.
The worker passes credentials through job args (matching the
MonicaDocumentImportWorker pattern) so the main worker can wipe the API key
from the DB immediately after contact import completes.

Drop the unauthenticated link fallback from the decoder — Monica's
/api/photos endpoint always returns dataUrl, so the previous Req.get(link)
path was likely 401'ing on protected storage URLs. If a photo lacks
dataUrl, it's now surfaced as a failed entry in sync_summary with a
no_data_url reason, instead of being silently dropped.

The worker writes sync_summary after each page so the UI shows live
progress — total/synced/failed/not_found counts plus a per-photo table —
and logs each page boundary at :info under the [MonicaPhotoSync] prefix.
Per-photo decisions log at :debug.

Tests cover: dataUrl import + avatar set, not_found on missing contact,
failed on missing dataUrl, dedup by content_hash, and mid-flight
sync_summary writes between pages.
The :photo_sync queue was removed in commit e474853 along with
PhotoBatchSyncWorker, but the Oban queues config in config.exs was
updated to drop it. Jobs queued to :photo_sync would sit forever with
no consumers. Switch to :imports to match MonicaApiCrawlWorker and
MonicaDocumentImportWorker.
- CLAUDE.md: remove photo_sync + api_supplement from the Oban queues list
  (those queues were removed in commit e474853); update queue count from 9
  to 7 to match config/config.exs.
- MonicaApiCrawlWorker moduledoc: clarify that photo import runs as a
  separate MonicaPhotoSyncWorker job, not inline.
- Delete docs/superpowers/specs/2026-03-21-extensible-import-system-design.md
  and docs/superpowers/plans/2026-03-22-extensible-import-system.md. Both
  describe the pre-refactor design (PhotoSyncWorker, ApiSupplementWorker,
  file-based Monica import, per-photo job model) that no longer exists.
  CLAUDE.md plus the live moduledocs are now the source of truth.
Spec captures the photo-sync-after-reset bug, its root cause (orphaned
import_records pointing at deleted contacts), and the decomposition of
AccountResetWorker into a thin orchestrator plus one Cleanup module per
data domain. Covers module layout, order-of-operations, account-scoped
Oban job cancellation, error handling, and the test plan including a
mandatory cross-account isolation check on every Cleanup module.
…efinement

Plan decomposes the work into 13 TDD tasks, each producing one cleanup module
(or the worker refactor) per commit. Spec refinement: split the proposed
TagsAndActivitiesCleanup into Contacts.Cleanup (handles tags as part of the
contacts axis) and Activities.Cleanup (its own context), aligning with the
SOLID-elixir SRP guidance flagged in the brainstorming pass.
reminder_rules is account-scoped (not reminder-scoped) and has no FK
relationship to reminders, so it's not CASCADE-cleared. Rules are 3
seeded-per-account pre-notification defaults treated as reference data.
…n explosion, E.164 normalization

Three independent bugs combined to make 1000 Monica imports surface as
~6000 entries in the duplicates tab even when "Auto-merge definite
duplicates" was checked:

Bug C (primary cause of "auto-merge did nothing"):
  MonicaApiCrawlWorker.build_opts/1 only forwarded "extra_notes" — every
  other wizard option, including "auto_merge_duplicates", was silently
  dropped before reaching MonicaApi.crawl/5. Auto-merge was structurally
  unreachable from the UI; only unit tests calling crawl/5 directly with
  their own opts had ever exercised it. Now forwards api_options
  unchanged, preserving the legacy extra_notes default-on semantic.

Bug A (primary cause of "6000 entries"):
  DuplicateDetectionWorker.find_phone_matches/1 joined contact_fields on
  the digit-normalized value but pre-filtered the *raw* value, so any
  phone whose digits-stripped form was empty ("+", "()", "-", "N/A")
  passed the filter, normalized to "", and matched every other zero-digit
  phone — C(N,2) false candidates. The email side had a smaller analog:
  no TRIM, no cf2 filter. Now strict equality on canonical values for
  phone, TRIM(LOWER(...)) plus per-side non-empty filters for email.

Bug B (auto-merge predicate too narrow):
  Within a name group the predicate compared values raw — Monica's
  CardDAV-sync duplicates (monicahq/monica#6175) escape it on trivial
  whitespace/casing artifacts. Name key now trims + collapses whitespace;
  predicate accepts shared email OR phone OR address with normalized
  comparators; addresses preloaded.

Bug D (heuristic phone storage):
  PhoneFormatter.normalize/1 was heuristic — "10 digits stays as-is",
  "11 digits starting with 1 becomes +1...", "00" IDD prefix unhandled —
  so the same number written two ways was stored as two different
  values. Replaced with ex_phone_number (libphonenumber port) producing
  E.164. normalize/2 takes a default_region for bare numbers; bare input
  without a region round-trips unchanged. PhoneRenormalizeWorker
  backfills existing rows once.

UX:
  auto_merge_duplicates wizard default flipped to true. New region
  picker pre-populated from account.locale, listing every
  libphonenumber-supported region with localized country names via
  ex_cldr_territories ∩ ExPhoneNumber.Metadata.get_supported_regions/0.

Detection worker phone match simplified to plain equality now that
values are canonical at write-time, removing the per-row regex and the
filter mismatch that caused the cartesian explosion.

Tests:
  - Boundary test in monica_api_crawl_worker_test that round-trips the
    wizard flag through build_opts (would have caught Bug C directly).
  - Cartesian-explosion regression in duplicate_detection_worker_test.
  - Email TRIM and phone-normalization-on-import in respective files.
  - phone_formatter_test rewritten for explicit-region semantics.
  - New phone_renormalize_worker_test (5 tests).
  - Production libphonenumber metadata in :test (was test-only metadata
    that diverged from real validation on "555" NANP prefixes).

Dialyzer:
  Added .dialyzer_ignore.exs suppressing two :contract_supertype
  warnings against Kith.Cldr.Territory — these are emitted from
  generated code in ex_cldr_territories, not actionable from our side.

1122 tests pass, mix quality clean.
…iner clustering

DNSCluster connects via `Node.connect(:"basename@<ip>")` — it uses the
raw IP as the host part of the node name. That requires each peer's
actual node name to be `name@<ip>`, which conflicts with Phoenix
release's default of `name@<hostname>`.

The user's Portainer deployment exposes containers under stable service
names (`app`, `worker`) that resolve via Docker DNS — but the BEAM
nodes are named after the container ID (`kith@64c98536e88c`), so
`Node.connect(:"kith@app")` fails the handshake.

Switch to libcluster's Epmd strategy which connects by explicit node
name (no IP rewriting). Each container is configured via env to:
  - `RELEASE_NODE=kith@app` (or `kith@worker`)
  - `KITH_CLUSTER_HOSTS=kith@app,kith@worker`
  - `RELEASE_COOKIE=<shared>`
  - `RELEASE_DISTRIBUTION=name`

libcluster runs Cluster.Strategy.Epmd which polls `Node.connect/1`
for each host periodically; once one direction connects, the
bidirectional distribution is established and PubSub spans both.

Dev and test are unaffected: `KITH_CLUSTER_HOSTS` is unset, so the
libcluster topology list is empty and Cluster.Supervisor no-ops.
@bashar-qassis
Copy link
Copy Markdown
Owner Author

Superseded by #23. Branch grew beyond original duplicate-detection scope to cover account reset completeness, Monica import overhaul, worker clustering, and infra changes. New PR has accurate title + full description across all five themes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant