fix: overhaul duplicate detection scoring, add address matching, trigger after imports#22
Closed
bashar-qassis wants to merge 45 commits into
Closed
fix: overhaul duplicate detection scoring, add address matching, trigger after imports#22bashar-qassis wants to merge 45 commits into
bashar-qassis wants to merge 45 commits into
Conversation
…ger after imports The duplicate detection worker had several bugs preventing it from catching obvious duplicates: - Scoring formula (name*0.4 + email*0.35 + phone*0.25 with threshold 0.4) meant contacts sharing the same email but with different names scored 0.35, below the threshold — silently missed. - Email comparison was case-sensitive. - Only one side of email/phone field pairs had its type verified. - Address data was completely ignored. - No import worker triggered duplicate detection after completion. Fixes: - Replace additive scoring with max-signal + bonus approach where each signal independently qualifies (email=0.85, phone=0.75, address=0.60, name=similarity) - Add case-insensitive email matching via LOWER() fragments - Filter both cf1 and cf2 contact_field_types in email/phone queries - Use LIKE 'mailto%' pattern to handle protocol colon inconsistency - Add address matching on normalized line1 + postal_code - Enqueue DuplicateDetectionWorker after successful completion in all three import workers (MonicaApiCrawlWorker, ImportSourceWorker, ImportWorker) - Add comprehensive test suite (20 tests) for the detection worker
5850c3f to
38cadb8
Compare
list_candidates now takes limit/offset opts (default 20 per page). The LiveView loads one page at a time with a "Load more" button. Dismiss removes the candidate from the current list without reloading.
The /contacts/duplicates route uses ContactLive.Index, not the standalone Duplicates LiveView. Added limit/offset pagination with Load more button and optimistic dismiss (no full re-query) to match the standalone page.
Photos with the same content_hash on both contacts caused a unique constraint violation during merge. Now deletes duplicate photos from the non-survivor before transferring the rest, matching the pattern used for contact_tags and activity_contacts. Also collapsed the merge flow from 4 steps to 3 by combining the preview and confirm steps into a single "Review & merge" step. From the duplicates page (contact preselected), merge is now 2 clicks instead of 3.
Replace the custom hand-rolled Oban dashboard (412 LOC LiveView) with the official `oban_web` package, now open-source and free as of Oban 2.20. Mount at /admin/oban behind a new :require_admin on_mount hook gated by Kith.Policy.can?(user, :manage, :oban). Hide the "Jobs" nav link from non-admin users. Drops the dead photo_sync queue from Oban config — its worker (PhotoBatchSyncWorker) was removed in commit e474853 when Monica imports moved to API crawling.
Make every published host port configurable through env-var substitution so contributors can resolve local port conflicts without editing compose files. Defaults preserve current behavior with no .env present. Dev (docker-compose.dev.yml): MAILPIT_SMTP_PORT, MAILPIT_WEB_PORT, APP_PORT join the existing DB_PORT and MINIO_*_PORT vars. Prod (docker-compose.prod.yml): HTTP_PORT, HTTPS_PORT for Caddy. Internal container ports remain hardcoded so services can address each other on standard ports over the Docker network.
Photo import previously ran inline as Phase 4 of MonicaApi.crawl/5, which buried it inside the contact-crawl job and left the import_history "Photo Sync" UI panel stuck on "in progress" forever — sync_summary was never written because the refactor that deleted PhotoBatchSyncWorker (commit e474853) removed the only writers. Move the photo crawl into its own MonicaPhotoSyncWorker on the photo_sync queue, enqueued by MonicaApiCrawlWorker when api_options["photos"] is true. The worker passes credentials through job args (matching the MonicaDocumentImportWorker pattern) so the main worker can wipe the API key from the DB immediately after contact import completes. Drop the unauthenticated link fallback from the decoder — Monica's /api/photos endpoint always returns dataUrl, so the previous Req.get(link) path was likely 401'ing on protected storage URLs. If a photo lacks dataUrl, it's now surfaced as a failed entry in sync_summary with a no_data_url reason, instead of being silently dropped. The worker writes sync_summary after each page so the UI shows live progress — total/synced/failed/not_found counts plus a per-photo table — and logs each page boundary at :info under the [MonicaPhotoSync] prefix. Per-photo decisions log at :debug. Tests cover: dataUrl import + avatar set, not_found on missing contact, failed on missing dataUrl, dedup by content_hash, and mid-flight sync_summary writes between pages.
The :photo_sync queue was removed in commit e474853 along with PhotoBatchSyncWorker, but the Oban queues config in config.exs was updated to drop it. Jobs queued to :photo_sync would sit forever with no consumers. Switch to :imports to match MonicaApiCrawlWorker and MonicaDocumentImportWorker.
- CLAUDE.md: remove photo_sync + api_supplement from the Oban queues list (those queues were removed in commit e474853); update queue count from 9 to 7 to match config/config.exs. - MonicaApiCrawlWorker moduledoc: clarify that photo import runs as a separate MonicaPhotoSyncWorker job, not inline. - Delete docs/superpowers/specs/2026-03-21-extensible-import-system-design.md and docs/superpowers/plans/2026-03-22-extensible-import-system.md. Both describe the pre-refactor design (PhotoSyncWorker, ApiSupplementWorker, file-based Monica import, per-photo job model) that no longer exists. CLAUDE.md plus the live moduledocs are now the source of truth.
Spec captures the photo-sync-after-reset bug, its root cause (orphaned import_records pointing at deleted contacts), and the decomposition of AccountResetWorker into a thin orchestrator plus one Cleanup module per data domain. Covers module layout, order-of-operations, account-scoped Oban job cancellation, error handling, and the test plan including a mandatory cross-account isolation check on every Cleanup module.
…efinement Plan decomposes the work into 13 TDD tasks, each producing one cleanup module (or the worker refactor) per commit. Spec refinement: split the proposed TagsAndActivitiesCleanup into Contacts.Cleanup (handles tags as part of the contacts axis) and Activities.Cleanup (its own context), aligning with the SOLID-elixir SRP guidance flagged in the brainstorming pass.
reminder_rules is account-scoped (not reminder-scoped) and has no FK relationship to reminders, so it's not CASCADE-cleared. Rules are 3 seeded-per-account pre-notification defaults treated as reference data.
…n explosion, E.164 normalization
Three independent bugs combined to make 1000 Monica imports surface as
~6000 entries in the duplicates tab even when "Auto-merge definite
duplicates" was checked:
Bug C (primary cause of "auto-merge did nothing"):
MonicaApiCrawlWorker.build_opts/1 only forwarded "extra_notes" — every
other wizard option, including "auto_merge_duplicates", was silently
dropped before reaching MonicaApi.crawl/5. Auto-merge was structurally
unreachable from the UI; only unit tests calling crawl/5 directly with
their own opts had ever exercised it. Now forwards api_options
unchanged, preserving the legacy extra_notes default-on semantic.
Bug A (primary cause of "6000 entries"):
DuplicateDetectionWorker.find_phone_matches/1 joined contact_fields on
the digit-normalized value but pre-filtered the *raw* value, so any
phone whose digits-stripped form was empty ("+", "()", "-", "N/A")
passed the filter, normalized to "", and matched every other zero-digit
phone — C(N,2) false candidates. The email side had a smaller analog:
no TRIM, no cf2 filter. Now strict equality on canonical values for
phone, TRIM(LOWER(...)) plus per-side non-empty filters for email.
Bug B (auto-merge predicate too narrow):
Within a name group the predicate compared values raw — Monica's
CardDAV-sync duplicates (monicahq/monica#6175) escape it on trivial
whitespace/casing artifacts. Name key now trims + collapses whitespace;
predicate accepts shared email OR phone OR address with normalized
comparators; addresses preloaded.
Bug D (heuristic phone storage):
PhoneFormatter.normalize/1 was heuristic — "10 digits stays as-is",
"11 digits starting with 1 becomes +1...", "00" IDD prefix unhandled —
so the same number written two ways was stored as two different
values. Replaced with ex_phone_number (libphonenumber port) producing
E.164. normalize/2 takes a default_region for bare numbers; bare input
without a region round-trips unchanged. PhoneRenormalizeWorker
backfills existing rows once.
UX:
auto_merge_duplicates wizard default flipped to true. New region
picker pre-populated from account.locale, listing every
libphonenumber-supported region with localized country names via
ex_cldr_territories ∩ ExPhoneNumber.Metadata.get_supported_regions/0.
Detection worker phone match simplified to plain equality now that
values are canonical at write-time, removing the per-row regex and the
filter mismatch that caused the cartesian explosion.
Tests:
- Boundary test in monica_api_crawl_worker_test that round-trips the
wizard flag through build_opts (would have caught Bug C directly).
- Cartesian-explosion regression in duplicate_detection_worker_test.
- Email TRIM and phone-normalization-on-import in respective files.
- phone_formatter_test rewritten for explicit-region semantics.
- New phone_renormalize_worker_test (5 tests).
- Production libphonenumber metadata in :test (was test-only metadata
that diverged from real validation on "555" NANP prefixes).
Dialyzer:
Added .dialyzer_ignore.exs suppressing two :contract_supertype
warnings against Kith.Cldr.Territory — these are emitted from
generated code in ex_cldr_territories, not actionable from our side.
1122 tests pass, mix quality clean.
…iner clustering DNSCluster connects via `Node.connect(:"basename@<ip>")` — it uses the raw IP as the host part of the node name. That requires each peer's actual node name to be `name@<ip>`, which conflicts with Phoenix release's default of `name@<hostname>`. The user's Portainer deployment exposes containers under stable service names (`app`, `worker`) that resolve via Docker DNS — but the BEAM nodes are named after the container ID (`kith@64c98536e88c`), so `Node.connect(:"kith@app")` fails the handshake. Switch to libcluster's Epmd strategy which connects by explicit node name (no IP rewriting). Each container is configured via env to: - `RELEASE_NODE=kith@app` (or `kith@worker`) - `KITH_CLUSTER_HOSTS=kith@app,kith@worker` - `RELEASE_COOKIE=<shared>` - `RELEASE_DISTRIBUTION=name` libcluster runs Cluster.Strategy.Epmd which polls `Node.connect/1` for each host periodically; once one direction connects, the bidirectional distribution is established and PubSub spans both. Dev and test are unaffected: `KITH_CLUSTER_HOSTS` is unset, so the libcluster topology list is empty and Cluster.Supervisor no-ops.
9 tasks
Owner
Author
|
Superseded by #23. Branch grew beyond original duplicate-detection scope to cover account reset completeness, Monica import overhaul, worker clustering, and infra changes. New PR has accurate title + full description across all five themes. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
line1 + postal_code(case-insensitive, trimmed).LIKE 'mailto%'pattern, handling the colon inconsistency between seeded and custom-created field types.DuplicateDetectionWorkeron successful completion.Test plan
mix compile --warnings-as-errors— cleanmix test— 1035 tests, 0 failuresmix quality— format, credo, sobelow, dialyzer all pass