fix(skilify): plug the three holes that produced cross-author duplicate skills#119
Conversation
The worker stamped `new Date().toISOString()` on both createdAt and updatedAt at every INSERT to the skills table — so after MERGE, the v+1 row reported the merge timestamp as creation date, losing the original creation date entirely. Across the 27 rows in prod today, every single one has created_at == updated_at. The local SKILL.md frontmatter already preserved created_at correctly across merges (skill-writer.ts mergeSkill inherits the v=1 timestamp). This change threads that value through SkillWriteResult so the worker can pass it to insertSkillRow instead of stamping a fresh now(). Tests assert that mergeSkill returns the v=1 createdAt unchanged and a fresh updatedAt, matching the frontmatter behavior.
Two developers cloning the same repo with different URL styles (SSH vs HTTPS, with/without .git suffix, with embedded credentials) were getting different project_key values for the same project, because deriveProjectKey just sha1'd the raw output of `git config --get remote.origin.url`. Concrete cost: 5 distinct keys for the same repo across realistic clone variants — which means the dedup gate, the per-project state file, and the skills table all treated the same logical project as multiple separate ones. Cross-cloner skill dedup didn't work at all. normalizeGitRemoteUrl now collapses all common surface forms down to `host/owner/repo` (lowercased) before hashing: - strip URL scheme (https://, git://, ssh://, …) - convert SCP-style git@host:path → host/path - drop user@ / user:pass@ credentials - drop trailing .git and trailing slashes - lowercase the result Tests exercise the full set of clone styles seen in the wild plus a fallback case for non-URL inputs (the cwd-hash path still works when the cwd isn't a git repo).
…eable The autopull lands pulled skills under ~/.claude/skills/<name>--<author>/ regardless of scope-config, while the worker was reading only <cwd>/.claude/skills/. With the default install=project setting (no ~/.deeplake/state/skilify/config.json present), the gate saw zero existing skills no matter how many had been autopulled — root cause of every cross-author duplicate currently in the table. This change moves the existing-skills enumeration into a dedicated testable module that reads from BOTH roots, dedupes by name (project wins), and tags each entry with its source so the gate prompt can distinguish them. Skills are now rendered as either: --- existing skill [project]: <name> --- --- existing skill [global, read-only]: <name> --- The prompt restricts MERGE targets to [project] only. Globally-pulled skills are reference-only so the gate can avoid duplicating work already covered there, but cannot silently overwrite teammates' rows. A full "edit teammates' skills" model (contributors column, auto-promote scope on cross-author MERGE) is tracked separately in #118. The refactor is mechanical — the prompt-rendering logic moved verbatim into src/skilify/existing-skills.ts. New unit tests cover: - both roots scanned, source tagged correctly - same name in both roots: project wins - MERGE target list excludes [global] entries - char-cap path produces the omitted-count placeholder Bundle-scan test updated to look for the new gate-prompt heading.
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR adds Git remote URL normalization for stable project keys, introduces a new existing-skills management system that aggregates project and global skills with separate merge eligibility, and extends skill metadata to include creation and update timestamps that propagate through the persistence pipeline. ChangesSkilify System Enhancement: Git Normalization, Existing Skills, and Metadata Timestamps
Sequence Diagram(s)sequenceDiagram
participant Agent as Skilify Agent
participant Writer as Skill Writer
participant Existing as Existing Skills
participant Prompt as Prompt Builder
participant Deeplake as Deeplake Table
Agent->>Existing: listAllExistingSkills(cwd)
Existing->>Existing: load project + global skills
Existing->>Existing: deduplicate, tag by source
Existing-->>Agent: [TaggedSkill[], mergeTargetNames]
Agent->>Prompt: buildPrompt(pairs, existing)
Prompt->>Prompt: render "EXISTING SKILLS" block
Prompt->>Prompt: restrict MERGE to mergeTargetNames
Prompt-->>Agent: curated_prompt
Agent->>Writer: writeNewSkill() or mergeSkill()
Writer->>Writer: extract createdAt, updatedAt from frontmatter
Writer-->>Agent: {path, action, version, createdAt, updatedAt}
Agent->>Deeplake: insertSkillRow(..., createdAt, updatedAt)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related issues
Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
Coverage ReportScope: files changed in this PR. Enforced threshold: 90% per metric (per file via
File Coverage — 4 files changed
Generated for commit 310d60d. |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
claude-code/tests/skilify-state.test.ts (1)
70-84: ⚡ Quick winAdd regression cases for explicit default ports.
The canonicalization suite currently misses
ssh://...:22/...andhttps://...:443/...variants, which are common enough to reintroduce project-key fragmentation if behavior regresses.✅ Suggested test additions
it("collapses all common git URL forms to one canonical string", () => { const variants = [ "git@github.com:activeloopai/hivemind.git", "git@github.com:activeloopai/hivemind", "https://github.com/activeloopai/hivemind.git", "https://github.com/activeloopai/hivemind", + "https://github.com:443/activeloopai/hivemind.git", "https://github.com/activeloopai/hivemind/", "https://emanuele@github.com/activeloopai/hivemind.git", "https://emanuele:secret@github.com/activeloopai/hivemind.git", "ssh://git@github.com/activeloopai/hivemind.git", + "ssh://git@github.com:22/activeloopai/hivemind.git", ];🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@claude-code/tests/skilify-state.test.ts` around lines 70 - 84, Add regression cases for explicit default ports to the canonicalization test: update the variants array in skilify-state.test.ts (the test that calls normalizeGitRemoteUrl) to include SSH with :22 (e.g., "ssh://git@github.com:22/activeloopai/hivemind.git" and "git@github.com:22:activeloopai/hivemind.git" or equivalent scp-style if supported) and HTTPS with :443 (e.g., "https://github.com:443/activeloopai/hivemind.git" and credentialed forms like "https://user:pass@github.com:443/activeloopai/hivemind.git"), then assert they normalize to the same canonical "github.com/activeloopai/hivemind".
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@claude-code/bundle/session-end.js`:
- Around line 356-369: The normalizeGitRemoteUrl function currently preserves
explicit default ports (e.g. :22, :443) so equivalent remotes normalize to
different strings; update normalizeGitRemoteUrl to detect and strip default
ports for common schemes (SSH -> :22, HTTPS -> :443, HTTP -> :80) before
removing .git and trailing slashes. Specifically, when parsing the input in
normalizeGitRemoteUrl, capture the scheme (if any) and host:port, and if the
port equals the scheme's default, remove the :port fragment (while keeping
non-default ports intact); ensure this logic works both for normal URL forms and
for scp-style remotes handled by the existing scp branch so projectKey
generation becomes stable.
In `@src/skilify/state.ts`:
- Around line 75-102: normalizeGitRemoteUrl currently leaves explicit default
ports (e.g., :22 or :443) in the host path which causes duplicate project keys;
update normalizeGitRemoteUrl to strip default SSH and HTTPS ports from the host
portion (remove trailing :22 and :443 when they appear right after the hostname
and before a slash or end) while preserving non-default ports, so
deriveProjectKey sees canonical signatures; modify the hostname normalization
logic inside normalizeGitRemoteUrl (the code handling scp/scheme stripping and
host/path assembly) to run a regex that removes :22 and :443 only for
default-scheme cases and not arbitrary numeric suffixes.
---
Nitpick comments:
In `@claude-code/tests/skilify-state.test.ts`:
- Around line 70-84: Add regression cases for explicit default ports to the
canonicalization test: update the variants array in skilify-state.test.ts (the
test that calls normalizeGitRemoteUrl) to include SSH with :22 (e.g.,
"ssh://git@github.com:22/activeloopai/hivemind.git" and
"git@github.com:22:activeloopai/hivemind.git" or equivalent scp-style if
supported) and HTTPS with :443 (e.g.,
"https://github.com:443/activeloopai/hivemind.git" and credentialed forms like
"https://user:pass@github.com:443/activeloopai/hivemind.git"), then assert they
normalize to the same canonical "github.com/activeloopai/hivemind".
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 8941a04d-60c0-419a-8972-2f20c45b9c5f
📒 Files selected for processing (20)
claude-code/bundle/capture.jsclaude-code/bundle/session-end.jsclaude-code/bundle/skilify-worker.jsclaude-code/tests/skilify-bundle-scan.test.tsclaude-code/tests/skilify-existing-skills.test.tsclaude-code/tests/skilify-skill-writer.test.tsclaude-code/tests/skilify-state.test.tscodex/bundle/skilify-worker.jscodex/bundle/stop.jscursor/bundle/capture.jscursor/bundle/session-end.jscursor/bundle/skilify-worker.jshermes/bundle/capture.jshermes/bundle/session-end.jshermes/bundle/skilify-worker.jspi/bundle/skilify-worker.jssrc/skilify/existing-skills.tssrc/skilify/skilify-worker.tssrc/skilify/skill-writer.tssrc/skilify/state.ts
…-bugs # Conflicts: # claude-code/bundle/skillify-worker.js # claude-code/tests/skillify-bundle-scan.test.ts # codex/bundle/skillify-worker.js # cursor/bundle/skillify-worker.js # hermes/bundle/skillify-worker.js # pi/bundle/skilify-worker.js # src/skillify/existing-skills.ts
Equivalent remotes were still producing different project_keys when the URL carried the scheme's default port explicitly: https://github.com:443/org/repo ssh://git@github.com:22/org/repo git://github.com:9418/org/repo http://github.com:80/org/repo vs the same URLs without the explicit port. Rare in hand-typed config but realistic for automation / hosting that emits canonical URLs with ports, so worth folding in. Non-default ports (e.g. :8443 for an internal git server) stay — they identify the remote and must not be stripped.
| return join(STATE_DIR, `${projectKey}.lock`); | ||
| } | ||
|
|
||
| /** |
There was a problem hiding this comment.
where do we use git for skills and why
There was a problem hiding this comment.
deriveProjectKey produces a stable per-project identifier used in three places:
skillstable partition key — every row carriesproject_key. Combined withnameit's the dedup key the gate uses to decide KEEP vs MERGE, and whatpull.tsuses to namespace skills cross-project (<root>/<project_key>/...in the legacy layout, now flattened to<name>--<author>/).- Per-project state file —
~/.deeplake/state/skillify/<project_key>.json(Stop counter, watermark UUID, last mined date). Two projects mustn't share this file. - Autopull manifest —
pulled.jsonrecords each pulled skill with itsproject_keysounpullknows which entries belong to which project.
Why git remote URL specifically as the input to the hash: it's the most reliable cross-machine identifier of "the same logical project". A repo cloned by Alice into /Users/alice/work/foo and by me into /home/emanuele/code/foo shares git remote.origin.url — so both their skills get the same project_key, the dedup gate can reason about them as one project, and pulling Alice's skill lands it in my namespace correctly.
If we hashed the absolute cwd instead, every clone of the same repo on every machine would have a different key → duplicate skills across cloners, no cross-cloner dedup. The cwd is only a fallback when there's no git remote (e.g. mining in a non-repo working dir like /home/ec2-user) — and that fallback is exactly what produced some of the duplicates we cleaned up in this PR (the three hivemind-*-testing skills emanuele had: one project cwd had git remote, one didn't, so they hashed differently).
The fix in this PR (normalizeGitRemoteUrl) makes the git-URL path actually deliver on that promise — before the fix, SSH vs HTTPS vs .git-suffix variants of the same URL were producing 5 different hashes for the same repo (test cases in claude-code/tests/skillify-state.test.ts).
|
openclaw is missing |
openclaw was the only agent ever excluded from the per-agent bundle-scan iteration, even though its `dist/skillify-worker.js` ships exactly the same compiled worker code as the others. That left a real gap: if a build regression dropped the gate prompt or the migration helper specifically from openclaw's bundle (esbuild config drift, cache staleness), no test would have caught it. Adds "openclaw" to AGENTS and maps it to `dist/` via a small special-case in `bundlePath` (openclaw is a separate npm sub-package whose build output is conventionally `dist/`, gitignored, and regenerated by `npm run build`). Same assertions now apply: - skillify-worker bundle ships, contains EXISTING SKILLS heading and the [project] are MERGE-eligible clause - UPDATE-on-skills anti-pattern check - legacy state-dir migration helper present and called from readState The dedicated openclaw test at the bottom (inlined migration helper inside index.js) stays — that helper is genuinely a different shape (`migrateOpenclawSkillifyLegacyStateDir`, not the shared `migrateLegacyStateDir`) and lives in a different bundle. Test count: 19 -> 21.
|
Good catch — addressed in The dedicated test on |
Summary
Three tactical fixes for the cross-author and cross-project duplicate skills problem on the
skillsDeeplake table. Each commit is independent and focused; landing all three closes the gap that produced the duplicates we currently have in prod (e.g.meta-harness-continual-learning↔continual-learning-meta-harness, the three near-identicalhivemind-*-testingskills).What was actually broken
created_atwas reset on every MERGE. All 27 rows in the prodskillstable havecreated_at == updated_atbecause the worker stampednow()on both fields at every INSERT. The local SKILL.md frontmatter already preserved the v=1 creation date correctly — only the DB row didn't see it.deriveProjectKeyhashed the raw git remote URL. Five surface forms of the same repo (git@...,https://..., with/without.git, with embedded credentials) produced five differentproject_keyvalues. Two devs cloning the same repo with different URL styles ended up with disjoint state.The gate only read
<cwd>/.claude/skills/, never~/.claude/skills/. Autopull lands skills under the global root regardless of scope-config; the worker only looked at the project root. With the defaultinstall=project, the gate saw zero existing skills no matter how many had been autopulled — so every mining run started from a blank slate and re-created near-identical skills.Commits
fix(skilify): preserve created_at across MERGE in skills table— thread the v=1created_atthroughSkillWriteResultinstead of stampingnow().fix(skilify): normalize git remote URL before hashing for projectKey— collapse SSH/HTTPS/.git/cred variants to a canonicalhost/owner/repoform.fix(skilify): gate reads project + global skills, only [project] mergeable— extract the existing-skills logic into a testable module, read both roots, tag entries[project](MERGE-eligible) or[global, read-only](reference only). Cross-author MERGE intentionally stays forbidden in this PR.Out of scope (filed separately)
contributorscolumn and auto-promoteme → teamon cross-author edits — see skills table: addcontributorscolumn for team/org scope edits #118. The interim restriction in this PR (MERGE only into your own project skills) avoids silently overwriting teammates' work until that lands.skills-autopull-fanout-symlinksversions) — purely a storage-bloat concern, not a correctness one.Test plan
npm run buildcleannpm test— 2189 passing (was 2175 before; +14 from new tests)skilify-state.test.ts(URL normalization, 4 cases),skilify-skill-writer.test.ts(createdAt threading throughSkillWriteResult),skilify-existing-skills.test.ts(7 cases: both roots, dedupe project-wins, source tagging, char-cap placeholder)[global, read-only]Summary by CodeRabbit
Release Notes
New Features
Improvements