aboutcode-org · johnmhoran · May 23, 2026 · May 21, 2026 · May 21, 2026 · May 21, 2026
diff --git a/cc-by-4.0.LICENSE b/cc-by-4.0.LICENSE
diff --git a/website/README.md b/website/README.md
@@ -1,6 +1,7 @@
-# Website
+# AboutCode.org website
 
-This website is built using [Docusaurus](https://docusaurus.io/), a modern static website generator.
+The AboutCode.org website is built using [Docusaurus](https://docusaurus.io/),
+a static website generator.
 
 ## Installation
 
@@ -14,28 +15,19 @@ yarn
 yarn start
 ```
 
-This command starts a local development server and opens up a browser window. Most changes are reflected live without having to restart the server.
+This command starts a local development server and opens up a browser window. 
+Most changes are reflected live without having to restart the server.
 
 ## Build
 
 ```bash
 yarn build
 ```
 
-This command generates static content into the `build` directory and can be served using any static contents hosting service.
+This command generates static content into the `build` directory and can be
+served using any static contents hosting service.
 
-## Deployment
 
-Using SSH:
+## License
 
-```bash
-USE_SSH=true yarn deploy
-```
-
-Not using SSH:
-
-```bash
-GIT_USER=<Your GitHub username> yarn deploy
-```
-
-If you are using GitHub pages for hosting, this command is a convenient way to build the website and push to the `gh-pages` branch.
+CC-BY-SA-4.0
diff --git a/...log/agentic-scancode-port-case-study/2026-06-23-agentic-rust-port-case-study.md b/...log/agentic-scancode-port-case-study/2026-06-23-agentic-rust-port-case-study.md
@@ -0,0 +1,112 @@
+---
+slug: agentic-scancode-port-case-study
+title: An AI agent ported our codebase from Python to Rust
+authors: [pombredanne]
+tags: [scancode, rust, agent, genai]
+hide_table_of_contents: false
+---
+
+# An AI agent ported our codebase from Python to Rust
+
+## A case study, not an isolated incident
+
+ScanCode detects licenses, copyrights, package dependencies, vulnerabilities, and a few more things in both source code and binary files. The use cases include license and security compliance and software supply chain management. It is the product of over a decade of careful design, architecture, and testing by an open source community of over 700 contributors, supporting more than 40,000 automated tests covering license detection alone, and over 90,000 automated tests overall.
+
+The core module is ScanCode Toolkit, the industry-leading open source code scanning engine. In early 2026, an agentic LLM system ported ScanCode Toolkit, from Python to Rust, published the derived results under a name that infringed the ScanCode trademark, stripped copyright and license notices from both ScanCode and third-party code we vendored and carefully attributed, and started an outreach campaign, without ever engaging the AboutCode community.
+
+This incident is not isolated. AboutCode (and many other open source projects) are experiencing a steady influx of AI-generated issues and pull requests that are superficially plausible, templated, often duplicating existing reports, and almost never grounded in actual use of the software. Maintainers across the open source ecosystem call this AI slop. It consumes human triage time, degrades signal in issue trackers, and erodes the social contract between users, contributors, and maintainers. The porting incident described in this post is the same phenomenon at a larger scale and with higher stakes.
+
+<!-- truncate -->
+
+This article documents what happened technically, what it reveals about the current state of AI-assisted development, and what the open source community needs to do when dealing with AI-generated code.
+
+## What the agent did
+
+The porting was driven by an LLM orchestration harness (using OpenCode and an OpenClaw-vibe coded OpenCode plugin). The agent's approach was straightforward: take a mature, well-tested Python codebase and refactor it in Rust. This is not an independent rewrite or inspired by ScanCode as it claims. It is a mechanical translation and it is exactly the kind of task LLMs are well-suited for.
+
+Why? Code translation is fundamentally like a language translation task, and Large Language Models (LLMs) were originally designed for such language tasks. The extensive ScanCode test suite provided the specification and the guide rails. The agent did not need to understand the algorithms; it only needed to produce code that passed the tests.
+
+This is worth repeating: A comprehensive test suite, decent documentation, and curated datasets is what makes automated porting possible. It is also what makes a codebase easier to replicate without understanding it.
+
+The agent's initial approach, using an existing Rust license-detection library, failed to match ScanCode's output quality. The agent then did what any translator would do when a loose paraphrase fails: it copied the original more closely. The final port reproduces ScanCode's core algorithms, code organization, and data-driven architecture in Rust, not because the agent understood them, but because it had enough training data and test feedback to converge on equivalent code.
+
+## Performance claims
+
+The Rust port published a "benchmark" that claimed 10x to 100x improvements in performance. Many benchmarks are fundamentally flawed because they are designed to document and assert their own tool's feature or performance superiority to help sell or promote that tool.
+
+Compiled Rust is capable of outperforming interpreted Python. In the published "benchmarks", the Rust port runs faster than ScanCode, but when checked it returns incorrect results, missing detections and skipping files. ScanCode runs the standard ScanCode test suite faster than the Rust port, even though the Rust port covers fewer tests. After applying optimization similar to what the Rust port did, ScanCode runs as fast or faster than the Rust port, while maintaining correctness, and attribution.
+
+Testing correctness or speed on a subset does not equate with superiority on the whole.
+
+This also demonstrates a core problem of AI-assisted software development. The agents replicated ScanCode's structure well enough to pass some tests, but not well enough to pass all tests. The port applied performance optimizations and caching strategies to appear faster, but sacrificing critical data correctness and completeness.
+
+## License and copyright failures
+
+ScanCode is Apache-2.0 licensed. The Apache open source license is among the most permissive available, with minimal requirements:
+
+1. Retain the original NOTICE file.
+2. Preserve license and copyright headers, including in modified files.
+3. Note changes made to modified files.
+4. Do not reuse the project name without permission.
+
+The port violated all four requirements. Requirements 1 and 4 were partially corrected after ScanCode maintainers reached out. Requirements 2 and 3 were not.
+
+This impacts more than ScanCode itself and its authors and contributors. ScanCode incorporates code from dozens of other open source projects, each with its own license and copyright. We track all of this meticulously with origin files, per-file copyright headers, and attribution notices. The agent stripped all of it, extending the license violations to every upstream project whose code passed through ScanCode into the port. Note also that the Apache license is not graduated: you either comply or you are not licensed. As of this publication, the port is not compliant.
+
+The irony is not subtle. ScanCode is the product of the collective expertise of the compliance community, and is a tool that the industry uses to detect exactly this kind of license and copyright violation.
+
+## LLMs do not track provenance
+
+The most important technical observation is not about speed or correctness. It is about attribution.
+
+LLMs, by design, do not track provenance. When an agent translates code, it produces output. It does not record that the output derives from a specific file, authored by a specific contributor, under a specific license. That metadata is not part of the model's output representation.
+
+This is a structural problem, not a configuration issue. Agents copying from open source projects will strip attribution by default unless explicit post-processing steps are added to detect and preserve license headers and carefully track the code origin and license. No such steps were taken here. The result is that LLM-assisted porting, as currently practiced, is a plagiarism pipeline with no attribution layer.
+
+This obfuscation is not always passive. In reviewing the commit history and structure of the Rust port, there is evidence that the agent actively worked to distance the output from its source, either directly or steered through prompting. Variable names were changed, comments were rewritten or stripped, additional references to ported code lines added, and the claim of an "independent rewrite merely inspired by ScanCode" was baked into the project's framing from the start, based on evidence found in the generated code and the issue tracker.
+
+Prompting for originality does not produce originality. The agent was following instructions. If you prompt an agent to produce an "original implementation", it will generate whatever surface-level variation possible while the code underneath remains derived from the original project. It produces the appearance of originality, which is a worse outcome than straightforward copying because it is harder to detect.
+
+The same dynamic occurs at a smaller scale in everyday AI-assisted development. When a developer uses a code generation tool to produce a utility function, a parser, or a data structure, the generated code may closely reproduce implementation patterns from open source code present in the model's training data, without any indication of that lineage. Most developers do not know to check for this. Most tools do not flag it.
+
+This is a warning to both sides of the AI-assisted development discussion. For open source developers, your licenses and your contributors' credits are invisible to the agent. For developers producing AI-generated code, the output your tools produce may carry unresolved obligations to authors whose work was used without attribution.
+
+## This is a case study, not an isolated incident
+
+This episode is not primarily about one project or one set of actors. It is a preview of a pattern that will repeat across the open source ecosystem.
+
+The specific conditions that made ScanCode a target are the same conditions that characterize most successful open source projects: a mature codebase, comprehensive tests, plenty of documentation, lots of curated content, large downstream user base, an active community, and a well-known and trusted name. The tools and techniques used are becoming routine with AI-generated commits, contributions, and rewrites: agentic orchestration, automated issue crawling, and targeted community outreach.
+
+The human and social dimensions of this incident are as important, if not more important, as the technical ones. The agent crawled ScanCode's issue tracker and implemented old, outdated or incorrect features, such as a three-year-old feature request for yum database support, a tool that Fedora deprecated a decade ago and whose repository was archived in March 2026. The agent also reported the development of new features, but these features already exist in other AboutCode open source projects.
+
+This is what automated development without community context produces: technically functional work that is socially and strategically incoherent, creating mostly useless or redundant technical debt and bypassing the ecosystem domain expertise and collective wisdom needed to select which feature to implement.
+
+This is one of the less-discussed costs of AI slop at scale. It is not just noise, it is misdirected effort that consumes real resources on both sides. Maintainers spend time triaging and closing low-quality issues. Automated systems spend compute implementing stale or irrelevant features. Neither produces value. And the accumulated technical debt in cluttered issue trackers, undiscovered license violations, and replicated but misunderstood code falls on human maintainers to clean up.
+
+The community outreach campaign by the Rust port team contacting users to suggest replacing ScanCode reflects the same absence of community understanding. The Rust port developers never engaged ScanCode's public community channels, weekly meetings, or chatting with maintainers, until that campaign began. An automated system optimizing for adoption does not naturally model the trust relationships and collaborative norms that open source communities are built on.
+
+## Feedback for the community
+
+The path forward is not to litigate this one case. The path forward is to develop best practices.
+
+Benchmark suites and clear performance profiles matter more than ever, both to guide legitimate contributors and to provide ground truth against inflated claims. License compliance tooling, including tools like ScanCode, should be routinely applied to AI-generated contributions. Attribution gaps are not always intentional; they are often invisible without explicitly checking. And we are building more open source tools to help ensure open source authors are properly credited for their work.
+
+To open source maintainers, you should care for and protect the integrity of your brand, copyright, and license. With the cost of code generation reaching zero, this is your key asset.
+
+To developers using agentic coding tools, license and copyright compliance does not happen automatically.
+
+If your tooling generates code ported from or inspired by existing projects, you are responsible for the output's obligations. Build attribution checking into your workflow, not as an afterthought.
+
+AI and ML practitioners and enthusiasts, please understand that the open source projects you train on, benchmark against, and port from are maintained by communities. These communities are composed of people, and have norms, governance structures, contribution pathways, and years of accumulated context and domain expertise.
+
+Participating in these communities, before, during, and after building on their work, is not a formality. It is how the ecosystem that enables your work is sustained. The performance gains available in ScanCode come from that community's accumulated expertise.
+
+## Onwards and upwards: Taking the high road
+
+The open source community's response to AI-generated code cannot be purely defensive. We need to develop shared norms for attribution in agent-assisted development, better tooling for detecting and flagging provenance gaps, and clearer frameworks for what constitutes a derivative work versus an independent implementation.
+
+We also need practitioners on both sides of this conversation, open source maintainers and AI engineers, to work together to build those frameworks. That is harder than forking and porting a repository. It is also the only approach that scales.
+
+Building those frameworks begins with the next conversation, and that conversation is open to everyone.
+
+P.S. Detecting this kind of AI-assisted obfuscated porting is exactly the problem that motivated our recent AI-Generated Code Search project, now integrated into AboutCode MatchCode code matching engine. MatchCode is designed to identify code that has been structurally reproduced across language boundaries. This matches not just token-level similarity but also algorithmic and architectural similarities using fingerprints. If you are a maintainer concerned that your project may have been ported or replicated without attribution, it is worth joining us so we can assist you in running your codebase through MatchCode as a baseline. The tool exists precisely because provenance questions like this one are becoming more common. This is not a silver bullet, but it can help.
diff --git a/...ed-licenses-public-database-scancode-licensedb/ScanCode-LicenseDB-1536x1085.png b/...ed-licenses-public-database-scancode-licensedb/ScanCode-LicenseDB-1536x1085.png
diff --git a/...e/blog/curated-licenses-public-database-scancode-licensedb/scancode-db-blog.png b/...e/blog/curated-licenses-public-database-scancode-licensedb/scancode-db-blog.png
diff --git a/website/blog/purls-of-wisdom/atom_grey-1024x683.png b/website/blog/purls-of-wisdom/atom_grey-1024x683.png
diff --git a/website/blog/purls-of-wisdom/standards_2x.png b/website/blog/purls-of-wisdom/standards_2x.png
diff --git a/website/blog/tags.yml b/website/blog/tags.yml
@@ -3,6 +3,11 @@ advisories:
   permalink: /advisories
   description: advisories tag description
 
+agent:
+  label: agent
+  permalink: /agent
+  description: agent tag description
+
 api:
   label: api
   permalink: /api
@@ -38,6 +43,11 @@ java:
   permalink: /java
   description: java tag description
 
+genai:
+  label: GenAI
+  permalink: /genai
+  description: GenAI tag description
+
 license clarity scoring:
   label: license clarity scoring
   permalink: /license clarity scoring
@@ -63,6 +73,16 @@ SCA automation:
   permalink: /SCA automation
   description: SCA automation tag description
 
+rust:
+  label: Rust
+  permalink: /rust
+  description: Rust tag description
+
+scancode:
+  label: ScanCode
+  permalink: /scancode
+  description: ScanCode tag description
+
 vcio:
   label: vcio
   permalink: /vcio

diff --git a/website/docs/getting_started/getting_started-compliance.md b/website/docs/getting_started/getting_started-compliance.md
@@ -39,7 +39,7 @@ reference database for the 2400 licenses detected by ScanCode. It is limited
 to public license texts but not to only those licenses that meet the OSI definition of open source. ScanCode's objective is to identify licenses regardless of whether they are open source, proprietary or in-between. Each 
 license in the LicenseDB is labelled with a License Category, such as 'Copyleft', 'Permissive' or 'Public Domain'.
 
-There are also other [AboutCode projects](/#scancode-projects) that are components or extensions of ScanCode.
+There are also other [AboutCode projects](/projects/) that are components or extensions of ScanCode.
 
 ## Apply license usage policies
 The only feasible way to automate license compliance for third-party software
@@ -109,5 +109,3 @@ includes an obligation to provide instructions and tools to build from source.
 You can use *DejaCode** to track Product packages or components that are subject to source redistribution obligations and their deployment/distribution
 status. **DejaCode** also provides reports to create a source redistribution
 checklist in case you receive a request for source.
-
-
diff --git a/website/docs/getting_started/getting_started-getting-started.md b/website/docs/getting_started/getting_started-getting-started.md
@@ -22,7 +22,7 @@ The use cases are grouped according to 3 major topics:
 
 If you already know which AboutCode projects you are interested in you can
 find project information in the **AboutCode Projects Overview** section of the
-home page of this website. Each project card provides comprehensive project
+Projects page of this website. Each project card provides comprehensive project
 information including:
 - Description
 - Documentation URL
@@ -32,17 +32,17 @@ information including:
 - Platform
 
 The projects are presented in 5 categories:
-- [Applications](/#application-projects): These projects offer an application
+- [Applications](/projects/): These projects offer an application
   that you can install in the cloud or a local environment.
-- [ScanCode](/#scancode-projects): These projects are components or extensions
+- [ScanCode](/projects/): These projects are components or extensions
   of ScanCode.
-- [Package-URL](/#purl-projects): These projects provide tools and data to
+- [Package-URL](/projects/): These projects provide tools and data to
   support the use of the PURL (Package-URL) or VERS (Version Range Specifier)
   specifications.
-- [Inspectors](/#inspectors): AboutCode Inspectors are special-purpose
+- [Inspectors](/projects/): AboutCode Inspectors are special-purpose
   analysis tools. You can run them as a ScanCode Toolkit plugin, as steps in
   a ScanCode.io pipeline, or from the command line.
-- [Libraries](/#libraries): AboutCode libraries are key building blocks for
+- [Libraries](/projects/): AboutCode libraries are key building blocks for
   the AboutCode software and data stack - they have also been incorporated
   into other major FOSS projects and are available for use by anyone.