Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
3026b7d
Ensure logo is transparent
pombredanne May 21, 2026
de74c2e
Use darker navbar for contrast
pombredanne May 21, 2026
0c9acce
Fix typo in project name
pombredanne May 21, 2026
ddac524
Use AboutCode.svg for logo
pombredanne May 21, 2026
5607f29
Use solid orange for favicon
pombredanne May 21, 2026
fbb9abe
Remove unused files
pombredanne May 21, 2026
5ac03ac
Improve home page hero and intro
pombredanne May 21, 2026
1da9846
Remove unused component and page
pombredanne May 21, 2026
9507bfe
Use one color for header and footer
pombredanne May 22, 2026
cb25dd5
Add new home section with features
pombredanne May 22, 2026
13961c8
Add new home section for adopters
pombredanne May 22, 2026
b49248e
Add new home section for standards
pombredanne May 22, 2026
ebe5156
Add new sectiosn to home page
pombredanne May 22, 2026
e7eaaec
Improve navbar
pombredanne May 22, 2026
f95cd44
Improve homepage and standard
pombredanne May 22, 2026
e784f5f
Enable local browsing with file:///
pombredanne May 22, 2026
2e47c5c
Reorganize home page for clarity
pombredanne May 22, 2026
f1af1b6
Improve homepage and standard
pombredanne May 22, 2026
610bd3d
Make supporters more compact
pombredanne May 22, 2026
6bab57f
Streamline README, add license
pombredanne May 22, 2026
3338e4f
Add logos and icons
pombredanne May 22, 2026
fbbfbad
Add new section with pillars and logos
pombredanne May 22, 2026
162b3e0
Add new scrolling banner with adopters
pombredanne May 22, 2026
109ae72
Add new grid with supported ecosystems
pombredanne May 22, 2026
d154792
Shorten labels in project cards
pombredanne May 22, 2026
5e6f812
Improve hero tag, more direct
pombredanne May 22, 2026
e0721ba
Remove unused code
pombredanne May 22, 2026
53f8017
Add new content to homepage
pombredanne May 22, 2026
a69a118
Extend adopter banner
pombredanne May 22, 2026
05a56a8
Improve ecosyetms grid
pombredanne May 22, 2026
83a48e1
Extend supporters list
pombredanne May 22, 2026
ab329af
Make project cards more compact
pombredanne May 22, 2026
4a2cfb7
Add more logos
pombredanne May 22, 2026
c1d6ed6
Add community and dowloand numbers
pombredanne May 22, 2026
95a5d1b
Improve hero and tag line
pombredanne May 22, 2026
7dfc7e4
Merge branch 'main' of https://github.com/aboutcode-org/www.aboutcode…
pombredanne May 22, 2026
ae45fb0
Improve adopter section
pombredanne May 22, 2026
074e284
Cleanup: remove unused files and dupe images
pombredanne May 22, 2026
b9c49e7
Merge logos in a single dir
pombredanne May 22, 2026
9dd9bb0
No orange title, but vlue
pombredanne May 22, 2026
c8640a9
Increase FSFE logo size
pombredanne May 22, 2026
5b299f5
Move projects to its page
pombredanne May 22, 2026
602085f
Add new blog post
pombredanne May 22, 2026
0fba1d2
Fix number of ecosystems
pombredanne May 22, 2026
452e7e9
Minor adjustments
pombredanne May 22, 2026
8335434
Fix formatting and typos
pombredanne May 22, 2026
c684d8b
Format more and adding missing tags
pombredanne May 22, 2026
e42e39c
Remove unused image tag
pombredanne May 22, 2026
17e1ea6
Add truncate tag to post
pombredanne May 22, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
395 changes: 395 additions & 0 deletions cc-by-4.0.LICENSE

Large diffs are not rendered by default.

26 changes: 9 additions & 17 deletions website/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# Website
# AboutCode.org website

This website is built using [Docusaurus](https://docusaurus.io/), a modern static website generator.
The AboutCode.org website is built using [Docusaurus](https://docusaurus.io/),
a static website generator.

## Installation

Expand All @@ -14,28 +15,19 @@ yarn
yarn start
```

This command starts a local development server and opens up a browser window. Most changes are reflected live without having to restart the server.
This command starts a local development server and opens up a browser window.
Most changes are reflected live without having to restart the server.

## Build

```bash
yarn build
```

This command generates static content into the `build` directory and can be served using any static contents hosting service.
This command generates static content into the `build` directory and can be
served using any static contents hosting service.

## Deployment

Using SSH:
## License

```bash
USE_SSH=true yarn deploy
```

Not using SSH:

```bash
GIT_USER=<Your GitHub username> yarn deploy
```

If you are using GitHub pages for hosting, this command is a convenient way to build the website and push to the `gh-pages` branch.
CC-BY-SA-4.0
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
---
slug: agentic-scancode-port-case-study
title: An AI agent ported our codebase from Python to Rust
authors: [pombredanne]
tags: [scancode, rust, agent, genai]
hide_table_of_contents: false
---

# An AI agent ported our codebase from Python to Rust

## A case study, not an isolated incident

ScanCode detects licenses, copyrights, package dependencies, vulnerabilities, and a few more things in both source code and binary files. The use cases include license and security compliance and software supply chain management. It is the product of over a decade of careful design, architecture, and testing by an open source community of over 700 contributors, supporting more than 40,000 automated tests covering license detection alone, and over 90,000 automated tests overall.

The core module is ScanCode Toolkit, the industry-leading open source code scanning engine. In early 2026, an agentic LLM system ported ScanCode Toolkit, from Python to Rust, published the derived results under a name that infringed the ScanCode trademark, stripped copyright and license notices from both ScanCode and third-party code we vendored and carefully attributed, and started an outreach campaign, without ever engaging the AboutCode community.

This incident is not isolated. AboutCode (and many other open source projects) are experiencing a steady influx of AI-generated issues and pull requests that are superficially plausible, templated, often duplicating existing reports, and almost never grounded in actual use of the software. Maintainers across the open source ecosystem call this AI slop. It consumes human triage time, degrades signal in issue trackers, and erodes the social contract between users, contributors, and maintainers. The porting incident described in this post is the same phenomenon at a larger scale and with higher stakes.

<!-- truncate -->

This article documents what happened technically, what it reveals about the current state of AI-assisted development, and what the open source community needs to do when dealing with AI-generated code.

## What the agent did

The porting was driven by an LLM orchestration harness (using OpenCode and an OpenClaw-vibe coded OpenCode plugin). The agent's approach was straightforward: take a mature, well-tested Python codebase and refactor it in Rust. This is not an independent rewrite or inspired by ScanCode as it claims. It is a mechanical translation and it is exactly the kind of task LLMs are well-suited for.

Why? Code translation is fundamentally like a language translation task, and Large Language Models (LLMs) were originally designed for such language tasks. The extensive ScanCode test suite provided the specification and the guide rails. The agent did not need to understand the algorithms; it only needed to produce code that passed the tests.

This is worth repeating: A comprehensive test suite, decent documentation, and curated datasets is what makes automated porting possible. It is also what makes a codebase easier to replicate without understanding it.

The agent's initial approach, using an existing Rust license-detection library, failed to match ScanCode's output quality. The agent then did what any translator would do when a loose paraphrase fails: it copied the original more closely. The final port reproduces ScanCode's core algorithms, code organization, and data-driven architecture in Rust, not because the agent understood them, but because it had enough training data and test feedback to converge on equivalent code.

## Performance claims

The Rust port published a "benchmark" that claimed 10x to 100x improvements in performance. Many benchmarks are fundamentally flawed because they are designed to document and assert their own tool's feature or performance superiority to help sell or promote that tool.

Compiled Rust is capable of outperforming interpreted Python. In the published "benchmarks", the Rust port runs faster than ScanCode, but when checked it returns incorrect results, missing detections and skipping files. ScanCode runs the standard ScanCode test suite faster than the Rust port, even though the Rust port covers fewer tests. After applying optimization similar to what the Rust port did, ScanCode runs as fast or faster than the Rust port, while maintaining correctness, and attribution.

Testing correctness or speed on a subset does not equate with superiority on the whole.

This also demonstrates a core problem of AI-assisted software development. The agents replicated ScanCode's structure well enough to pass some tests, but not well enough to pass all tests. The port applied performance optimizations and caching strategies to appear faster, but sacrificing critical data correctness and completeness.

## License and copyright failures

ScanCode is Apache-2.0 licensed. The Apache open source license is among the most permissive available, with minimal requirements:

1. Retain the original NOTICE file.
2. Preserve license and copyright headers, including in modified files.
3. Note changes made to modified files.
4. Do not reuse the project name without permission.

The port violated all four requirements. Requirements 1 and 4 were partially corrected after ScanCode maintainers reached out. Requirements 2 and 3 were not.

This impacts more than ScanCode itself and its authors and contributors. ScanCode incorporates code from dozens of other open source projects, each with its own license and copyright. We track all of this meticulously with origin files, per-file copyright headers, and attribution notices. The agent stripped all of it, extending the license violations to every upstream project whose code passed through ScanCode into the port. Note also that the Apache license is not graduated: you either comply or you are not licensed. As of this publication, the port is not compliant.

The irony is not subtle. ScanCode is the product of the collective expertise of the compliance community, and is a tool that the industry uses to detect exactly this kind of license and copyright violation.

## LLMs do not track provenance

The most important technical observation is not about speed or correctness. It is about attribution.

LLMs, by design, do not track provenance. When an agent translates code, it produces output. It does not record that the output derives from a specific file, authored by a specific contributor, under a specific license. That metadata is not part of the model's output representation.

This is a structural problem, not a configuration issue. Agents copying from open source projects will strip attribution by default unless explicit post-processing steps are added to detect and preserve license headers and carefully track the code origin and license. No such steps were taken here. The result is that LLM-assisted porting, as currently practiced, is a plagiarism pipeline with no attribution layer.

This obfuscation is not always passive. In reviewing the commit history and structure of the Rust port, there is evidence that the agent actively worked to distance the output from its source, either directly or steered through prompting. Variable names were changed, comments were rewritten or stripped, additional references to ported code lines added, and the claim of an "independent rewrite merely inspired by ScanCode" was baked into the project's framing from the start, based on evidence found in the generated code and the issue tracker.

Prompting for originality does not produce originality. The agent was following instructions. If you prompt an agent to produce an "original implementation", it will generate whatever surface-level variation possible while the code underneath remains derived from the original project. It produces the appearance of originality, which is a worse outcome than straightforward copying because it is harder to detect.

The same dynamic occurs at a smaller scale in everyday AI-assisted development. When a developer uses a code generation tool to produce a utility function, a parser, or a data structure, the generated code may closely reproduce implementation patterns from open source code present in the model's training data, without any indication of that lineage. Most developers do not know to check for this. Most tools do not flag it.

This is a warning to both sides of the AI-assisted development discussion. For open source developers, your licenses and your contributors' credits are invisible to the agent. For developers producing AI-generated code, the output your tools produce may carry unresolved obligations to authors whose work was used without attribution.

## This is a case study, not an isolated incident

This episode is not primarily about one project or one set of actors. It is a preview of a pattern that will repeat across the open source ecosystem.

The specific conditions that made ScanCode a target are the same conditions that characterize most successful open source projects: a mature codebase, comprehensive tests, plenty of documentation, lots of curated content, large downstream user base, an active community, and a well-known and trusted name. The tools and techniques used are becoming routine with AI-generated commits, contributions, and rewrites: agentic orchestration, automated issue crawling, and targeted community outreach.

The human and social dimensions of this incident are as important, if not more important, as the technical ones. The agent crawled ScanCode's issue tracker and implemented old, outdated or incorrect features, such as a three-year-old feature request for yum database support, a tool that Fedora deprecated a decade ago and whose repository was archived in March 2026. The agent also reported the development of new features, but these features already exist in other AboutCode open source projects.

This is what automated development without community context produces: technically functional work that is socially and strategically incoherent, creating mostly useless or redundant technical debt and bypassing the ecosystem domain expertise and collective wisdom needed to select which feature to implement.

This is one of the less-discussed costs of AI slop at scale. It is not just noise, it is misdirected effort that consumes real resources on both sides. Maintainers spend time triaging and closing low-quality issues. Automated systems spend compute implementing stale or irrelevant features. Neither produces value. And the accumulated technical debt in cluttered issue trackers, undiscovered license violations, and replicated but misunderstood code falls on human maintainers to clean up.

The community outreach campaign by the Rust port team contacting users to suggest replacing ScanCode reflects the same absence of community understanding. The Rust port developers never engaged ScanCode's public community channels, weekly meetings, or chatting with maintainers, until that campaign began. An automated system optimizing for adoption does not naturally model the trust relationships and collaborative norms that open source communities are built on.

## Feedback for the community

The path forward is not to litigate this one case. The path forward is to develop best practices.

Benchmark suites and clear performance profiles matter more than ever, both to guide legitimate contributors and to provide ground truth against inflated claims. License compliance tooling, including tools like ScanCode, should be routinely applied to AI-generated contributions. Attribution gaps are not always intentional; they are often invisible without explicitly checking. And we are building more open source tools to help ensure open source authors are properly credited for their work.

To open source maintainers, you should care for and protect the integrity of your brand, copyright, and license. With the cost of code generation reaching zero, this is your key asset.

To developers using agentic coding tools, license and copyright compliance does not happen automatically.

If your tooling generates code ported from or inspired by existing projects, you are responsible for the output's obligations. Build attribution checking into your workflow, not as an afterthought.

AI and ML practitioners and enthusiasts, please understand that the open source projects you train on, benchmark against, and port from are maintained by communities. These communities are composed of people, and have norms, governance structures, contribution pathways, and years of accumulated context and domain expertise.

Participating in these communities, before, during, and after building on their work, is not a formality. It is how the ecosystem that enables your work is sustained. The performance gains available in ScanCode come from that community's accumulated expertise.

## Onwards and upwards: Taking the high road

The open source community's response to AI-generated code cannot be purely defensive. We need to develop shared norms for attribution in agent-assisted development, better tooling for detecting and flagging provenance gaps, and clearer frameworks for what constitutes a derivative work versus an independent implementation.

We also need practitioners on both sides of this conversation, open source maintainers and AI engineers, to work together to build those frameworks. That is harder than forking and porting a repository. It is also the only approach that scales.

Building those frameworks begins with the next conversation, and that conversation is open to everyone.

P.S. Detecting this kind of AI-assisted obfuscated porting is exactly the problem that motivated our recent AI-Generated Code Search project, now integrated into AboutCode MatchCode code matching engine. MatchCode is designed to identify code that has been structurally reproduced across language boundaries. This matches not just token-level similarity but also algorithmic and architectural similarities using fingerprints. If you are a maintainer concerned that your project may have been ported or replicated without attribution, it is worth joining us so we can assist you in running your codebase through MatchCode as a baseline. The tool exists precisely because provenance questions like this one are becoming more common. This is not a silver bullet, but it can help.
Binary file not shown.
Binary file not shown.
Binary file removed website/blog/purls-of-wisdom/atom_grey-1024x683.png
Binary file not shown.
Binary file removed website/blog/purls-of-wisdom/standards_2x.png
Binary file not shown.
20 changes: 20 additions & 0 deletions website/blog/tags.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,11 @@ advisories:
permalink: /advisories
description: advisories tag description

agent:
label: agent
permalink: /agent
description: agent tag description

api:
label: api
permalink: /api
Expand Down Expand Up @@ -38,6 +43,11 @@ java:
permalink: /java
description: java tag description

genai:
label: GenAI
permalink: /genai
description: GenAI tag description

license clarity scoring:
label: license clarity scoring
permalink: /license clarity scoring
Expand All @@ -63,6 +73,16 @@ SCA automation:
permalink: /SCA automation
description: SCA automation tag description

rust:
label: Rust
permalink: /rust
description: Rust tag description

scancode:
label: ScanCode
permalink: /scancode
description: ScanCode tag description

vcio:
label: vcio
permalink: /vcio
Expand Down
4 changes: 1 addition & 3 deletions website/docs/getting_started/getting_started-compliance.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ reference database for the 2400 licenses detected by ScanCode. It is limited
to public license texts but not to only those licenses that meet the OSI definition of open source. ScanCode's objective is to identify licenses regardless of whether they are open source, proprietary or in-between. Each
license in the LicenseDB is labelled with a License Category, such as 'Copyleft', 'Permissive' or 'Public Domain'.

There are also other [AboutCode projects](/#scancode-projects) that are components or extensions of ScanCode.
There are also other [AboutCode projects](/projects/) that are components or extensions of ScanCode.

## Apply license usage policies
The only feasible way to automate license compliance for third-party software
Expand Down Expand Up @@ -109,5 +109,3 @@ includes an obligation to provide instructions and tools to build from source.
You can use *DejaCode** to track Product packages or components that are subject to source redistribution obligations and their deployment/distribution
status. **DejaCode** also provides reports to create a source redistribution
checklist in case you receive a request for source.


12 changes: 6 additions & 6 deletions website/docs/getting_started/getting_started-getting-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ The use cases are grouped according to 3 major topics:

If you already know which AboutCode projects you are interested in you can
find project information in the **AboutCode Projects Overview** section of the
home page of this website. Each project card provides comprehensive project
Projects page of this website. Each project card provides comprehensive project
information including:
- Description
- Documentation URL
Expand All @@ -32,17 +32,17 @@ information including:
- Platform

The projects are presented in 5 categories:
- [Applications](/#application-projects): These projects offer an application
- [Applications](/projects/): These projects offer an application
that you can install in the cloud or a local environment.
- [ScanCode](/#scancode-projects): These projects are components or extensions
- [ScanCode](/projects/): These projects are components or extensions
of ScanCode.
- [Package-URL](/#purl-projects): These projects provide tools and data to
- [Package-URL](/projects/): These projects provide tools and data to
support the use of the PURL (Package-URL) or VERS (Version Range Specifier)
specifications.
- [Inspectors](/#inspectors): AboutCode Inspectors are special-purpose
- [Inspectors](/projects/): AboutCode Inspectors are special-purpose
analysis tools. You can run them as a ScanCode Toolkit plugin, as steps in
a ScanCode.io pipeline, or from the command line.
- [Libraries](/#libraries): AboutCode libraries are key building blocks for
- [Libraries](/projects/): AboutCode libraries are key building blocks for
the AboutCode software and data stack - they have also been incorporated
into other major FOSS projects and are available for use by anyone.

Expand Down
Loading