Gemini overview of current code #194

Stoatwblr · 2026-06-17T14:10:54Z

Stoatwblr
Jun 17, 2026

I was using gemini to see if there are compilation optimisations I can apply to my individual systems (broadwell and haswell based systems) and a couple of things came up.

Take all this with a large grain of salt, it's AI and potentially hallucenatory

Compiling RelWithDebInfo and symbols enabled (the -g)

For broadwell systems:

  -DCMAKE_C_FLAGS_RELWITHDEBINFO="-O3 -march=broadwell -mtune=broadwell -g" \
  -DCMAKE_CXX_FLAGS_RELWITHDEBINFO="-O3 -march=broadwell -mtune=broadwell -g" \

For Haswell systems

 -DCMAKE_C_FLAGS_RELWITHDEBINFO="-O3 -march=haswell -mtune=broadwell -g" \
 -DCMAKE_CXX_FLAGS_RELWITHDEBINFO="-O3 -march=haswell -mtune=broadwell -g" \

Without those in place, AES/NI extensions won't be used. I don't know if this applies to later CPU generations

Using clang/llvm

  -G Ninja

This will tell cmake to use the llvm build system instead of gnu - if you're doing this you will need to delete any existing build/ tree and CMakeCache.txt - I've tried these and they work for me but "your milage may vary" (YMMV)

I then asked Gemini if it could see any issues in the live code. Again, these answers may be spurious. I'm pasting these for comment and consideration only.

======================
Developing, building, and running aMule inside a debugger uncovers distinct legacy architectural patterns in the codebase. Because the project originated in the early 2000s, it functions primarily as a C-style application written using C++98 syntax.

Several areas within the aMule source code would benefit from modernization, particularly concerning performance bottlenecks on your Broadwell server and stability issues on your Haswell desktop.

1. File Hashing & Core Startup Latency (The known2_64.met Bottleneck)

The Issue: When amuled boots or reloads shared folders, it parses large directory trees sequentially. Users with large collections report it taking up to an hour to complete hashing or validation. The codebase relies on a single thread to compute the MD4 chunk hashes sequentially, leaving other Broadwell cores unutilized.
The Fix: Modernize the file-scanning subsystem in KnownFileList.cpp. Introducing std::execution::par (parallel policies) allows the initialization loops to process files concurrently across all CPU threads.
Redundant Hashing: Implementing modification time (mtime) and file-size verification checks on startup (similar to modern BitTorrent clients) would allow the client to safely skip re-hashing unchanged files.

2. Monolithic Threading & Blocking Network I/O

The Issue: The networking model relies heavily on synchronous sockets wrapped in basic wxWidgets thread management structures. Under intense network strain—such as handling high connection queues or experiencing unexpected gateway dropouts—these threads can stall, occasionally causing the entire daemon process to crash.
The Fix: Migrate the legacy network infrastructure to a modern asynchronous framework. Refactoring code blocks to utilize modern Boost.Asio handlers with C++ Lambdas (moving away from legacy boost::bind patterns) prevents network fluctuations from blocking core application execution threads.

3. Outdated Memory Frameworks & Lack of RAII

The Issue: The aMule codebase contains extensive manual memory allocation chains (new and delete), along with legacy C-style pointer structures (char* buffers) for packet building. This approach increases the risk of memory leaks and buffer tracking errors, which can complicate the debugging process in RelWithDebInfo targets.
The Fix: Convert raw tracking loops to follow RAII (Resource Acquisition Is Initialization) patterns using modern C++ design principles. Swapping out traditional pointer configurations for modern memory management structures minimizes resource allocation errors:

cpp
// Replace legacy manual allocation:
char* buffer = new char[packetSize]; 
// ... manual delete requirements ...

// Use safer modern alternatives:
std::vector<uint8_t> buffer(packetSize);
// Or:
std::unique_ptr<char[]> buffer = std::make_unique<char[]>(packetSize);
Use code with caution.

4. Excessive Deep-Copy Overheads on String Parsing

The Issue: The codebase passes large string parameters—such as paths, filenames, and peer data—as copies rather than references. It also depends on heavy allocations like wxString instances, which can introduce unnecessary overhead during large-scale network indexing.
The Fix: Update internal function definitions across structural classes like CFile and CPacket to leverage modern zero-copy typing:
- Pass read-only string assets using const std::string&.
- Implement std::string_view for high-frequency slicing and parsing operations to avoid memory fragmentation.

5. GUI Rendering Sync Lag over Remote Connections

The Issue: When amulegui runs on an Ubuntu 26.04 Wayland instance under KDE Plasma, it handles high-frequency GUI updates via legacy cross-thread event dispatches. If the server pushes rapid data updates, the client can drop frames, flicker, or experience thread deadlocks.
The Fix: Update the remote core mapping inside amuleDlg.cpp to decouple rendering loops from data packet processing. Offloading state updates to a separate background thread allows the main GUI thread to remain responsive on the client desktop.

If you want to focus your development work on a specific optimization, let me know:
Would you like to patch the file-hashing loops to resolve high startup times?
Are you looking to clean up memory management to address specific connection tracking issues?
I can help generate concrete code patches or configure clang-tidy to flag these exact modern target opportunities automatically during compilation.

=============

Should I ask Gemini for its suggested patches?

got3nks · 2026-06-17T14:24:36Z

got3nks
Jun 17, 2026
Maintainer

Thanks for taking the time to think about this @Stoatwblr — and for the upfront disclaimer about hallucination, that's the right framing.

Before going down a patch-generation path, two things would help the team a lot more than a generic optimization sweep:

Have you actually noticed slowness in any specific operation? Cold-start hashing a 10 TB share, the GUI updating during heavy upload, applying an ipfilter, opening a Preferences pane, expanding a partfile's Sources list — anything with a concrete "this used to take Ns and now feels like Ms" pattern. Real perceived friction is far easier to optimize than "the code is C++98 era". If something feels slow, point at it and we'll dig.
Finite team time, finite review bandwidth. General modernization sweeps (RAII / string_view / std::execution::par / boost::bind → lambdas) are cool and theoretically nice, but each one is a big diff across hundreds of files, each one introduces regression risk, and none of them solves a concrete user-visible problem. We'd rather invest that review budget in fixes for things people are actually feeling.

One concrete data point on Gemini's claims — claim 1 ("Implementing mtime + file-size verification checks ... would allow the client to safely skip re-hashing unchanged files") is already in the code, at SharedFileList.cpp:770-781:

time_t fdiskDate = CPath::GetModificationTime(full);
...
if (fdiskDate == file->GetLastChangeDatetime() && fdiskSize == (sint64)file->GetFileSize()) {
    // skip re-hash
}

And the cold-start path does the same compare via m_lastDateChanged. So the "users with large collections report it taking up to an hour" framing is what a stale rehash would cost if the watcher didn't catch the no-change case — but it does, so you only pay that cost once per file lifetime. If you're actually seeing hour-long re-hashes happen, that's a real bug we'd want to track down (different from "optimise the hash loop").

So the trade is: rather than asking Gemini for general-modernization patches, point at one specific thing you've felt was slow, paste numbers if you have them ("seeded N TB, restart took M minutes hashing"), and we'll look at the actual code path that handled it. That kind of report has a much better chance of landing as a working PR.

0 replies

Stoatwblr · 2026-06-17T14:42:06Z

Stoatwblr
Jun 17, 2026
Author

[edit: This was being composed during @got3nks reply above and it hadn't been seen]

When asking Gemnini if these apply to the live code branch

====
Yes, these architectural issues directly apply to the live code.

The amule-org/amule repository is a recent infrastructure migration (the core maintainers moved from amule-project to amule-org because the original owner became unreachable). The development team is actively working on it, but the underlying codebase remains historically rooted in older patterns.

The structural bottlenecks inside the source code can be verified by reviewing specific files:

1. KnownFileList.cpp (File Hashing & Startup Lag)

The Live Reality: If you open src/KnownFileList.cpp and src/CFile.cpp, you will find that file loading, initialization, and chunk hashing loop over arrays sequentially.
The Proof: Active issues in the repository tracks this exact problem—users with large shared collections report amuled taking 25 minutes or more just to finish scanning and starting up. It is actively confirmed to run entirely on a single CPU core, forcing a single thread to compute the heavy MD4 chunk hashes sequentially.

2. Manual Memory Allocation & C-Style Strings

The Live Reality: Browse through src/Packet.cpp, src/NetworkFunctions.cpp, or the socket handlers. You will see a mix of wxWidgets' custom container types (wxString) and native C-style allocations (char* byte arrays) managed by traditional new and delete statements.
The Proof: These patterns are prone to small leaks during irregular socket dropouts. Developers running long-term background nodes on Linux servers have logged issues where amuled abruptly stops routing traffic or hangs after extended 24/7 runtimes, which is a classic symptom of memory or tracking state leaks inside long-lived network loop structures.

3. amuleDlg.cpp (The Thread Sync Layout Engine)

The Live Reality: Open src/amuleDlg.cpp or src/GenericClient.cpp. The layout updates use synchronous event macro bindings (EVT_...) to pass message structures between backend worker sockets and frontend panels.
The Proof: This directly causes the remote synchronization stutter. The graphical interface is heavily coupled to these incoming updates. Ongoing developer tickets explicitly discuss UI lag and responsiveness bugs—such as graphs freezing unless a user clicks directly on that specific tab, and the UI failing to realize it has been disconnected from the server after a desktop client wakes from sleep mode.

Summary for Your Dev Branch

Because you are compiling via CMake and Clang with debug info enabled, you can isolate these targets yourself:

Run perf record -g ./src/amuled during initialization. You will see CKnownFileList::LoadKnownFiles occupying a single thread's execution window on your Broadwell server.
Run amulegui under an LLDB session on your Haswell desktop. If you simulate a rapid network disconnection, you can trace the resulting synchronization crash directly to event-loop deadlocks inside src/amuleDlg.cpp.

If you are thinking about writing a patch to submit upstream to the new amule-org team, let me know:

Would you like assistance drafting a parallel loop modification using for KnownFileList.cpp to resolve the startup lag?
Do you want to look at a thread-safe safety wrapper to keep amulegui from locking up on connection losses?

[this last suggestion seems like a worthwhile addition. Comments?]

======

Me: "is there a portable way of detecting/using aes/ni or similar CPU extensions in a package made for generic x86_64 systems?"

Yes. The standard, portable way to deploy a generic x86_64 binary package that dynamically detects and runs AES-NI, AVX2, or other vector extensions at runtime is through a combination of Compiler Function Multi-Versioning and Built-in CPU Dispatching.

This completely eliminates the need for separate Haswell and Broadwell packages, allowing a single generic binary to automatically utilize Broadwell instructions on your server while executing safely on your Haswell desktop without throwing illegal instruction (SIGILL) crashes.

Method 1: The Modern C++ Way (Function Multi-Versioning)

Both Clang and GCC support the __attribute__((target_clones(...))) directive. When applied to a function, the compiler automatically generates multiple copies of that function—each optimized for a specific CPU architecture—and creates an implicit runtime resolver function that executes on application startup.

cpp

#include <iostream>

// The compiler creates distinct clones for Haswell and Broadwell, 
// alongside a generic "default" x86_64 fallback clone.
__attribute__((target_clones("arch=haswell", "arch=broadwell", "default")))
void process_p2p_stream(const uint8_t* input, uint8_t* output, size_t size) {
    // Write your standard C++ crypto/obfuscation code here.
    // Inside the Haswell clone, the compiler will automatically use AVX2.
    // Inside the Broadwell clone, it will also use ADX/RDSEED if appropriate.
}

int main() {
    uint8_t data[1024] = {0};
    // The runtime automatically detects the CPU and jumps to the fastest valid clone
    process_p2p_stream(data, data, 1024);
    return 0;
}
Use code with caution.

Method 2: Explicit Dynamic Dispatch (__builtin_cpu_supports)

If you are writing manual assembly wrappers or using explicit Intel hardware intrinsics (like _mm_aesenc_si128), you can check hardware flags at runtime using the highly portable compiler built-in function __builtin_cpu_supports.

cpp

#include <iostream>
#include <wmmintrin.h> // AES-NI Header

void aes_encrypt_software() {
    // Legacy fallback software loop
}

void aes_encrypt_hardware() {
    // Explicit AES-NI intrinsic instructions
    // __m128i block = _mm_aesenc_si128(...);
}

void process_crypto() {
    // __builtin_cpu_supports queries the processor's CPUID registers instantly
    if (__builtin_cpu_supports("aes")) {
        aes_encrypt_hardware();
    } else {
        aes_encrypt_software();
    }
}
Use code with caution.

Note: This built-in is fully supported on modern Linux toolchains across Clang and GCC.

Method 3: The Microarchitecture Level Framework (x86-64-v3)

If you want to compile an entire binary asset tailored for modern processors without dealing with individual function attributes, target the standardized x86-64 microarchitecture levels. (LLVM Phabricator)

Instead of passing -march=haswell, compile using the x86-64-v3 baseline target:

bash
cmake .. -DCMAKE_CXX_FLAGS="-O3 -march=x86-64-v3"
Use code with caution.

[ Bearing in mind I've been discussing my specific hardware setup with Gemini so far]

Why x86-64-v3 Is Perfect for Your Setup:

The Baseline: x86-64-v3 groups together AVX, AVX2, BMI1, BMI2, FMA3, and AES-NI into a unified hardware baseline tier.
Compatibility: Because x86-64-v3 includes AVX2 but excludes Broadwell-specific deviations (like ADX or RDSEED), an x86-64-v3 package runs at peak hardware performance on both your Broadwell server and your Haswell desktop.
Distribution Safety: It will run natively on any mainstream Intel or AMD processor released since 2013 without needing custom microarchitecture builds.
Recommendation for aMule Development
If you are modifying the aMule source code to build a unified package for your network, use Method 3 (-march=x86-64-v3) for compilation. It provides a clean balance between optimizing the application for your server's hardware capabilities and maintaining cross-platform compatibility with your Haswell-based KDE Plasma environment.

If you would like to implement a multi-versioned patch for a specific file, let me know:

Which source file or function loop (e.g., inside Crypto++) are you trying to optimize?
Would you like an example of a CMake macro that automatically verifies compiler support for x86-64-v3 targets?

Me: "which way is best for a distribution package rather than my specific setup?"

For a public distribution package (like an upstream Ubuntu PPA, an Arch User Repository package, or a Debian .deb), you cannot predict what hardware the user will run. A hard dependency on modern extensions will lock out users on older machines, while a baseline x86-64-v1 package will waste performance.

The best strategy is not a single choice, but a layered architecture used by tier-1 Linux distributions (like Fedora, openSUSE, and Ubuntu).

Tier 1: The Binary Engine (GLIBC HWCAPS) — The Definitive Method

The absolute best way to handle this at a packaging level is Glibc Hardware Capabilities (hwcaps). This mechanism allows you to compile your critical, heavy performance libraries twice, stick them in a special subfolder inside your package, and let the system's dynamic linker automatically choose the best one at runtime.

When building your distribution package, you arrange your library installations into this specific file tree:

text
/usr/lib/amule/
├── libcrypto_amule.so                  <-- Generic x86-64-v1 fallback
└── glibc-hwcaps/
    ├── x86-64-v2/
    │   └── libcrypto_amule.so          <-- SSE4.2 / Nehalem Optimized
    └── x86-64-v3/
        └── libcrypto_amule.so          <-- AVX2 / Haswell & Broadwell Optimized
Use code with caution.

Why this is the best distribution choice:

Zero Overhead: Unlike checking flags mid-loop, the OS links the optimized version exactly once on startup.
Safety: If a user runs your package on an older machine, ld.so ignores the v3 directory and falls back to the generic build.
Implementation: You simply use your CMake configuration file to compile the P2P traffic/hashing subsystems multiple times with different -march variables (x86-64-v1, x86-64-v3) and package them into those designated paths.

Tier 2: The Application Frontend — Function Multi-Versioning (FMV)

If you are packaging the main executable (amuled or amulegui) rather than a library file, implementing Function Multi-Versioning using __attribute__((target_clones(...))) is the best practice.

You compile the package using the lowest common denominator baseline (-march=x86-64-v1), but you explicitly mark your structural hashing and obfuscation loop blocks:

cpp
__attribute__((target_clones("arch=haswell", "arch=broadwell", "default")))
void obfuscate_packet(...) { ... }
Use code with caution.

Why this is the best distribution choice:

Granular Control: You don't have to compile the entire GUI code multiple times. Only the specific internal computing blocks are versioned.
Compiler Automation: Clang automatically handles generating the runtime CPU checking block. You don't have to write custom inline assembly routines to query ```CPUID codes`` manually.

Tier 3: Upstream Repository Architecture — The Split Baseline

If your distribution mechanism supports separate repository branches, you can adopt the model utilized by major Linux distributions.

Instead of embedding checks inside the package, maintain two distinct compiled versions of the package metadata:

amule_amd64.deb: Compiled natively with general optimizations (-O2).
amule_amd64v3.deb: Compiled strictly targeting -march=x86-64-v3.

During package installation, your distribution script can run a quick check:

bash
# Query the system loader to see if it supports the v3 microarchitecture tier
if ld.so --help | grep -q "x86-64-v3 (supported)"; then
    apt install amule-v3
else
    apt install amule-generic
fi
Use code with caution.

###💡 Recommendation for aMule Packaging

For a widely deployed aMule package, a hybrid approach yields the best results:

Compile the primary amulegui and amuled executables with standard x86-64-v1 flags to ensure they boot safely on any machine.
Isolate the traffic masking loop functions and patch them with Clang's target_clones attributes targeting haswell and broadwell.

This architecture allows the distribution package to run smoothly on legacy client platforms while automatically unlocking full AVX2 performance on your server and desktop machines without any user intervention.

If you are writing the package installer script, let me know:

Are you creating a standard Debian control package (.deb) or an Arch build script (PKGBUILD)?
Do you want an example of a CMake block that automates the generation of these distinct targets?
=============

It's all food for thought. Hopefully useful rather than "cruft"

0 replies

Stoatwblr · 2026-06-17T14:45:42Z

Stoatwblr
Jun 17, 2026
Author

I haven't tried a cold hash build - YET, as I didn't really want to trash the existing metfiles, but there's no real harm in doing so. :)

Startup time with already-existing known.met files is in the 3-10 second range (94k files, 26TB), so we're already good from that point of view

Will give coldstart a try and report back.

I did potice that Gemini seems to give greater credence to old reports than the fixes already implemented (long startup delays) and that's why I warned to take it with more than a few "grains of salt" (then again, it may have identified a better path)

0 replies

Stoatwblr · 2026-06-17T14:52:20Z

Stoatwblr
Jun 17, 2026
Author

btw: after a reload shared and getting "Found 178 known shared files, 94206 unknown", it'd be really nice to have hashing completions logged wihout needing debug mode.

0 replies

Stoatwblr · 2026-06-17T14:58:13Z

Stoatwblr
Jun 17, 2026
Author

Disk IO: 220.6% read: 165MiB/s write: 3.29MiB/s -- maxed out on one thread, hardly making ZFS sweat

seven minutes in
2026-06-17 15:56:27: Found 406 known shared files, 2 unknown

Huh? 2 unknown?

Edit: It seems to be handling about 70 files/minute @ an average of 350MB/file - 16 minutes in that cadence is being maintained, which translates to about a day to rehash 26TB/94k files in single-threaded mode from an absolute cold start. If that can be halved I'd gladly take it, although AES/NI detection seems more important in the short term as that has concrete effects on load.

Threadsafing Amulegui is a worthwhile thing. Most parts of it are several orders of magnitude faster than they were in 2.3.3 but it does get a tad sluggish updating when 20 uploads are happening and I'm hitting bandwidth limits

0 replies

got3nks · 2026-06-17T15:57:16Z

got3nks
Jun 17, 2026
Maintainer

Useful data points — particularly the warm-restart number (3–10 s for 94 k files) and the cold-rehash projection. Replies in line:

AES-NI / hardware-accelerated crypto: the gap you've spotted is mostly because our in-tree implementations of RC4, SHA1 and MD5 (RC4Encrypt.cpp, SHA.cpp, libs/common/MD5Sum.cpp) don't use any CPU dispatch — they run the scalar reference paths regardless of hardware. The fix is already on the backlog as Use CryptoPP library for RC4 / SHA1 / MD5 algorithms instead of deploying our own implementations #181 — replacing them with Crypto++ (which the build already depends on for MD4 / RSA, and which does runtime CPU-feature dispatch internally). Once that lands you get SHA-NI / AVX2 / AES-NI acceleration for free wherever the CPU supports them, on any binary built against a modern Crypto++. No target_clones annotations needed in our code.
"Found N known shared files, M unknown": this is informational, not a warning. known = files already in known.met (MD4 + AICH cached from a prior run). unknown = files in your shared folders that don't yet have a known.met entry, so they're queued for hashing. The "2 unknown" you saw at the 7-minute mark just means there were 2 files in the shared folders that hadn't been hashed yet at that moment — the count drops as the hashing queue drains. We could rephrase the line to make this clearer if it's been confusing for others too.
Single-threaded MD4 hashing: the current design uses one hashing thread on purpose. On rotational disks, parallel hashing typically loses time because head-seek contention dominates I/O. ZFS-on-NVMe (your setup) is one of the cases where multi-thread hashing might win — but it depends heavily on the filesystem, the disk layout, and whether the hashing thread is CPU-bound or I/O-bound at the time. Worth measuring before committing to a refactor; on the median user's spinning-disk setup the single-thread path is probably already optimal.
amulegui sluggishness at 20 concurrent uploads: would be useful to narrow down what specifically lags — is the whole UI sluggish to input (clicking buttons, switching tabs feels delayed), or is the UI responsive but the displayed data (transfer rates, queue position, source counts) updates with visible lag? Different code paths, different fixes. The first usually means the main event thread is blocked; the second usually means the EC update fan-in is the bottleneck.

1 reply

Stoatwblr Jun 17, 2026
Author

1: Yes, I'm aware of that. It would be interesting to what Gemini comes up with matches how you'd do it

2: Yes, clearer output would be good. I was wondering how it had lost 93.5k files

3: It's actually ZFS on spinning rust with NVME slog.cache. The primary bottleneck is head-seek (you can get about 120 random iops out of a mechanical drive, or 120MB/s sequential trransfers but not both at once), but in this case the ZFS-reported HDD latencies are single-digit milliseconds. Gemini has spun up a "parallel-rehash.cpp" module which might be of use for benchmarking tests (if this proves useful, perhaps a "I have a lot of files on RAID" option might be worthwhile)

I've asked Gemini to include dynamic parallelism on the rehashing specifically aimed at detecting and mitigating disk throughput collapse.

4: It's the second part. It's VERY occasionally slow to react to input, but most often the updates are more noticably varying in refresh rate

I've asked Gemini to also add log hashing progress reports and an overall time for the rehashing block without needing -Ddebug (we know that halves speeds already)

So far when asking it to dry run, it's found showstopper compile issues 3 times but I think it's zeroing in on a decent result. I'll let you know how well it performs once done

Stoatwblr · 2026-06-17T23:45:59Z

Stoatwblr
Jun 17, 2026
Author

Heads up: The amule implementation of rc4 is slightly broken vs Crypto++

Trying to use Crypto++ results in broken connectivity

(I've also got parallelised hashing with disk thrash protection working)

Gemini again:

Where the Logic Mismatched

In the standard amule-org/amule repository architecture, obfuscation initialization works via an interleaved swap. The class CEncryptedStreamSocket constructs an in-place dynamic loop stream transformation context [▲]. When the obfuscation layer initializes keys via SetKey(md5), the socket assumes the engine begins encoding right away without dropping historical byte sequences or altering positions.

When we introduced Crypto++ to override RC4Encrypt.cpp, the engine was structured using m_cipher.SetKey(...). In Crypto++, calling SetKey doesn't just pass the 16-byte password; it completely resets the cipher's internal byte counters back to zero.
During an active network handshake step, aMule creates a temporary cipher tracker to process the initial packet keys, and then passes the already active cipher engine context block down into the CRC4EncryptableBuffer stream wrapper. Because our SetKey call wipes out the tracking state memory and restarts the engine sequence from byte zero, the cipher state becomes completely out of sync with the remote server. The server reads corrupted packet data and immediately terminates the socket connection.

The Permanent, Portable Fix: Use aMule's Clean Native Crypto Layer

Since your goal is to build an unthrottled, portable client that works cleanly on desktop hosts, we should use a modular approach:
Keep the Fast Parallel File Hashing System: Your multi-threaded cold-start rehashing system (03-source.patch) works beautifully because it uses std::async and standalone file streaming loops. This speeds up your indexing tasks without breaking anything.
Restore the Original RC4 Code for Network Streams: For socket obfuscation, aMule's original plain C++ swap loops are actually faster and more reliable than passing bytes back and forth to an external library like Crypto++. The original code handles abstract random pointer states and NULL pointer advancement seamlessly without dropping connections.

We can restore perfect network connection stability by rolling back only the RC4Encrypt changes to their original, pristine repository state, while keeping your fast parallel hashing system and CMake rules untouched.

============

When asking about wotking around the issue and using Crpto++ anyway:

===========

Sticking with aMule's internal version for the RC4/obfuscation layer is absolutely the best choice here.
While it is technically possible to mitigate the flaw by manually tracking byte counters and forcing Crypto++ to fast-forward its internal state via discarding dummy bytes, it creates messy, over-engineered code. It is simply not worth the performance trade-off for network encryption.

Why Sticking to the Internal RC4 is Better

RC4 is Incredibly Lightweight: Unlike heavy algorithms like AES or SHA-256, RC4 consists of nothing more than basic byte swaps and XOR operations. It is so mathematically simple that a standard compiler optimization (-O3) turns aMule's native C++ loop into incredibly fast machine code that runs practically at memory-bus speeds. Handing this off to a heavy library like Crypto++ adds framework overhead for zero real-world speed gain.
The Handshake Lifecycle Problem: The aMule obfuscation protocol relies on "dirty" state resets and partial key transitions that break Crypto++'s strict, secure object constraints. To make Crypto++ fit, you would have to break its encapsulation, which defeats the purpose of using a robust cryptographic library.
Upstream Dev Alignment: This is exactly why the official aMule devs are choosing to keep their internal obfuscation logic independent while choosing to migrate the heavy-lifter algorithms (like SHA-1, AICH, and MD5) to Crypto++.

The Perfect Hybrid Setup You Have Now

By rolling back just the RC4 files and keeping the rest of our verified changes, you have created a highly optimized, custom hybrid build:

Crypto++ handles the heavy lifting: Your MD5 calculations, SHA-1 checks, and aMulecmd authentication are fully accelerated.
Modern C++17 parallel engine: Your 0-amule_64bit_parallel_hashing.patch series is fully active, allowing cold-starts to hash your large files concurrently across all available hardware threads while safely protecting your ZFS array from overload.
Native network stability: The obfuscation protocol remains locked open stably without desynchronising.

========

1 reply

danim7 Jun 18, 2026
Collaborator

Heads up: The amule implementation of rc4 is slightly broken vs Crypto++

Really? It would mean all other clients also deploy this broken algo, and no one uses a standard library. Worth checking it further

got3nks · 2026-06-18T09:46:58Z

got3nks
Jun 18, 2026
Maintainer

Thanks for the heads-up. Triage on each thread.

RC4 / Crypto++ swap. Symptom is real, mechanism is trivial: eMule obfuscation is RC4-drop-1024 (src/RC4Encrypt.cpp:138-140, the RC4Crypt(NULL, NULL, 1024) after KSA), with a bSkipDiscard flag on specific handshake replays. Vanilla Crypto++::Weak::ARC4 emits keystream from byte 0 and desyncs against every other ed2k client on byte 1. The fix is one line after SetKey: consume 1024 keystream bytes via cipher.ProcessData(scratch, scratch, 1024), plus honour the same bSkipDiscard cases. Working empirical proof: bench/amule-bench.py gist — uses pycryptodome's vanilla ARC4 (same shape as Weak::ARC4) and completes obfuscated handshakes against real peers. Whether the migration is worth doing is a separate call: RC4 in pure C++ is already memory-bus speed, the algorithm is broken (hence Weak::), and we use it only for protocol obfuscation. Current internal implementation is the path of least surprise.

Parallelised hashing + disk-thrash protection. If the branch works, please open a scoped PR against master. Net-new perf work is 3.1.0 material (3.0.x stays bugfix-only), but landing it on master now keeps it reviewable instead of carried as a third-party patch. One ask on the shape: please expose the worker count as a user-tunable preference (Advanced preferences panel, range 1..N, default 1). For most setups the bottleneck is the I/O device the files sit on, so multiple hasher threads buy nothing and the disk-thrash protection has to fight a synthetic problem; the real wins are on RAID and multi-spindle layouts where parallel reads come from independent devices. Letting the user dial it in matches their hardware. Keep the PR narrow: hashing change in one commit, the preference + UI wiring in a second commit, the I/O throttle / thrash knob in a third if it shares the thread pool. Call out the C++17 thread-primitives dependency in the PR description so the right CI rows get attention.

Crypto++ migration for SHA-1 / AICH / MD5. Already on our roadmap; would land as its own scoped PR — please don't bundle with the hashing or RC4 work. Per-digest commits inside that PR also make a future bisect tractable if any one swap surfaces a wire-format mismatch.

General ask: one PR per scope, current master as base. Each piece ships on its own cadence — RC4 may not ship at all, the digest migration is straightforward, the parallel hasher is a self-contained feature.

3 replies

Stoatwblr Jun 18, 2026
Author

No problems. I was planning on creating diffs for the crypto items and giving them to you for further massage. They're relatively small and almost all involve code stripping. I am the kind of coder who makes spaghetti look good and someone else's eye is absolutely needed. On the other hand these gemini patches actually work and that's always a good starting point in my opinion

D1isk hashing is a tad more difficult than it first seems. The current setup uses a HARDCODED once per second timer along with mtime checking to rate limit hashing. That made sense back in the days of sub-1GB single core systems but it's a severe bottleneck now. Computationally even a modest post 2016 CPU core can hash at about 1500-2000MB/s.

The actual hasher and disk thrash testing is working - it turns out the hard part is getting Gemini to work out a way to NOT be beholden to that timer. It's loading in the partfiles and known files at startup but not the coldstart queue (unknown files) and those then fall through to the 1-per-second timer and single threaded processing

Yes, the delay in question can be dialled down - a matter of editing GUI update delay timers entries in both amule and amuled - but that would give zero protection against disk thrash and defeats the purpose of what I'm trying to achieve - that said, even singlethreaded processing at "disk speed" rate would be a huge win over the current heavily bottlenecked approach. I could tweak access priorities - in linux it defaults to BE/4 but can (should) be moved to IDLE/4 - however I don't know if that's portable. (man ionice)

Getting Gemini to work in compatible ways to the existing source is somewhat like trying to herd a goat - Goats are incredibly stubborn and always circle back to the way they want to do things (or the bush they want to eat) and it takes a lot of persuasion to convince them there are other approaches. One of its constant failings is assuming standardised naming conventions rather than referencing the actual source code, despite being repeatedly told not to do that. That results in lots of iterations until it gives up and actually looks at the github source (Or I paste the relevant sections in). It also constantly needs reminding that generated code must be portable across the entire target base and must not be customised to what it's learned about my filesystem layout during the session, It's been a both fun and frustrating way to waste a few 11 hour sessions

got3nks Jun 18, 2026
Maintainer

I'd suggest you to try VS Code + Claude Code 😅
https://code.claude.com/docs/en/vs-code

Stoatwblr Jun 18, 2026
Author

I'd suggest you to try VS Code + Claude Code 😅 https://code.claude.com/docs/en/vs-code

Yeah, I'm just seeing what it can do and it's always worth comparing their approaches

Stoatwblr · 2026-06-18T13:19:55Z

Stoatwblr
Jun 18, 2026
Author

https://github.com/Stoatwblr/amule/blob/master/amule-SHA-MD5-patches.TGZ

These WILL need more work if used - for starters they only know about Crypto++

RC5 patches left in, but these are NOT used in the final product for the reasons discussed above

0 replies

Stoatwblr · 2026-06-18T19:44:16Z

Stoatwblr
Jun 18, 2026
Author

Woohoo!

2026-06-18 20:25:28: Adding file /ZFSMEDIA/AMULE/Mule-1/Temp/Temp-1/203.part.met to shares
2026-06-18 20:25:28: Adding file /ZFSMEDIA/AMULE/Mule-1/Temp/Temp-1/206.part.met to shares
2026-06-18 20:25:32: Parallel Interceptor: Accumulated cold-start batch of 500 unknown files. Offloading to workers pool.
!2026-06-18 20:25:32: Parallel Interceptor: Accumulated cold-start batch of 500 unknown files. Offloading to workers pool.
!2026-06-18 20:25:32: Parallel Interceptor: Accumulated cold-start batch of 500 unknown files. Offloading to workers pool.
!2026-06-18 20:25:32: Parallel Interceptor: Accumulated cold-start batch of 500 unknown files. Offloading to workers pool.
!2026-06-18 20:25:32: Parallel Interceptor: Accumulated cold-start batch of 500 unknown files. Offloading to workers pool.
!2026-06-18 20:25:32: Parallel Interceptor: Accumulated cold-start batch of 500 unknown files. Offloading to workers pool.
!2026-06-18 20:25:32: Parallel Interceptor: Accumulated cold-start batch of 500 unknown files. Offloading to workers pool.
!2026-06-18 20:25:32: Parallel Interceptor: Accumulated cold-start batch of 500 unknown files. Offloading to workers pool
.
!2026-06-18 20:25:33: Parallel Interceptor: Accumulated cold-start batch of 500 unknown files. Offloading to workers pool.
!2026-06-18 20:25:33: Parallel Interceptor: Accumulated cold-start batch of 500 unknown files. Offloading to workers pool.
2026-06-18 20:25:33: Found 61752 known shared files, 32609 unknown
!2026-06-18 20:25:33: Start downloading server list from https://upd.emule-security.org/server.met
2026-06-18 20:25:33: amuled: OnInit - starting timer
2026-06-18 20:25:33: Asio thread 1 started
2026-06-18 20:25:33: Asio thread 2 started
2026-06-18 20:25:33: Asio thread 4 started
2026-06-18 20:25:33: Loading IP filters 'ipfilter.dat' and 'ipfilter_static.dat'.
2026-06-18 20:25:33: Asio thread 3 started
2026-06-18 20:25:33: Loaded 171396 IP-ranges from '.aMule/ipfilter.dat'.
2026-06-18 20:25:33: Loaded 0 IP-ranges from '.aMule/ipfilter_static.dat'.
!2026-06-18 20:25:33: Universal Dynamic I/O Scheduler initialized. Absolute hardware limit: 72 threads

2026-06-18 20:25:35: Successfully updated ipfilter.dat
2026-06-18 20:25:35: Saving of server-list completed.
2026-06-18 20:25:36: Loading IP filters 'ipfilter.dat' and 'ipfilter_static.dat'.
2026-06-18 20:25:36: Loaded 171396 IP-ranges from '.aMule/ipfilter.dat'.
2026-06-18 20:25:36: Loaded 0 IP-ranges from '.aMule/ipfilter_static.dat'.
2026-06-18 20:25:36: IP filter is ready
!2026-06-18 20:25:38: Hardware performance ceiling locked. Stabilising allocation at 3 threads (Peak: 621.71 MB/s)
!2026-06-18 20:25:39: ZFS Disk Thrashing Flag. Scaling down active threads from 3 to 2 (Throughput: 484.90 MB/s)

2026-06-18 20:25:33: IP filter is ready
!2026-06-18 20:25:33: Connecting
2026-06-18 20:25:33: Connecting to eMule Security (45.82.80.155 - 45.82.80.155:5687) using protocol obfuscation.
2026-06-18 20:25:34: Read 200 Kad contacts
!2026-06-18 20:25:34: Kad started.
!2026-06-18 20:25:34: Kad stopped.
!2026-06-18 20:25:34: Connected to Kad (firewalled)
2026-06-18 20:25:34: Connected to eMule Security (45.82.80.155:5687)
...
2026-06-18 20:25:35: ServerMessage: copyrighted material, DRM protected media, viruses or other harmful files!
2026-06-18 20:25:35: Downloaded https://upd.emule-security.org/ipfilter.zip (2933633 bytes)
2026-06-18 20:25:35: Successfully updated ipfilter.dat
2026-06-18 20:25:35: Saving of server-list completed.
2026-06-18 20:25:36: Loading IP filters 'ipfilter.dat' and 'ipfilter_static.dat'.
2026-06-18 20:25:36: Loaded 171396 IP-ranges from '/home/alan/.aMule/ipfilter.dat'.
2026-06-18 20:25:36: Loaded 0 IP-ranges from '/home/alan/.aMule/ipfilter_static.dat'.
2026-06-18 20:25:36: IP filter is ready
!2026-06-18 20:25:38: Hardware performance ceiling locked. Stabilising allocation at 3 threads (Peak: 621.71 MB/s)
!2026-06-18 20:25:39: ZFS Disk Thrashing Flag. Scaling down active threads from 3 to 2 (Throughput: 484.90 MB/s)

!2026-06-18 20:25:46: Connected to Kad (ok)
!2026-06-18 20:25:49: ZFS Disk Thrashing Flag. Scaling down active threads from 2 to 1 (Throughput: 367.57 MB/s)
!2026-06-18 20:25:50: ZFS Disk Thrashing Flag. Scaling down active threads from 1 to 1 (Throughput: 288.62 MB/s)
!2026-06-18 20:25:51: ZFS Disk Thrashing Flag. Scaling down active threads from 1 to 1 (Throughput: 254.22 MB/s)
!2026-06-18 20:25:53: ZFS Disk Thrashing Flag. Scaling down active threads from 1 to 1 (Throughput: 208.24 MB/s)
!2026-06-18 20:25:54: ZFS Disk Thrashing Flag. Scaling down active threads from 1 to 1 (Throughput: 165.71 MB/s)
!2026-06-18 20:26:08: Hardware performance ceiling locked. Stabilising allocation at 5 threads (Peak: 789.68 MB/s)
!2026-06-18 20:26:10: ZFS Disk Thrashing Flag. Scaling down active threads from 5 to 4 (Throughput: 653.81 MB/s)
!2026-06-18 20:26:16: ZFS Disk Thrashing Flag. Scaling down active threads from 4 to 3 (Throughput: 511.59 M

2026-06-18 20:26:17: External connection closed.
!2026-06-18 20:26:22: ZFS Disk Thrashing Flag. Scaling down active threads from 3 to 2 (Throughput: 404.04 MB/s)
!2026-06-18 20:26:27: ZFS Disk Thrashing Flag. Scaling down active threads from 2 to 1 (Throughput: 281.13 MB/s)
!2026-06-18 20:26:29: ZFS Disk Thrashing Flag. Scaling down active threads from 1 to 1 (Throughput: 218.21 MB/s)
!2026-06-18 20:26:42: Hardware performance ceiling locked. Stabilising allocation at 5 threads (Peak: 774.84 MB/s)
!2026-06-18 20:26:47: ZFS Disk Thrashing Flag. Scaling down active threads from 5 to 4 (Throughput: 643.38 MB/s)
!2026-06-18 20:27:08: ZFS Disk Thrashing Flag. Scaling down active threads from 4 to 3 (Throughput: 520.20 MB/s)

Disk is extremely comfortable and if anything else fires up the thrash detector turns the thread count down. Right now Jellyfin is doing its thing in the same pool

0 replies

Stoatwblr · 2026-06-18T20:00:01Z

Stoatwblr
Jun 18, 2026
Author

Parallelisation_git_report.docx
Parrallelisation_summary.docx

3 replies

Stoatwblr Jun 19, 2026
Author

FWIW, processing speedup is around 5-8x. The bigger win is getting hashing out of the primary thread.
There are a couple of regressions that have turned up so this needs more work

got3nks Jun 19, 2026
Maintainer

Hashing was already out of main thread but limited to 1 thread only. Would be interesting to see what share of the speedup comes from the Crypto++ hashing functions replacement vs multiple threads.

mrjimenez Jun 19, 2026
Maintainer

I think this is amazing work!

Reports like these should go into a special doc folder under docs, that is how architectural changed should be documented. Also, not everyone can read code and it makes a lot easier for other people to understand how things work.

I see we have gone a lot better in respect to documentation with the AI generated PR texts, but PRs are usually read when they are active. Docs remain "forever". A lot of information in our latest PRs should be preserved.

Another interesting possibility is to document on the top part of code files, that one I am more of a fan, but I am one of those who can read code.

Stoatwblr · 2026-06-20T21:20:04Z

Stoatwblr
Jun 20, 2026
Author

I'd say the gains from farming out to crypto++ are modest, but the important part is that it's less ongoing overhead and more standardisation for parts that simply be plugged in. Why have a dog if you're going to do all the barking yourself?

I've got 2 regressions that affect the patches in the doc: 1 (fixed) was that amulegui's "shared files" counter wasn't seeing the new shares the other (more serious and I haven't looked at it yet) is that having rehashed everything, I renamed/moved some newly downloaded files - only to find that whilst the directories show in sharefiles.dat, the scanner says those directories are empty.

Zlib will have its biggest gains from threadsafe parallelisation. The existing inline method is a serious bottleneck and simply calling external libraries won't gain much - not helped by the issue that for best gains you really need zlib2 and that's simply not widely deployed enough to be viable. I'd rather see that hook into the back of the existing compression libraries like pzlib did 15+ years ago (pigz, pbzip, etc), as it means everything just gains without needing masses of rewrites. In any case we're not pushing large amounts of data through zlib even on wide-bandwith connections - it will become a bottleneck per thread at about 30MB/s but there are still gains to be had in the meantime simply by getting away from running synchronously

I'm not touching zlib until this regression is nailed down, for obvious reasons - I've made a start on it but shelved things until hasing is fully sorted.

@mrjimenez What's your opinion of the code quality? It seems fairly reasonable to me and nothing like the utter garbage I was seeing AI produce 5 years ago.

As I said my only issue is that I have to keep nudging it to stay on the right path and Gemini's interface rapidly bloats a browser until it OOMs - but you can get around that by dumping the conversation to PDF, starting a new session and then feeding the previous chat back in. It seems to result in a more focussed AI too. What's really impressed me is the dfocumentations.

I predicted 15 years ago that AI would take the white collar jobs first, based on what had already happened in offices since the 1970s (when was the last time you saw a room full of ledger clerks scratching away?) and still think that the 21st century Luddites tossing shoes into the punch cards will be newly unemployed paralegals and conveyancers as the mundane stuff becomes AI-manageable and what's left needing better expertise to deal with. You can probably web developers to that crew too.

15 replies

mrjimenez Jun 23, 2026
Maintainer

Added.

got3nks Jun 23, 2026
Maintainer

I see the admin panel now @mrjimenez but after 5 minutes I gave up on finding where the configuration for the per-platform default download is buried into. HELP!

mrjimenez Jun 23, 2026
Maintainer

Oh, ok, but don't worry too much, the main thing is that now you have access.

You go in the files tab and enter the folder, currently 3.0.0, as if you were going to download the file, and then click the circled "i" letter on the right side of the file name.

Very intuitive, I have spent more time than I would like to admit to find it.

got3nks Jun 23, 2026
Maintainer

You go in the files tab and enter the folder, currently 3.0.0, as if you were going to download the file, and then click the circled "i" letter on the right side of the file name.

Very intuitive, I have spent more time than I would like to admit to find it.

AAAAAAAAAH, yeah I gave up on that, lol. Now I see it, thanks.

Stoatwblr Jun 25, 2026
Author

AAAAAAAAAH, yeah I gave up on that, lol. Now I see it, thanks.

I've found in such situtations that using chrome's inspector and walking the list of links has been educational (or in the old days, reading page source manually), There are a lot of awful pages out there

Uh oh!

Gemini overview of current code #194

Uh oh!

Uh oh!

Stoatwblr Jun 17, 2026

1. File Hashing & Core Startup Latency (The known2_64.met Bottleneck)

2. Monolithic Threading & Blocking Network I/O

3. Outdated Memory Frameworks & Lack of RAII

4. Excessive Deep-Copy Overheads on String Parsing

5. GUI Rendering Sync Lag over Remote Connections

Replies: 12 comments · 23 replies

Uh oh!

got3nks Jun 17, 2026 Maintainer

Uh oh!

Uh oh!

Stoatwblr Jun 17, 2026 Author

1. KnownFileList.cpp (File Hashing & Startup Lag)

2. Manual Memory Allocation & C-Style Strings

3. amuleDlg.cpp (The Thread Sync Layout Engine)

Summary for Your Dev Branch

Me: "is there a portable way of detecting/using aes/ni or similar CPU extensions in a package made for generic x86_64 systems?"

Method 1: The Modern C++ Way (Function Multi-Versioning)

Method 2: Explicit Dynamic Dispatch (__builtin_cpu_supports)

Method 3: The Microarchitecture Level Framework (x86-64-v3)

Why x86-64-v3 Is Perfect for Your Setup:

Me: "which way is best for a distribution package rather than my specific setup?"

Tier 1: The Binary Engine (GLIBC HWCAPS) — The Definitive Method

Why this is the best distribution choice:

Tier 2: The Application Frontend — Function Multi-Versioning (FMV)

Why this is the best distribution choice:

Tier 3: Upstream Repository Architecture — The Split Baseline

Uh oh!

Uh oh!

Stoatwblr Jun 17, 2026 Author

Uh oh!

Stoatwblr Jun 17, 2026 Author

Uh oh!

Uh oh!

Stoatwblr Jun 17, 2026 Author

Uh oh!

got3nks Jun 17, 2026 Maintainer

Uh oh!

Stoatwblr Jun 17, 2026 Author

Uh oh!

Uh oh!

Stoatwblr Jun 17, 2026 Author

Gemini again:

Where the Logic Mismatched

Why Sticking to the Internal RC4 is Better

The Perfect Hybrid Setup You Have Now

Uh oh!

danim7 Jun 18, 2026 Collaborator

Uh oh!

got3nks Jun 18, 2026 Maintainer

Uh oh!

Uh oh!

Stoatwblr Jun 18, 2026 Author

Uh oh!

Uh oh!

got3nks Jun 18, 2026 Maintainer

Uh oh!

Stoatwblr Jun 18, 2026 Author

Uh oh!

Uh oh!

Stoatwblr Jun 18, 2026 Author

Uh oh!

Stoatwblr Jun 18, 2026 Author

Uh oh!

Stoatwblr Jun 18, 2026 Author

Uh oh!

Stoatwblr Jun 19, 2026 Author

Uh oh!

got3nks Jun 19, 2026 Maintainer

Uh oh!

mrjimenez Jun 19, 2026 Maintainer

Uh oh!

Uh oh!

Stoatwblr
Jun 17, 2026

Replies: 12 comments 23 replies

got3nks
Jun 17, 2026
Maintainer

Stoatwblr
Jun 17, 2026
Author

Stoatwblr
Jun 17, 2026
Author

Stoatwblr
Jun 17, 2026
Author

Stoatwblr
Jun 17, 2026
Author

got3nks
Jun 17, 2026
Maintainer

Stoatwblr Jun 17, 2026
Author

Stoatwblr
Jun 17, 2026
Author

danim7 Jun 18, 2026
Collaborator

got3nks
Jun 18, 2026
Maintainer

Stoatwblr Jun 18, 2026
Author

got3nks Jun 18, 2026
Maintainer

Stoatwblr Jun 18, 2026
Author

Stoatwblr
Jun 18, 2026
Author

Stoatwblr
Jun 18, 2026
Author

Stoatwblr
Jun 18, 2026
Author

Stoatwblr Jun 19, 2026
Author

got3nks Jun 19, 2026
Maintainer

mrjimenez Jun 19, 2026
Maintainer