Issue Summarization — examples/cifar100
Background
The cifar100 example suite is the primary benchmark in KubeEdge Ianvs for evaluating Federated Class Incremental Learning (FCIL) combined with Semi-Supervised Learning (SSL). It uses the standard CIFAR-100 dataset (60,000 images across 100 classes, split into distributed text index files to simulate edge clients) and is currently the only suite in the repository that exercises the federatedclassincrementallearning core paradigm. It spans 7 runnable sub-examples across 4 sub-directories, ranging from a simple FedAvg baseline all the way to advanced semi-supervised algorithms like FediCarl and GLFC.
This suite sits at a critical intersection in the repository: it is simultaneously the most technically ambitious example and the most broken one. Multiple contributors — LFX applicants, independent researchers, and developers new to the project — have independently attempted to run it and failed, each at a different stage of the same cascading chain of failures. The issues documented here are not theoretical concerns raised in code review. They are concrete, reproducible bugs discovered by real people on real machines, following the official instructions exactly and hitting walls that the documentation gives no guidance on.
The fact that so few contributors have successfully run any cifar100 sub-example end-to-end is a direct consequence of the layered failures described below. Each layer of breakage acts as a filter: those who make it past the dependency errors hit the hardcoded path errors; those who fix the paths hit the wrong YAML key mismatches; those who get past that hit runtime crashes in the model code itself. Because the example has almost never been run by anyone outside the original author, the codebase has not received the kind of hands-on validation that would surface and fix these issues organically — and they have quietly accumulated, undetected, across releases.
The 4 Sub-Directories at a Glance
Think of a Smart City Traffic Camera System to understand the increasing complexity of each sub-directory:
1. federated_learning/ — Baseline FedAvg
Cameras train a local CNN on manually labeled images (cars, trucks) and sync only their updated model weights — not raw video — to a central aggregation server each night. The server averages all client updates using standard FedAvg and pushes back a smarter master model. No incremental learning, no unlabeled data, no SSL. This is the intended starting point for any new contributor trying to understand how the cifar100 suite works.
2. federated_class_incremental_learning/ — CIL Upgrade
Six months later, new object classes (scooters, delivery robots) appear in the city. The cameras have deleted their original training data to save storage space on their constrained hardware. A standard AI retrained on scooters would overwrite its memory of cars — the catastrophic forgetting problem. CIL solves this by expanding the classifier head (tf.pad) with new output neurons dedicated to the new classes, while freezing the existing neural pathways for cars and trucks. The model grows without forgetting, and it never needs the deleted original data again.
3. fci_ssl/ — Cutting Edge Semi-Supervised Learning
Engineers now want to track food trucks, but only have time to manually label 50 images. Meanwhile, the cameras capture tens of thousands of unlabeled frames every day. This sub-directory applies SSL algorithms (FediCarl, GLFC) to generate pseudo-labels — confident guesses — for the unlabeled footage, using those 50 labeled images as a seed. The camera teaches itself how food trucks look from different angles, in rain, and partially obscured, without waiting for more human annotations. This is the most algorithmically complex sub-directory and the one most actively under research.
4. sedna_federated_learning/ — Production Deployment
The first three sub-directories simulate the entire FL pipeline on a single developer machine. This one is different. The train_worker process runs on a low-power hardware chip physically bolted to a traffic pole in outdoor conditions. The aggregation_worker runs on a Kubernetes cluster in a data center miles away. They communicate asynchronously over a real 5G network using Sedna's production interfaces. This is not a benchmark simulation — it is a live end-to-end test of the entire KubeEdge-Sedna production stack under real distributed hardware constraints.

The image above shows the full ianvs benchmark architecture. The cifar100 example suite sits at the top example layer. Running a benchmark job invokes the federatedclassincrementallearning paradigm in the core layer, which in turn calls Sedna's aggregation and dataset interfaces at the infrastructure layer. Every issue documented in this summarization blocks some stage of this vertical call chain — either preventing the tool from starting at all, preventing the example config from loading, or crashing the training loop mid-execution.
Table of Contents
| Category |
Issues |
| A — Missing Dependencies |
#351, #327, #363 |
| B — Hardcoded Paths |
#252 |
| C — FL Core Implementation |
#306 |
| D — CI/CD Validation |
#411 |
Category A — Missing Dependencies & Broken Requirements
Issues where incomplete requirements.txt files block cifar100 from being set up or executed.
These three issues are closely related and were filed independently by different contributors who each hit the same wall without any coordination between them. They represent three separate attempts, by three separate people, to follow the official setup instructions — and three separate failures at the exact same point. The existence of three independent reports within a relatively short window is itself meaningful: it is not possible to dismiss this as a user error or a platform quirk when unrelated contributors on different machines reproduce the exact same sequence of crashes.
What makes this category particularly frustrating from a contributor's perspective is that pip install -r requirements.txt completes successfully and exits with code 0. There are no warnings, no errors, no indication that anything is wrong. A contributor who follows the documentation step-by-step has every reason to believe their environment is ready — until the moment they try to run something and are met with a crash that gives no hint that the installation step was the problem.
| # |
Title |
State |
| #351 |
Missing requirements.txt in examples/cifar100/ — tensorflow, keras, sedna absent |
Open |
| #327 |
Critical: Incomplete root requirements.txt blocks installation |
Open |
| #363 |
Independent reproduction of #327 — confirms universal blocker |
Open |
Issue #351 — Missing examples/cifar100/requirements.txt
Introduction
examples/cifar100/ has no requirements.txt file of its own. The three packages it critically depends on — tensorflow, keras, and sedna — are completely absent from the root requirements.txt as well. Any contributor who clones the repository and follows the standard setup instructions will encounter ModuleNotFoundError before a single line of cifar100 training code runs.
This issue is not about a misconfigured environment or an unusual machine setup. It is a structural gap: the file that should declare what cifar100 needs simply does not exist. Every contributor who has attempted to run cifar100 — regardless of their system, operating system, or Python version — has hit this exact error at the exact same point. There is no workaround short of manually discovering and installing the missing packages, which requires understanding the codebase well enough to identify what is needed — knowledge a new contributor cannot reasonably be expected to have before running the example for the first time.

The screenshot above confirms the exact crash: Python cannot find tensorflow, keras, or sedna because none of them are installed anywhere in the environment. This error occurs at the very first import statement, before any cifar100 training logic, dataset loading, or paradigm initialization begins. The crash is deterministic and identical regardless of the machine or operating system.
Challenge
1. Silent failure — The root requirements.txt installs successfully — it only lists prettytable, scikit-learn, numpy, pandas, tqdm, matplotlib, and onnx. The install exits with code 0 and no warnings. A contributor who has just run pip install -r requirements.txt and seen it complete cleanly has no reason to suspect that critical packages are missing. The problem only reveals itself at runtime when Python throws the error, by which point the contributor may have spent significant time on other parts of the setup process.
2. Version sensitivity — cifar100 does not just need tensorflow to be installed. It needs a specific, mutually compatible ecosystem: TensorFlow 2.x + Keras 3.x + Sedna. These three must be version-compatible with each other and with the Python version being used. A plain pip install tensorflow keras sedna without pinned versions will install whatever PyPI considers the latest, which risks incompatible releases that break in subtle and hard-to-diagnose ways — protobuf conflicts, missing class attributes, changed API signatures.
3. Cross-platform complexity — TensorFlow has different installation paths for CPU-only machines versus GPU machines, and it does not support Windows natively. A single generic requirements.txt entry for tensorflow will not work correctly on all platforms. Properly handling this requires platform-specific installation guidance that does not currently exist anywhere in the repository.
4. Scope — All 7 cifar100 sub-examples share this one missing file. Creating it fixes all of them in a single change. However, at least one sub-example has an additional dependency: glfc requires h5py for reading and writing .weights.h5 model checkpoint files. A blanket fix must account for these sub-example-specific needs as well, otherwise some sub-examples will still fail at a later stage even after the shared file is created.
Impact
| Scope |
Impact |
| cifar100 |
All 7 sub-examples are completely unrunnable. The federatedclassincrementallearning paradigm cannot be tested at any level — not the simplest baseline, not the most advanced SSL variant. |
| ianvs core |
The FL paradigm code in core is exercised exclusively by these examples. With none of them runnable, any regression in the core paradigm logic — a broken aggregation step, a changed interface — goes completely undetected and can silently reach production. |
| Other examples |
sedna is a shared dependency used by sedna_federated_learning/ and any other example that uses Sedna's ClassFactory or aggregation interfaces. Its absence from the root requirements.txt means those examples are also broken for a fresh install, not just cifar100. |
Pull Requests
| PR |
Title |
State |
Fixes #351? |
| #354 |
fix(examples): add missing requirements.txt for robot and cifar100 |
Open |
Yes — explicitly states Fixes #351, but does not pin versions, so version compatibility issues may still surface after install |
| #421 |
fix: add missing core dependencies and macOS troubleshooting guide |
Open |
Partially — adds core-level missing packages but does not create the cifar100-specific requirements.txt |
Suggested Complete Fix
tensorflow>=2.12.0,<2.16.0
keras>=2.12.0,<3.0.0
sedna>=0.6.0
numpy>=1.23.0
h5py>=3.7.0
Issue #413 — Missing colorlog and PyYAML in Root requirements.txt
Introduction
The root requirements.txt is missing colorlog and PyYAML, both of which are imported unconditionally at module level in ianvs core:
core/common/log.py line 18 → import colorlog
core/common/utils.py line 24 → import yaml
A clean install following the official guide — pip install -r requirements.txt followed by pip install -e . — produces no errors and no warnings. But every single ianvs command then crashes immediately, before any example code runs, before any paradigm is loaded, before any configuration is parsed. The process exits in the core logging initialization step.
This issue is importantly distinct from #351. Issue #351 blocks cifar100 specifically — a contributor trying a different example would not hit it. This issue blocks the entire ianvs framework from starting. It does not matter which example a contributor wants to run, or how correctly they have configured everything else. The tool crashes before getting anywhere near example-level code, and the stack trace shows no connection to any example, making it genuinely difficult to trace the cause without reading the source files directly.

*`core/common/log.py` line 18: a bare `import colorlog` statement with no `try/except` guard and no conditional check. If the package is absent from the environment, the Python interpreter raises `ModuleNotFoundError` at this exact line during module initialization. The entire ianvs process exits here, before the logging system is set up, before any example or paradigm code is reached.*

*`pip install -r requirements.txt` exits cleanly with exit code `0` and no output indicating a problem. A contributor who follows the documentation exactly has every reason to believe their environment is fully configured. Nothing in this output hints that two critical packages are missing.*

The crash as it appears to the contributor: ModuleNotFoundError: No module named 'colorlog', with a traceback pointing to core/common/log.py line 18. There is no mention of cifar100, no mention of any example, and no suggestion that installing a package will fix it. Without reading log.py directly, this error gives no useful signal about what went wrong or how to fix it.
Challenge
1. Deceptive silent install — pip exits with code 0 and produces zero output suggesting a problem. A contributor who has spent time carefully following the documentation, setting up a virtual environment, and running the install steps has no mechanism to discover that two packages were not installed until the moment they try to run ianvs and see it crash. By that point, the assumption is typically that the crash is caused by something in the example or the paradigm — not by the install step that appeared to complete successfully.
2. Core-level crash, not example-level — Unlike #351, which is scoped to cifar100, this bug sits in core/common/ — the very first module ianvs loads on startup. It blocks every single example in the repository without exception. A contributor trying to run bdd, pcb-aoi, robot, or any other example would hit the exact same crash before getting anywhere near their target example. This makes it arguably the highest-priority dependency bug in the entire codebase — it is not an example-specific issue, it is a framework startup issue.
3. No guard in the source — Both failing imports are bare, unconditional module-level statements:
import colorlog # core/common/log.py line 18 — crashes if not installed
import yaml # core/common/utils.py line 24 — crashes if not installed
Note: Adding a try/except guard around these imports would only mask the underlying problem by silently swallowing the error. The correct and complete fix is ensuring these packages are declared in requirements.txt so they are guaranteed to be present in any environment that follows the official setup instructions.
Impact
| Scope |
Impact |
| cifar100 |
ianvs never reaches the FL paradigm or any cifar100 logic at all — the process exits in log.py first. Even if a contributor has correctly installed all cifar100-specific packages, this bug blocks execution before anything cifar100-related runs. |
| All examples |
This is a repository-wide blocker. Every example — bdd, pcb-aoi, MOT17, robot, cifar100 — is broken for any contributor doing a fresh install following the documentation. This is not an edge case; it affects 100% of new contributors on their first day. |
| CI |
main.yaml only verifies that pip install -e . exits with code 0. It never runs a single ianvs command after installation. This means the bug is completely invisible to automated testing and will persist across every release until it is explicitly fixed. |
Pull Requests
| PR |
Title |
State |
Fixes #413? |
| #414 |
deps: add missing colorlog and PyYAML to requirements.txt |
Open |
Yes — the exact targeted fix, also alphabetically sorts all entries in requirements.txt for long-term maintainability |
| #421 |
fix: add missing core dependencies and macOS troubleshooting guide |
Open |
Partially — overlapping scope, addresses colorlog and PyYAML among a broader set of fixes |
What PR #414 Changes
+colorlog # ADDED
+matplotlib
numpy
+onnx
pandas
+prettytable~=2.5.0
+PyYAML # ADDED
+scikit-learn
+tqdm
Issue #363 — Independent Reproduction of Cascading Install Failure
Introduction
Filed by Adityakk9031 11 days after #327, independently confirming the same cascading failure chain on a different machine. These two contributors did not coordinate with each other. They did not reference each other's issues. They simply both tried to follow the official install guide and encountered the exact same sequence of failures in the exact same order.
The significance of this independent reproduction goes beyond just confirming the bugs exist. It rules out the possibility that #327 was caused by something unusual about that specific machine or environment. When two unrelated contributors on separate machines hit the identical 4-step error chain within 11 days of each other — without any communication — the only reasonable interpretation is that this failure is deterministic and affects every new install. There is no way to follow the official setup instructions and avoid these errors. Following the official guide produces 4 sequential ModuleNotFoundError crashes, each one only visible after manually fixing the previous:
setuptools → colorlog → PyYAML → sedna
Challenge
1. Confirms 100% fresh-install failure rate — Two uncoordinated contributors, two different machines, the same 4-step error chain, 11 days apart. This rules out environment-specific quirks and confirms that the broken requirements.txt is a universal, deterministic problem. Any contributor who follows the documented setup process will hit these errors. There are no exceptions.
2. One error at a time — no way to see ahead — Each missing package only surfaces after the previous one is fixed manually. A contributor cannot discover all missing packages in a single pass. They fix the first error, re-run, get the second error, fix that, re-run, get the third, and so on. At no point does the tooling tell them how many more errors are waiting. This one-at-a-time discovery loop is slow, frustrating, and provides no indication of how far from a working environment the contributor is at any given moment.
3. Version pin too loose in the proposed fix — Both #327 and #363 propose adding sedna>=0.4.0 to requirements.txt as the fix for the fourth error. However, the actual ianvs codebase requires sedna to expose the JsonlDataParse class in sedna.datasources, which is absent from older PyPI releases. A plain pip install sedna with >=0.4.0 may install a version that does not include JsonlDataParse, causing another failure downstream. The proposed fix from both issues is therefore incomplete, and a tighter version pin is needed.
4. Potentially more undiscovered dependencies — Both issues include the note that there may be more missing packages beyond the four documented. Further packages required by Sedna's internal communication layer — such as requests, grpcio, or protobuf — could surface once the first four errors are resolved. The full list of missing dependencies may not yet be known.
Impact
| Scope |
Impact |
| cifar100 |
sedna, colorlog, and yaml all fail before any FCIL logic is reached. Even a contributor who is determined to push through every error manually will not reach the first line of cifar100 training code without multiple rounds of manual package discovery and installation. |
| Repository |
Two independent reports from two unrelated contributors proves this is not an obscure or rarely-triggered problem. It is the first thing every new contributor hits, and it hits them before they can do anything productive with the repository. |
| Issue tracking |
Both #327 and #363 describe the identical failure with the same proposed fix. Having two open, unresolved issues for the same root cause adds confusion for maintainers about which PR is responsible for closing which issue. Merging PR #421 would close both simultaneously. |
#363 vs #327 — Key Differences
| Aspect |
#327 (shivam8415) |
#363 (Adityakk9031) |
| Error chain |
Same 4 steps |
Same 4 steps |
| Platform |
Windows |
Windows |
| Filed date |
2026-02-06 |
2026-02-17 |
| Unique addition |
Roadmap + validation script proposal |
Independent cross-machine confirmation |
| Closed by |
PR #421 |
PR #421 |
Key insight: pip install exits with code 0 and no warnings despite being completely insufficient. Contributors trust the clean install output and are blindsided when ianvs crashes immediately on first use. There is no indication in the install output that 4 more failures are waiting, and no guidance in the documentation about how to resolve them.
Category B — Hardcoded Paths & Configuration Failures
Issues caused by developer-specific absolute paths baked directly into YAML config files, making all examples non-portable on any machine other than the original author's.
A contributor who successfully resolves all Category A dependency issues, gets ianvs installed correctly, and finally runs a cifar100 benchmark job will immediately encounter a completely different class of failure. The YAML configuration files that define each benchmark job contain file paths that are hardcoded to the original developer's personal machine — specifically, paths beginning with /home/wyd/ianvs/... which point to directories and files that exist on one specific laptop and nowhere else.
This is not a subtle misconfiguration. It is an immediate, hard crash that fires before any training logic executes, with a FileNotFoundError that clearly identifies the non-portable path. The fact that these paths reached main and remained there through multiple releases indicates that the YAML config files were written once on a single machine and never tested anywhere else. No other contributor successfully ran them, which means no one had the opportunity to catch the issue through normal use.
| # |
Title |
State |
| #252 |
Hardcoded /home/wyd/ paths + 4 additional runtime bugs in cifar100 |
Open |
Issue #252 — Hardcoded Paths + 4 Runtime Bugs
Introduction
Issue #252 was filed by an LFX 2025 Term-3 applicant who pushed through all the Category A dependency errors and attempted to actually run examples/cifar100/federated_learning/fedavg. What they found was not a single bug but 5 compounding bugs, each one layered on top of the previous, each only visible after the prior one is resolved. This is the most thorough and detailed bug report in the cifar100 suite. It documents exactly how far the example is from being runnable end-to-end, even after all dependency issues are resolved.
The fact that this level of analysis was produced by an LFX applicant — someone applying for a mentorship, not a maintainer — underscores how broken the onboarding experience is. A contributor new to the project should be able to follow the README, run a baseline example, and understand what it does. Instead, they encountered a series of compounding failures that required reading source code, understanding the relationship between YAML configs and the Dataset class, and debugging a runtime crash in the model's predict loop. Even after all that work, the example still could not be run to completion.
Even after resolving all dependency issues from Category A, this cascade of bugs blocks execution completely:
Hardcoded paths → TF version conflict → Wrong YAML keys → AttributeError in basemodel.py → No README

Running benchmarking.py with a cifar100 config immediately throws FileNotFoundError. All 18 YAML files across all 6 sub-examples contain hardcoded references to /home/wyd/ianvs/... — a path that only exists on the original developer's machine. This is the first error a contributor sees after successfully completing all dependency installation steps.

testenv.yaml specifies dataset paths using train_url and test_url keys. The core Dataset class in dataset.py reads train_index and test_index to resolve the actual file paths — train_url and test_url exist in the class but serve a different purpose (they hold resolved URLs after processing). Using the wrong keys means the dataset is silently never loaded. No error is raised; training proceeds as if the dataset is empty.

The predict() method in basemodel.py calls data.x, assuming the input is a Dataset object with attribute access. In the federated learning paradigm, ianvs actually passes a plain Python list. This causes an AttributeError in the prediction loop, crashing execution even after the path errors and YAML key mismatches are corrected. It is a code-level bug independent of configuration.

TF 2.10.0 triggers a protobuf version conflict during package installation. No TensorFlow version is pinned anywhere in the cifar100 example, and there is no documentation in the repository about which TF version, Keras version, and Python version are known to be compatible with each other. Contributors are left to discover compatible combinations through trial and error.

BACKEND_TYPE = "KEARS" in basemodel.py — a clear typo for "KERAS". This string is used to select the model backend at initialization time. The wrong string means the backend is never correctly identified, causing a silent mismatch that affects how the model is loaded and run. No explicit error is thrown at this line, making it difficult to trace.
Challenge
1. Scale of the path problem — /home/wyd/ appears 36 times across 18 YAML files spanning all 6 cifar100 sub-examples. Every benchmarkingjob.yaml, testenv.yaml, and algorithm.yaml in the suite is affected. This is not a one-line fix. Any partial fix that addresses only one sub-example leaves the remaining five broken in exactly the same way and gives a false impression that progress has been made.
2. Wrong YAML keys — All testenv.yaml files use train_url/test_url as dataset configuration keys. The core Dataset class in dataset.py (lines 57–60, 157–158) reads train_index and test_index to resolve the actual data file paths. The _url fields do exist in the class, but they hold the resolved output URLs after path processing — not the input index file paths that the config should be providing. Using the wrong keys means the dataset indexing step is silently skipped and training proceeds with no data loaded, producing meaningless results with no error message to indicate what went wrong.
3. AttributeError in basemodel.py — The predict() method calls data.x, written with the assumption that the input is a Dataset object that supports attribute access. In practice, the federated learning paradigm in ianvs passes a plain Python list to predict(). This causes an AttributeError every time the prediction loop runs, regardless of how correctly the YAML configuration is set up. It is a fundamental code-level incompatibility between the example and the paradigm it is designed to use.
4. TensorFlow version sensitivity — No TensorFlow version is pinned anywhere in the example. TF 2.10.0 triggers a known protobuf version conflict that prevents installation from completing cleanly. TF 2.x on Python 3.10+ has further API compatibility issues. There is no documentation in the repository — not in the README, not in a comment, not in a requirements file — about which combination of TF version, Keras version, and Python version is known to produce a working environment for cifar100.
5. No README — The federated_learning/fedavg/ directory has no README.md file. There are no setup instructions, no description of what the example does, no list of prerequisites, and no description of expected output that a contributor could use to verify whether a run completed correctly. Without a README, a new contributor has no starting point and no way to validate their progress.
Impact
| Scope |
Impact |
| cifar100 |
The hardcoded paths affect all 7 sub-examples across all 6 sub-directories. Fixing only fedavg still leaves fci_ssl/, federated_class_incremental_learning/, and glfc/ with the same /home/wyd/ paths baked in. Any fix must be applied comprehensively across all 18 YAML files, or it is incomplete and the sub-examples that were not fixed remain broken in exactly the same way. |
| ianvs core |
The wrong YAML keys (train_url/test_url vs train_index/test_index) reveal a gap between what the core Dataset class expects as input and what the example configuration templates actually provide. Any future contributor who uses these YAML files as a template for a new example will unknowingly inherit this silent bug and reproduce it in their own work. |
| New contributors |
This issue was filed by an LFX applicant who invested significant debugging time just to get the example to partially run — and still could not get it to complete. The cifar100 suite is described as the primary benchmark for FCIL in ianvs, which means contributors expect it to be a working reference implementation. In its current state it provides zero onboarding value and leaves contributors with the impression that the project is unmaintained. |
Pull Requests
| PR |
Title |
State |
Fixes #252? |
| #420 |
fix(examples): replace hardcoded developer paths across all cifar100 configs |
Open |
Yes — replaces all /home/wyd/ occurrences across all 18 YAML files in all 6 sub-examples with portable relative or environment-variable-based paths |
| #354 |
fix(examples): add missing requirements.txt for robot and cifar100 |
Open |
Partially — addresses the missing dependency layer only; does not touch path bugs or code-level crashes |
| #421 |
fix: add missing core dependencies and macOS troubleshooting guide |
Open |
Partially — fixes the core install chain that blocks reaching this example at all, but does not address the YAML or code bugs within it |
Remaining Bugs With No Open PR
The following 4 bugs exist in the codebase today and will remain after all current open PRs are merged, unless they are explicitly addressed by new PRs:
| Bug |
Location |
Description |
| Wrong YAML dataset keys |
All testenv.yaml files |
train_url/test_url used instead of train_index/test_index — dataset index is never resolved, training proceeds with no data silently |
AttributeError in predict() |
federated_learning/fedavg/algorithm/basemodel.py |
data.x assumes a Dataset object but receives a plain Python list — crashes the prediction loop every time |
BACKEND_TYPE typo |
federated_learning/fedavg/algorithm/basemodel.py |
"KEARS" should be "KERAS" — causes a silent backend mismatch at model initialization |
kwargs key mismatch |
federated_class_incremental_learning |
kwargs.get("lr") does not match the learning_rate key name defined in the corresponding YAML config |
Category C — FL Paradigm Core Implementation
This category contains a single closed issue that is worth examining despite being closed. It is not directly related to the runtime bugs documented in Categories A and B, but it provides important context about the underlying FL paradigm implementation that the entire cifar100 suite depends on to function.
The issue is notable because it was opened and closed by the same person within 24 hours, with no explanation and no linked PR. This means the community has no record of what was actually done, when the FL paradigm implementation was written, or whether it has ever been tested against a successful end-to-end run. Given that the cifar100 examples — the only code that exercises the FL paradigm — have been broken by the bugs above and have therefore rarely if ever been run successfully, the practical correctness of the FL paradigm implementation remains unverified.
| # |
Title |
State |
| #306 |
Implement Federated Learning Paradigm Support |
Closed |
Issue #306 — FL Paradigm Implementation (Closed)
Introduction
Filed on 2026-01-28, claiming that Federated Learning was documented as a supported ianvs paradigm but had no actual implementation in core/testcasecontroller/algorithm/paradigm/. The issue was closed by the original author 24 hours later on 2026-01-29. There is no comment in the thread explaining why it was closed. There is no linked PR, no commit reference, and no note saying "already implemented" or "fixed in commit X." The closure reason is completely opaque.
The FL paradigm does exist in the codebase today, which suggests the implementation was either already present when the issue was filed (making the issue incorrect from the start), or was added and the issue was closed silently without documentation. Either way, the lack of any record leaves the community without clarity on when the implementation was written or by whom.
| Class |
File |
Lines |
FederatedLearning |
federated_learning.py |
353 |
FederatedClassIncrementalLearning |
federated_class_incremental_learning.py |
295 |

The FederatedClassIncrementalLearning class exists in core/testcasecontroller/algorithm/paradigm/federated_learning/ and is registered in core/__init__.py. The paradigm implementation is present in the codebase and will be loaded when ianvs initializes.

core/common/constant.py defines FEDERATED_CLASS_INCREMENTAL_LEARNING = "federatedclassincrementallearning". This is the string that all cifar100 YAML config files reference under paradigm_type. The mapping between the config string and the implementation class is in place — the wiring exists on paper.
However, the original broader vision of #306 remains largely unmet. And more practically: the FL paradigm implementation has never been successfully validated end-to-end in practice. Because all the cifar100 examples that exercise it have been blocked by the bugs documented in this issue summarization, no contributor has run them successfully. The code exists, but its correctness under real execution conditions is unverified.
What #306 Proposed vs What Exists Today
| Feature |
Proposed in #306 |
Currently Implemented |
FedAvg aggregation |
Yes |
Yes |
FedProx aggregation |
Yes |
No |
FedOpt aggregation |
Yes |
No |
SCAFFOLD aggregation |
Yes |
No |
| Fairness metrics |
Yes |
No |
| Communication cost modeling |
Yes |
No |
Impact
| Scope |
Impact |
| cifar100 |
All 7 sub-examples use paradigm_type: federatedclassincrementallearning in their YAML configs. The FederatedClassIncrementalLearning class is the load-bearing foundation of the entire cifar100 benchmark. Without it, every sub-example fails at paradigm lookup before any training logic is reached. Its existence is necessary but not sufficient — it must also be correct. |
| ianvs core |
The FL paradigm classes are registered in core/__init__.py alongside SingleTaskLearning, IncrementalLearning, and LifelongLearning. Any regression introduced into these classes would silently break all cifar100 examples. The current CI pipeline has no validation that exercises any FL paradigm logic, so such a regression could persist across many releases without detection. |
| KubeEdge ecosystem |
Ianvs's FL paradigm is the connection point between Sedna's infrastructure-level aggregation interfaces and the ianvs benchmarking layer. Without a verified, correctly functioning FL paradigm in ianvs, it is not possible to benchmark or compare FL approaches on the KubeEdge-Sedna stack — which undermines the stated value proposition of the combined architecture. |
Key insight: The question is not whether the FL paradigm class exists in the codebase — it does. The question is whether it works correctly end-to-end when the blocking bugs in Categories A and B are finally resolved and the cifar100 examples can actually be run. Because no contributor has successfully run a cifar100 sub-example, the FL paradigm implementation has never been validated against real execution. The silent 24-hour closure of #306 means this uncertainty has existed since the issue was first filed, and no one has had reason to re-examine it since.
Category D — CI/CD & Automated Validation
This category addresses the underlying structural reason why all the bugs across Categories A, B, and C were able to reach main and remain there undetected across multiple releases. The issues in the previous categories are symptoms. The absence of example validation in CI is the root condition that allowed those symptoms to accumulate silently.
Every bug documented in this issue summarization — the missing requirements.txt, the absent colorlog and PyYAML, the hardcoded /home/wyd/ paths, the wrong YAML dataset keys, the KEARS typo, the algorithms:conda syntax error — passed through the CI pipeline with a green passing checkmark. CI never flagged any of them, because CI never looked at any example file. It only linted core/ Python code. From CI's perspective, every PR that introduced or left in place these bugs was perfectly valid and safe to merge.
| # |
Title |
State |
| #411 |
Modernize CI Pipeline: Update Python Matrix, Actions, Add Example Validation |
Open |
Issue #411 — Broken CI Pipeline Allows All Bugs to Merge Silently
Introduction
Issue #411 identifies 4 structural problems in .github/workflows/main.yaml that together mean the CI pipeline provides no meaningful validation of whether the repository is in a working state for anyone who wants to run an example. These are not minor oversights or outdated configurations that happen to still work. They are gaps that actively prevent CI from catching the exact category of bugs that have accumulated in the cifar100 suite:
| Problem |
Detail |
| EOL Python versions |
Tests against 3.7, 3.8, 3.9 — all three are end-of-life and no longer receive security updates |
| Deprecated Actions |
actions/checkout@v3 and actions/setup-python@v3 use a deprecated Node.js 16 runtime that GitHub has formally retired |
| Permanently pinned pip |
pip==24.0 is hardcoded in the workflow with no mechanism to update it |
| Zero example validation |
CI never parses, lints, imports, or runs any file under examples/ — the entire example layer is invisible to CI |
As concrete, directly verifiable evidence of the last point: a confirmed YAML syntax error has existed on line 17 of examples/cifar100/fci_ssl/fedavg/benchmarkingjob.yaml on the main branch for an unknown period of time. The error would be caught by any YAML parser in a single line of CI script. It has never been flagged, because that single line of script was never added.

Every CI run currently produces a Node.js 16 deprecation warning from actions/checkout@v3, regardless of what code was changed in the PR. This creates a persistent background level of warning noise on every single run. When warnings appear on every run unconditionally, they become invisible — reviewers learn to ignore them, and genuine new warnings get lost in the noise.

The current main.yaml tests against Python 3.7, 3.8, and 3.9 — all of which are end-of-life — and runs only pylint on the core/ directory. There is no step that touches any file under examples/. The entire example layer of the repository has zero automated validation coverage.

Line 17 of examples/cifar100/fci_ssl/fedavg/benchmarkingjob.yaml currently reads algorithms:conda — the word conda was accidentally appended to the YAML key name, making the key invalid and the entire YAML file unparseable. This error exists on main today. Running yaml.safe_load() on this file throws a ScannerError immediately. It has never been caught by CI because CI has never run a YAML parser on any example config file.
Challenge
1. Confirmed YAML syntax error on main
- algorithms:conda # 'conda' appended by mistake — entire config unparseable by any YAML parser
+ algorithms:
A single line of CI script would catch this: python -c "import yaml; yaml.safe_load(open('benchmarkingjob.yaml'))". That line has never been added to the pipeline, so the error has sat undetected on main for an unknown period of time.
2. EOL Python matrix conflicts with cifar100 requirements
Keras 3.x dropped support for Python versions below 3.9. TensorFlow 2.x dropped Python 3.7 support after TF 2.11. This means that even if a maintainer wanted to add cifar100 to the CI test matrix today, every test run would fail on every Python version currently being tested — producing a permanently red CI for a reason unrelated to the actual correctness of the code. The Python matrix must be updated before example validation can be added without introducing permanent false failures.
3. Zero example validation across the entire examples layer
CI runs pylint on core/. That is the only validation step. It does not check whether import ianvs succeeds after installation, does not parse any YAML config file in examples/, does not check Python syntax in any example basemodel.py, and does not run any cifar100 sub-example even in a minimal dry-run mode. Every bug from #351, #413, #327, and #252 — missing packages, hardcoded paths, wrong YAML keys, the KEARS typo — merged through CI with a passing green checkmark and reached main without any automated system raising a concern.
4. Deprecated Actions noise on every run
The Node.js 16 deprecation warnings generated by actions/checkout@v3 on every single CI run — whether or not any Actions-related code changed — create a constant background level of noise that desensitizes reviewers to CI warnings in general. When warnings are always present, they stop being informative.
Impact
| Scope |
Impact |
| cifar100 |
fci_ssl/fedavg is completely unparseable today due to the algorithms:conda typo — it cannot even load its benchmark job configuration. Every other cifar100 sub-example carries the same undetected risk. Any YAML-level mistake introduced by any PR would merge silently and stay on main indefinitely. |
| All examples |
Every PR that modifies any file under examples/ merges without any automated validation. CI's green checkmark communicates false confidence — it means only that the core/ Python code passed a linter, not that any example in the repository can actually be run. |
| LFX contributors and new developers |
Every applicant and new developer who attempts to run cifar100 discovers these bugs manually through trial and error, spending time and effort that could be spent on actual contributions. A CI pipeline that validates examples would have surfaced all of these bugs automatically at the PR stage, kept the examples in a verifiably runnable state, and made the entire manual restoration effort described in this issue summarization unnecessary. |
Pull Requests
| PR |
Title |
State |
Fixes #411? |
| #412 |
ci: modernize CI pipeline and add example validation |
Open |
Yes — direct fix: updates Python matrix to 3.9–3.12, upgrades all actions to v4/v5, adds an example YAML validation job, and fixes the algorithms:conda typo in fci_ssl/fedavg/benchmarkingjob.yaml |
What PR #412 Changes
| File |
Change |
Description |
.github/workflows/main.yaml |
+79 / -11 |
New supported Python version matrix, updated Actions versions, new example YAML validation job added as a separate CI step |
examples/cifar100/fci_ssl/fedavg/benchmarkingjob.yaml |
+1 / -1 |
Fixes the algorithms:conda key typo to algorithms: |
setup.py |
+3 / -2 |
Updates python_requires from >=3.6 to >=3.9 to match the new CI matrix and the actual runtime requirements of the examples |
Current: ["3.7", "3.8", "3.9"] all end-of-life, no longer maintained
Proposed: ["3.9", "3.10", "3.11", "3.12"] 3.10, 3.11, 3.12 actively maintained
Key insight: Merging the fixes from Categories A and B without also merging #412 treats the symptoms without addressing the cause. The same class of bugs — missing dependencies, broken configs, wrong keys, typos in source code — will re-accumulate on main over time as new PRs are merged, because nothing in the pipeline will catch them. CI validation is the only change in this entire list that makes the repository self-correcting going forward. It is the fix that prevents all the other fixes from needing to be made again.
Issue Summarization —
examples/cifar100Background
The cifar100 example suite is the primary benchmark in KubeEdge Ianvs for evaluating Federated Class Incremental Learning (FCIL) combined with Semi-Supervised Learning (SSL). It uses the standard CIFAR-100 dataset (60,000 images across 100 classes, split into distributed text index files to simulate edge clients) and is currently the only suite in the repository that exercises the
federatedclassincrementallearningcore paradigm. It spans 7 runnable sub-examples across 4 sub-directories, ranging from a simple FedAvg baseline all the way to advanced semi-supervised algorithms like FediCarl and GLFC.This suite sits at a critical intersection in the repository: it is simultaneously the most technically ambitious example and the most broken one. Multiple contributors — LFX applicants, independent researchers, and developers new to the project — have independently attempted to run it and failed, each at a different stage of the same cascading chain of failures. The issues documented here are not theoretical concerns raised in code review. They are concrete, reproducible bugs discovered by real people on real machines, following the official instructions exactly and hitting walls that the documentation gives no guidance on.
The fact that so few contributors have successfully run any cifar100 sub-example end-to-end is a direct consequence of the layered failures described below. Each layer of breakage acts as a filter: those who make it past the dependency errors hit the hardcoded path errors; those who fix the paths hit the wrong YAML key mismatches; those who get past that hit runtime crashes in the model code itself. Because the example has almost never been run by anyone outside the original author, the codebase has not received the kind of hands-on validation that would surface and fix these issues organically — and they have quietly accumulated, undetected, across releases.
The 4 Sub-Directories at a Glance
1.
federated_learning/— Baseline FedAvgCameras train a local CNN on manually labeled images (cars, trucks) and sync only their updated model weights — not raw video — to a central aggregation server each night. The server averages all client updates using standard FedAvg and pushes back a smarter master model. No incremental learning, no unlabeled data, no SSL. This is the intended starting point for any new contributor trying to understand how the cifar100 suite works.
2.
federated_class_incremental_learning/— CIL UpgradeSix months later, new object classes (scooters, delivery robots) appear in the city. The cameras have deleted their original training data to save storage space on their constrained hardware. A standard AI retrained on scooters would overwrite its memory of cars — the catastrophic forgetting problem. CIL solves this by expanding the classifier head (
tf.pad) with new output neurons dedicated to the new classes, while freezing the existing neural pathways for cars and trucks. The model grows without forgetting, and it never needs the deleted original data again.3.
fci_ssl/— Cutting Edge Semi-Supervised LearningEngineers now want to track food trucks, but only have time to manually label 50 images. Meanwhile, the cameras capture tens of thousands of unlabeled frames every day. This sub-directory applies SSL algorithms (FediCarl, GLFC) to generate pseudo-labels — confident guesses — for the unlabeled footage, using those 50 labeled images as a seed. The camera teaches itself how food trucks look from different angles, in rain, and partially obscured, without waiting for more human annotations. This is the most algorithmically complex sub-directory and the one most actively under research.
4.
sedna_federated_learning/— Production DeploymentThe first three sub-directories simulate the entire FL pipeline on a single developer machine. This one is different. The
train_workerprocess runs on a low-power hardware chip physically bolted to a traffic pole in outdoor conditions. Theaggregation_workerruns on a Kubernetes cluster in a data center miles away. They communicate asynchronously over a real 5G network using Sedna's production interfaces. This is not a benchmark simulation — it is a live end-to-end test of the entire KubeEdge-Sedna production stack under real distributed hardware constraints.The image above shows the full ianvs benchmark architecture. The cifar100 example suite sits at the top example layer. Running a benchmark job invokes the
federatedclassincrementallearningparadigm in the core layer, which in turn calls Sedna's aggregation and dataset interfaces at the infrastructure layer. Every issue documented in this summarization blocks some stage of this vertical call chain — either preventing the tool from starting at all, preventing the example config from loading, or crashing the training loop mid-execution.Table of Contents
Category A — Missing Dependencies & Broken Requirements
These three issues are closely related and were filed independently by different contributors who each hit the same wall without any coordination between them. They represent three separate attempts, by three separate people, to follow the official setup instructions — and three separate failures at the exact same point. The existence of three independent reports within a relatively short window is itself meaningful: it is not possible to dismiss this as a user error or a platform quirk when unrelated contributors on different machines reproduce the exact same sequence of crashes.
What makes this category particularly frustrating from a contributor's perspective is that
pip install -r requirements.txtcompletes successfully and exits with code0. There are no warnings, no errors, no indication that anything is wrong. A contributor who follows the documentation step-by-step has every reason to believe their environment is ready — until the moment they try to run something and are met with a crash that gives no hint that the installation step was the problem.requirements.txtinexamples/cifar100/—tensorflow,keras,sednaabsentrequirements.txtblocks installationIssue #351 — Missing
examples/cifar100/requirements.txtIntroduction
examples/cifar100/has norequirements.txtfile of its own. The three packages it critically depends on —tensorflow,keras, andsedna— are completely absent from the rootrequirements.txtas well. Any contributor who clones the repository and follows the standard setup instructions will encounterModuleNotFoundErrorbefore a single line of cifar100 training code runs.This issue is not about a misconfigured environment or an unusual machine setup. It is a structural gap: the file that should declare what cifar100 needs simply does not exist. Every contributor who has attempted to run cifar100 — regardless of their system, operating system, or Python version — has hit this exact error at the exact same point. There is no workaround short of manually discovering and installing the missing packages, which requires understanding the codebase well enough to identify what is needed — knowledge a new contributor cannot reasonably be expected to have before running the example for the first time.
The screenshot above confirms the exact crash: Python cannot find
tensorflow,keras, orsednabecause none of them are installed anywhere in the environment. This error occurs at the very first import statement, before any cifar100 training logic, dataset loading, or paradigm initialization begins. The crash is deterministic and identical regardless of the machine or operating system.Challenge
1. Silent failure — The root
requirements.txtinstalls successfully — it only listsprettytable,scikit-learn,numpy,pandas,tqdm,matplotlib, andonnx. The install exits with code0and no warnings. A contributor who has just runpip install -r requirements.txtand seen it complete cleanly has no reason to suspect that critical packages are missing. The problem only reveals itself at runtime when Python throws the error, by which point the contributor may have spent significant time on other parts of the setup process.2. Version sensitivity — cifar100 does not just need
tensorflowto be installed. It needs a specific, mutually compatible ecosystem: TensorFlow 2.x + Keras 3.x + Sedna. These three must be version-compatible with each other and with the Python version being used. A plainpip install tensorflow keras sednawithout pinned versions will install whatever PyPI considers the latest, which risks incompatible releases that break in subtle and hard-to-diagnose ways — protobuf conflicts, missing class attributes, changed API signatures.3. Cross-platform complexity — TensorFlow has different installation paths for CPU-only machines versus GPU machines, and it does not support Windows natively. A single generic
requirements.txtentry fortensorflowwill not work correctly on all platforms. Properly handling this requires platform-specific installation guidance that does not currently exist anywhere in the repository.4. Scope — All 7 cifar100 sub-examples share this one missing file. Creating it fixes all of them in a single change. However, at least one sub-example has an additional dependency:
glfcrequiresh5pyfor reading and writing.weights.h5model checkpoint files. A blanket fix must account for these sub-example-specific needs as well, otherwise some sub-examples will still fail at a later stage even after the shared file is created.Impact
federatedclassincrementallearningparadigm cannot be tested at any level — not the simplest baseline, not the most advanced SSL variant.sednais a shared dependency used bysedna_federated_learning/and any other example that uses Sedna'sClassFactoryor aggregation interfaces. Its absence from the rootrequirements.txtmeans those examples are also broken for a fresh install, not just cifar100.Pull Requests
fix(examples): add missing requirements.txt for robot and cifar100Fixes #351, but does not pin versions, so version compatibility issues may still surface after installfix: add missing core dependencies and macOS troubleshooting guiderequirements.txtSuggested Complete Fix
Issue #413 — Missing
colorlogandPyYAMLin Rootrequirements.txtIntroduction
The root
requirements.txtis missingcolorlogandPyYAML, both of which are imported unconditionally at module level in ianvs core:core/common/log.pyline 18 →import colorlogcore/common/utils.pyline 24 →import yamlA clean install following the official guide —
pip install -r requirements.txtfollowed bypip install -e .— produces no errors and no warnings. But every single ianvs command then crashes immediately, before any example code runs, before any paradigm is loaded, before any configuration is parsed. The process exits in the core logging initialization step.This issue is importantly distinct from #351. Issue #351 blocks cifar100 specifically — a contributor trying a different example would not hit it. This issue blocks the entire ianvs framework from starting. It does not matter which example a contributor wants to run, or how correctly they have configured everything else. The tool crashes before getting anywhere near example-level code, and the stack trace shows no connection to any example, making it genuinely difficult to trace the cause without reading the source files directly.
The crash as it appears to the contributor:
ModuleNotFoundError: No module named 'colorlog', with a traceback pointing tocore/common/log.pyline 18. There is no mention of cifar100, no mention of any example, and no suggestion that installing a package will fix it. Without readinglog.pydirectly, this error gives no useful signal about what went wrong or how to fix it.Challenge
1. Deceptive silent install —
pipexits with code0and produces zero output suggesting a problem. A contributor who has spent time carefully following the documentation, setting up a virtual environment, and running the install steps has no mechanism to discover that two packages were not installed until the moment they try to run ianvs and see it crash. By that point, the assumption is typically that the crash is caused by something in the example or the paradigm — not by the install step that appeared to complete successfully.2. Core-level crash, not example-level — Unlike #351, which is scoped to cifar100, this bug sits in
core/common/— the very first module ianvs loads on startup. It blocks every single example in the repository without exception. A contributor trying to runbdd,pcb-aoi,robot, or any other example would hit the exact same crash before getting anywhere near their target example. This makes it arguably the highest-priority dependency bug in the entire codebase — it is not an example-specific issue, it is a framework startup issue.3. No guard in the source — Both failing imports are bare, unconditional module-level statements:
Impact
log.pyfirst. Even if a contributor has correctly installed all cifar100-specific packages, this bug blocks execution before anything cifar100-related runs.bdd,pcb-aoi,MOT17,robot,cifar100— is broken for any contributor doing a fresh install following the documentation. This is not an edge case; it affects 100% of new contributors on their first day.main.yamlonly verifies thatpip install -e .exits with code0. It never runs a single ianvs command after installation. This means the bug is completely invisible to automated testing and will persist across every release until it is explicitly fixed.Pull Requests
deps: add missing colorlog and PyYAML to requirements.txtrequirements.txtfor long-term maintainabilityfix: add missing core dependencies and macOS troubleshooting guidecolorlogandPyYAMLamong a broader set of fixesWhat PR #414 Changes
Issue #363 — Independent Reproduction of Cascading Install Failure
Introduction
Filed by
Adityakk903111 days after #327, independently confirming the same cascading failure chain on a different machine. These two contributors did not coordinate with each other. They did not reference each other's issues. They simply both tried to follow the official install guide and encountered the exact same sequence of failures in the exact same order.The significance of this independent reproduction goes beyond just confirming the bugs exist. It rules out the possibility that #327 was caused by something unusual about that specific machine or environment. When two unrelated contributors on separate machines hit the identical 4-step error chain within 11 days of each other — without any communication — the only reasonable interpretation is that this failure is deterministic and affects every new install. There is no way to follow the official setup instructions and avoid these errors. Following the official guide produces 4 sequential
ModuleNotFoundErrorcrashes, each one only visible after manually fixing the previous:Challenge
1. Confirms 100% fresh-install failure rate — Two uncoordinated contributors, two different machines, the same 4-step error chain, 11 days apart. This rules out environment-specific quirks and confirms that the broken
requirements.txtis a universal, deterministic problem. Any contributor who follows the documented setup process will hit these errors. There are no exceptions.2. One error at a time — no way to see ahead — Each missing package only surfaces after the previous one is fixed manually. A contributor cannot discover all missing packages in a single pass. They fix the first error, re-run, get the second error, fix that, re-run, get the third, and so on. At no point does the tooling tell them how many more errors are waiting. This one-at-a-time discovery loop is slow, frustrating, and provides no indication of how far from a working environment the contributor is at any given moment.
3. Version pin too loose in the proposed fix — Both #327 and #363 propose adding
sedna>=0.4.0torequirements.txtas the fix for the fourth error. However, the actual ianvs codebase requiressednato expose theJsonlDataParseclass insedna.datasources, which is absent from older PyPI releases. A plainpip install sednawith>=0.4.0may install a version that does not includeJsonlDataParse, causing another failure downstream. The proposed fix from both issues is therefore incomplete, and a tighter version pin is needed.4. Potentially more undiscovered dependencies — Both issues include the note that there may be more missing packages beyond the four documented. Further packages required by Sedna's internal communication layer — such as
requests,grpcio, orprotobuf— could surface once the first four errors are resolved. The full list of missing dependencies may not yet be known.Impact
sedna,colorlog, andyamlall fail before any FCIL logic is reached. Even a contributor who is determined to push through every error manually will not reach the first line of cifar100 training code without multiple rounds of manual package discovery and installation.#363 vs #327 — Key Differences
shivam8415)Adityakk9031)Category B — Hardcoded Paths & Configuration Failures
A contributor who successfully resolves all Category A dependency issues, gets ianvs installed correctly, and finally runs a cifar100 benchmark job will immediately encounter a completely different class of failure. The YAML configuration files that define each benchmark job contain file paths that are hardcoded to the original developer's personal machine — specifically, paths beginning with
/home/wyd/ianvs/...which point to directories and files that exist on one specific laptop and nowhere else.This is not a subtle misconfiguration. It is an immediate, hard crash that fires before any training logic executes, with a
FileNotFoundErrorthat clearly identifies the non-portable path. The fact that these paths reachedmainand remained there through multiple releases indicates that the YAML config files were written once on a single machine and never tested anywhere else. No other contributor successfully ran them, which means no one had the opportunity to catch the issue through normal use./home/wyd/paths + 4 additional runtime bugs in cifar100Issue #252 — Hardcoded Paths + 4 Runtime Bugs
Introduction
Issue #252 was filed by an LFX 2025 Term-3 applicant who pushed through all the Category A dependency errors and attempted to actually run
examples/cifar100/federated_learning/fedavg. What they found was not a single bug but 5 compounding bugs, each one layered on top of the previous, each only visible after the prior one is resolved. This is the most thorough and detailed bug report in the cifar100 suite. It documents exactly how far the example is from being runnable end-to-end, even after all dependency issues are resolved.The fact that this level of analysis was produced by an LFX applicant — someone applying for a mentorship, not a maintainer — underscores how broken the onboarding experience is. A contributor new to the project should be able to follow the README, run a baseline example, and understand what it does. Instead, they encountered a series of compounding failures that required reading source code, understanding the relationship between YAML configs and the Dataset class, and debugging a runtime crash in the model's predict loop. Even after all that work, the example still could not be run to completion.
Even after resolving all dependency issues from Category A, this cascade of bugs blocks execution completely:
Running
benchmarking.pywith a cifar100 config immediately throwsFileNotFoundError. All 18 YAML files across all 6 sub-examples contain hardcoded references to/home/wyd/ianvs/...— a path that only exists on the original developer's machine. This is the first error a contributor sees after successfully completing all dependency installation steps.testenv.yamlspecifies dataset paths usingtrain_urlandtest_urlkeys. The coreDatasetclass indataset.pyreadstrain_indexandtest_indexto resolve the actual file paths —train_urlandtest_urlexist in the class but serve a different purpose (they hold resolved URLs after processing). Using the wrong keys means the dataset is silently never loaded. No error is raised; training proceeds as if the dataset is empty.The
predict()method inbasemodel.pycallsdata.x, assuming the input is aDatasetobject with attribute access. In the federated learning paradigm, ianvs actually passes a plain Pythonlist. This causes anAttributeErrorin the prediction loop, crashing execution even after the path errors and YAML key mismatches are corrected. It is a code-level bug independent of configuration.TF
2.10.0triggers aprotobufversion conflict during package installation. No TensorFlow version is pinned anywhere in the cifar100 example, and there is no documentation in the repository about which TF version, Keras version, and Python version are known to be compatible with each other. Contributors are left to discover compatible combinations through trial and error.BACKEND_TYPE = "KEARS"inbasemodel.py— a clear typo for"KERAS". This string is used to select the model backend at initialization time. The wrong string means the backend is never correctly identified, causing a silent mismatch that affects how the model is loaded and run. No explicit error is thrown at this line, making it difficult to trace.Challenge
1. Scale of the path problem —
/home/wyd/appears 36 times across 18 YAML files spanning all 6 cifar100 sub-examples. Everybenchmarkingjob.yaml,testenv.yaml, andalgorithm.yamlin the suite is affected. This is not a one-line fix. Any partial fix that addresses only one sub-example leaves the remaining five broken in exactly the same way and gives a false impression that progress has been made.2. Wrong YAML keys — All
testenv.yamlfiles usetrain_url/test_urlas dataset configuration keys. The coreDatasetclass indataset.py(lines 57–60, 157–158) readstrain_indexandtest_indexto resolve the actual data file paths. The_urlfields do exist in the class, but they hold the resolved output URLs after path processing — not the input index file paths that the config should be providing. Using the wrong keys means the dataset indexing step is silently skipped and training proceeds with no data loaded, producing meaningless results with no error message to indicate what went wrong.3.
AttributeErrorinbasemodel.py— Thepredict()method callsdata.x, written with the assumption that the input is aDatasetobject that supports attribute access. In practice, the federated learning paradigm in ianvs passes a plain Pythonlisttopredict(). This causes anAttributeErrorevery time the prediction loop runs, regardless of how correctly the YAML configuration is set up. It is a fundamental code-level incompatibility between the example and the paradigm it is designed to use.4. TensorFlow version sensitivity — No TensorFlow version is pinned anywhere in the example. TF
2.10.0triggers a knownprotobufversion conflict that prevents installation from completing cleanly. TF2.xon Python3.10+has further API compatibility issues. There is no documentation in the repository — not in the README, not in a comment, not in a requirements file — about which combination of TF version, Keras version, and Python version is known to produce a working environment for cifar100.5. No README — The
federated_learning/fedavg/directory has noREADME.mdfile. There are no setup instructions, no description of what the example does, no list of prerequisites, and no description of expected output that a contributor could use to verify whether a run completed correctly. Without a README, a new contributor has no starting point and no way to validate their progress.Impact
fedavgstill leavesfci_ssl/,federated_class_incremental_learning/, andglfc/with the same/home/wyd/paths baked in. Any fix must be applied comprehensively across all 18 YAML files, or it is incomplete and the sub-examples that were not fixed remain broken in exactly the same way.train_url/test_urlvstrain_index/test_index) reveal a gap between what the coreDatasetclass expects as input and what the example configuration templates actually provide. Any future contributor who uses these YAML files as a template for a new example will unknowingly inherit this silent bug and reproduce it in their own work.Pull Requests
fix(examples): replace hardcoded developer paths across all cifar100 configs/home/wyd/occurrences across all 18 YAML files in all 6 sub-examples with portable relative or environment-variable-based pathsfix(examples): add missing requirements.txt for robot and cifar100fix: add missing core dependencies and macOS troubleshooting guideRemaining Bugs With No Open PR
The following 4 bugs exist in the codebase today and will remain after all current open PRs are merged, unless they are explicitly addressed by new PRs:
testenv.yamlfilestrain_url/test_urlused instead oftrain_index/test_index— dataset index is never resolved, training proceeds with no data silentlyAttributeErrorinpredict()federated_learning/fedavg/algorithm/basemodel.pydata.xassumes aDatasetobject but receives a plain Pythonlist— crashes the prediction loop every timeBACKEND_TYPEtypofederated_learning/fedavg/algorithm/basemodel.py"KEARS"should be"KERAS"— causes a silent backend mismatch at model initializationkwargskey mismatchfederated_class_incremental_learningkwargs.get("lr")does not match thelearning_ratekey name defined in the corresponding YAML configCategory C — FL Paradigm Core Implementation
This category contains a single closed issue that is worth examining despite being closed. It is not directly related to the runtime bugs documented in Categories A and B, but it provides important context about the underlying FL paradigm implementation that the entire cifar100 suite depends on to function.
The issue is notable because it was opened and closed by the same person within 24 hours, with no explanation and no linked PR. This means the community has no record of what was actually done, when the FL paradigm implementation was written, or whether it has ever been tested against a successful end-to-end run. Given that the cifar100 examples — the only code that exercises the FL paradigm — have been broken by the bugs above and have therefore rarely if ever been run successfully, the practical correctness of the FL paradigm implementation remains unverified.
Issue #306 — FL Paradigm Implementation (Closed)
Introduction
Filed on 2026-01-28, claiming that Federated Learning was documented as a supported ianvs paradigm but had no actual implementation in
core/testcasecontroller/algorithm/paradigm/. The issue was closed by the original author 24 hours later on 2026-01-29. There is no comment in the thread explaining why it was closed. There is no linked PR, no commit reference, and no note saying "already implemented" or "fixed in commit X." The closure reason is completely opaque.The FL paradigm does exist in the codebase today, which suggests the implementation was either already present when the issue was filed (making the issue incorrect from the start), or was added and the issue was closed silently without documentation. Either way, the lack of any record leaves the community without clarity on when the implementation was written or by whom.
FederatedLearningfederated_learning.pyFederatedClassIncrementalLearningfederated_class_incremental_learning.pyThe
FederatedClassIncrementalLearningclass exists incore/testcasecontroller/algorithm/paradigm/federated_learning/and is registered incore/__init__.py. The paradigm implementation is present in the codebase and will be loaded when ianvs initializes.core/common/constant.pydefinesFEDERATED_CLASS_INCREMENTAL_LEARNING = "federatedclassincrementallearning". This is the string that all cifar100 YAML config files reference underparadigm_type. The mapping between the config string and the implementation class is in place — the wiring exists on paper.However, the original broader vision of #306 remains largely unmet. And more practically: the FL paradigm implementation has never been successfully validated end-to-end in practice. Because all the cifar100 examples that exercise it have been blocked by the bugs documented in this issue summarization, no contributor has run them successfully. The code exists, but its correctness under real execution conditions is unverified.
What #306 Proposed vs What Exists Today
FedAvgaggregationFedProxaggregationFedOptaggregationSCAFFOLDaggregationImpact
paradigm_type: federatedclassincrementallearningin their YAML configs. TheFederatedClassIncrementalLearningclass is the load-bearing foundation of the entire cifar100 benchmark. Without it, every sub-example fails at paradigm lookup before any training logic is reached. Its existence is necessary but not sufficient — it must also be correct.core/__init__.pyalongsideSingleTaskLearning,IncrementalLearning, andLifelongLearning. Any regression introduced into these classes would silently break all cifar100 examples. The current CI pipeline has no validation that exercises any FL paradigm logic, so such a regression could persist across many releases without detection.Category D — CI/CD & Automated Validation
This category addresses the underlying structural reason why all the bugs across Categories A, B, and C were able to reach
mainand remain there undetected across multiple releases. The issues in the previous categories are symptoms. The absence of example validation in CI is the root condition that allowed those symptoms to accumulate silently.Every bug documented in this issue summarization — the missing
requirements.txt, the absentcolorlogandPyYAML, the hardcoded/home/wyd/paths, the wrong YAML dataset keys, theKEARStypo, thealgorithms:condasyntax error — passed through the CI pipeline with a green passing checkmark. CI never flagged any of them, because CI never looked at any example file. It only lintedcore/Python code. From CI's perspective, every PR that introduced or left in place these bugs was perfectly valid and safe to merge.Issue #411 — Broken CI Pipeline Allows All Bugs to Merge Silently
Introduction
Issue #411 identifies 4 structural problems in
.github/workflows/main.yamlthat together mean the CI pipeline provides no meaningful validation of whether the repository is in a working state for anyone who wants to run an example. These are not minor oversights or outdated configurations that happen to still work. They are gaps that actively prevent CI from catching the exact category of bugs that have accumulated in the cifar100 suite:3.7,3.8,3.9— all three are end-of-life and no longer receive security updatesactions/checkout@v3andactions/setup-python@v3use a deprecated Node.js 16 runtime that GitHub has formally retiredpip==24.0is hardcoded in the workflow with no mechanism to update itexamples/— the entire example layer is invisible to CIAs concrete, directly verifiable evidence of the last point: a confirmed YAML syntax error has existed on line 17 of
examples/cifar100/fci_ssl/fedavg/benchmarkingjob.yamlon themainbranch for an unknown period of time. The error would be caught by any YAML parser in a single line of CI script. It has never been flagged, because that single line of script was never added.Every CI run currently produces a Node.js 16 deprecation warning from
actions/checkout@v3, regardless of what code was changed in the PR. This creates a persistent background level of warning noise on every single run. When warnings appear on every run unconditionally, they become invisible — reviewers learn to ignore them, and genuine new warnings get lost in the noise.The current
main.yamltests against Python3.7,3.8, and3.9— all of which are end-of-life — and runs onlypylinton thecore/directory. There is no step that touches any file underexamples/. The entire example layer of the repository has zero automated validation coverage.Line 17 of
examples/cifar100/fci_ssl/fedavg/benchmarkingjob.yamlcurrently readsalgorithms:conda— the wordcondawas accidentally appended to the YAML key name, making the key invalid and the entire YAML file unparseable. This error exists onmaintoday. Runningyaml.safe_load()on this file throws aScannerErrorimmediately. It has never been caught by CI because CI has never run a YAML parser on any example config file.Challenge
1. Confirmed YAML syntax error on
mainA single line of CI script would catch this:
python -c "import yaml; yaml.safe_load(open('benchmarkingjob.yaml'))". That line has never been added to the pipeline, so the error has sat undetected onmainfor an unknown period of time.2. EOL Python matrix conflicts with cifar100 requirements
Keras 3.x dropped support for Python versions below
3.9. TensorFlow 2.x dropped Python3.7support after TF2.11. This means that even if a maintainer wanted to add cifar100 to the CI test matrix today, every test run would fail on every Python version currently being tested — producing a permanently red CI for a reason unrelated to the actual correctness of the code. The Python matrix must be updated before example validation can be added without introducing permanent false failures.3. Zero example validation across the entire examples layer
CI runs
pylintoncore/. That is the only validation step. It does not check whetherimport ianvssucceeds after installation, does not parse any YAML config file inexamples/, does not check Python syntax in any examplebasemodel.py, and does not run any cifar100 sub-example even in a minimal dry-run mode. Every bug from #351, #413, #327, and #252 — missing packages, hardcoded paths, wrong YAML keys, theKEARStypo — merged through CI with a passing green checkmark and reachedmainwithout any automated system raising a concern.4. Deprecated Actions noise on every run
The Node.js 16 deprecation warnings generated by
actions/checkout@v3on every single CI run — whether or not any Actions-related code changed — create a constant background level of noise that desensitizes reviewers to CI warnings in general. When warnings are always present, they stop being informative.Impact
fci_ssl/fedavgis completely unparseable today due to thealgorithms:condatypo — it cannot even load its benchmark job configuration. Every other cifar100 sub-example carries the same undetected risk. Any YAML-level mistake introduced by any PR would merge silently and stay onmainindefinitely.examples/merges without any automated validation. CI's green checkmark communicates false confidence — it means only that thecore/Python code passed a linter, not that any example in the repository can actually be run.Pull Requests
ci: modernize CI pipeline and add example validation3.9–3.12, upgrades all actions tov4/v5, adds an example YAML validation job, and fixes thealgorithms:condatypo infci_ssl/fedavg/benchmarkingjob.yamlWhat PR #412 Changes
.github/workflows/main.yaml+79 / -11examples/cifar100/fci_ssl/fedavg/benchmarkingjob.yaml+1 / -1algorithms:condakey typo toalgorithms:setup.py+3 / -2python_requiresfrom>=3.6to>=3.9to match the new CI matrix and the actual runtime requirements of the examples