Skip to content

feat(core): support offline validation and HTTP cache management#165

Open
kikkomep wants to merge 25 commits into
crs4:developfrom
kikkomep:feat/offline-mode
Open

feat(core): support offline validation and HTTP cache management#165
kikkomep wants to merge 25 commits into
crs4:developfrom
kikkomep:feat/offline-mode

Conversation

@kikkomep
Copy link
Copy Markdown
Member

Summary

Introduce an offline validation mode that lets users validate RO-Crates without network access, relying on a persistent local HTTP cache for every resource the validator needs to resolve.

What changes

  • Offline mode — a new option on the validator (exposed as --offline on the CLI) serves every HTTP-backed resource from a local cache, failing clearly when a resource has never been fetched.
  • Persistent HTTP cache — the validator now uses a stable, user-level cache location (XDG-aware) so online and offline runs share the same store.
  • Transparent caching of remote contexts — remote JSON-LD @context fetches performed by rdflib go through the same cache, which previously bypassed it.
  • Cache pre-population (warm-up) — the validator can discover the external resources declared by the active profiles and prefetch them, both automatically (when online) and on demand.
  • Cache management CLI — a new rocrate-validator cache command group lets users inspect, reset and pre-populate the cache.

User-facing changes

  • validate --offline / validate --no-cache flags.
  • new rocrate-validator cache info|reset|warm subcommand.

[fix #114 #115]

kikkomep added 13 commits April 23, 2026 11:28
Extend HttpRequester with `offline` and `no_cache` options, plus
fetch_fresh/has_cached/clear_cache/cache_info/reset helpers. Add an
OfflineCacheMissError and an _OfflineFallbackSession that yields a 504
when requests_cache is unavailable, mirroring the only-if-cached
behavior. Emit standardized `CachedHttpRequester:` log lines describing
each request's cache outcome.
Introduce `get_user_cache_dir()` (honoring `XDG_CACHE_HOME`,
falling back to `~/.cache/rocrate-validator`) and `get_default_http_cache_path()`,
plus the `USER_CACHE_DIR_NAME` / `USER_CACHE_FILE_NAME` constants, so the HTTP cache
can be located under a stable, user-level directory instead of the previous `/tmp` prefix.
Cover the offline/no-cache flags, fetch_fresh, has_cached, clear_cache,
cache_info, reset and the _OfflineFallbackSession 504 behavior, plus
the standardized cache-outcome log messages.
Introduce `cache_warmup` helpers that discover external artifacts declared
by profile descriptors via `prof:hasResource`/`prof:hasArtifact` and
prefetch them so subsequent offline runs resolve every required resource
from the local HTTP cache. Add the `ROCRATE_VALIDATOR_AUTO_WARM`
environment variable to toggle automatic warm-up.
Introduce `install_document_loader()` that patches rdflib's
`source_to_json` so remote `@context` resolution goes through
HttpRequester, benefiting from the HTTP cache and honoring offline mode
(raising OfflineCacheMissError on offline cache misses). The install
is idempotent and reversible via `uninstall_document_loader()` for tests.
Expose `offline` and `no_cache` flags on ValidationSettings and default
`cache_path` to the persistent user HTTP cache so consecutive online/
offline runs share the same store.
Validate that `offline` and `no_cache` are mutually exclusive.
Install the JSON-LD document loader so rdflib's remote `@context`
resolution goes through the cache.
Introduce `rocrate-validator cache` with `info`, `reset` and `warm`
subcommands to inspect, clear and pre-populate the persistent HTTP
cache used by offline validation. `warm` discovers cacheable URLs from
profile descriptors and can also prefetch remote RO-Crates.
Redirect XDG_CACHE_HOME to a session-scoped tmp dir so tests never touch the
developer's real ~/.cache, and default ROCRATE_VALIDATOR_AUTO_WARM=0 per test
to prevent unintended network calls. Tests that need warm-up opt in explicitly.
Switch DEFAULT_HTTP_CACHE_MAX_AGE from 300s to -1 so cached HTTP
resources (JSON-LD contexts, profile artifacts, etc.) persist
indefinitely by default. The `-1` sentinel is already supported
throughout the cache stack and is the value used by `cache warm`.
Users can still opt into a finite TTL via `--cache-max-age`.

Note: this does not affect remote RO-Crates downloaded for
validation, which are always re-fetched online via `fetch_fresh`
(and the cached copy overwritten) so that subsequent offline runs
validate against the latest known remote state. The `max_age`
setting only governs the regular cached session used for other
HTTP-backed resources.
@kikkomep kikkomep force-pushed the feat/offline-mode branch from 1e122d2 to 419fece Compare May 13, 2026 17:04
@elichad
Copy link
Copy Markdown
Contributor

elichad commented May 14, 2026

I've just tested this PR and it works well offline! The rocrate-validator cache commands look really helpful for our needs - will allow us to build the cache easily when we build our docker image before running it offline.

A few comments after my experiments - none are blocking, just quality of life stuff:

  1. when I ran validation in offline mode against an un-cached profile, it emitted a useful warning telling me how to update the cache, but it emitted it a LOT of times (below I show just the last few instances, but there were many more) - could this be reduced so the error is only shown once?
 [2026-05-14 13:25:17,889] WARNING in models: Consider reporting this as a bug.                      
 [2026-05-14 13:25:17,928] WARNING in models: Unexpected error during check                             
 <rocrate_validator.requirements.shacl.checks.SHACLCheck object at 0x748a66147850>.  Exception: Resource                  
 'https://w3id.org/ro/crate/1.2/context' is not available in the HTTP cache and the validator is running in offline mode. 
 Run online once, or use `rocrate-validator cache warm` to pre-populate the cache.                                     
 [2026-05-14 13:25:17,928] WARNING in models: Consider reporting this as a bug.                      
 [2026-05-14 13:25:17,957] WARNING in models: Unexpected error during check                             
 <rocrate_validator.requirements.shacl.checks.SHACLCheck object at 0x748a6611cbd0>.  Exception: Resource                  
 'https://w3id.org/ro/crate/1.2/context' is not available in the HTTP cache and the validator is running in offline mode. 
 Run online once, or use `rocrate-validator cache warm` to pre-populate the cache.                                     
 [2026-05-14 13:25:17,957] WARNING in models: Consider reporting this as a bug.                      
 [2026-05-14 13:25:17,994] WARNING in requirements: Forced SHACL run for zero-shape target profile      
 five-safes-crate-0.4 failed: Resource 'https://w3id.org/ro/crate/1.2/context' is not available in the HTTP cache and the 
 validator is running in offline mode. Run online once, or use `rocrate-validator cache warm` to pre-populate the         
 cache.     
  1. Normally I can run rocrate-validator validate <path> -p process-run-crate and it will find the process-run-crate-0.5 profile automatically. But if I run rocrate-validator cache warm -p process-run-crate I get the following output:
Profile(s) not found and skipped: process-run-crate
Nothing to warm up.

It would be useful if either cache warm could automatically find the versioned profile, or if it was more clearly documented that the full profile identifier must be used with cache warm.

  1. I tried to manually cache https://w3id.org/ro/crate/1.2/context but couldn't achieve this. I tried:
    • rocrate-validator cache warm https://w3id.org/ro/crate/1.2/context (I don't think this is expected to work)
    • rocrate-validator cache warm --crate https://w3id.org/ro/crate/1.2 to hopefully cache the context from the RO-Crate 1.2 crate (error "Calling read(decode_content=False) is not supported after read(decode_content=True) was called")
    • rocrate-validator cache warm --crate https://www.researchobject.org/ro-crate/specification/1.2/ro-crate-metadata.json (same error)

It would be great if there was an easy way to provide a non-crate URL to cache.

@kikkomep
Copy link
Copy Markdown
Member Author

Changes in the latest few commits:

  1. Noisy repeated warning — fixed in 757b86a.

  2. cache warm -p with non-fully-qualified profile — now automatically uses the latest profile version when a non-fully-qualified identifier is passed. A warning is emitted when more than one version is available, notifying the user that the latest is being selected automatically.

  3. cache warm --url/-uwarm now supports caching of specific URLs via the new option.

Also added: cache list command (aliased ls) to list the cache contents.

@elichad let me know if it all looks good

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature request: cache downloaded contexts/vocabularies between runs

2 participants