Skip to content

Decouple cache tiers: key the venv on dependency content, not directory existence #24

@rahlk

Description

@rahlk

=Is your feature request related to a problem? Please describe.

Yes. The virtualenv cache in Codeanalyzer is keyed on directory
existence
, not on dependency-manifest content, and a single
rebuild_analysis flag conflates three independent caches (venv /
CodeQL DB / analysis.json). This causes one correctness bug and one
performance bug:

  • Stale-deps correctness bugcore.py:231-235:

    venv_path = self.cache_dir / self.project_dir.name / "virtualenv"
    if not venv_path.exists() or self.rebuild_analysis:
        # python -m venv ...; pip install -r requirements*.txt; pip install -e .

    Edit requirements.txt/pyproject.toml/a lockfile while the venv
    dir already exists and rebuild_analysis=False → the venv is not
    recreated → analysis silently runs against stale dependency versions
    (wrong Jedi/CodeQL resolution).

  • Wasted rebuildsrebuild_analysis=True (--eager) with
    byte-identical dependencies still tears down and re-pip installs
    the venv (~30s), even though --eager is meant to invalidate the
    analysis, not the environment.

  • One flag, three cachesrebuild_analysis (read at
    core.py:62) gates the venv (:235), the CodeQL DB (:326), and
    the symbol-table cache (:370,502,552,582). There is no way to
    rebuild the analysis without also paying for a full venv rebuild.

The three artifacts have three different invalidation triggers:

Artifact Correct rebuild trigger Granularity
virtualenv dependency-manifest content changes all-or-nothing
CodeQL DB any *.py content changes all-or-nothing
symbol table per-file content changes incremental (already correct)
call graph / analysis.json any *.py content changes all-or-nothing

Describe the solution you'd like

Cache each tier independently, keyed by its own content hash:

  1. venv keyed on a dependency-manifest hash. SHA256 over the
    manifests that exist (requirements.txt, requirements-dev.txt,
    dev-requirements.txt, test-requirements.txt, pyproject.toml,
    setup.py, setup.cfg, Pipfile, Pipfile.lock, poetry.lock,
    uv.lock). Persist as <venv>/.deps_hash; rebuild iff
    venv missing OR stored_hash != current_hash OR rebuild_venv.
    Editing source must not rebuild the venv; changing a dependency
    must, even when rebuild_analysis=False.

  2. Separate, content-addressed cache roots:

    • cache_dir/venv/<dep_hash>/ — virtualenv
    • cache_dir/codeql/<src_hash>/ — CodeQL DB. The source-checksum
      invalidation at core.py:313-352 is already correct; the change
      is to key the DB directory by <src_hash> and retain prior
      DBs instead of --overwriteing in place, so revisiting an
      earlier source state (git bisect, branch switch, CI re-run of a
      prior SHA) reuses an existing DB.
  3. Split rebuild_analysis into independent controls
    (rebuild_venv, rebuild_db, rebuild_analysis), each defaulting
    to its own content-key check. Keep --eager (== rebuild all) as
    back-compat sugar.

  4. Expose the resolved venv/DB/analysis paths via
    AnalysisOptions
    so embedders (e.g. CLDK) can select the cache
    root without re-deriving keys.

Acceptance criteria:

  • Editing a .py file rebuilds the CodeQL DB + analysis.json
    but reuses the existing venv (no pip install).
  • Editing a dependency manifest rebuilds the venv even with
    rebuild_analysis=False.
  • --eager with unchanged deps does not rebuild the venv.
  • Switching to a previously-analyzed source state reuses the
    cached CodeQL DB (no database create).
  • AnalysisOptions exposes the resolved venv/DB/analysis paths.
  • Existing single-cache_dir callers keep working (back-compat
    default for the new roots).

Implementation sketch:

  • Add Codeanalyzer._dependency_hash(project_dir) -> str (mirror of
    _compute_checksum but over the manifest list above).
  • In __enter__, replace the not venv_path.exists() or rebuild
    predicate with the .deps_hash marker comparison; write the marker
    after a successful pip install.
  • Relocate db_path under cache_dir/codeql/<src_hash>/ keyed by the
    existing _compute_checksum; drop --overwrite in favor of
    per-hash dirs.
  • Extend AnalysisOptions with rebuild_venv/rebuild_db fields and
    read-back properties for the resolved paths.

Describe alternatives you've considered

  • Downstream workaround (current state in CLDK). python-sdk
    passes a dependency-hash-keyed cache_dir from
    cldk/analysis/python/codeanalyzer/cache.py. This keeps the venv
    stable but cannot prevent the in-place CodeQL DB rebuild,
    because venv and DB share one cache_dir. It also duplicates key
    logic that rightfully belongs upstream. Rejected as a permanent
    fix; it only masks the venv bug for one consumer.

  • Hash the whole project tree for everything (single key). Simple,
    but any one-character source edit would then invalidate the venv
    too — the exact bug being fixed, inverted. Rejected.

  • Always rebuild the venv (drop caching). Correct but defeats the
    purpose; ~30s pip install on every run. Rejected.

  • mtime/size-based venv invalidation instead of content hash.
    Cheaper to compute, but unreliable across clones/checkouts/CI where
    mtimes differ for identical content. Content hash is the robust
    choice; manifest files are small so the cost is negligible.

Additional context

  • Discovered while building feature-parity Python analysis in CLDK
    (python-sdk), where the venv rebuild dominated repeated-run
    latency (~30s cold vs ~3s warm once the venv was stabilized).
  • Line references are against installed 0.1.14.
  • Once this lands, the CLDK-side helper
    (cldk/analysis/python/codeanalyzer/cache.py) should collapse to
    "pick a root," and its workaround comments referencing this issue
    should be removed.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions