add ability to download cached workspace (#520)

* create "stale" field on workspace state A provider that downloads its workspace state directly cannot assume that this state is a valid basis for a future incremental update, and should mark the downloaded workspace as stale. Signed-off-by: Will Murphy <will.murphy@anchore.com> * WIP add configs Signed-off-by: Will Murphy <will.murphy@anchore.com> * lint fix Signed-off-by: Will Murphy <will.murphy@anchore.com> * [wip] working on vunnel results db listing Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com> * update and tests for safe_extract_tar Now that we're using it for more than one thing, make an extractor that generally prevents path traversal. Signed-off-by: Will Murphy <will.murphy@anchore.com> * [wip] adding tests for fetching listing and archives Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com> * [wip] add more negative tests for provider tests Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com> * unit test for new workspace changes Signed-off-by: Will Murphy <will.murphy@anchore.com> * replace the workspace results instead of overlaying Signed-off-by: Will Murphy <will.murphy@anchore.com> * clean up hasher implementation Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com> * add tests for prep workspace from listing entry Signed-off-by: Will Murphy <will.murphy@anchore.com> * do not include inputs in tar test fixture Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com> * vunnel fetch existing workspace working Signed-off-by: Will Murphy <will.murphy@anchore.com> * add unit test for full update flow Signed-off-by: Will Murphy <will.murphy@anchore.com> * update existing unit tests for new config values Signed-off-by: Will Murphy <will.murphy@anchore.com> * add unit test for default behavior of new configs Signed-off-by: Will Murphy <will.murphy@anchore.com> * lint fix Signed-off-by: Will Murphy <will.murphy@anchore.com> * add missing annotations import Signed-off-by: Will Murphy <will.murphy@anchore.com> * Use 3.9 compatible annotations Relying on the from __future__ import annotations doesn't work with the mashumaro. Signed-off-by: Will Murphy <will.murphy@anchore.com> * validate that enabling import results requires host and path Signed-off-by: Will Murphy <will.murphy@anchore.com> * rename listing field and add schema Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com> * only require github token when downloading Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com> * add zstd support Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com> * add tests for zstd support Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com> * add tests for _has_newer_archive Signed-off-by: Will Murphy <will.murphy@anchore.com> * fix tests for zstd Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com> * show stderr to log when git commands fail Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com> * move import_results to common field on provider Signed-off-by: Will Murphy <will.murphy@anchore.com> * add concept for distribution version Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com> * single source of truth for provider schemas Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com> * add distribution-version to schema, provider state, and listing entry Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com> * clear workspace on different dist version Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com> * fix defaulting logic and update tests Signed-off-by: Will Murphy <will.murphy@anchore.com> * default distribution version and path Signed-off-by: Will Murphy <will.murphy@anchore.com> * make "" and None both use default path Signed-off-by: Will Murphy <will.murphy@anchore.com> --------- Signed-off-by: Will Murphy <will.murphy@anchore.com> Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com> Co-authored-by: Alex Goodman <wagoodman@users.noreply.github.com>
anchore · Mar 27, 2024 · 90b176c · 90b176c
1 parent 6b4fa38
commit 90b176c
Show file tree

Hide file tree

Showing 41 changed files with 1,967 additions and 127 deletions.
diff --git a/poetry.lock b/poetry.lock
diff --git a/pyproject.toml b/pyproject.toml
@@ -57,6 +57,8 @@ importlib-metadata = "^7.0.1"
 xsdata = {extras = ["cli", "lxml", "soap"], version = ">=22.12,<25.0"}
 pytest-snapshot = "^0.9.0"
 mashumaro = "^3.10"
+iso8601 = "^2.1.0"
+zstandard = "^0.22.0"
 
 [tool.poetry.group.dev.dependencies]
 pytest = ">=7.2.2,<9.0.0"

diff --git a/schema/provider-archive-listing/README.md b/schema/provider-archive-listing/README.md
@@ -0,0 +1,17 @@
+# `ProviderState` JSON Schema
+
+This schema governs the `listing.json` file used when providers are configured to fetch pre-computed results (by using `import_results_enabled`). The listing file is how the provider knows what results are available, where to fetch them from, and how to validate them.
+
+See `src/vunnel.distribution.Listing` for the root object that represents this schema.
+
+## Updating the schema
+
+Versioning the JSON schema must be done manually by copying the existing JSON schema into a new `schema-x.y.z.json` file and manually making the necessary updates (or by using an online tool such as https://www.liquid-technologies.com/online-json-to-schema-converter).
+
+This schema is being versioned based off of the "SchemaVer" guidelines, which slightly diverges from Semantic Versioning to tailor for the purposes of data models.
+
+Given a version number format `MODEL.REVISION.ADDITION`:
+
+- `MODEL`: increment when you make a breaking schema change which will prevent interaction with any historical data
+- `REVISION`: increment when you make a schema change which may prevent interaction with some historical data
+- `ADDITION`: increment when you make a schema change that is compatible with all historical data
diff --git a/schema/provider-archive-listing/schema-1.0.0.json b/schema/provider-archive-listing/schema-1.0.0.json
@@ -0,0 +1,66 @@
+{
+  "$schema": "http://json-schema.org/draft-04/schema#",
+  "type": "object",
+  "properties": {
+    "schema": {
+      "type": "object",
+      "properties": {
+        "version": {
+          "type": "string"
+        },
+        "url": {
+          "type": "string"
+        }
+      },
+      "required": [
+        "version",
+        "url"
+      ]
+    },
+    "provider": {
+      "type": "string"
+    },
+    "available": {
+      "type": "object",
+      "properties": {
+        "1": {
+          "type": "array",
+          "items": [
+            {
+              "type": "object",
+              "properties": {
+                "distribution_checksum": {
+                  "type": "string"
+                },
+                "built": {
+                  "type": "string"
+                },
+                "checksum": {
+                  "type": "string"
+                },
+                "url": {
+                  "type": "string"
+                },
+                "version": {
+                  "type": "integer"
+                }
+              },
+              "required": [
+                "built",
+                "checksum",
+                "distribution_checksum",
+                "url",
+                "version"
+              ]
+            }
+          ]
+        }
+      }
+    }
+  },
+  "required": [
+    "schema",
+    "available",
+    "provider"
+  ]
+}
diff --git a/schema/provider-workspace-state/schema-1.0.2.json b/schema/provider-workspace-state/schema-1.0.2.json
@@ -0,0 +1,80 @@
+{
+  "$schema": "http://json-schema.org/draft-04/schema#",
+  "type": "object",
+  "title": "provider-workspace-state",
+  "description": "describes the filesystem state of a provider workspace directory",
+  "properties": {
+    "provider": {
+      "type": "string"
+    },
+    "urls": {
+      "type": "array",
+      "items": [
+        {
+          "type": "string"
+        }
+      ]
+    },
+    "store": {
+      "type": "string"
+    },
+    "timestamp": {
+      "type": "string"
+    },
+    "listing": {
+      "type": "object",
+      "properties": {
+        "digest": {
+          "type": "string"
+        },
+        "path": {
+          "type": "string"
+        },
+        "algorithm": {
+          "type": "string"
+        }
+      },
+      "required": [
+        "digest",
+        "path",
+        "algorithm"
+      ]
+    },
+    "version": {
+      "type": "integer",
+      "description": "version describing the result data shape + the provider processing behavior semantics"
+    },
+    "distribution_version": {
+      "type": "integer",
+      "description": "version describing purely the result data shape"
+    },
+    "schema": {
+      "type": "object",
+      "properties": {
+        "version": {
+          "type": "string"
+        },
+        "url": {
+          "type": "string"
+        }
+      },
+      "required": [
+        "version",
+        "url"
+      ]
+    },
+    "stale": {
+      "type": "boolean",
+      "description": "set to true if the workspace is stale and cannot be used for an incremental update"
+    }
+  },
+  "required": [
+    "provider",
+    "urls",
+    "store",
+    "timestamp",
+    "listing",
+    "version",
+    "schema"
+  ]
+}
diff --git a/src/vunnel/cli/config.py b/src/vunnel/cli/config.py
@@ -2,13 +2,41 @@
 
 import os
 from dataclasses import dataclass, field, fields
-from typing import Any
+from typing import TYPE_CHECKING, Any
+
+if TYPE_CHECKING:
+    from collections.abc import Generator
 
 import mergedeep
 import yaml
 from mashumaro.mixins.dict import DataClassDictMixin
 
-from vunnel import providers
+from vunnel import provider, providers
+
+
+@dataclass
+class ImportResults:
+    """
+    These are the defaults for all providers. Corresponding
+    fields on specific providers override these values.
+
+    If a path is "" or None, path will be set to "providers/{provider_name}/listing.json".
+    If an empty path is needed, specify "/".
+    """
+
+    __default_path__ = "providers/{provider_name}/listing.json"
+    host: str = ""
+    path: str = __default_path__
+    enabled: bool = False
+
+    def __post_init__(self) -> None:
+        if not self.path:
+            self.path = self.__default_path__
+
+
+@dataclass
+class CommonProviderConfig:
+    import_results: ImportResults = field(default_factory=ImportResults)
 
 
 @dataclass
@@ -26,12 +54,32 @@ class Providers:
     ubuntu: providers.ubuntu.Config = field(default_factory=providers.ubuntu.Config)
     wolfi: providers.wolfi.Config = field(default_factory=providers.wolfi.Config)
 
+    common: CommonProviderConfig = field(default_factory=CommonProviderConfig)
+
+    def __post_init__(self) -> None:
+        for name in self.provider_names():
+            runtime_cfg = getattr(self, name).runtime
+            if runtime_cfg and isinstance(runtime_cfg, provider.RuntimeConfig):
+                if runtime_cfg.import_results_enabled is None:
+                    runtime_cfg.import_results_enabled = self.common.import_results.enabled
+                if not runtime_cfg.import_results_host:
+                    runtime_cfg.import_results_host = self.common.import_results.host
+                if not runtime_cfg.import_results_path:
+                    runtime_cfg.import_results_path = self.common.import_results.path
+
     def get(self, name: str) -> Any | None:
-        for f in fields(Providers):
-            if self._normalize_name(f.name) == self._normalize_name(name):
-                return getattr(self, f.name)
+        for candidate in self.provider_names():
+            if self._normalize_name(candidate) == self._normalize_name(name):
+                return getattr(self, candidate)
         return None
 
+    @staticmethod
+    def provider_names() -> Generator[str, None, None]:
+        for f in fields(Providers):
+            if f.name == "common":
+                continue
+            yield f.name
+
     @staticmethod
     def _normalize_name(name: str) -> str:
         return name.lower().replace("-", "_")

diff --git a/src/vunnel/distribution.py b/src/vunnel/distribution.py
@@ -0,0 +1,89 @@
+from __future__ import annotations
+
+import datetime
+import os
+from dataclasses import dataclass, field
+from urllib.parse import urlparse
+
+import iso8601
+from mashumaro.mixins.dict import DataClassDictMixin
+
+from vunnel import schema as schema_def
+
+DB_SUFFIXES = {".tar.gz", ".tar.zst"}
+
+
+@dataclass
+class ListingEntry(DataClassDictMixin):
+    # the date this archive was built relative to the data enclosed in the archive
+    built: str
+
+    # the URL where the vunnel provider archive is located
+    url: str
+
+    # the digest of the archive referenced at the URL.
+    # Note: all checksums are labeled with "algorithm:value" ( e.g. sha256:1234567890abcdef1234567890abcdef)
+    distribution_checksum: str
+
+    # the digest of the checksums file within the archive referenced at the URL
+    # Note: all checksums are labeled with "algorithm:value" ( e.g. xxhash64:1234567890abcdef)
+    enclosed_checksum: str
+
+    # the provider distribution version this archive was built with (different than the provider version)
+    distribution_version: int = 1
+
+    def basename(self) -> str:
+        basename = os.path.basename(urlparse(self.url, allow_fragments=False).path)
+        if not _has_suffix(basename, suffixes=DB_SUFFIXES):
+            msg = f"entry url is not a db archive: {basename}"
+            raise RuntimeError(msg)
+
+        return basename
+
+    def age_in_days(self, now: datetime.datetime | None = None) -> int:
+        if not now:
+            now = datetime.datetime.now(tz=datetime.timezone.utc)
+        return (now - iso8601.parse_date(self.built)).days
+
+
+@dataclass
+class ListingDocument(DataClassDictMixin):
+    # mapping of provider versions to a list of ListingEntry objects denoting archives available for download
+    available: dict[int, list[ListingEntry]]
+
+    # the provider name this document is associated with
+    provider: str
+
+    # the schema information for this document
+    schema: schema_def.Schema = field(default_factory=schema_def.ProviderListingSchema)
+
+    @classmethod
+    def new(cls, provider: str) -> ListingDocument:
+        return cls(available={}, provider=provider)
+
+    def latest_entry(self, schema_version: int) -> ListingEntry | None:
+        if schema_version not in self.available:
+            return None
+
+        if not self.available[schema_version]:
+            return None
+
+        return self.available[schema_version][0]
+
+    def add(self, entry: ListingEntry) -> None:
+        if not self.available.get(entry.distribution_version):
+            self.available[entry.distribution_version] = []
+
+        self.available[entry.distribution_version].append(entry)
+
+        # keep listing entries sorted by date (rfc3339 formatted entries, which iso8601 is a superset of)
+        self.available[entry.distribution_version].sort(
+            key=lambda x: iso8601.parse_date(x.built),
+            reverse=True,
+        )
+
+
+def _has_suffix(el: str, suffixes: set[str] | None) -> bool:
+    if not suffixes:
+        return True
+    return any(el.endswith(s) for s in suffixes)