Skip to content

Commit

Permalink
add ability to download cached workspace (#520)
Browse files Browse the repository at this point in the history
* create "stale" field on workspace state

A provider that downloads its workspace state directly cannot assume
that this state is a valid basis for a future incremental update, and
should mark the downloaded workspace as stale.

Signed-off-by: Will Murphy <will.murphy@anchore.com>

* WIP add configs

Signed-off-by: Will Murphy <will.murphy@anchore.com>

* lint fix

Signed-off-by: Will Murphy <will.murphy@anchore.com>

* [wip] working on vunnel results db listing

Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com>

* update and tests for safe_extract_tar

Now that we're using it for more than one thing, make an extractor that
generally prevents path traversal.

Signed-off-by: Will Murphy <will.murphy@anchore.com>

* [wip] adding tests for fetching listing and archives

Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com>

* [wip] add more negative tests for provider tests

Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com>

* unit test for new workspace changes

Signed-off-by: Will Murphy <will.murphy@anchore.com>

* replace the workspace results instead of overlaying

Signed-off-by: Will Murphy <will.murphy@anchore.com>

* clean up hasher implementation

Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com>

* add tests for prep workspace from listing entry

Signed-off-by: Will Murphy <will.murphy@anchore.com>

* do not include inputs in tar test fixture

Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com>

* vunnel fetch existing workspace working

Signed-off-by: Will Murphy <will.murphy@anchore.com>

* add unit test for full update flow

Signed-off-by: Will Murphy <will.murphy@anchore.com>

* update existing unit tests for new config values

Signed-off-by: Will Murphy <will.murphy@anchore.com>

* add unit test for default behavior of new configs

Signed-off-by: Will Murphy <will.murphy@anchore.com>

* lint fix

Signed-off-by: Will Murphy <will.murphy@anchore.com>

* add missing annotations import

Signed-off-by: Will Murphy <will.murphy@anchore.com>

* Use 3.9 compatible annotations

Relying on the from __future__ import annotations doesn't work with the
mashumaro.

Signed-off-by: Will Murphy <will.murphy@anchore.com>

* validate that enabling import results requires host and path

Signed-off-by: Will Murphy <will.murphy@anchore.com>

* rename listing field and add schema

Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com>

* only require github token when downloading

Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com>

* add zstd support

Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com>

* add tests for zstd support

Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com>

* add tests for _has_newer_archive

Signed-off-by: Will Murphy <will.murphy@anchore.com>

* fix tests for zstd

Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com>

* show stderr to log when git commands fail

Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com>

* move import_results to common field on provider

Signed-off-by: Will Murphy <will.murphy@anchore.com>

* add concept for distribution version

Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com>

* single source of truth for provider schemas

Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com>

* add distribution-version to schema, provider state, and listing entry

Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com>

* clear workspace on different dist version

Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com>

* fix defaulting logic and update tests

Signed-off-by: Will Murphy <will.murphy@anchore.com>

* default distribution version and path

Signed-off-by: Will Murphy <will.murphy@anchore.com>

* make "" and None both use default path

Signed-off-by: Will Murphy <will.murphy@anchore.com>

---------

Signed-off-by: Will Murphy <will.murphy@anchore.com>
Signed-off-by: Alex Goodman <wagoodman@users.noreply.github.com>
Co-authored-by: Alex Goodman <wagoodman@users.noreply.github.com>
  • Loading branch information
willmurphyscode and wagoodman committed Mar 27, 2024
1 parent 6b4fa38 commit 90b176c
Show file tree
Hide file tree
Showing 41 changed files with 1,967 additions and 127 deletions.
150 changes: 148 additions & 2 deletions poetry.lock

Large diffs are not rendered by default.

2 changes: 2 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,8 @@ importlib-metadata = "^7.0.1"
xsdata = {extras = ["cli", "lxml", "soap"], version = ">=22.12,<25.0"}
pytest-snapshot = "^0.9.0"
mashumaro = "^3.10"
iso8601 = "^2.1.0"
zstandard = "^0.22.0"

[tool.poetry.group.dev.dependencies]
pytest = ">=7.2.2,<9.0.0"
Expand Down
17 changes: 17 additions & 0 deletions schema/provider-archive-listing/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# `ProviderState` JSON Schema

This schema governs the `listing.json` file used when providers are configured to fetch pre-computed results (by using `import_results_enabled`). The listing file is how the provider knows what results are available, where to fetch them from, and how to validate them.

See `src/vunnel.distribution.Listing` for the root object that represents this schema.

## Updating the schema

Versioning the JSON schema must be done manually by copying the existing JSON schema into a new `schema-x.y.z.json` file and manually making the necessary updates (or by using an online tool such as https://www.liquid-technologies.com/online-json-to-schema-converter).

This schema is being versioned based off of the "SchemaVer" guidelines, which slightly diverges from Semantic Versioning to tailor for the purposes of data models.

Given a version number format `MODEL.REVISION.ADDITION`:

- `MODEL`: increment when you make a breaking schema change which will prevent interaction with any historical data
- `REVISION`: increment when you make a schema change which may prevent interaction with some historical data
- `ADDITION`: increment when you make a schema change that is compatible with all historical data
66 changes: 66 additions & 0 deletions schema/provider-archive-listing/schema-1.0.0.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
{
"$schema": "http://json-schema.org/draft-04/schema#",
"type": "object",
"properties": {
"schema": {
"type": "object",
"properties": {
"version": {
"type": "string"
},
"url": {
"type": "string"
}
},
"required": [
"version",
"url"
]
},
"provider": {
"type": "string"
},
"available": {
"type": "object",
"properties": {
"1": {
"type": "array",
"items": [
{
"type": "object",
"properties": {
"distribution_checksum": {
"type": "string"
},
"built": {
"type": "string"
},
"checksum": {
"type": "string"
},
"url": {
"type": "string"
},
"version": {
"type": "integer"
}
},
"required": [
"built",
"checksum",
"distribution_checksum",
"url",
"version"
]
}
]
}
}
}
},
"required": [
"schema",
"available",
"provider"
]
}
80 changes: 80 additions & 0 deletions schema/provider-workspace-state/schema-1.0.2.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
{
"$schema": "http://json-schema.org/draft-04/schema#",
"type": "object",
"title": "provider-workspace-state",
"description": "describes the filesystem state of a provider workspace directory",
"properties": {
"provider": {
"type": "string"
},
"urls": {
"type": "array",
"items": [
{
"type": "string"
}
]
},
"store": {
"type": "string"
},
"timestamp": {
"type": "string"
},
"listing": {
"type": "object",
"properties": {
"digest": {
"type": "string"
},
"path": {
"type": "string"
},
"algorithm": {
"type": "string"
}
},
"required": [
"digest",
"path",
"algorithm"
]
},
"version": {
"type": "integer",
"description": "version describing the result data shape + the provider processing behavior semantics"
},
"distribution_version": {
"type": "integer",
"description": "version describing purely the result data shape"
},
"schema": {
"type": "object",
"properties": {
"version": {
"type": "string"
},
"url": {
"type": "string"
}
},
"required": [
"version",
"url"
]
},
"stale": {
"type": "boolean",
"description": "set to true if the workspace is stale and cannot be used for an incremental update"
}
},
"required": [
"provider",
"urls",
"store",
"timestamp",
"listing",
"version",
"schema"
]
}
58 changes: 53 additions & 5 deletions src/vunnel/cli/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,41 @@

import os
from dataclasses import dataclass, field, fields
from typing import Any
from typing import TYPE_CHECKING, Any

if TYPE_CHECKING:
from collections.abc import Generator

import mergedeep
import yaml
from mashumaro.mixins.dict import DataClassDictMixin

from vunnel import providers
from vunnel import provider, providers


@dataclass
class ImportResults:
"""
These are the defaults for all providers. Corresponding
fields on specific providers override these values.
If a path is "" or None, path will be set to "providers/{provider_name}/listing.json".
If an empty path is needed, specify "/".
"""

__default_path__ = "providers/{provider_name}/listing.json"
host: str = ""
path: str = __default_path__
enabled: bool = False

def __post_init__(self) -> None:
if not self.path:
self.path = self.__default_path__


@dataclass
class CommonProviderConfig:
import_results: ImportResults = field(default_factory=ImportResults)


@dataclass
Expand All @@ -26,12 +54,32 @@ class Providers:
ubuntu: providers.ubuntu.Config = field(default_factory=providers.ubuntu.Config)
wolfi: providers.wolfi.Config = field(default_factory=providers.wolfi.Config)

common: CommonProviderConfig = field(default_factory=CommonProviderConfig)

def __post_init__(self) -> None:
for name in self.provider_names():
runtime_cfg = getattr(self, name).runtime
if runtime_cfg and isinstance(runtime_cfg, provider.RuntimeConfig):
if runtime_cfg.import_results_enabled is None:
runtime_cfg.import_results_enabled = self.common.import_results.enabled
if not runtime_cfg.import_results_host:
runtime_cfg.import_results_host = self.common.import_results.host
if not runtime_cfg.import_results_path:
runtime_cfg.import_results_path = self.common.import_results.path

def get(self, name: str) -> Any | None:
for f in fields(Providers):
if self._normalize_name(f.name) == self._normalize_name(name):
return getattr(self, f.name)
for candidate in self.provider_names():
if self._normalize_name(candidate) == self._normalize_name(name):
return getattr(self, candidate)
return None

@staticmethod
def provider_names() -> Generator[str, None, None]:
for f in fields(Providers):
if f.name == "common":
continue
yield f.name

@staticmethod
def _normalize_name(name: str) -> str:
return name.lower().replace("-", "_")
Expand Down
89 changes: 89 additions & 0 deletions src/vunnel/distribution.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
from __future__ import annotations

import datetime
import os
from dataclasses import dataclass, field
from urllib.parse import urlparse

import iso8601
from mashumaro.mixins.dict import DataClassDictMixin

from vunnel import schema as schema_def

DB_SUFFIXES = {".tar.gz", ".tar.zst"}


@dataclass
class ListingEntry(DataClassDictMixin):
# the date this archive was built relative to the data enclosed in the archive
built: str

# the URL where the vunnel provider archive is located
url: str

# the digest of the archive referenced at the URL.
# Note: all checksums are labeled with "algorithm:value" ( e.g. sha256:1234567890abcdef1234567890abcdef)
distribution_checksum: str

# the digest of the checksums file within the archive referenced at the URL
# Note: all checksums are labeled with "algorithm:value" ( e.g. xxhash64:1234567890abcdef)
enclosed_checksum: str

# the provider distribution version this archive was built with (different than the provider version)
distribution_version: int = 1

def basename(self) -> str:
basename = os.path.basename(urlparse(self.url, allow_fragments=False).path)
if not _has_suffix(basename, suffixes=DB_SUFFIXES):
msg = f"entry url is not a db archive: {basename}"
raise RuntimeError(msg)

return basename

def age_in_days(self, now: datetime.datetime | None = None) -> int:
if not now:
now = datetime.datetime.now(tz=datetime.timezone.utc)
return (now - iso8601.parse_date(self.built)).days


@dataclass
class ListingDocument(DataClassDictMixin):
# mapping of provider versions to a list of ListingEntry objects denoting archives available for download
available: dict[int, list[ListingEntry]]

# the provider name this document is associated with
provider: str

# the schema information for this document
schema: schema_def.Schema = field(default_factory=schema_def.ProviderListingSchema)

@classmethod
def new(cls, provider: str) -> ListingDocument:
return cls(available={}, provider=provider)

def latest_entry(self, schema_version: int) -> ListingEntry | None:
if schema_version not in self.available:
return None

if not self.available[schema_version]:
return None

return self.available[schema_version][0]

def add(self, entry: ListingEntry) -> None:
if not self.available.get(entry.distribution_version):
self.available[entry.distribution_version] = []

self.available[entry.distribution_version].append(entry)

# keep listing entries sorted by date (rfc3339 formatted entries, which iso8601 is a superset of)
self.available[entry.distribution_version].sort(
key=lambda x: iso8601.parse_date(x.built),
reverse=True,
)


def _has_suffix(el: str, suffixes: set[str] | None) -> bool:
if not suffixes:
return True
return any(el.endswith(s) for s in suffixes)
Loading

0 comments on commit 90b176c

Please sign in to comment.