Skip to content

chore(ci): add binary-license manifest check and collector script#4451

Closed
bobbai00 wants to merge 4 commits into
apache:mainfrom
bobbai00:chore/binary-license-ci
Closed

chore(ci): add binary-license manifest check and collector script#4451
bobbai00 wants to merge 4 commits into
apache:mainfrom
bobbai00:chore/binary-license-ci

Conversation

@bobbai00

@bobbai00 bobbai00 commented Apr 22, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this PR?

Add a CI workflow and two helper scripts that diff LICENSE-binary against the actually-bundled dependencies for three ecosystems.

  • .github/workflows/check-binary-licenses.yml — runs on PRs touching **/build.sbt, project/plugins.sbt, project/AddMetaInfLicenseFiles.scala, frontend/package.json, frontend/yarn.lock, amber/requirements.txt, amber/operator-requirements.txt, LICENSE-binary, NOTICE-binary, bin/licensing/**, or the workflow itself. Three jobs:

    • check-jvm-deps: sbt dist each module, unzip lib/, run the checker.
    • check-npm-deps: frontend production build (emits 3rdpartylicenses.txt), run the checker.
    • check-python-deps: install Python requirements, pip-licenses → CSV, run the checker.
  • bin/licensing/check_binary_deps.py — parses bullets under each ecosystem heading in LICENSE-binary, reports ADDED (bundled but not claimed) and STALE (claimed but not bundled). Skips jars whose stripped stem starts with org.apache.texera. so Texera's own jars aren't flagged.

  • bin/licensing/collect_binary_licenses.sh — maintainer helper that enumerates the currently-bundled dependencies to seed or refresh LICENSE-binary.

Any related issues, documentation, discussions?

Closes #4450. Depends on #4387 (the checker reads LICENSE-binary), #4449 (the JVM job needs the dist zips), and #4447 (the self-skip relies on the org.apache.texera. groupId).

How was this PR tested?

Checker run against the reviewer spreadsheet that informed LICENSE-binary, confirming the expected ADDED / STALE sets. Self-skip verified on a post-#4447 dist tree — org.apache.texera.<artifact>-*.jar is filtered out.

Was this PR authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Claude Opus 4.7)

Adds a CI workflow and two helper scripts that keep LICENSE-binary
and NOTICE-binary honest against the actually-bundled dependencies:

- .github/workflows/check-binary-licenses.yml: runs on PRs touching
  build.sbt, plugins.sbt, the licensing helpers, the frontend package
  manifests, the Python requirements, or LICENSE-binary/NOTICE-binary.
  Three jobs:
    * check-jvm-deps: builds every sbt-native-packager dist, unzips
      the lib/ directories, and runs the JVM checker against them.
    * check-npm-deps: runs the frontend production build to emit
      3rdpartylicenses.txt and runs the npm checker against it.
    * check-python-deps: installs the Python requirements, runs
      pip-licenses to produce a CSV manifest, and runs the Python
      checker against it.

- bin/licensing/check_binary_deps.py: the checker. Parses bullets in
  LICENSE-binary per ecosystem and reports ADDED (bundled but not
  claimed) and STALE (claimed but not bundled) with remediation hints.

- bin/licensing/collect_binary_licenses.sh: a maintainer helper that
  enumerates the currently-bundled jars/packages across the three
  ecosystems to seed or refresh LICENSE-binary.

This PR depends on LICENSE-binary / NOTICE-binary being present at
the repo root (see the companion content PR); until both land, the
workflow will no-op or fail with a clear message.

Generated-by: Claude Code (Claude Opus 4.7)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added pyamber ci changes related to CI dev labels Apr 22, 2026
sbt dist zips contain both third-party jars and Texera-produced jars.
The third-party jars must be claimed by LICENSE-binary; Texera's own
jars should not be, since they are the subject of the binary distribution,
not a dependency of it.

Filter out any jar whose stripped stem starts with "org.apache.texera."
before comparing against LICENSE-binary's claims.

Depends on the groupId rename (apache#4447) having landed so Texera jars
actually carry the org.apache.texera.* prefix.

Generated-by: Claude Code (Claude Opus 4.7)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@chenlica

Copy link
Copy Markdown
Contributor

@bobbai00 Thanks. Please suggest a reviewer.

bobbai00 added a commit to bobbai00/texera that referenced this pull request Apr 24, 2026
Hyphenate three Python entries so they match the canonical name
(lowercase, [-_.] collapsed to -):
  - huggingface_hub   -> huggingface-hub
  - pydantic_core     -> pydantic-core
  - typing_extensions -> typing-extensions

Paired with a checker change on apache#4451 that applies the same
normalization to both claimed and real names before comparison,
making the verification robust against upstream hyphen/underscore
variants.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PyPI distributions inconsistently use hyphens, underscores, or dots
in their canonical names (huggingface_hub, pydantic_core,
python-dateutil, scikit-image, ...). Verbatim string comparison
between LICENSE-binary entries and pip-licenses output breaks the
moment an upstream bump changes the preferred form.

Apply `canonicalize_name` (lowercase, collapse [-_.]+ to -) to both
sides so the checker is indifferent. Paired with the matching
canonicalization of Python entries in LICENSE-binary on apache#4387.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@bobbai00

Copy link
Copy Markdown
Contributor Author

#4387 include this. So close it

@bobbai00 bobbai00 closed this Apr 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci changes related to CI dev pyamber

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Binary license manifests lack CI enforcement

2 participants