chore(ci): add binary-license manifest check and collector script#4451
Closed
bobbai00 wants to merge 4 commits into
Closed
chore(ci): add binary-license manifest check and collector script#4451bobbai00 wants to merge 4 commits into
bobbai00 wants to merge 4 commits into
Conversation
Adds a CI workflow and two helper scripts that keep LICENSE-binary
and NOTICE-binary honest against the actually-bundled dependencies:
- .github/workflows/check-binary-licenses.yml: runs on PRs touching
build.sbt, plugins.sbt, the licensing helpers, the frontend package
manifests, the Python requirements, or LICENSE-binary/NOTICE-binary.
Three jobs:
* check-jvm-deps: builds every sbt-native-packager dist, unzips
the lib/ directories, and runs the JVM checker against them.
* check-npm-deps: runs the frontend production build to emit
3rdpartylicenses.txt and runs the npm checker against it.
* check-python-deps: installs the Python requirements, runs
pip-licenses to produce a CSV manifest, and runs the Python
checker against it.
- bin/licensing/check_binary_deps.py: the checker. Parses bullets in
LICENSE-binary per ecosystem and reports ADDED (bundled but not
claimed) and STALE (claimed but not bundled) with remediation hints.
- bin/licensing/collect_binary_licenses.sh: a maintainer helper that
enumerates the currently-bundled jars/packages across the three
ecosystems to seed or refresh LICENSE-binary.
This PR depends on LICENSE-binary / NOTICE-binary being present at
the repo root (see the companion content PR); until both land, the
workflow will no-op or fail with a clear message.
Generated-by: Claude Code (Claude Opus 4.7)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
sbt dist zips contain both third-party jars and Texera-produced jars. The third-party jars must be claimed by LICENSE-binary; Texera's own jars should not be, since they are the subject of the binary distribution, not a dependency of it. Filter out any jar whose stripped stem starts with "org.apache.texera." before comparing against LICENSE-binary's claims. Depends on the groupId rename (apache#4447) having landed so Texera jars actually carry the org.apache.texera.* prefix. Generated-by: Claude Code (Claude Opus 4.7) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
|
@bobbai00 Thanks. Please suggest a reviewer. |
bobbai00
added a commit
to bobbai00/texera
that referenced
this pull request
Apr 24, 2026
Hyphenate three Python entries so they match the canonical name (lowercase, [-_.] collapsed to -): - huggingface_hub -> huggingface-hub - pydantic_core -> pydantic-core - typing_extensions -> typing-extensions Paired with a checker change on apache#4451 that applies the same normalization to both claimed and real names before comparison, making the verification robust against upstream hyphen/underscore variants. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PyPI distributions inconsistently use hyphens, underscores, or dots in their canonical names (huggingface_hub, pydantic_core, python-dateutil, scikit-image, ...). Verbatim string comparison between LICENSE-binary entries and pip-licenses output breaks the moment an upstream bump changes the preferred form. Apply `canonicalize_name` (lowercase, collapse [-_.]+ to -) to both sides so the checker is indifferent. Paired with the matching canonicalization of Python entries in LICENSE-binary on apache#4387. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
Author
|
#4387 include this. So close it |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this PR?
Add a CI workflow and two helper scripts that diff
LICENSE-binaryagainst the actually-bundled dependencies for three ecosystems..github/workflows/check-binary-licenses.yml— runs on PRs touching**/build.sbt,project/plugins.sbt,project/AddMetaInfLicenseFiles.scala,frontend/package.json,frontend/yarn.lock,amber/requirements.txt,amber/operator-requirements.txt,LICENSE-binary,NOTICE-binary,bin/licensing/**, or the workflow itself. Three jobs:sbt disteach module, unziplib/, run the checker.3rdpartylicenses.txt), run the checker.pip-licenses→ CSV, run the checker.bin/licensing/check_binary_deps.py— parses bullets under each ecosystem heading inLICENSE-binary, reports ADDED (bundled but not claimed) and STALE (claimed but not bundled). Skips jars whose stripped stem starts withorg.apache.texera.so Texera's own jars aren't flagged.bin/licensing/collect_binary_licenses.sh— maintainer helper that enumerates the currently-bundled dependencies to seed or refreshLICENSE-binary.Any related issues, documentation, discussions?
Closes #4450. Depends on #4387 (the checker reads
LICENSE-binary), #4449 (the JVM job needs the dist zips), and #4447 (the self-skip relies on theorg.apache.texera.groupId).How was this PR tested?
Checker run against the reviewer spreadsheet that informed
LICENSE-binary, confirming the expected ADDED / STALE sets. Self-skip verified on a post-#4447 dist tree —org.apache.texera.<artifact>-*.jaris filtered out.Was this PR authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Claude Opus 4.7)