Skip to content

Set the package_content values for the PyPI collector #619

@tdruez

Description

@tdruez

I see the PackageContentType value is set in various collectors but it is missing for the pypi one.

Most Python packages have usually a binary wheel (or several) and a source (tar.gz, zip, ..) available on PyPI.

It would be very useful to have the package_content value for PyPI packages on the PurlDB API consumer side (DejaCode for example) to easily select the source as the primary package when multiple records are available (2 or more usually per PURL).


For example: purl=pkg:pypi/boto3@1.37.26

2 records are available in the PurlDB:

  • pkg:pypi/boto3@1.37.26?file_name=boto3-1.37.26-py3-none-any.whl
  • pkg:pypi/boto3@1.37.26?file_name=boto3-1.37.26.tar.gz

Those 2 are properly part of the same PackageSet, but lack a package_content value.

Something along this logic could be implemented (required to be adapted and tested though):

from pathlib import Path
from urllib.parse import urlparse

from packagedb.models import PackageContentType

def get_pypi_package_content_type(download_url):
    source_extensions = (".tar.gz", ".zip", ".tar.bz2", ".tar.xz", ".tar.Z", ".tgz", ".tbz")
    binary_extensions = (".whl", ".egg")

    filename = Path(urlparse(download_url).path).name

    if filename.endswith(source_extensions):
        return PackageContentType.SOURCE_ARCHIVE
    if filename.endswith(source_extensions):
        return PackageContentType.BINARY

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions