Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harvesting not working #561

Open
glogowski-wojciech-MSFT opened this issue Mar 5, 2024 · 5 comments
Open

Harvesting not working #561

glogowski-wojciech-MSFT opened this issue Mar 5, 2024 · 5 comments

Comments

@glogowski-wojciech-MSFT
Copy link

I submitted harvesting requests for the following Python packages using the clearlydefined website, first on 2024-02-28 in a single query, and then on 2024-02-29, each one in a separate query, in a separate browser tab. As of today, 2024-03-04, none of these packages were harvested:

package name 2024-02-28 query 2024-02-29 query
pypi/pypi/-/nvidia-cublas-cu12/12.1.3.1 1 1
pypi/pypi/-/nvidia-cuda-cupti-cu12/12.1.105 1 1
pypi/pypi/-/nvidia-cuda-nvrtc-cu12/12.1.105 1
pypi/pypi/-/nvidia-cuda-runtime-cu12/12.1.105 1 1
pypi/pypi/-/nvidia-cudnn-cu12/8.9.2.26 1 1
pypi/pypi/-/nvidia-cufft-cu12/11.0.2.54 1 1
pypi/pypi/-/nvidia-curand-cu12/10.3.2.106 1
pypi/pypi/-/nvidia-cusolver-cu12/11.4.5.107 1 1
pypi/pypi/-/nvidia-cusparse-cu12/12.1.0.106 1 1
pypi/pypi/-/nvidia-nccl-cu12/2.19.3 1 1
pypi/pypi/-/nvidia-nvjitlink-cu12/12.3.101 1 1
pypi/pypi/-/nvidia-nvtx-cu12/12.1.105 1 1
pypi/pypi/-/onnxruntime/1.17.1 1
pypi/pypi/-/tensorboard-data-server/0.7.2 1
pypi/pypi/-/tensorboard/2.16.2 1
pypi/pypi/-/thop/0.1.1.post2207130030 1 1
pypi/pypi/-/torch/2.2.1 1 1
pypi/pypi/-/torchvision/0.17.1 1 1
pypi/pypi/-/triton/2.2.0 1

The harvesting either does not reliably work or takes a very long time (5 days and counting). Either way I believe this requires a fix or at least extra documentation. I will also appreciate help with harvesting these specific packages.

@glogowski-wojciech-MSFT glogowski-wojciech-MSFT changed the title Harvesting not working or taking very long Harvesting not working Mar 11, 2024
@glogowski-wojciech-MSFT
Copy link
Author

As of today, 2024-03-11, none of these packages was harvested. I have requested the harvesting again programmatically on 2024-03-06 and received 201 HTTP responses. Given that it is 12 days since the original harvesting requests, I am changing the issue title from "Harvesting not working or taking very long" to "Harvesting not working".

@qtomlinson
Copy link
Collaborator

@glogowski-wojciech-MSFT Thanks for reporting the issue! In ClearlyDefined, we typically download source distributions (*.tar.gz or *.zip) for Python packages. However, upon checking the first three packages on PyPI, it was found that they do not have source distributions available. You can find the package information here:
https://pypi.org/project/nvidia-cublas-cu12/12.1.3.1/#files
https://pypi.org/project/nvidia-cuda-cupti-cu12/12.1.105/#files
https://pypi.org/project/nvidia-cuda-runtime-cu12/12.1.105/#files

The absence of source distributions may be the reason why the harvesting process failed for the listed packages.

@qtomlinson
Copy link
Collaborator

qtomlinson commented May 6, 2024

During the harvesting process, we download a source distribution from PyPI to perform further analysis, such as running the licensee, reuse, and ScanCode tools. If a source package is not available, the package is currently marked as missing. This behavior was introduced in this PR to address this issue.

When a package is marked as missing during the harvest, there is no information stored regarding the downloaded registry information for that PyPI package. In addition, curation can only be created through a pull request against https://github.com/clearlydefined/curated-data rather than through the user interface.

Due to recent questions about harvesting PyPI packages without source distributions, it may be worthwhile to discuss the matter further on the original issue. Should we allow the harvest to succeed even if the source PyPI package cannot be downloaded? Could it be considered the intended behavior for those PyPI packages where no files are displayed on the components details page due to the unavailability of the source package?

@capfei @bduranc @jeffwilcox @elrayle Any thoughts?

@elrayle
Copy link
Collaborator

elrayle commented May 6, 2024

@Jeffrey-Luszcz ☝ See comment responding to issue raised in the community meeting today.

@bduranc
Copy link

bduranc commented May 9, 2024

During the harvesting process, we download a source distribution from PyPI to perform further analysis, such as running the licensee, reuse, and ScanCode tools. If a source package is not available, the package is currently marked as missing. This behavior was introduced in this PR to address this issue.

When a package is marked as missing during the harvest, there is no information stored regarding the downloaded registry information for that PyPI package. In addition, curation can only be created through a pull request against https://github.com/clearlydefined/curated-data rather than through the user interface.

Due to recent questions about harvesting PyPI packages without source distributions, it may be worthwhile to discuss the matter further on the original issue. Should we allow the harvest to succeed even if the source PyPI package cannot be downloaded? Could it be considered the intended behavior for those PyPI packages where no files are displayed on the components details page due to the unavailability of the source package?

@capfei @bduranc @jeffwilcox @elrayle Any thoughts?

Can I assume in this context, that the "normal" package files (i.e. binary/deployable code) are still being retrieved and scanned?
Or is this what you are referring to by "source distributions"? I ask because for other types like Maven and Debian, we do harvest the source separate from the binary artifact as their own (source archive) definitions, but they don't hold each other up.

In either case, I think what's important is we have some clear way of notifying end-users the reason why they can't see the files. And if it's due to a tool error (as was discussed in clearlydefined/website#964), then I consider this as different than the files just "not being available". We of course cannot consider it "succeeded" in such cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants