New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

PyTorch and Torchvision packages #941

Closed

aclex wants to merge 13 commits into gentoo:master from aclex:pytorch

Contributor

aclex commented Feb 10, 2020

Here's a merge of the current state of PyTorch, Torchvision and some other required packages ebuilds as in development at https://github.com/aclex/pytorch-ebuild. I've been suggested to make a pull request to merge it here, to join the efforts of the development, so I've decided to try.

Could you please take a look at the ebuilds and consider them for maybe merge to your repository? Please, let me know, if something should be fixed for that. Thanks in advance!

aclex added 2 commits

February 10, 2020 21:51


          PyTorch and Torchvision packages

68ff7ae

A merge of the current state of PyTorch, Torchvision and
some other required packages ebuilds as in development at
https://github.com/aclex/pytorch-ebuild, to join the efforts
of the development.


          Add description for rocm USE flag

aclex mentioned this pull request

Co-maintaining pytorch in Gentoo Science Overlay aclex/pytorch-ebuild#7

Closed

heroxbd requested changes

View reviewed changes

Contributor

heroxbd left a comment

Hi, thank you so much for your PR.
Please refer to the Python project recommendations for improvements. https://wiki.gentoo.org/wiki/Project:Python

app-doc/pytorch-docs/pytorch-docs-1.3.1.ebuild Outdated Show resolved Hide resolved

app-doc/pytorch-docs/pytorch-docs-1.3.1.ebuild Outdated

+              src_unpack() {
+              	unpack ${A}
+              	mv "${WORKDIR}/${PYTORCH_NAME}-${PV}" "${S}"

Contributor

heroxbd Feb 14, 2020

mv is an external command. Should append with || die.

Contributor Author

aclex Feb 18, 2020

Thanks for pointing out on this! Fixed.

app-doc/pytorch-docs/pytorch-docs-1.3.1.ebuild Outdated Show resolved Hide resolved

app-doc/pytorch-docs/pytorch-docs-1.3.1.ebuild Outdated

+              src_install() {
+              	local doc_build_dir="${S}/docs"
+              	mkdir -p "${D}/usr/share/doc/${P}"

Contributor

heroxbd Feb 14, 2020

|| die

Contributor Author

aclex Feb 18, 2020

Fixed.

app-doc/pytorch-docs/pytorch-docs-1.3.1.ebuild Outdated Show resolved Hide resolved

dev-python/javasphinx/javasphinx-0.9.15.ebuild Outdated

+              RDEPEND="
+              	>=dev-python/javalang-0.10.1
+              	dev-python/lxml

Contributor

heroxbd Feb 14, 2020

All the Python dependencies should have the same Python version USE flags. cf. https://wiki.gentoo.org/wiki/Project:Python/distutils-r1#PYTHON_USEDEP

The same applies to all others.

Contributor Author

aclex Feb 18, 2020

I'd like to make some more thorough tests on this, as I seem to remove it after it led to some failures. But not sure why. Will test it and sort it out.

Contributor Author

aclex Feb 18, 2020

Tested and fixed it as well in the affected ebuilds.

sci-libs/pytorch/pytorch-1.3.1.ebuild Outdated Show resolved Hide resolved

sci-libs/pytorch/pytorch-1.3.1.ebuild

+              	for file in ${multilib_failing_files[@]}; do
+              		mv -f "${D}/usr/lib/$file" "${D}/usr/lib64"
+              	done

Contributor

heroxbd Feb 14, 2020

This works. But we'd better fix the build system.

Contributor Author

aclex Feb 18, 2020

Unfortunately, this needs some more experiments and tests, as we kind of go against the upstream here.

sci-libs/pytorch/pytorch-1.3.1.ebuild

		rm -rfv "${D}/usr/lib64/cmake"

		rm -rfv "${D}/usr/share/doc/mkldnn"

Contributor

heroxbd Feb 14, 2020

Patching the build system would be better and cleaner.

Contributor

heroxbd Feb 14, 2020

But we can improve it gradually if your time is limited.

Contributor

heroxbd Feb 14, 2020

Same applies to 1.4.0.

Contributor Author

aclex Feb 18, 2020

Agree, but not changed yet.

sci-libs/torchvision/torchvision-0.4.2.ebuild

+              	if use python; then
+              		FORCE_CUDA=$(usex cuda 1 0) \
+              		CUDA_HOME=$(usex cuda ${CUDA_HOME} "") \

Contributor

heroxbd Feb 14, 2020

Where is CUDA_HOME set?

Contributor Author

aclex Feb 18, 2020

I've just put it following the expectations in the setup.py, but it seems to require some more environment preparation, like here: https://github.com/pytorch/vision/blob/master/packaging/pkg_helpers.bash

Contributor

heroxbd Feb 19, 2020

I see.

Contributor Author

aclex commented Feb 14, 2020

@heroxbd thank you very much for this hard work of reviewing all these ebuilds, Benda! Many thanks for the detailed comments and references! I'll try to fix them all and update the request as soon, as possible.

Contributor

heroxbd commented Feb 18, 2020

@heroxbd thank you very much for this hard work of reviewing all these ebuilds, Benda! Many thanks for the detailed comments and references! I'll try to fix them all and update the request as soon, as possible.

Hi Alex, are you working on this? If you are busy, I can implement what I have commented.

Contributor Author

aclex commented Feb 18, 2020

@heroxbd Hi, Benda! Yes, unfortunately I didn't have a chance to work on it last weekend, so it goes slower, than I expected. I hope to submit the fixes soon, maybe this evening, except for the things related to the build system (those artifacts which are moved or removed manually). As for them, I tried to do it the proper way, but they seem to be quite tricky both to implement and test, especially given the upstream strongly discourage any changes to the building configuration.

Sorry again for the delay, I hope to return with the results soon.

Contributor

heroxbd commented Feb 18, 2020

@heroxbd Hi, Benda! Yes, unfortunately I didn't have a chance to work on it last weekend, so it goes slower, than I expected. I hope to submit the fixes soon, maybe this evening, except for the things related to the build system (those artifacts which are moved or removed manually). As for them, I tried to do it the proper way, but they seem to be quite tricky both to implement and test, especially given the upstream strongly discourage any changes to the building configuration.

Sorry again for the delay, I hope to return with the results soon.

No problem, we can work together on this. You can update what you have and I can continue from there to make a working version. Then we can keep improving it.

aclex added 7 commits

February 18, 2020 21:04


          Remove Python versions 2.7 and 3.5 in pytorch-docs

c68b6a9


          Safety checks in pytorch-docs

8b1fd05


          Remove untested keywords in dev-python/javalang

b897be9


          Remove absent env variable in dev-python/javalang


          Remove Python 2.7 and 3.5 targets and deps in javalang

41c4ee5


          Remove deprecated Python targets in dev-python/javasphinx

3f0ac14


          Fix Python call in Pytorch ebuilds

25e73b0

Contributor Author

aclex commented Feb 18, 2020

@heroxbd Benda, so I've pushed the most trivial portion of fixes for now, so I'll continue to look at the remaining ones, though it's going to take some more time. Please, feel free to make any changes you like, no problem with it at all. Still I don't really know, how it can be done within the pull request, but I think I can share the request branch with you, if it's needed.

aclex added 4 commits

February 18, 2020 22:12


          Remove deprecated Python targets in Pytorch-related ebuilds

1422dc4


          Updated manifests

8395a7a


          app-doc/pytorch-docs: Add PYTHON_USEDEP settings to deps

4d3ddfb


          Add PYTHON_USEDEP settings in all Pytorch-related ebuilds

e753aad

Contributor Author

aclex commented Feb 18, 2020

I also would like to apologize for my commit messages, which, as I see, don't tally well with the history of the repository. I tried to follow the rules in recent messages, though it's not always easy, when the commit affect several ebuilds at once. Anyway, feel free to correct me.

heroxbd reviewed

View reviewed changes

sci-libs/pytorch/pytorch-1.4.0.ebuild Show resolved Hide resolved

gentoo-bot pushed a commit that referenced this pull request


          sci-libs/pytorch: initial commit.

bb46b0f

  This is CPU version only. ROCm and CUDA options will be added later.

Reference: https://github.com/aclex/pytorch-ebuild
Bug: #941
Package-Manager: Portage-2.3.88, Repoman-2.3.18
Signed-off-by: Benda Xu <heroxbd@gentoo.org>

Contributor

heroxbd commented Feb 19, 2020

Hi, I have committed only pytorch CPU version to the tree. We still need to add more features like ROCm and CUDA, besides torchvision, etc. Please rebase against the newest tree and send atomic pull requests.

This one is too big to handle smoothly. And please use repoman for the ebuild commits.

Contributor Author

aclex commented Feb 19, 2020

OK, thank you. At first glance it looks quite deeply modified, so I'll try to look closer at it later.

gentoo-bot pushed a commit that referenced this pull request


          sci-libs/pytorch: add support for cuda.

ce9798e

Bug: #941
Package-Manager: Portage-2.3.88, Repoman-2.3.18
Signed-off-by: Benda Xu <heroxbd@gentoo.org>

Contributor Author

aclex commented Feb 21, 2020

Benda, I've looked through the commit you've mentioned, and, to be honest, quite disappointed, as it's silently pretty much another version of what I've put in this request. It's written in the commit as if authored by me, but there're hardly a few lines I wrote actually) To me it looks kind of contradictory to your suggestion of adding things atomically. Perhaps, it might have more sense, if we elaborated some joint version together somewhere externally, with all the little changes and fixes gradually accumulated from both versions. But OK, that's the way it is.

I'm actually quite confused about the next steps on this, as I don't know your plan, what you didn't included as wasteful or unnecessary and what was just postponed for future. As you wrote, ROCm, torchvision and other things were excluded for now, so apparently that's pretty much it. Anyway, I'll backport the fixes on your suggestions, so please feel free to pick any other parts you consider worth merging, I'd be glad to help you with that.

My initial plan after your suggestion was to merge all the tree fully here and render my branch to read-only, but, I think, I just underestimated the amount of efforts and pain of the process, following all the packaging rules, given there's a lot of work to be done in the ebuild itself to keep it stable enough after each version bump, while free time on this is very limited. So, I think, it would be the best way to continue it separately. I'm going to concentrate on ROCm support now, as there're major issues there. Again, I'm open for any interaction, to not do the same work twice etc.

Anyway, thank you very much again for your help, suggestions and guidance! Was useful experience for me.

Contributor

heroxbd commented Feb 22, 2020 •

edited

Benda, I've looked through the commit you've mentioned, and, to be honest, quite disappointed, as it's silently pretty much another version of what I've put in this request. It's written in the commit as if authored by me, but there're hardly a few lines I wrote actually)

Not really, the CPU version is based on your ebuild, with the exception of ROCm and SRC_URI parts. The biggest point of your contribution is the cmake driven build and related patches, which is buried underneath setup.py. I put you on the author on purpose, because you have more code than me.

I am sorry if that confused you.

To me it looks kind of contradictory to your suggestion of adding things atomically.

By atomically, I mean putting in CPU, AMD, nVidia versions one at a time. And when it comes to AMD, land the ROCm ebuilds first and then an update to the pytorch ebuild. Finally the reverse dependencies of pytorch.

Perhaps, it might have more sense, if we elaborated some joint version together somewhere externally, with all the little changes and fixes gradually accumulated from both versions. But OK, that's the way it is.

The science overlay itself is an working place for the gentoo main tree. Ultimately every ebuild that matures in science overlay should land into gentoo. So let's polish the joint version on science overlay.

I'm actually quite confused about the next steps on this, as I don't know your plan, what you didn't included as wasteful or unnecessary and what was just postponed for future. As you wrote, ROCm, torchvision and other things were excluded for now, so apparently that's pretty much it. Anyway, I'll backport the fixes on your suggestions, so please feel free to pick any other parts you consider worth merging, I'd be glad to help you with that.

Please see the atomic comment above.

My initial plan after your suggestion was to merge all the tree fully here and render my branch to read-only, but, I think, I just underestimated the amount of efforts and pain of the process, following all the packaging rules, given there's a lot of work to be done in the ebuild itself to keep it stable enough after each version bump, while free time on this is very limited. So, I think, it would be the best way to continue it separately. I'm going to concentrate on ROCm support now, as there're major issues there. Again, I'm open for any interaction, to not do the same work twice etc.

Thank you! I am interested in ROCm ebuilds as well. If you could revise them according to be the standards, I am more than happy to land them to the science overlay.

I understand there is a steep learning curve for packaging. Requiring a quality standard does not seem to make much sense as long as it just works, but the quality standards are accumulated wisdom and proven best efforts in the history. They will pay off in the long run.

Anyway, thank you very much again for your help, suggestions and guidance! Was useful experience for me.

You are welcome. I apologize again for not synchronizing the big picture of pytorch/ROCm with you promptly.

Contributor Author

aclex commented Feb 22, 2020

Thank you for the reply and clarification of your view, Benda!

The science overlay itself is an working place for the gentoo main tree. Ultimately every ebuild that matures in science overlay should land into gentoo. So let's polish the joint version on science overlay.

This is so attractive aim, especially in case of PyTorch, as, to my knowledge, there's no version of it in repositories of many other distributions.

The main problem is that the recommended building process just ignores the system-wide installation, FHS, building separated from the installation and all that stuff and very clearly discourage anyone to try doing these things. I don't think, that's for bad will, but obviously rather because of complex build configuration historically fused of several different packages. To make it system compliant we actually need to violate its rules, which usually leads to quite subtle problems. In that case, incremental steps are very useful and valuable for the same reasons we do separate commits instead of one squashed. For example, I carefully selected all the header files in the image, checking if they are included in some Torch public headers or not, to not break the compilation of C++ part and packages which depend on it (like, as I remember, torchvision). As I can see, at least pybind11 headers are commented out from installation in the submitted version, so it's unclear, if it works in these scenarios at all. But it's already there. Such things could easily be avoided, if we tried to observe all the proposals together, discussing each one separately. Two heads are always better, than one, they say)

…when it comes to AMD, land the ROCm ebuilds first and then an update to the pytorch ebuild. Finally the reverse dependencies of pytorch.

The ROCm-enabled build relies on the ebuilds in the separate repository: https://github.com/justxi/rocm/. As I know, packages from there are accepted to the main Gentoo repository directly, but that's not finished process, about a half of the packages are not yet moved. My PyTorch ebuild just mentions the dependencies from there and implies this repository has been registered beforehand, when rocm USE flag enabled. Effectively this means it depends on non-existent packages, so if it's not appropriate case for SCI repository, the ROCm-enabled configuration would have to wait for all the ROCm packages added to the main Gentoo repository.

I understand there is a steep learning curve for packaging. Requiring a quality standard does not seem to make much sense as long as it just works, but the quality standards are accumulated wisdom and proven best efforts in the history. They will pay off in the long run.

I surely agree with you on the rules importance and always try to respect and follow the rules I'm aware of. Not a good thing is just when they are quite spread and there're a lot of moments that all checks are passed and you thing it's all good, while there are so much more requirements violated. I mean, I just want to share my results with everyone, to save them from doing the same work, making it easy to use as much, as possible. I have no power to keep everyone happy with every single of my commit messages :)

Contributor

heroxbd commented Feb 23, 2020

Thank you for the reply and clarification of your view, Benda!

The science overlay itself is an working place for the gentoo main tree. Ultimately every ebuild that matures in science overlay should land into gentoo. So let's polish the joint version on science overlay.

This is so attractive aim, especially in case of PyTorch, as, to my knowledge, there's no version of it in repositories of many other distributions.

Indeed.

The main problem is that the recommended building process just ignores the system-wide installation, FHS, building separated from the installation and all that stuff and very clearly discourage anyone to try doing these things. I don't think, that's for bad will, but obviously rather because of complex build configuration historically fused of several different packages. To make it system compliant we actually need to violate its rules, which usually leads to quite subtle problems. In that case, incremental steps are very useful and valuable for the same reasons we do separate commits instead of one squashed. For example, I carefully selected all the header files in the image, checking if they are included in some Torch public headers or not, to not break the compilation of C++ part and packages which depend on it (like, as I remember, torchvision). As I can see, at least pybind11 headers are commented out from installation in the submitted version, so it's unclear, if it works in these scenarios at all. But it's already there. Such things could easily be avoided, if we tried to observe all the proposals together, discussing each one separately. Two heads are always better, than one, they say)

That's a very smart move to use PyPI package to figure out which headers to keep.

We are now using pybind11 from the system, not bundled as far as I remember.

…when it comes to AMD, land the ROCm ebuilds first and then an update to the pytorch ebuild. Finally the reverse dependencies of pytorch.

The ROCm-enabled build relies on the ebuilds in the separate repository: https://github.com/justxi/rocm/. As I know, packages from there are accepted to the main Gentoo repository directly, but that's not finished process, about a half of the packages are not yet moved. My PyTorch ebuild just mentions the dependencies from there and implies this repository has been registered beforehand, when rocm USE flag enabled. Effectively this means it depends on non-existent packages, so if it's not appropriate case for SCI repository, the ROCm-enabled configuration would have to wait for all the ROCm packages added to the main Gentoo repository.

Good to know of the ongoing ROCm efforts.

I understand there is a steep learning curve for packaging. Requiring a quality standard does not seem to make much sense as long as it just works, but the quality standards are accumulated wisdom and proven best efforts in the history. They will pay off in the long run.

I surely agree with you on the rules importance and always try to respect and follow the rules I'm aware of. Not a good thing is just when they are quite spread and there're a lot of moments that all checks are passed and you thing it's all good, while there are so much more requirements violated. I mean, I just want to share my results with everyone, to save them from doing the same work, making it easy to use as much, as possible. I have no power to keep everyone happy with every single of my commit messages :)

Sure, no problem at all.

Contributor Author

aclex commented Feb 24, 2020

We are now using pybind11 from the system, not bundled as far as I remember.

Ah, indeed, haven't noticed it's available in the system. Then that's obviously the best option, and I am probably wrong here, hopefully it's going to work then.

That's a very smart move to use PyPI package to figure out which headers to keep.

Unfortunately, it's not a source reliable enough, as e.g. for C++ development it's not even mentioned, with preference to the official ZIP binary archive unpacked locally. Another problem is PyPI packages are not always proved to be absolutely consistent, for example, CPU version of Torchvision goes with box operations linked to non-extistent CUDA library, leading to runtime linker failure, for several versions in a row already. Nobody seems to care yet :) So yes, I try to check all that cases known to me manually.

Contributor

heroxbd commented Feb 25, 2020 via email

Alexey Chernov <notifications@github.com> writes:

> We are now using pybind11 from the system, not bundled as far as I > remember. Ah, indeed, haven't noticed it's available in the system. Then that's obviously the best option, and I am probably wrong here, hopefully it's going to work then.

It works as far as I can remember.

> That's a very smart move to use PyPI package to figure out which > headers to keep. Unfortunately, it's not a source reliable enough, as e.g. for C++ development it's not even mentioned, with preference to the official ZIP binary archive unpacked locally. Another problem is PyPI packages are not always proved to be absolutely consistent, for example, CPU version of Torchvision goes with box operations linked to non-extistent CUDA library, leading to runtime linker failure, for several versions in a row already. Nobody seems to care yet :) So yes, I try to check all that cases known to me manually.

I can see it is a great endeavour to do so. Thank you!

Contributor

epsilon-0 commented Apr 23, 2020

@heroxbd @aclex , is this still being worked on?
I wanted to get pytorch from the central system in my cluster as opposed to letting my lab users install it manually for themselves, mostly for performance and consistency reasons.
I'm hoping this can be added soonish, thanks a lot for your work :)

Collaborator

TheChymera commented Sep 29, 2020 •

edited

@epsilon-0 the package has kindly been provided by @heroxbd and @aclex , witht the current version in the overlay superseding the one in this PR.

The package works well, other than it will not build with the cuda USE flag in any of the labs I am getting feedback from. Help would very much be appreciated on that front https://bugs.gentoo.org/742617

TheChymera closed this

aclex mentioned this pull request

adding the ebuilds to the gentoo science overlay aclex/pytorch-ebuild#14

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment