Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyTorch and Torchvision packages #941

Closed
wants to merge 13 commits into from
Closed

PyTorch and Torchvision packages #941

wants to merge 13 commits into from

Conversation

aclex
Copy link
Contributor

@aclex aclex commented Feb 10, 2020

Here's a merge of the current state of PyTorch, Torchvision and some other required packages ebuilds as in development at https://github.com/aclex/pytorch-ebuild. I've been suggested to make a pull request to merge it here, to join the efforts of the development, so I've decided to try.

Could you please take a look at the ebuilds and consider them for maybe merge to your repository? Please, let me know, if something should be fixed for that. Thanks in advance!

A merge of the current state of PyTorch, Torchvision and
some other required packages ebuilds as in development at
https://github.com/aclex/pytorch-ebuild, to join the efforts
of the development.
Copy link
Contributor

@heroxbd heroxbd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thank you so much for your PR.
Please refer to the Python project recommendations for improvements. https://wiki.gentoo.org/wiki/Project:Python

app-doc/pytorch-docs/pytorch-docs-1.3.1.ebuild Outdated Show resolved Hide resolved

src_unpack() {
unpack ${A}
mv "${WORKDIR}/${PYTORCH_NAME}-${PV}" "${S}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mv is an external command. Should append with || die.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing out on this! Fixed.

app-doc/pytorch-docs/pytorch-docs-1.3.1.ebuild Outdated Show resolved Hide resolved
src_install() {
local doc_build_dir="${S}/docs"

mkdir -p "${D}/usr/share/doc/${P}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

|| die

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

app-doc/pytorch-docs/pytorch-docs-1.3.1.ebuild Outdated Show resolved Hide resolved

RDEPEND="
>=dev-python/javalang-0.10.1
dev-python/lxml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the Python dependencies should have the same Python version USE flags. cf. https://wiki.gentoo.org/wiki/Project:Python/distutils-r1#PYTHON_USEDEP

The same applies to all others.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to make some more thorough tests on this, as I seem to remove it after it led to some failures. But not sure why. Will test it and sort it out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested and fixed it as well in the affected ebuilds.

sci-libs/pytorch/pytorch-1.3.1.ebuild Outdated Show resolved Hide resolved

for file in ${multilib_failing_files[@]}; do
mv -f "${D}/usr/lib/$file" "${D}/usr/lib64"
done
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works. But we'd better fix the build system.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, this needs some more experiments and tests, as we kind of go against the upstream here.

rm -rfv "${D}/usr/lib64/cmake"

rm -rfv "${D}/usr/share/doc/mkldnn"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Patching the build system would be better and cleaner.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we can improve it gradually if your time is limited.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same applies to 1.4.0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, but not changed yet.


if use python; then
FORCE_CUDA=$(usex cuda 1 0) \
CUDA_HOME=$(usex cuda ${CUDA_HOME} "") \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is CUDA_HOME set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just put it following the expectations in the setup.py, but it seems to require some more environment preparation, like here: https://github.com/pytorch/vision/blob/master/packaging/pkg_helpers.bash

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.

@aclex
Copy link
Contributor Author

aclex commented Feb 14, 2020

@heroxbd thank you very much for this hard work of reviewing all these ebuilds, Benda! Many thanks for the detailed comments and references! I'll try to fix them all and update the request as soon, as possible.

@heroxbd
Copy link
Contributor

heroxbd commented Feb 18, 2020

@heroxbd thank you very much for this hard work of reviewing all these ebuilds, Benda! Many thanks for the detailed comments and references! I'll try to fix them all and update the request as soon, as possible.

Hi Alex, are you working on this? If you are busy, I can implement what I have commented.

@aclex
Copy link
Contributor Author

aclex commented Feb 18, 2020

@heroxbd Hi, Benda! Yes, unfortunately I didn't have a chance to work on it last weekend, so it goes slower, than I expected. I hope to submit the fixes soon, maybe this evening, except for the things related to the build system (those artifacts which are moved or removed manually). As for them, I tried to do it the proper way, but they seem to be quite tricky both to implement and test, especially given the upstream strongly discourage any changes to the building configuration.

Sorry again for the delay, I hope to return with the results soon.

@heroxbd
Copy link
Contributor

heroxbd commented Feb 18, 2020

@heroxbd Hi, Benda! Yes, unfortunately I didn't have a chance to work on it last weekend, so it goes slower, than I expected. I hope to submit the fixes soon, maybe this evening, except for the things related to the build system (those artifacts which are moved or removed manually). As for them, I tried to do it the proper way, but they seem to be quite tricky both to implement and test, especially given the upstream strongly discourage any changes to the building configuration.

Sorry again for the delay, I hope to return with the results soon.

No problem, we can work together on this. You can update what you have and I can continue from there to make a working version. Then we can keep improving it.

@aclex
Copy link
Contributor Author

aclex commented Feb 18, 2020

@heroxbd Benda, so I've pushed the most trivial portion of fixes for now, so I'll continue to look at the remaining ones, though it's going to take some more time. Please, feel free to make any changes you like, no problem with it at all. Still I don't really know, how it can be done within the pull request, but I think I can share the request branch with you, if it's needed.

@aclex
Copy link
Contributor Author

aclex commented Feb 18, 2020

I also would like to apologize for my commit messages, which, as I see, don't tally well with the history of the repository. I tried to follow the rules in recent messages, though it's not always easy, when the commit affect several ebuilds at once. Anyway, feel free to correct me.

gentoo-bot pushed a commit that referenced this pull request Feb 19, 2020
  This is CPU version only. ROCm and CUDA options will be added later.

Reference: https://github.com/aclex/pytorch-ebuild
Bug: #941
Package-Manager: Portage-2.3.88, Repoman-2.3.18
Signed-off-by: Benda Xu <heroxbd@gentoo.org>
@heroxbd
Copy link
Contributor

heroxbd commented Feb 19, 2020

Hi, I have committed only pytorch CPU version to the tree. We still need to add more features like ROCm and CUDA, besides torchvision, etc. Please rebase against the newest tree and send atomic pull requests.

This one is too big to handle smoothly. And please use repoman for the ebuild commits.

@aclex
Copy link
Contributor Author

aclex commented Feb 19, 2020

OK, thank you. At first glance it looks quite deeply modified, so I'll try to look closer at it later.

gentoo-bot pushed a commit that referenced this pull request Feb 20, 2020
Bug: #941
Package-Manager: Portage-2.3.88, Repoman-2.3.18
Signed-off-by: Benda Xu <heroxbd@gentoo.org>
@aclex
Copy link
Contributor Author

aclex commented Feb 21, 2020

Benda, I've looked through the commit you've mentioned, and, to be honest, quite disappointed, as it's silently pretty much another version of what I've put in this request. It's written in the commit as if authored by me, but there're hardly a few lines I wrote actually) To me it looks kind of contradictory to your suggestion of adding things atomically. Perhaps, it might have more sense, if we elaborated some joint version together somewhere externally, with all the little changes and fixes gradually accumulated from both versions. But OK, that's the way it is.

I'm actually quite confused about the next steps on this, as I don't know your plan, what you didn't included as wasteful or unnecessary and what was just postponed for future. As you wrote, ROCm, torchvision and other things were excluded for now, so apparently that's pretty much it. Anyway, I'll backport the fixes on your suggestions, so please feel free to pick any other parts you consider worth merging, I'd be glad to help you with that.

My initial plan after your suggestion was to merge all the tree fully here and render my branch to read-only, but, I think, I just underestimated the amount of efforts and pain of the process, following all the packaging rules, given there's a lot of work to be done in the ebuild itself to keep it stable enough after each version bump, while free time on this is very limited. So, I think, it would be the best way to continue it separately. I'm going to concentrate on ROCm support now, as there're major issues there. Again, I'm open for any interaction, to not do the same work twice etc.

Anyway, thank you very much again for your help, suggestions and guidance! Was useful experience for me.

@heroxbd
Copy link
Contributor

heroxbd commented Feb 22, 2020

Benda, I've looked through the commit you've mentioned, and, to be honest, quite disappointed, as it's silently pretty much another version of what I've put in this request. It's written in the commit as if authored by me, but there're hardly a few lines I wrote actually)

Not really, the CPU version is based on your ebuild, with the exception of ROCm and SRC_URI parts. The biggest point of your contribution is the cmake driven build and related patches, which is buried underneath setup.py. I put you on the author on purpose, because you have more code than me.

I am sorry if that confused you.

To me it looks kind of contradictory to your suggestion of adding things atomically.

By atomically, I mean putting in CPU, AMD, nVidia versions one at a time. And when it comes to AMD, land the ROCm ebuilds first and then an update to the pytorch ebuild. Finally the reverse dependencies of pytorch.

Perhaps, it might have more sense, if we elaborated some joint version together somewhere externally, with all the little changes and fixes gradually accumulated from both versions. But OK, that's the way it is.

The science overlay itself is an working place for the gentoo main tree. Ultimately every ebuild that matures in science overlay should land into gentoo. So let's polish the joint version on science overlay.

I'm actually quite confused about the next steps on this, as I don't know your plan, what you didn't included as wasteful or unnecessary and what was just postponed for future. As you wrote, ROCm, torchvision and other things were excluded for now, so apparently that's pretty much it. Anyway, I'll backport the fixes on your suggestions, so please feel free to pick any other parts you consider worth merging, I'd be glad to help you with that.

Please see the atomic comment above.

My initial plan after your suggestion was to merge all the tree fully here and render my branch to read-only, but, I think, I just underestimated the amount of efforts and pain of the process, following all the packaging rules, given there's a lot of work to be done in the ebuild itself to keep it stable enough after each version bump, while free time on this is very limited. So, I think, it would be the best way to continue it separately. I'm going to concentrate on ROCm support now, as there're major issues there. Again, I'm open for any interaction, to not do the same work twice etc.

Thank you! I am interested in ROCm ebuilds as well. If you could revise them according to be the standards, I am more than happy to land them to the science overlay.

I understand there is a steep learning curve for packaging. Requiring a quality standard does not seem to make much sense as long as it just works, but the quality standards are accumulated wisdom and proven best efforts in the history. They will pay off in the long run.

Anyway, thank you very much again for your help, suggestions and guidance! Was useful experience for me.

You are welcome. I apologize again for not synchronizing the big picture of pytorch/ROCm with you promptly.

@aclex
Copy link
Contributor Author

aclex commented Feb 22, 2020

Thank you for the reply and clarification of your view, Benda!

The science overlay itself is an working place for the gentoo main tree. Ultimately every ebuild that matures in science overlay should land into gentoo. So let's polish the joint version on science overlay.

This is so attractive aim, especially in case of PyTorch, as, to my knowledge, there's no version of it in repositories of many other distributions.

The main problem is that the recommended building process just ignores the system-wide installation, FHS, building separated from the installation and all that stuff and very clearly discourage anyone to try doing these things. I don't think, that's for bad will, but obviously rather because of complex build configuration historically fused of several different packages. To make it system compliant we actually need to violate its rules, which usually leads to quite subtle problems. In that case, incremental steps are very useful and valuable for the same reasons we do separate commits instead of one squashed. For example, I carefully selected all the header files in the image, checking if they are included in some Torch public headers or not, to not break the compilation of C++ part and packages which depend on it (like, as I remember, torchvision). As I can see, at least pybind11 headers are commented out from installation in the submitted version, so it's unclear, if it works in these scenarios at all. But it's already there. Such things could easily be avoided, if we tried to observe all the proposals together, discussing each one separately. Two heads are always better, than one, they say)

…when it comes to AMD, land the ROCm ebuilds first and then an update to the pytorch ebuild. Finally the reverse dependencies of pytorch.

The ROCm-enabled build relies on the ebuilds in the separate repository: https://github.com/justxi/rocm/. As I know, packages from there are accepted to the main Gentoo repository directly, but that's not finished process, about a half of the packages are not yet moved. My PyTorch ebuild just mentions the dependencies from there and implies this repository has been registered beforehand, when rocm USE flag enabled. Effectively this means it depends on non-existent packages, so if it's not appropriate case for SCI repository, the ROCm-enabled configuration would have to wait for all the ROCm packages added to the main Gentoo repository.

I understand there is a steep learning curve for packaging. Requiring a quality standard does not seem to make much sense as long as it just works, but the quality standards are accumulated wisdom and proven best efforts in the history. They will pay off in the long run.

I surely agree with you on the rules importance and always try to respect and follow the rules I'm aware of. Not a good thing is just when they are quite spread and there're a lot of moments that all checks are passed and you thing it's all good, while there are so much more requirements violated. I mean, I just want to share my results with everyone, to save them from doing the same work, making it easy to use as much, as possible. I have no power to keep everyone happy with every single of my commit messages :)

@heroxbd
Copy link
Contributor

heroxbd commented Feb 23, 2020

Thank you for the reply and clarification of your view, Benda!

The science overlay itself is an working place for the gentoo main tree. Ultimately every ebuild that matures in science overlay should land into gentoo. So let's polish the joint version on science overlay.

This is so attractive aim, especially in case of PyTorch, as, to my knowledge, there's no version of it in repositories of many other distributions.

Indeed.

The main problem is that the recommended building process just ignores the system-wide installation, FHS, building separated from the installation and all that stuff and very clearly discourage anyone to try doing these things. I don't think, that's for bad will, but obviously rather because of complex build configuration historically fused of several different packages. To make it system compliant we actually need to violate its rules, which usually leads to quite subtle problems. In that case, incremental steps are very useful and valuable for the same reasons we do separate commits instead of one squashed. For example, I carefully selected all the header files in the image, checking if they are included in some Torch public headers or not, to not break the compilation of C++ part and packages which depend on it (like, as I remember, torchvision). As I can see, at least pybind11 headers are commented out from installation in the submitted version, so it's unclear, if it works in these scenarios at all. But it's already there. Such things could easily be avoided, if we tried to observe all the proposals together, discussing each one separately. Two heads are always better, than one, they say)

That's a very smart move to use PyPI package to figure out which headers to keep.

We are now using pybind11 from the system, not bundled as far as I remember.

…when it comes to AMD, land the ROCm ebuilds first and then an update to the pytorch ebuild. Finally the reverse dependencies of pytorch.

The ROCm-enabled build relies on the ebuilds in the separate repository: https://github.com/justxi/rocm/. As I know, packages from there are accepted to the main Gentoo repository directly, but that's not finished process, about a half of the packages are not yet moved. My PyTorch ebuild just mentions the dependencies from there and implies this repository has been registered beforehand, when rocm USE flag enabled. Effectively this means it depends on non-existent packages, so if it's not appropriate case for SCI repository, the ROCm-enabled configuration would have to wait for all the ROCm packages added to the main Gentoo repository.

Good to know of the ongoing ROCm efforts.

I understand there is a steep learning curve for packaging. Requiring a quality standard does not seem to make much sense as long as it just works, but the quality standards are accumulated wisdom and proven best efforts in the history. They will pay off in the long run.

I surely agree with you on the rules importance and always try to respect and follow the rules I'm aware of. Not a good thing is just when they are quite spread and there're a lot of moments that all checks are passed and you thing it's all good, while there are so much more requirements violated. I mean, I just want to share my results with everyone, to save them from doing the same work, making it easy to use as much, as possible. I have no power to keep everyone happy with every single of my commit messages :)

Sure, no problem at all.

@aclex
Copy link
Contributor Author

aclex commented Feb 24, 2020

We are now using pybind11 from the system, not bundled as far as I remember.

Ah, indeed, haven't noticed it's available in the system. Then that's obviously the best option, and I am probably wrong here, hopefully it's going to work then.

That's a very smart move to use PyPI package to figure out which headers to keep.

Unfortunately, it's not a source reliable enough, as e.g. for C++ development it's not even mentioned, with preference to the official ZIP binary archive unpacked locally. Another problem is PyPI packages are not always proved to be absolutely consistent, for example, CPU version of Torchvision goes with box operations linked to non-extistent CUDA library, leading to runtime linker failure, for several versions in a row already. Nobody seems to care yet :) So yes, I try to check all that cases known to me manually.

@heroxbd
Copy link
Contributor

heroxbd commented Feb 25, 2020 via email

@epsilon-0
Copy link
Contributor

@heroxbd @aclex , is this still being worked on?
I wanted to get pytorch from the central system in my cluster as opposed to letting my lab users install it manually for themselves, mostly for performance and consistency reasons.
I'm hoping this can be added soonish, thanks a lot for your work :)

@TheChymera
Copy link
Collaborator

TheChymera commented Sep 29, 2020

@epsilon-0 the package has kindly been provided by @heroxbd and @aclex , witht the current version in the overlay superseding the one in this PR.

The package works well, other than it will not build with the cuda USE flag in any of the labs I am getting feedback from. Help would very much be appreciated on that front https://bugs.gentoo.org/742617

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants