New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PyTorch and Torchvision packages #941
Conversation
A merge of the current state of PyTorch, Torchvision and some other required packages ebuilds as in development at https://github.com/aclex/pytorch-ebuild, to join the efforts of the development.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, thank you so much for your PR.
Please refer to the Python project recommendations for improvements. https://wiki.gentoo.org/wiki/Project:Python
|
||
src_unpack() { | ||
unpack ${A} | ||
mv "${WORKDIR}/${PYTORCH_NAME}-${PV}" "${S}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mv
is an external command. Should append with || die
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing out on this! Fixed.
src_install() { | ||
local doc_build_dir="${S}/docs" | ||
|
||
mkdir -p "${D}/usr/share/doc/${P}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|| die
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
|
||
RDEPEND=" | ||
>=dev-python/javalang-0.10.1 | ||
dev-python/lxml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the Python dependencies should have the same Python version USE flags. cf. https://wiki.gentoo.org/wiki/Project:Python/distutils-r1#PYTHON_USEDEP
The same applies to all others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to make some more thorough tests on this, as I seem to remove it after it led to some failures. But not sure why. Will test it and sort it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested and fixed it as well in the affected ebuilds.
|
||
for file in ${multilib_failing_files[@]}; do | ||
mv -f "${D}/usr/lib/$file" "${D}/usr/lib64" | ||
done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works. But we'd better fix the build system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, this needs some more experiments and tests, as we kind of go against the upstream here.
rm -rfv "${D}/usr/lib64/cmake" | ||
|
||
rm -rfv "${D}/usr/share/doc/mkldnn" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Patching the build system would be better and cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we can improve it gradually if your time is limited.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same applies to 1.4.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, but not changed yet.
|
||
if use python; then | ||
FORCE_CUDA=$(usex cuda 1 0) \ | ||
CUDA_HOME=$(usex cuda ${CUDA_HOME} "") \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is CUDA_HOME
set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just put it following the expectations in the setup.py
, but it seems to require some more environment preparation, like here: https://github.com/pytorch/vision/blob/master/packaging/pkg_helpers.bash
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see.
@heroxbd thank you very much for this hard work of reviewing all these ebuilds, Benda! Many thanks for the detailed comments and references! I'll try to fix them all and update the request as soon, as possible. |
Hi Alex, are you working on this? If you are busy, I can implement what I have commented. |
@heroxbd Hi, Benda! Yes, unfortunately I didn't have a chance to work on it last weekend, so it goes slower, than I expected. I hope to submit the fixes soon, maybe this evening, except for the things related to the build system (those artifacts which are moved or removed manually). As for them, I tried to do it the proper way, but they seem to be quite tricky both to implement and test, especially given the upstream strongly discourage any changes to the building configuration. Sorry again for the delay, I hope to return with the results soon. |
No problem, we can work together on this. You can update what you have and I can continue from there to make a working version. Then we can keep improving it. |
@heroxbd Benda, so I've pushed the most trivial portion of fixes for now, so I'll continue to look at the remaining ones, though it's going to take some more time. Please, feel free to make any changes you like, no problem with it at all. Still I don't really know, how it can be done within the pull request, but I think I can share the request branch with you, if it's needed. |
I also would like to apologize for my commit messages, which, as I see, don't tally well with the history of the repository. I tried to follow the rules in recent messages, though it's not always easy, when the commit affect several ebuilds at once. Anyway, feel free to correct me. |
This is CPU version only. ROCm and CUDA options will be added later. Reference: https://github.com/aclex/pytorch-ebuild Bug: #941 Package-Manager: Portage-2.3.88, Repoman-2.3.18 Signed-off-by: Benda Xu <heroxbd@gentoo.org>
Hi, I have committed only pytorch CPU version to the tree. We still need to add more features like ROCm and CUDA, besides torchvision, etc. Please rebase against the newest tree and send atomic pull requests. This one is too big to handle smoothly. And please use |
OK, thank you. At first glance it looks quite deeply modified, so I'll try to look closer at it later. |
Bug: #941 Package-Manager: Portage-2.3.88, Repoman-2.3.18 Signed-off-by: Benda Xu <heroxbd@gentoo.org>
Benda, I've looked through the commit you've mentioned, and, to be honest, quite disappointed, as it's silently pretty much another version of what I've put in this request. It's written in the commit as if authored by me, but there're hardly a few lines I wrote actually) To me it looks kind of contradictory to your suggestion of adding things atomically. Perhaps, it might have more sense, if we elaborated some joint version together somewhere externally, with all the little changes and fixes gradually accumulated from both versions. But OK, that's the way it is. I'm actually quite confused about the next steps on this, as I don't know your plan, what you didn't included as wasteful or unnecessary and what was just postponed for future. As you wrote, ROCm, torchvision and other things were excluded for now, so apparently that's pretty much it. Anyway, I'll backport the fixes on your suggestions, so please feel free to pick any other parts you consider worth merging, I'd be glad to help you with that. My initial plan after your suggestion was to merge all the tree fully here and render my branch to read-only, but, I think, I just underestimated the amount of efforts and pain of the process, following all the packaging rules, given there's a lot of work to be done in the ebuild itself to keep it stable enough after each version bump, while free time on this is very limited. So, I think, it would be the best way to continue it separately. I'm going to concentrate on ROCm support now, as there're major issues there. Again, I'm open for any interaction, to not do the same work twice etc. Anyway, thank you very much again for your help, suggestions and guidance! Was useful experience for me. |
Not really, the CPU version is based on your I am sorry if that confused you.
By atomically, I mean putting in CPU, AMD, nVidia versions one at a time. And when it comes to AMD, land the ROCm ebuilds first and then an update to the pytorch ebuild. Finally the reverse dependencies of pytorch.
The science overlay itself is an working place for the gentoo main tree. Ultimately every ebuild that matures in science overlay should land into gentoo. So let's polish the joint version on science overlay.
Please see the atomic comment above.
Thank you! I am interested in ROCm ebuilds as well. If you could revise them according to be the standards, I am more than happy to land them to the science overlay. I understand there is a steep learning curve for packaging. Requiring a quality standard does not seem to make much sense as long as it
You are welcome. I apologize again for not synchronizing the big picture of pytorch/ROCm with you promptly. |
Thank you for the reply and clarification of your view, Benda!
This is so attractive aim, especially in case of PyTorch, as, to my knowledge, there's no version of it in repositories of many other distributions. The main problem is that the recommended building process just ignores the system-wide installation, FHS, building separated from the installation and all that stuff and very clearly discourage anyone to try doing these things. I don't think, that's for bad will, but obviously rather because of complex build configuration historically fused of several different packages. To make it system compliant we actually need to violate its rules, which usually leads to quite subtle problems. In that case, incremental steps are very useful and valuable for the same reasons we do separate commits instead of one squashed. For example, I carefully selected all the header files in the image, checking if they are included in some Torch public headers or not, to not break the compilation of C++ part and packages which depend on it (like, as I remember,
The ROCm-enabled build relies on the ebuilds in the separate repository: https://github.com/justxi/rocm/. As I know, packages from there are accepted to the main Gentoo repository directly, but that's not finished process, about a half of the packages are not yet moved. My PyTorch ebuild just mentions the dependencies from there and implies this repository has been registered beforehand, when
I surely agree with you on the rules importance and always try to respect and follow the rules I'm aware of. Not a good thing is just when they are quite spread and there're a lot of moments that all checks are passed and you thing it's all good, while there are so much more requirements violated. I mean, I just want to share my results with everyone, to save them from doing the same work, making it easy to use as much, as possible. I have no power to keep everyone happy with every single of my commit messages :) |
Indeed.
That's a very smart move to use PyPI package to figure out which headers to keep. We are now using
Good to know of the ongoing ROCm efforts.
Sure, no problem at all. |
Ah, indeed, haven't noticed it's available in the system. Then that's obviously the best option, and I am probably wrong here, hopefully it's going to work then.
Unfortunately, it's not a source reliable enough, as e.g. for C++ development it's not even mentioned, with preference to the official ZIP binary archive unpacked locally. Another problem is PyPI packages are not always proved to be absolutely consistent, for example, CPU version of Torchvision goes with box operations linked to non-extistent CUDA library, leading to runtime linker failure, for several versions in a row already. Nobody seems to care yet :) So yes, I try to check all that cases known to me manually. |
Alexey Chernov <notifications@github.com> writes:
> We are now using pybind11 from the system, not bundled as far as I
> remember.
Ah, indeed, haven't noticed it's available in the system. Then that's
obviously the best option, and I am probably wrong here, hopefully
it's going to work then.
It works as far as I can remember.
> That's a very smart move to use PyPI package to figure out which
> headers to keep.
Unfortunately, it's not a source reliable enough, as e.g. for C++
development it's not even mentioned, with preference to the official
ZIP binary archive unpacked locally. Another problem is PyPI packages
are not always proved to be absolutely consistent, for example, CPU
version of Torchvision goes with box operations linked to
non-extistent CUDA library, leading to runtime linker failure, for
several versions in a row already. Nobody seems to care yet :) So yes,
I try to check all that cases known to me manually.
I can see it is a great endeavour to do so. Thank you!
|
@epsilon-0 the package has kindly been provided by @heroxbd and @aclex , witht the current version in the overlay superseding the one in this PR. The package works well, other than it will not build with the |
Here's a merge of the current state of PyTorch, Torchvision and some other required packages ebuilds as in development at https://github.com/aclex/pytorch-ebuild. I've been suggested to make a pull request to merge it here, to join the efforts of the development, so I've decided to try.
Could you please take a look at the ebuilds and consider them for maybe merge to your repository? Please, let me know, if something should be fixed for that. Thanks in advance!