Example does not run due to missing cutlass lib #2

catid · 2022-10-03T18:35:44Z

To reproduce the error, I started a fresh install with these commands, following the README guides:

cd python
python setup.py bdist_wheel
pip install dist/*.whl
cd ..
python3 examples/05_stable_diffusion/compile.py

│ModuleNotFoundError: No module named 'cutlass_lib'

The text was updated successfully, but these errors were encountered:

antinucleon · 2022-10-03T18:59:18Z

Please clone code with git clone --recursive to clone the 3rdparty code as well

On Mon, Oct 3, 2022 at 11:35 Chris Taylor ***@***.***> wrote: To reproduce the error, I started a fresh install with these commands, following the README guides: cd python python setup.py bdist_wheel pip install dist/*.whl cd .. python3 examples/05_stable_diffusion/compile.py │ModuleNotFoundError: No module named 'cutlass_lib' — Reply to this email directly, view it on GitHub <#2>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJTLXTUQMHLPG3S5JV27FLWBMRQZANCNFSM6AAAAAAQ33GZIA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

-- Bing Xu

catid · 2022-10-03T19:12:08Z

This fixed it for me. To fix my install, I had to run pip install dist/*.whl --force-reinstall so probably worth throwing that in the README as well if you want to make it more fool-proof.

antinucleon · 2022-10-03T19:13:18Z

Ok we will add today! Thanks for suggestions!

On Mon, Oct 3, 2022 at 12:12 Chris Taylor ***@***.***> wrote: This fixed it for me. To fix my install, I had to run pip install dist/*.whl --force-reinstall so probably worth throwing that in the README as well if you want to make it more fool-proof. — Reply to this email directly, view it on GitHub <#2 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJTLXUQ6AKHHVZ46ZN6CMTWBMVZFANCNFSM6AAAAAAQ33GZIA> . You are receiving this because you commented.Message ID: ***@***.***>

-- Bing Xu

antinucleon · 2022-10-03T20:16:44Z

#3

* upgrade compiler to ROCM 5.3 version * remove unnecessary build fixes Co-authored-by: illsilin <Illia.Silin@amd.com>

Summary: Pull Request resolved: fairinternal/AITemplate#1100 With this diff, the ops from the `conv` family are getting `float32` support. Namely: - `conv2d` - `conv2d_bias` - `conv2d_bias_relu` - `conv2d_bias_hardswish` - `conv2d_bias_sigmoid` - `conv2d_bias_add` - `conv2d_bias_add_relu` - `conv2d_bias_add_hardswish` - `conv2d_bias_few_channels` - `conv2d_bias_relu_few_channels` - `conv2d_bias_hardswish_few_channels` - `transposed_conv2d` - `transposed_conv2d_bias` - `transposed_conv2d_bias_relu` - `depthwise_conv3d` **A few points worth reviewer's attention:** **facebookincubator#1**. For the ops relying on the `cutlass` kernels, the tolerance of assertion in the respective unit tests had to be increased from `1e-2` to `5e-2` to make the tests pass for the `float32` version of the ops. If I've missed anything and the ops' output can be made closer to that of `pytorch`, please let me know. **facebookincubator#2.** `cutlass`'s SIMT kernels had to be excluded from selection for the `conv2d_bias_add_*` and `conv2d_*_bias_few_channels` kernels. Otherwise, generated CUDA code for the ops runs into template instantiation errors during compilation. Disabling SIMT kernels was inspired by the existing code here: https://www.internalfb.com/code/fbsource/[0f1fbb522f6ec10b23a6331da4adfdf2c9fe5908]/fbcode/aitemplate/AITemplate/python/aitemplate/backend/cuda/gemm_universal/common.py?lines=1072-1077 **facebookincubator#3.** There don't seem to be any kernels with `cutlass_lib.library.DataType.f32` inputs / outputs (`op.A.element`, `op.B.element`, etc.) in the `Target.current()._operators[Conv3d]` dict. As a result, even though the `conv3d` op's code is extended to support `fp32`, technically it doesn't work with `fp32` inputs, because the list of selected kernels returned from here ends up being empty (profiler fails first): https://www.internalfb.com/code/fbsource/[D41423689-V1]/fbcode/aitemplate/AITemplate/python/aitemplate/backend/cuda/conv3d/common.py?lines=235 My guess is that `conv3d`'s current limitation to `fp16` comes from the current content of the [`generator.py`](https://www.internalfb.com/code/fbsource/[dc7b8ee10f0c]/fbcode/aitemplate/AITemplate/fb/3rdparty/cutlass/tools/library/scripts/generator.py) in the `cutlass` ilbrary. Currently, `conv3d` operators are only created with the `fp16` arguments here: https://www.internalfb.com/code/fbsource/[dc7b8ee10f0c31078f1e1a2fbd703c91441ccd2a]/fbcode/aitemplate/AITemplate/fb/3rdparty/cutlass/tools/library/scripts/generator.py?lines=1663%2C1668%2C1673%2C1722-1724 `conv2d` operators, on the other hand, are also created with `fp32` arguments: https://www.internalfb.com/code/fbsource/[dc7b8ee10f0c]/fbcode/aitemplate/AITemplate/fb/3rdparty/cutlass/tools/library/scripts/generator.py?lines=2472%2C2505 Maybe inserting a `CreateConv3dOperator` call after the line 2505 could add `fp32` versions of `conv3d` op, too? Is this feasible? (A quick attempt of doing so has run into some `KeyError`s downstream in `emit_instance` calls on the created ops: I guess, it's not that trivial.) `fp32` test for `conv3d` is written but disabled for now by a `unittest.skip` with a message. Importantly, `depthwise_conv3d` *does* support `fp32` now: its code is hand-written, hence was possible to extend to `fp32`. **facebookincubator#4.** In `V1` the newly added `fp32` tests have successfully passed Sandcastle, but failed Circle CI. Looking into the similar diffs for gemm / bmm --- D41168398 (fairinternal/AITemplate@1549112) and D41246673 (fairinternal/AITemplate@e81b808) --- I've noticed that the added `fp32` tests there were guarded against CUDA arch < 80. As the CUDA arch in Circle CI seems to be 75, this probably explains the failure of the `fp32` tests there. So in `V2` I've added the same guard here, too. **facebookincubator#5.** As written currently, alignment-based filtering of the `conv2d` and `conv3d` ops won't allow any `fp32` cutlass kernels in case of the number of channels divisible by `8` (as the maximum possible `ab_alignment` would be `4` for `fp32`). E.g., for `conv2d`: https://www.internalfb.com/code/fbsource/[427a647ecb904df6e6b8556f524ebf1a7017e755]/fbcode/aitemplate/AITemplate/python/aitemplate/backend/cuda/conv2d/common.py?lines=217-226%2C246-254%2C229 Apparently, alignment-based filtering needs to become `dtype`-aware. To this end, the code above (also for `conv3d`) has been refactored in terms of the following function from the `utils.alignment`: https://www.internalfb.com/code/fbsource/[bf9d94d11f61]/fbcode/aitemplate/AITemplate/python/aitemplate/utils/alignment.py?lines=39-48 Reviewed By: chenyang78 Differential Revision: D41423689 fbshipit-source-id: 09c63e96238b3a9c6085b4bc3e4c0a49fde4b924

* updated to 5th stable diffusion checkpoint (facebookincubator#57) * updated to 5th stable diffusion checkpoint * updated all stable diffusion example files to checkpoint v1.5 * Support different sizes via recompilation (StableDiff demo) (facebookincubator#71) Mostly, this commit is just re-establishing the relationship between various previously-hardcoded constants and the target image size (since the latent size is 1/8 of the image size, hardcoding the latent sizes is inconvenient). This adds `--width` and `--height` options to both compile.py and demo.py, and provided these both match you can process different sizes. For img2img mode, the size options passed at compile time must match the size of the actual input image. Consequently, the `--img2img` flag for `compile.py` no longer exists: all this ever did was change the hardcoded size to match the default input image used by `demo_img2img.py`. Yikes. Sooo it's slightly more flexible than before, but still has no support for a single binary to handle different image sizes. It isn't super clear that compiling a generic binary is useful: the upstream project can do that just fine: isn't the whole point of AITemplates to achieve performance gains via aggressive constant propagation and benchmarking to select the optimal kernels? * v0.1.1 (facebookincubator#74) * v0.11 * update cutlass * fix * add missing files * patch cutlass Co-authored-by: Bing Xu <bingxu@fb.com> * fix sm86 conv (facebookincubator#81) Co-authored-by: Bing Xu <bingxu@fb.com> * fix README.md of bert example (facebookincubator#82) * Add negative prompts feature for txt2img pipeline (facebookincubator#75) Add optional negative prompt option for txt2img pipeline * add missing copyright headers (facebookincubator#86) * Conv2d group (facebookincubator#73) * group conv * add conv_groups op compiler * Conv2d groups * Conv2d depthwise * wip * wip * wip * wip * only one ops to get feedback * only one ops to get feedback * Fix layout, now test passes * Fix docstring * Add conv2d_depthwise_bias and test * Add conv2d_depthwise_bias and test and frontends * doc * frontend import depthwise * Fix lint * Fix lint * Fix after rebase UTs pass * fix lint * fix more lint * add more tile size for GN + update CK to main (facebookincubator#40) (facebookincubator#3) * add more tile size for gn * update ck Co-authored-by: Terry Chen <terrychen@meta.com> Co-authored-by: Terry Chen <hahakuku@hotmail.com> Co-authored-by: Terry Chen <terrychen@meta.com> * Ck remove unnecessary compile include directories (facebookincubator#4) * remove unnecessary include directory while compiling ck code * refactor data_type.hpp under ck/utility/data_type.hpp * Update docker to ROCm5.3 (facebookincubator#2) * upgrade compiler to ROCM 5.3 version * remove unnecessary build fixes Co-authored-by: illsilin <Illia.Silin@amd.com> * Fix BERT benchmark for 2 gcd (facebookincubator#6) * fixed batch_size > 1 * load so file for benchmark * Ci setup (facebookincubator#11) * add script for ci and testing * fix syntax * fix syntax again * get rid of the drun alias * get rid of interactive flag for docker * fix syntax * run docker without sudo * run some sanity checks before docker * change the run directive * fix syntax * merge build and test steps into one * fix the path to examples * add pytorch * fix syntax * install timm module * set paths in the docker * change the version of the pytorch * try running bert and vit models * add modules for bert * test if examples work with FB repo * try building the docker from the ait source * try building the docker from the rocm/ait repo * get rid of unnecessary changing paths * try running examples 1 and 4 * update docker arguments * fix syntax * try skippinfg the rebuilding steps * try using the same commits as Jing * check the pytorch version * force replacing pytorch * update the examples * remove the foreground commands * skip the BERT tests while using mi100 * clean up and add logfiles * archive the logfiles * fix path to log files, refine steps * fix paths * fix path to logfiles * specify exact paths to logs * fix syntax * fix syntax * get rid of workspace path in artifact paths * write log headers and archive them in one step * set git branch name as global env var * fix syntax * set the branch name value in each necessary step * test posting test results to db * add missing python packages * do not install glob module * do not convert dbsshport to int type * check the port value * hardcode ssh port * try re-running with new action secrets * skip the ssh tunnel * apply changes to all branches and use tunnel if not running on db host * change the syntax to check hostname * fix syntax * move the python script for processing the results * only run ci for the push branch * add BERT tests * modify the script to parse and store BERT test results * post-merge fix of pr 6 (facebookincubator#13) Co-authored-by: root <root@ctr-ubbsmc15.amd.com> Co-authored-by: Chao Liu <lc.roy86@gmail.com> * Add stable diffusion benchmark to the CI. (facebookincubator#16) * add compilation of stable diffusion * add missing python modules and new demos * add accelerate module and fix the parsing script * only use batch size 1 for stable diffusion * add stable diffusion benchmark result to the table * sync upstream v0.1.1 (facebookincubator#15) * updated to 5th stable diffusion checkpoint (facebookincubator#57) * updated to 5th stable diffusion checkpoint * updated all stable diffusion example files to checkpoint v1.5 * Support different sizes via recompilation (StableDiff demo) (facebookincubator#71) Mostly, this commit is just re-establishing the relationship between various previously-hardcoded constants and the target image size (since the latent size is 1/8 of the image size, hardcoding the latent sizes is inconvenient). This adds `--width` and `--height` options to both compile.py and demo.py, and provided these both match you can process different sizes. For img2img mode, the size options passed at compile time must match the size of the actual input image. Consequently, the `--img2img` flag for `compile.py` no longer exists: all this ever did was change the hardcoded size to match the default input image used by `demo_img2img.py`. Yikes. Sooo it's slightly more flexible than before, but still has no support for a single binary to handle different image sizes. It isn't super clear that compiling a generic binary is useful: the upstream project can do that just fine: isn't the whole point of AITemplates to achieve performance gains via aggressive constant propagation and benchmarking to select the optimal kernels? * v0.1.1 (facebookincubator#74) * v0.11 * update cutlass * fix * add missing files * patch cutlass Co-authored-by: Bing Xu <bingxu@fb.com> * fix profile * fix profile bugs * update ck commit * fix format * fix format * update timeout * add rocm unittest case Co-authored-by: Ivan Mikhnenkov <39604625+ivanmikhnenkov@users.noreply.github.com> Co-authored-by: Chris Kitching <chriskitching@linux.com> Co-authored-by: Bing Xu <antinucleon@gmail.com> Co-authored-by: Bing Xu <bingxu@fb.com> * merge amd-develop Co-authored-by: Ivan Mikhnenkov <39604625+ivanmikhnenkov@users.noreply.github.com> Co-authored-by: Chris Kitching <chriskitching@linux.com> Co-authored-by: Bing Xu <antinucleon@gmail.com> Co-authored-by: Bing Xu <bingxu@fb.com> Co-authored-by: Zhang Jun <ewalker@live.cn> Co-authored-by: Bozhao <yubz86@gmail.com> Co-authored-by: Max Podkorytov <maxdp@meta.com> Co-authored-by: Ehsan Azar <dashesy@gmail.com> Co-authored-by: Chao Liu <lc.roy86@gmail.com> Co-authored-by: Terry Chen <hahakuku@hotmail.com> Co-authored-by: Terry Chen <terrychen@meta.com> Co-authored-by: carlushuang <carlus.huang@amd.com> Co-authored-by: illsilin <Illia.Silin@amd.com> Co-authored-by: zjing14 <zhangjing14@gmail.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: root <root@ctr-ubbsmc15.amd.com>

antinucleon closed this as completed Oct 3, 2022

asroy added a commit to shaojiewang/AITemplate that referenced this issue Nov 10, 2022

Update docker to ROCm5.3 (facebookincubator#2)

3b6a195

* upgrade compiler to ROCM 5.3 version * remove unnecessary build fixes Co-authored-by: illsilin <Illia.Silin@amd.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Example does not run due to missing cutlass lib #2

Example does not run due to missing cutlass lib #2

catid commented Oct 3, 2022

antinucleon commented Oct 3, 2022 via email

catid commented Oct 3, 2022

antinucleon commented Oct 3, 2022 via email

antinucleon commented Oct 3, 2022

Example does not run due to missing cutlass lib #2

Example does not run due to missing cutlass lib #2

Comments

catid commented Oct 3, 2022

antinucleon commented Oct 3, 2022 via email

catid commented Oct 3, 2022

antinucleon commented Oct 3, 2022 via email

antinucleon commented Oct 3, 2022