Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example does not run due to missing cutlass lib #2

Closed
catid opened this issue Oct 3, 2022 · 4 comments
Closed

Example does not run due to missing cutlass lib #2

catid opened this issue Oct 3, 2022 · 4 comments

Comments

@catid
Copy link

catid commented Oct 3, 2022

To reproduce the error, I started a fresh install with these commands, following the README guides:

cd python
python setup.py bdist_wheel
pip install dist/*.whl
cd ..
python3 examples/05_stable_diffusion/compile.py

│ModuleNotFoundError: No module named 'cutlass_lib'

@antinucleon
Copy link
Contributor

antinucleon commented Oct 3, 2022 via email

@catid
Copy link
Author

catid commented Oct 3, 2022

This fixed it for me. To fix my install, I had to run pip install dist/*.whl --force-reinstall so probably worth throwing that in the README as well if you want to make it more fool-proof.

@antinucleon
Copy link
Contributor

antinucleon commented Oct 3, 2022 via email

@antinucleon
Copy link
Contributor

#3

asroy added a commit to shaojiewang/AITemplate that referenced this issue Nov 10, 2022
* upgrade compiler to ROCM 5.3 version

* remove unnecessary build fixes

Co-authored-by: illsilin <Illia.Silin@amd.com>
tissue3 pushed a commit to tissue3/AITemplate-1 that referenced this issue Feb 7, 2023
Summary:
Pull Request resolved: fairinternal/AITemplate#1100

With this diff, the ops from the `conv` family are getting `float32` support. Namely:

- `conv2d`
- `conv2d_bias`
- `conv2d_bias_relu`
- `conv2d_bias_hardswish`
- `conv2d_bias_sigmoid`
- `conv2d_bias_add`
- `conv2d_bias_add_relu`
- `conv2d_bias_add_hardswish`
- `conv2d_bias_few_channels`
- `conv2d_bias_relu_few_channels`
- `conv2d_bias_hardswish_few_channels`
- `transposed_conv2d`
- `transposed_conv2d_bias`
- `transposed_conv2d_bias_relu`
- `depthwise_conv3d`

**A few points worth reviewer's attention:**

**facebookincubator#1**. For the ops relying on the `cutlass` kernels, the tolerance of assertion in the respective unit tests had to be increased from `1e-2` to `5e-2` to make the tests pass for the `float32` version of the ops. If I've missed anything and the ops' output can be made closer to that of `pytorch`, please let me know.

**facebookincubator#2.** `cutlass`'s SIMT kernels had to be excluded from selection for the `conv2d_bias_add_*` and `conv2d_*_bias_few_channels` kernels. Otherwise, generated CUDA code for the ops runs into template instantiation errors during compilation. Disabling SIMT kernels was inspired by the existing code here:

https://www.internalfb.com/code/fbsource/[0f1fbb522f6ec10b23a6331da4adfdf2c9fe5908]/fbcode/aitemplate/AITemplate/python/aitemplate/backend/cuda/gemm_universal/common.py?lines=1072-1077

**facebookincubator#3.** There don't seem to be any kernels with `cutlass_lib.library.DataType.f32` inputs / outputs (`op.A.element`, `op.B.element`, etc.) in the `Target.current()._operators[Conv3d]` dict. As a result, even though the `conv3d` op's code is extended to support `fp32`, technically it doesn't work with `fp32` inputs, because the list of selected kernels returned from here ends up being empty (profiler fails first):

https://www.internalfb.com/code/fbsource/[D41423689-V1]/fbcode/aitemplate/AITemplate/python/aitemplate/backend/cuda/conv3d/common.py?lines=235

My guess is that `conv3d`'s current limitation to `fp16` comes from the current content of the [`generator.py`](https://www.internalfb.com/code/fbsource/[dc7b8ee10f0c]/fbcode/aitemplate/AITemplate/fb/3rdparty/cutlass/tools/library/scripts/generator.py) in the `cutlass` ilbrary. Currently, `conv3d` operators are only created with the `fp16` arguments here:

https://www.internalfb.com/code/fbsource/[dc7b8ee10f0c31078f1e1a2fbd703c91441ccd2a]/fbcode/aitemplate/AITemplate/fb/3rdparty/cutlass/tools/library/scripts/generator.py?lines=1663%2C1668%2C1673%2C1722-1724

`conv2d` operators, on the other hand, are also created with `fp32` arguments:

https://www.internalfb.com/code/fbsource/[dc7b8ee10f0c]/fbcode/aitemplate/AITemplate/fb/3rdparty/cutlass/tools/library/scripts/generator.py?lines=2472%2C2505

Maybe inserting a `CreateConv3dOperator` call after the line 2505 could add `fp32` versions of `conv3d` op, too? Is this feasible? (A quick attempt of doing so has run into some `KeyError`s downstream in `emit_instance` calls on the created ops: I guess, it's not that trivial.)

`fp32` test for `conv3d` is written but disabled for now by a `unittest.skip` with a message. Importantly, `depthwise_conv3d` *does* support `fp32` now: its code is hand-written, hence was possible to extend to `fp32`.

**facebookincubator#4.** In `V1` the newly added `fp32` tests have successfully passed Sandcastle, but failed Circle CI. Looking into the similar diffs for gemm / bmm --- D41168398 (fairinternal/AITemplate@1549112) and D41246673 (fairinternal/AITemplate@e81b808) --- I've noticed that the added `fp32` tests there were guarded against CUDA arch < 80. As the CUDA arch in Circle CI seems to be 75, this probably explains the failure of the `fp32` tests there. So in `V2` I've added the same guard here, too.

**facebookincubator#5.** As written currently, alignment-based filtering of the `conv2d` and `conv3d` ops won't allow any `fp32` cutlass kernels in case of the number of channels divisible by `8` (as the maximum possible `ab_alignment` would be `4` for `fp32`). E.g., for `conv2d`:

https://www.internalfb.com/code/fbsource/[427a647ecb904df6e6b8556f524ebf1a7017e755]/fbcode/aitemplate/AITemplate/python/aitemplate/backend/cuda/conv2d/common.py?lines=217-226%2C246-254%2C229

Apparently, alignment-based filtering needs to become `dtype`-aware. To this end, the code above (also for `conv3d`) has been refactored in terms of the following function from the `utils.alignment`:

https://www.internalfb.com/code/fbsource/[bf9d94d11f61]/fbcode/aitemplate/AITemplate/python/aitemplate/utils/alignment.py?lines=39-48

Reviewed By: chenyang78

Differential Revision: D41423689

fbshipit-source-id: 09c63e96238b3a9c6085b4bc3e4c0a49fde4b924
evshiron pushed a commit to are-we-gfx1100-yet/AITemplate that referenced this issue Jun 21, 2023
* updated to 5th stable diffusion checkpoint (facebookincubator#57)

* updated to 5th stable diffusion checkpoint

* updated all stable diffusion example files to checkpoint v1.5

* Support different sizes via recompilation (StableDiff demo) (facebookincubator#71)

Mostly, this commit is just re-establishing the relationship
between various previously-hardcoded constants and the target
image size (since the latent size is 1/8 of the image size,
hardcoding the latent sizes is inconvenient).

This adds `--width` and `--height` options to both compile.py
and demo.py, and provided these both match you can process
different sizes. For img2img mode, the size options passed at
compile time must match the size of the actual input image.

Consequently, the `--img2img` flag for `compile.py` no longer
exists: all this ever did was change the hardcoded size to
match the default input image used by `demo_img2img.py`. Yikes.

Sooo it's slightly more flexible than before, but still has no
support for a single binary to handle different image sizes. It
isn't super clear that compiling a generic binary is useful: the
upstream project can do that just fine: isn't the whole point
of AITemplates to achieve performance gains via aggressive
constant propagation and benchmarking to select the optimal
kernels?

* v0.1.1 (facebookincubator#74)

* v0.11

* update cutlass

* fix

* add missing files

* patch cutlass

Co-authored-by: Bing Xu <bingxu@fb.com>

* fix sm86 conv (facebookincubator#81)

Co-authored-by: Bing Xu <bingxu@fb.com>

* fix README.md of bert example (facebookincubator#82)

* Add negative prompts feature for txt2img pipeline (facebookincubator#75)

Add optional negative prompt option for txt2img pipeline

* add missing copyright headers (facebookincubator#86)

* Conv2d group (facebookincubator#73)

* group conv

* add conv_groups op compiler

* Conv2d groups

* Conv2d depthwise

* wip

* wip

* wip

* wip

* only one ops to get feedback

* only one ops to get feedback

* Fix layout, now test passes

* Fix docstring

* Add conv2d_depthwise_bias and test

* Add conv2d_depthwise_bias and test and frontends

* doc

* frontend import depthwise

* Fix lint

* Fix lint

* Fix after rebase UTs pass

* fix lint

* fix more lint

* add more tile size for GN + update CK to main  (facebookincubator#40) (facebookincubator#3)

* add more tile size for gn

* update ck

Co-authored-by: Terry Chen <terrychen@meta.com>

Co-authored-by: Terry Chen <hahakuku@hotmail.com>
Co-authored-by: Terry Chen <terrychen@meta.com>

* Ck remove unnecessary compile include directories (facebookincubator#4)

* remove unnecessary include directory while compiling ck code

* refactor data_type.hpp under ck/utility/data_type.hpp

* Update docker to ROCm5.3 (facebookincubator#2)

* upgrade compiler to ROCM 5.3 version

* remove unnecessary build fixes

Co-authored-by: illsilin <Illia.Silin@amd.com>

* Fix BERT benchmark for 2 gcd (facebookincubator#6)

* fixed batch_size > 1

* load so file for benchmark

* Ci setup (facebookincubator#11)

* add script for ci and testing

* fix syntax

* fix syntax again

* get rid of the drun alias

* get rid of interactive flag for docker

* fix syntax

* run docker without sudo

* run some sanity checks before docker

* change the run directive

* fix syntax

* merge build and test steps into one

* fix the path to examples

* add pytorch

* fix syntax

* install timm module

* set paths in the docker

* change the version of the pytorch

* try running bert and vit models

* add modules for bert

* test if examples work with FB repo

* try building the docker from the ait source

* try building the docker from the rocm/ait repo

* get rid of unnecessary changing paths

* try running examples 1 and 4

* update docker arguments

* fix syntax

* try skippinfg the rebuilding steps

* try using the same commits as Jing

* check the pytorch version

* force replacing pytorch

* update the examples

* remove the foreground commands

* skip the BERT tests while using mi100

* clean up and add logfiles

* archive the logfiles

* fix path to log files, refine steps

* fix paths

* fix path to logfiles

* specify exact paths to logs

* fix syntax

* fix syntax

* get rid of workspace path in artifact paths

* write log headers and archive them in one step

* set git branch name as global env var

* fix syntax

* set the branch name value in each necessary step

* test posting test results to db

* add missing python packages

* do not install glob module

* do not convert dbsshport to int type

* check the port value

* hardcode ssh port

* try re-running with new action secrets

* skip the ssh tunnel

* apply changes to all branches and use tunnel if not running on db host

* change the syntax to check hostname

* fix syntax

* move the python script for processing the results

* only run ci for the push branch

* add BERT tests

* modify the script to parse and store BERT test results

* post-merge fix of pr 6 (facebookincubator#13)

Co-authored-by: root <root@ctr-ubbsmc15.amd.com>
Co-authored-by: Chao Liu <lc.roy86@gmail.com>

* Add stable diffusion benchmark to the CI. (facebookincubator#16)

* add compilation of stable diffusion

* add missing python modules and new demos

* add accelerate module and fix the parsing script

* only use batch size 1 for stable diffusion

* add stable diffusion benchmark result to the table

* sync upstream v0.1.1 (facebookincubator#15)

* updated to 5th stable diffusion checkpoint (facebookincubator#57)

* updated to 5th stable diffusion checkpoint

* updated all stable diffusion example files to checkpoint v1.5

* Support different sizes via recompilation (StableDiff demo) (facebookincubator#71)

Mostly, this commit is just re-establishing the relationship
between various previously-hardcoded constants and the target
image size (since the latent size is 1/8 of the image size,
hardcoding the latent sizes is inconvenient).

This adds `--width` and `--height` options to both compile.py
and demo.py, and provided these both match you can process
different sizes. For img2img mode, the size options passed at
compile time must match the size of the actual input image.

Consequently, the `--img2img` flag for `compile.py` no longer
exists: all this ever did was change the hardcoded size to
match the default input image used by `demo_img2img.py`. Yikes.

Sooo it's slightly more flexible than before, but still has no
support for a single binary to handle different image sizes. It
isn't super clear that compiling a generic binary is useful: the
upstream project can do that just fine: isn't the whole point
of AITemplates to achieve performance gains via aggressive
constant propagation and benchmarking to select the optimal
kernels?

* v0.1.1 (facebookincubator#74)

* v0.11

* update cutlass

* fix

* add missing files

* patch cutlass

Co-authored-by: Bing Xu <bingxu@fb.com>

* fix profile

* fix profile bugs

* update ck commit

* fix format

* fix format

* update timeout

* add rocm unittest case

Co-authored-by: Ivan Mikhnenkov <39604625+ivanmikhnenkov@users.noreply.github.com>
Co-authored-by: Chris Kitching <chriskitching@linux.com>
Co-authored-by: Bing Xu <antinucleon@gmail.com>
Co-authored-by: Bing Xu <bingxu@fb.com>

* merge amd-develop

Co-authored-by: Ivan Mikhnenkov <39604625+ivanmikhnenkov@users.noreply.github.com>
Co-authored-by: Chris Kitching <chriskitching@linux.com>
Co-authored-by: Bing Xu <antinucleon@gmail.com>
Co-authored-by: Bing Xu <bingxu@fb.com>
Co-authored-by: Zhang Jun <ewalker@live.cn>
Co-authored-by: Bozhao <yubz86@gmail.com>
Co-authored-by: Max Podkorytov <maxdp@meta.com>
Co-authored-by: Ehsan Azar <dashesy@gmail.com>
Co-authored-by: Chao Liu <lc.roy86@gmail.com>
Co-authored-by: Terry Chen <hahakuku@hotmail.com>
Co-authored-by: Terry Chen <terrychen@meta.com>
Co-authored-by: carlushuang <carlus.huang@amd.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: root <root@ctr-ubbsmc15.amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants