New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

P40 int8推理过于慢 #45

Closed

luanshaotong opened this issue Jul 13, 2023 · 12 comments

Assignees

luanshaotong commented Jul 13, 2023

样例代码运行速度大约1it/s
A100的 fp16 算力约为 300 TFOPS，官方速度 25.4 it/s
p40的 int8 算例为 47 TFOPS，速度大约应为4it/s
现在感觉跟p40的fp16速度大致相同，想知道是量化问题还是某些库没有装好？

Package Version

accelerate 0.20.3
aiofiles 23.1.0
aiohttp 3.8.4
aiosignal 1.3.1
altair 5.0.1
anyio 3.7.1
async-timeout 4.0.2
attrs 23.1.0
certifi 2023.5.7
charset-normalizer 3.2.0
click 8.1.4
cmake 3.26.4
contourpy 1.1.0
cpm-kernels 1.0.11
cycler 0.11.0
exceptiongroup 1.1.2
fastapi 0.99.1
ffmpy 0.3.0
filelock 3.12.2
fonttools 4.40.0
frozenlist 1.3.3
fsspec 2023.6.0
gradio 3.36.1
gradio_client 0.2.8
h11 0.14.0
httpcore 0.17.3
httpx 0.24.1
huggingface-hub 0.16.4
idna 3.4
importlib-metadata 6.8.0
importlib-resources 6.0.0
install 1.3.5
Jinja2 3.1.2
jsonschema 4.18.0
jsonschema-specifications 2023.6.1
kiwisolver 1.4.4
latex2mathml 3.76.0
linkify-it-py 2.0.2
lit 16.0.6
Markdown 3.4.3
markdown-it-py 3.0.0
MarkupSafe 2.1.3
matplotlib 3.7.2
mdit-py-plugins 0.3.3
mdtex2html 1.2.0
mdurl 0.1.2
mpmath 1.3.0
multidict 6.0.4
networkx 3.1
numpy 1.24.4
orjson 3.9.2
packaging 23.1
pandas 2.0.3
Pillow 10.0.0
pip 23.1.2
pkgutil_resolve_name 1.3.10
protobuf 4.23.4
psutil 5.9.5
pydantic 1.10.7
pydub 0.25.1
Pygments 2.15.1
pyparsing 3.0.9
python-dateutil 2.8.2
python-multipart 0.0.6
pytz 2023.3
PyYAML 6.0
referencing 0.29.1
regex 2023.6.3
requests 2.31.0
rpds-py 0.8.10
safetensors 0.3.1
semantic-version 2.10.0
sentencepiece 0.1.99
setuptools 41.6.0
six 1.16.0
sniffio 1.3.0
sse-starlette 1.6.1
starlette 0.28.0
sympy 1.12
tokenizers 0.13.3
toolz 0.12.0
torch 2.0.1+cu118
torchvision 0.15.2+cu118
tqdm 4.65.0
transformers 4.30.2
transformers-stream-generator 0.0.4
triton 2.0.0
typing_extensions 4.5.0
tzdata 2023.3
uc-micro-py 1.0.2
urllib3 2.0.3
uvicorn 0.22.0
websockets 11.0.3
yarl 1.9.2
zipp 3.16.0

Author

luanshaotong commented Jul 13, 2023 •

edited

~~另外查看好像cudnn的库没有安装，有没有影响~~ 发现是误会，cudnn装了

jianghaiqun commented Jul 14, 2023

请问您的物理内存是多少，我单卡p40 32g都爆内存，就是没启动成功过。
另外p40应该不支持fp16吧

Author

luanshaotong commented Jul 14, 2023 •

edited

请问您的物理内存是多少，我单卡p40 32g都爆内存，就是没启动成功过。另外p40应该不支持fp16吧

主机内存96G。测试加载的时候大概会用40G左右的内存

p40可以兼容fp16，速度和fp32是一样的（内存可能也和fp32一样）。

Collaborator

jameswu2014 commented Jul 14, 2023

速度慢应该是正常的，现在是采用混合精度来实现。主要目的是省显存。内存不够，试试调整一下swap区，看看能不能行。

Author

luanshaotong commented Jul 14, 2023 •

edited by jameswu2014

速度慢应该是正常的，现在是采用混合精度来实现。主要目的是省显存。内存不够，试试调整一下swap区，看看能不能行。

@jameswu2014 非常感谢，这样我就明白了。后续有没有计划直接int8计算，或者其他的加速方案比如fastertransformer？

我们正在迭代，请持续关注，谢谢。

GradientGuru assigned jameswu2014

mynewstart commented Jul 17, 2023 •

edited by jameswu2014

速度慢应该是正常的，现在是采用混合精度来实现。主要目的是省显存。内存不够，试试调整一下swap区，看看能不能行。

@jameswu2014 非常感谢，这样我就明白了。后续有没有计划直接int8计算，或者其他的加速方案比如fastertransformer？

请问比较慢的原因是因为模型中间计算还是用的fp16寸的，只是模型参数变为int8了是吗？以及中间结果用fp16存的话，为何不能和量化前的模型速度差不多，主要是慢在哪个地方了？

慢在了int8->fp16,反量化。后续我们会迭代，请持续关注，谢谢。

jameswu2014 closed this as completed

shesung commented Jul 27, 2023

就没用到int8计算。这里量化只是压缩了参数的存储大小，计算还是用fp16/fp32。现在大部分加速库，比如LLM.int8() ，都是基于tensor core。P40的int8加速是使用DP4A指令，跟tensor core的指令体系完全不同，估计未来这些加速库对pascal gpu的支持也够呛。还是趁早换20系之后的卡吧。。。

Qbuer commented Oct 16, 2023

就没用到int8计算。这里量化只是压缩了参数的存储大小，计算还是用fp16/fp32。现在大部分加速库，比如LLM.int8() ，都是基于tensor core。P40的int8加速是使用DP4A指令，跟tensor core的指令体系完全不同，估计未来这些加速库对pascal gpu的支持也够呛。还是趁早换20系之后的卡吧。。。

@shesung 请问“没用到int8计算” 是啥意思？GPU指令集级的int8计算优化吗？

shesung commented Oct 17, 2023

@Qbuer 是的，10系的int8加速指令是DP4A，大部分LLM加速库都没有支持这个指令。

mynewstart commented Oct 25, 2023

就没用到int8计算。这里量化只是压缩了参数的存储大小，计算还是用fp16/fp32。现在大部分加速库，比如LLM.int8() ，都是基于tensor core。P40的int8加速是使用DP4A指令，跟tensor core的指令体系完全不同，估计未来这些加速库对pascal gpu的支持也够呛。还是趁早换20系之后的卡吧。。。

@shesung 求大佬再解释下, 想现在的A100是支持int8计算吗？是因为加速库的指令系统和A100支持的不同吗？

shesung commented Oct 30, 2023

@mynewstart A100支持。主流的llm加速库几乎都是基于tensor core，所以从V100开始的卡几乎都支持int8加速。

mynewstart commented Oct 31, 2023

@shesung 感谢大佬回答! 我之前使用AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_8bit=True)在A100上inference模型为什么感觉没有加速，反而还更慢了，这是什么原因呀？

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment