-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2x performance drop using pytorch depending on how input data is fed into model #198
Comments
Thanks for reaching out! You will need to play with AIO_SKIP_MASTER_THREAD env variable (possible values are 0 and 1, default is 0) to get best performance, i.e.:
As you can see, ~100 ms latency is possible on your 2-threaded Ampere Altra VM in each case listed in your example, provided proper value of env is set. Fyi, we are working on a solution relieving the user from the need to adjust this parameter. Btw, you will get even better performance by auto-casting to fp16:
:) |
Thanks a lot for fast and helpful reply ) Indeed So the documentation says:
Does it mean that As for auto-casting to fp16 - it looks like magic ) Indeed it works 2x faster. I supposed pytorch was not supporting fp16 on CPU as mentioned in #152. I actually tried myself, before I found that issue. So setting I'm closing the issue, as you've already answered, but would appreciate another reply ) Not related, but, do you have plans to release your packages to be used outside docker or so that they can be used in my own custom container? |
Yes
Since x86 CPU didn't supported FP16 before very recent AVX-512 extension, it seems that nobody really cared about CPU support for FP16 in Pytorch. Currently Pytorch support of FP16 on CPU is very limited, but currently there is some work going on master branch of Pytorch, eg: And yes |
Please contact us at ai-support@amperecomputing.com and we should be able to get you a working .deb installer. |
Hi, it's me again 🙈. It seems like a similar issue, though a bit different. This time it depends on input size. For some (smaller) inputs it works fast, but after some threshold it suddenly slows down 2-3x. Here I'm using vision transformer from timm library. It basically reshapes an image from 2d to 1d sequence and runs quite basic transformer on it. So for img_size=110 and patch_size=10, sequence length will be 11 * 11 = 121, and if you increase img_size to 120, then sequence length will be 12 * 12 = 144. import torch
import timm
import time
# img_size = 110 # latency: 10ms
img_size = 120 # latency: 33ms
model = timm.models.VisionTransformer(
img_size=img_size,
patch_size=10,
embed_dim=128,
num_heads=8,
depth=12
)
model.eval()
model = torch.compile(model, backend='aio', options={'modelname': 'vit'})
data = torch.rand(1, 3, img_size, img_size)
n_warmup = 5
n = 100
with torch.no_grad():
for i in range(n + n_warmup):
if i == n_warmup:
start = time.time()
model(data)
duration = time.time() - start
latency = duration / n * 1000
cps = 1000 / latency
print(f'Latency: {round(latency)}ms, rate: {round(cps)} per second') With AIO_SKIP_MASTER_THREAD=1 it works a bit faster, though still there is same slowdown if changing input size. Should I provide logs or maybe you have ideas what could be wrong without them? |
Hi, I've tried to run your script with: There is some difference but, not that big. How do you run the script? What number of threads are you using? |
Hmm.... looks like I found the reason. There are the following lines in Attention module implementation: if self.fused_attn:
x = F.scaled_dot_product_attention(
q, k, v,
dropout_p=self.attn_drop.p if self.training else 0.,
)
else:
q = q * self.scale
attn = q @ k.transpose(-2, -1)
attn = attn.softmax(dim=-1)
attn = self.attn_drop(attn)
x = attn @ v I actually wanted to ask if you have optimizations for transformers/attention. Pytorch has this If I disable fused attention explicitly by setting |
Hi,
|
Hi!
I'm not sure if this is a good place to get help/report issue, if not, let me what would be a better way.
First of all thanks for great library! However trying it out, I faced some performance issues that I can't fully understand. Here is an example script. It is quite simple, I'm using standard resnet50 model and feeding it with some data in a loop. However what I noticed is that small changes in how I'm feeding the data might cause quite drastic 2x performance drop.
So, if you use same data or create random data on each iteration it works fast. However if you use
torch.stack
or just.clone
performance drops. TBH, I don't understand why these quite small changes would matter.I tried to use
torch.jit.script
andtorch.jit.trace
instead oftorch.compile
, but results weren't any better, in fact usingtorch.jit.script
latency was ~200ms even in 1st case (same data on each iteration). Buttorch.jit.trace
andtorch.compile
were very similar.For comparison, without compilation/scripting/tracing latency is ~290ms no matter how you feed the data.
What I also noticed is difference in CPU usage. In both cases both CPU cores are 100% utilized, however when it is slow, more time is spent in kernel threads (red portion)
Fast:
Slow:
I'm also attaching logs obtained with
AIO_DEBUG_MODE=5
log_fast.txt
log_slow.txt
I'm using your latest docker image amperecomputingai/pytorch:1.7.0
Is there something obvious that I'm missing?
The text was updated successfully, but these errors were encountered: