-
Notifications
You must be signed in to change notification settings - Fork 655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regression: EngineException: default_program(22): error: extra text after expected end of number with DJL 0.28.0 + intfloat/multilingual-e5-small on machine with GPU #3089
Comments
You can still use DJL 0.28.0 with PyTorch 2.0.1:
|
I tried.. but I mentioned that in the above. It fails with this:
|
Please try it again. I added 2.0.1 support for 0.28.0-SNAPSHOT |
Many thanks @frankfliu for your quick reply - that works. Hopefully we can somehow convince PyTorch to fix this issue, as staying pegged to 2.0.1 is not a great long-term solution. They claim libtorch is in maintenance mode, but this is a regression. |
|
@frankfliu - this worked fine on Linux, but on Windows (I have to support this platform too sadly), it fails with what looks like the same error, even when using PyTorch 2.0.1. The logs below confirm that version of PyTorch is being used. I installed NVidia Toolkit 11.8 and cuDNN 8.9.7 which I read are compatible with each other and that version of PyTorch. Is there something else I have done wrong here? Thanks again for any advise.
|
Can you convert the model to onnx? |
I can try that route, but I was hoping to avoid it, as the current code will dynamically download the HuggingFace models to the user's machine which is super convenient. I was hoping to avoid having to ship model files explicitly. Any ideas why Windows is affected like this? I thought this bug was PyTorch specific, so it seems really odd that 2.0.1 is showing this issue even on Windows. |
@frankfliu, FWIW using onnx seemed to work which is great! I saw the recent commit adding support for converting HuggingFace models to onnx: https://github.com/deepjavalibrary/djl/pull/3093/files. @xyang16, as a FYI while this worked, I got an error from model_zoo_importer.py you might want to fix:
Strangely, after changing things to use onnx for E5, this PyTorch model djl://ai.djl.huggingface.pytorch/sentence-transformers/clip-ViT-B-32-multilingual-v1 is now failing to load when it used to work. Perhaps this is related to the bug for "text embedding translator regression" you just fixed here? #3095. In any case, I might covert sentence-transformers/clip-ViT-B-32-multilingual-v1 to onnx so I can avoid using PyTorch completely. Any ideas on what might be causing this?
|
Fix model import issue: #3098 PyTorch issue should not related to Translator changes. It might caused by jit trace and DJL pytorch are using different version. |
@frankfliu - while OnnxRuntime works, I have found it to be about three times slower compared to PyTorch, even when using Re: the Windows issue when using PyTorch 2.0.1 "error: extra text after expected end of number" I mentioned here: #3089 (comment) won't be the case of "It might be caused by jit trace and DJL pytorch are using different version." I don't believe that is the case since I am dynamically downloading the models. I have zapped the .djl.ai directories before my runs and it doesn't seem to help. Any ideas on how to resolve the issue for Windows? It seems really odd it is behaving differently to Linux here. |
OnnxRuntime should be much faster than PyTorch. Are you sure you are using GPU? Which CUDA are you using? OnnxRuntime currently doesn't support CUDA 12. Did you install TensorRT? which version are you using? |
@frankfliu - the GPU is definitely being used as confirmed by nvidia-smi. I can also see what looks like the appropriate versions of the key libraries loaded, checked via
I did see this warning message which is a concern (I didn't see it initially as it went to stderr rather than our log files):
I'm not sure if this could be the cause of the performance issue. NVIDIA/TensorRT#2542 has some interesting insights here, but it is not clear to me. Are there some tweaks we have to make to the DJL ONNX exporter to handle this? As a reminder, I created this ONNX model using this command:
Also - as a poor profiler, I ran jstack against one of my worker processes many times to get a sense of common stacks. When using PyTorch, I very rarely see code interacting with the GPU. For OnnxRuntime it it more common, but curiously I see these kind of stacks fairly often (almost never with PyTorch):
and
This is curious because this code is executed for processing the "mean pool" of the output from the model. Effectively this code (note I can switch between PyTorch and OnnxRuntime via a system property): protected NDArray processEmbedding(TranslatorContext ctx, NDList list)
{
NDArray embedding;
if ("ortModel".equals(ctx.getModel().getNDManager().getName()))
{
embedding = list.get(0);
}
else
{
embedding = list.get("last_hidden_state");
}
Encoding encoding = (Encoding) ctx.getAttachment("Encoding");
NDArray inputAttentionMask = ctx.getNDManager().create(encoding.getAttentionMask()).toType(DataType.FLOAT32, true);
return meanPool(embedding, inputAttentionMask);
}
private static NDArray meanPool(NDArray embeddings, NDArray attentionMask)
{
long[] shape = embeddings.getShape().getShape();
attentionMask = attentionMask.expandDims(-1).broadcast(shape);
NDArray inputAttentionMaskSum = attentionMask.sum(AXIS);
NDArray clamp = inputAttentionMaskSum.clip(1e-9f, 1e12f);
NDArray prod = embeddings.mul(attentionMask);
NDArray sum = prod.sum(AXIS);
return sum.div(clamp);
} I am using OnnxRuntime with PyTorch as a "hybrid engine" as described by https://djl.ai/docs/hybrid_engine.html. Could this somehow be the cause of the slowdown? |
By the way, with latest DJL, you can use this model with Which GPU arch are you using? I run the benchmark this model on EC2
You can see the model inference latency is only 2.3 ms, the I can get 820 TPS |
@frankfliu - I am running on G5 instances, so I believe they are A10s. Great to hear you have added support for dynamically loading ONNX models now via HuggingFace. I'm curious if you can run your benchmarks for a longer period of time, and compare them to PyTorch directly to see if you can reproduce my issue. Any ideas about the "Your ONNX model has been generated with INT64 weights..." warning message I recieved? Could this be the issue with my performance? Curious you didn't see that. |
I ran your benchmarks with both PyTorch and OnnxRuntime, and OnnxRuntime seems way faster. I didn't see the warning message either, so maybe I'll change my code to use the dynamic model download to see if that helps. Here are my benchmark results. I changed the -c parameter to 10000, not that I think that was needed:
|
Hmmm, I can see the PyTorch benchmark didn't use the GPU for some reason.. |
For Pytorch, you need to set |
Ah.. thanks. Now we see better results for PyTorch, ONNX is still quite a bit faster (3-4x):
|
@frankfliu - for my program, despite using the downloaded version of the ONNX models, performance is sadly the same. Overall my program (which does a lot more than just inferencing) is 3x slower compared to using the PyTorch engine. Returning to the jstacks, I believe the time difference is due to post-processing in the Translator after inferencing has happened, and I think the issue is ONNX has to convert/allocate new PyTorch NDArrays, copy the data before it can run the operations due to the "hybrid engine" approach, via NDArrayAdapter. Where-as for PyTorch, it can very quickly compute split(), sum() and prod() with the existing NDArray by calling JniUtils.sum() and JniUtils.mul() immediately. So I don't think ONNX inferencing is the problem.. it is the post-processing code, and the need to copy/allocate PyTorch arrays to do that work that seems to be very slow. As a reminder.. here are some example jstacks when using ONNX which are not seen with PyTorch as these post-processing sections run very quickly. Any ideas on what can be done to speed up ONNX Translator post-processing?
|
You can use If post process is the bottleneck, you can make pooling and normalize as part of the model, and convert to onnx with post processing. It's possible to avoid memory copy for Hybrid engine. There is a private method in in OnnxRuntime, we could use reflection to invoke it (but I would rather avoid it if possible) |
@frankfliu - here are the results of the Metrics run. As suspected, it looks like postprocess is the culprit for OnnxRuntime. I'd like to keep (ideally) the same translator post-processing code regardless of the engine I use. Any ideas on how we can speed up OnnxRuntime post-processing? I understand it is not ideal to use the private method.. but the overheads seem very large at the moment. Thanks for all your help in this. PyTorch:
OnnxRuntime:
|
For OnnxRuntime, here is what might happened:
|
Thanks @frankfliu. Given what my jstacks show (albeit there are caveats with that), my guess is PyTorch CPU is being used in post-processing. Any suggestions on how to fix that? |
Seems a testing code get merged into master, please try again after this PR merged: #3122 |
@frankfliu - that change of yours looks good, but I am not using TextEmbeddingTranslator, but my own, which takes tokenised input directly as input (partly to handle large documents and ensure the numbers of tokens passed doesn't exceede 512). So I don't think that will solve the issue sadly: /**
* An embeddings translator which takes encoded input rather than
* the raw string.
*/
class EncodingEmbeddingTranslator implements Translator<Encoding, float[]>
{
private static final int[] AXIS = {0};
@Override
@CanIgnoreReturnValue
public NDList processInput(TranslatorContext ctx, Encoding encoding)
{
ctx.setAttachment("Encoding", encoding);
return encoding.toNDList(ctx.getNDManager(), false);
}
@Override
public NDList batchProcessInput(TranslatorContext ctx, List<Encoding> encodings)
{
NDManager manager = ctx.getNDManager();
ctx.setAttachment("Encodings", encodings);
NDList[] batch = new NDList[encodings.size()];
for (int i = 0; i < encodings.size(); i++)
{
batch[i] = encodings.get(i).toNDList(manager, false);
}
return getBatchifier().batchify(batch);
}
@Override
public float[] processOutput(TranslatorContext ctx, NDList list) {
Encoding encoding = (Encoding) ctx.getAttachment("Encoding");
NDArray embeddings = processEmbedding(ctx, list, encoding);
embeddings = embeddings.normalize(2, 0);
return embeddings.toFloatArray();
}
@Override
public List<float[]> batchProcessOutput(TranslatorContext ctx, NDList list) {
NDList[] batch = getBatchifier().unbatchify(list);
List<Encoding> encodings = (List<Encoding>) ctx.getAttachment("Encodings");
List<float[]> ret = new ArrayList<>(batch.length);
for (int i = 0; i < batch.length; ++i) {
NDArray array = processEmbedding(ctx, batch[i], encodings.get(i));
array = array.normalize(2, 0);
ret.add(array.toFloatArray());
}
return ret;
}
/**
* Process the embeddings.
*
* @param ctx the translator context.
* @param list the embeddings.
* @param encoding the encoding.
* @return the updated embeddings.
*/
protected NDArray processEmbedding(TranslatorContext ctx, NDList list, Encoding encoding)
{
NDArray embedding;
if ("ortModel".equals(ctx.getModel().getNDManager().getName()))
{
embedding = list.get(0);
}
else
{
embedding = list.get("last_hidden_state");
}
NDArray inputAttentionMask = ctx.getNDManager().create(encoding.getAttentionMask()).toType(DataType.FLOAT32, true);
return meanPool(embedding, inputAttentionMask);
}
/**
* Computes the mean pool.
*
* @param embeddings the embeddings.
* @param attentionMask the attention mask.
* @return the mean pool.
*/
private static NDArray meanPool(NDArray embeddings, NDArray attentionMask)
{
long[] shape = embeddings.getShape().getShape();
attentionMask = attentionMask.expandDims(-1).broadcast(shape);
NDArray inputAttentionMaskSum = attentionMask.sum(AXIS);
NDArray clamp = inputAttentionMaskSum.clip(1e-9f, 1e12f);
NDArray prod = embeddings.mul(attentionMask);
NDArray sum = prod.sum(AXIS);
return sum.div(clamp);
}
} |
@frankfliu - I still believe the issue is the alternative manager for OrtEngine (in my case PtEngine) is not using the GPU, hence the post operations are slow. I had a quick look in this area, and noticed this code: protected BaseNDManager(NDManager parent, Device device) {
this.parent = parent;
this.device = device == null ? defaultDevice() : device;
resources = new ConcurrentHashMap<>();
tempResources = new ConcurrentHashMap<>();
uid = UUID.randomUUID().toString();
Engine engine = getEngine().getAlternativeEngine();
if (engine != null) {
alternativeManager = engine.newBaseManager(Device.cpu());
}
} Won't this force the alternative manager PtEngine to use the CPU, and thus post processing will be slow? Shouldn't this code just be |
You are right, this is an issue. The issue is the alternative engine may not support GPU, I think it should be:
|
@frankfliu - In the ideal case, where the alternative engine does support GPUs, wouldn't we want it to use the same GPU, so any downstream operations will avoid any potential copies? Is there a way we can catch an exception to handle those engines which don't support GPUs and then just use the version without arguments? In my case, I will sometimes run multiple processes, each dedicated to a specific GPU, and I'd ideally want the operations for each process to be pinned to the right GPU. |
I'll create a PR with the appropriate exception handling so you can look at it. Indeed without it some tests will fail, although this is more likely a setup issue on my box.
|
My change now makes OnnxRuntime run almost the same time as PyTorch. I've created a PR here: #3138. |
Description
Trying to perform predictions using
intfloat/multilingual-e5-small
fails on a machine with a GPU. This used to work in DJL 0.26.0 using PY_TORCH 2.0.1 but now fails on 0.28.0 (and presumably 0.27.0).Expected Behavior
It performs a prediction without error.
Error Message
This seems similar to what is reported here: #2962, but according to https://github.com/deepjavalibrary/djl/blob/master/engines/pytorch/pytorch-engine/README.md, both DJL 0.27.0 and 0.28.0 no longer support PY_TORCH 2.0.1. For fun I tried, but it does indeed fail:
@frankfliu - many thanks for all your recent fixes, but it is not clear what can be done in this situation other than pytorch fixing pytorch/pytorch#107503. Or is there a workaround? Many thanks in advance.
The text was updated successfully, but these errors were encountered: