Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Neo][vLLM] Fix quantization failure caused by improperly loaded model. #2360

Merged
merged 1 commit into from
Sep 4, 2024

Conversation

a-ys
Copy link
Contributor

@a-ys a-ys commented Sep 4, 2024

Description

Fixes issue caused by: casper-hansen/AutoAWQ#558.
Fixes the following error message:

"OptimizationFatalError('Encountered an error during quantization: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)')",

Fixed in Neo by always quantizing with device_map=auto.
Relates to #2271. That PR fixes this issue by passing the device_map=auto in serving.properties. For non-Neo quantization, this will still be required.

@a-ys a-ys requested review from zachgk, frankfliu and a team as code owners September 4, 2024 21:51
@tosterberg tosterberg merged commit 452ff8e into deepjavalibrary:master Sep 4, 2024
9 checks passed
tosterberg pushed a commit to tosterberg/djl-serving that referenced this pull request Sep 4, 2024
tosterberg pushed a commit to tosterberg/djl-serving that referenced this pull request Sep 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants