-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Description
I want to use deepspeed for inference but i am not able to correctly load the model using deepspeed. As per my understanding of theory, deepspeed should load all the model weights on cpu or Nvme. But whenever i run this scipt(Attached with this message), All the model weights are first loaded on the CPU and then it straight transfers model weight on GPU and it runs CUDA out of memory. The command i am using to run the code below:
RUN: deepspeed --num_gpus 1 deepspeed_test.py
System Requirements:
I am using 24gb GPU and CPU with 100 GB ram.
Model: llama-13b
This is the piece of code, I am using:
import os
import deepspeed
import torch
from transformers import pipeline
local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '1'))
generator = pipeline('text-generation', model='model_path/llama-13b/')
generator.model = deepspeed.init_inference(generator.model,
mp_size=world_size,
dtype=torch.half,
replace_method = 'auto',
replace_with_kernel_inject=True)
string = generator("DeepSpeed is", do_sample=True, min_length=50)
if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:
print(string)
Please help me out with this, As you are well aware of the functionalities.