Skip to content

Zero cpu offloading is not working #3764

@devkaranjoshi

Description

@devkaranjoshi

I want to use deepspeed for inference but i am not able to correctly load the model using deepspeed. As per my understanding of theory, deepspeed should load all the model weights on cpu or Nvme. But whenever i run this scipt(Attached with this message), All the model weights are first loaded on the CPU and then it straight transfers model weight on GPU and it runs CUDA out of memory. The command i am using to run the code below:

RUN: deepspeed --num_gpus 1 deepspeed_test.py

System Requirements:
I am using 24gb GPU and CPU with 100 GB ram.
Model: llama-13b

This is the piece of code, I am using:

import os
import deepspeed
import torch
from transformers import pipeline

local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '1'))
generator = pipeline('text-generation', model='model_path/llama-13b/')

generator.model = deepspeed.init_inference(generator.model,
mp_size=world_size,
dtype=torch.half,
replace_method = 'auto',
replace_with_kernel_inject=True)

string = generator("DeepSpeed is", do_sample=True, min_length=50)
if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:
print(string)

Please help me out with this, As you are well aware of the functionalities.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions