Skip to content

Memory Decreases! But Latency Increases.... #6

@mitchellgordon95

Description

@mitchellgordon95

Things seem to be working as intended! I went from using GPT-J-6B with

model = AutoModelForCausalLM.from_pretrained("/mnt/models",torch_dtype=torch.float16,low_cpu_mem_usage=True).to(torch.device("cuda",0))

to

model = AutoModelForCausalLM.from_pretrained("/mnt/models",device_map="auto",load_in_8bit=True)

With nvidia-smi reporting a decrease in GPU memory consumption from ~15 GB to ~9GB. Very nice!

However, I don't think we can use this in production, because the latency of text generation increases from ~3.5s to ~12s to generate 45 output tokens. I'm using something like:

output_ids = self.model.generate(
    input_ids.cuda(),
    max_length=45,
    do_sample=True,
    top_p=request.get("top_p", 1.0),
    top_k=request.get("top_k", 50),
   ...
)

Is this increase in latency known / expected? Or is it specific to my system? For reference, my reproducing Dockerfile is:

FROM nvidia/cuda:11.3.0-devel-ubuntu20.04

ARG DEBIAN_FRONTEND=noninteractive

ENV APP_HOME /app
WORKDIR $APP_HOME

# NVIDIA rotated their GPG keys, so we have to remove the old ones to do apt-get update
RUN rm /etc/apt/sources.list.d/cuda.list
RUN rm /etc/apt/sources.list.d/nvidia-ml.list
RUN apt-get update && apt-get install -y build-essential wget vim git

RUN apt-get update
RUN apt-get install --yes git

# Note: we need curl for the liveness probe
RUN apt-get install --yes curl
RUN apt-get install --yes vim

# Install miniconda
ENV CONDA_DIR /opt/conda
RUN wget --quiet https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh && \
     /bin/bash ~/miniconda.sh -b -p /opt/conda
ENV PATH=$CONDA_DIR/bin:$PATH

# Install conda dependencies.
RUN conda install python=3.8
RUN conda install pytorch=1.12.1 cudatoolkit=11.3 -c pytorch

# Install pip deps
COPY requirements.txt ./
RUN pip install --no-cache-dir -r ./requirements.txt

# Copy local code to container image
COPY *.py ./

CMD ["python", "model.py"]

with requirements.txt being

kserve==0.9.0
git+https://github.com/huggingface/transformers.git@4a51075a96d2049f368b5f3dd6c0e9f08f599b62
accelerate==0.12.0
bitsandbytes==0.31.8

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions