-
-
Notifications
You must be signed in to change notification settings - Fork 815
Closed
Description
Things seem to be working as intended! I went from using GPT-J-6B with
model = AutoModelForCausalLM.from_pretrained("/mnt/models",torch_dtype=torch.float16,low_cpu_mem_usage=True).to(torch.device("cuda",0))to
model = AutoModelForCausalLM.from_pretrained("/mnt/models",device_map="auto",load_in_8bit=True)With nvidia-smi reporting a decrease in GPU memory consumption from ~15 GB to ~9GB. Very nice!
However, I don't think we can use this in production, because the latency of text generation increases from ~3.5s to ~12s to generate 45 output tokens. I'm using something like:
output_ids = self.model.generate(
input_ids.cuda(),
max_length=45,
do_sample=True,
top_p=request.get("top_p", 1.0),
top_k=request.get("top_k", 50),
...
)
Is this increase in latency known / expected? Or is it specific to my system? For reference, my reproducing Dockerfile is:
FROM nvidia/cuda:11.3.0-devel-ubuntu20.04
ARG DEBIAN_FRONTEND=noninteractive
ENV APP_HOME /app
WORKDIR $APP_HOME
# NVIDIA rotated their GPG keys, so we have to remove the old ones to do apt-get update
RUN rm /etc/apt/sources.list.d/cuda.list
RUN rm /etc/apt/sources.list.d/nvidia-ml.list
RUN apt-get update && apt-get install -y build-essential wget vim git
RUN apt-get update
RUN apt-get install --yes git
# Note: we need curl for the liveness probe
RUN apt-get install --yes curl
RUN apt-get install --yes vim
# Install miniconda
ENV CONDA_DIR /opt/conda
RUN wget --quiet https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh && \
/bin/bash ~/miniconda.sh -b -p /opt/conda
ENV PATH=$CONDA_DIR/bin:$PATH
# Install conda dependencies.
RUN conda install python=3.8
RUN conda install pytorch=1.12.1 cudatoolkit=11.3 -c pytorch
# Install pip deps
COPY requirements.txt ./
RUN pip install --no-cache-dir -r ./requirements.txt
# Copy local code to container image
COPY *.py ./
CMD ["python", "model.py"]
with requirements.txt being
kserve==0.9.0
git+https://github.com/huggingface/transformers.git@4a51075a96d2049f368b5f3dd6c0e9f08f599b62
accelerate==0.12.0
bitsandbytes==0.31.8
Metadata
Metadata
Assignees
Labels
No labels