<a href="https://colab.research.google.com/github/dlimeng/llmdemo/blob/main/Local_DeepSeekQwen15b.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Run LLM locally with vLLM on Colab GPUs

Meng Li @Google AI

[vLLM](https://github.com/vllm-project/vllm) is a fast and user-friendly library for LLM inference and serving. vLLM optimizes LLM inference with mechanisms like PagedAttention for memory management and continuous batching for increasing throughput. For popular models, vLLM has been shown to increase throughput by a multiple of 2 to 4.

This notebook demonstrates how to run machine learning inference by using vLLM and GPUs

## Requirements

This notebook assumes that a GPU is enabled in Colab. If this setting isn't enabled, the locally executed sections of this notebook might not work. To enable a GPU, in the Colab menu, click **Runtime** > **Change runtime type**. For **Hardware accelerator**, choose a GPU accelerator.

## Install dependencies

Before creating your pipeline, download and install the dependencies required to develop with vLLM.

In [None]:
!pip install openai>=1.52.2
!pip install vllm>=0.6.3
!pip install triton>=3.1.0
!pip install nest_asyncio # only needed in colab
!pip check

ipython 7.34.0 requires jedi, which is not installed.
pygobject 3.42.1 requires pycairo, which is not installed.


## Colab only: allow nested asyncio

The vLLM model handler logic below uses asyncio to feed vLLM records. This only works if we are not already in an asyncio event loop. Most of the time, this is fine, but colab already operates in an event loop. To work around this, we can use nest_asyncio to make things work smoothly in colab. Do not include this step outside of colab.

In [None]:
# This should not be necessary outside of colab.
import nest_asyncio
nest_asyncio.apply()


## Run locally with vLLM

In this section, you run a vLLM server. Use the `deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B` model. This model is small enough to fit in Colab memory and doesn't require any extra authentication.

First, start the vLLM server. This step might take a minute or two, because the model needs to download before vLLM starts running inference.

In [None]:
! python -m vllm.entrypoints.openai.api_server --model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B  --dtype half

2025-01-24 10:57:31.321653: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-01-24 10:57:31.354484: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-01-24 10:57:31.364483: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-01-24 10:57:31.386886: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
INFO 01-24 10:57:36 api_server.py:712] vLLM API serve

Next, while the vLLM server is running, open a separate terminal to communicate with the vLLM serving process. To open a terminal in Colab, in the sidebar, click **Terminal**. In the terminal, run the following commands.

```
curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
        "prompt": "The meaning of life is to "
    }'
```

This code runs against the server running in the cell. You can experiment with different prompts.