<img src="https://drive.google.com/uc?export=view&id=1wYSMgJtARFdvTt5g7E20mE4NmwUFUuog" width="200">

[![Build Fast with AI](https://img.shields.io/badge/BuildFastWithAI-GenAI%20Bootcamp-blue?style=for-the-badge&logo=artificial-intelligence)](https://www.buildfastwithai.com/genai-course)
[![EduChain GitHub](https://img.shields.io/github/stars/satvik314/educhain?style=for-the-badge&logo=github&color=gold)](https://github.com/satvik314/educhain)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1qq07ZvtaLGVl94kN0SkWK6B4HkTN15jk?usp=sharing)
## Master Generative AI in 6 Weeks
**What You'll Learn:**
- Build with Latest LLMs
- Create Custom AI Apps
- Learn from Industry Experts
- Join Innovation Community
Transform your AI ideas into reality through hands-on projects and expert mentorship.
[Start Your Journey](https://www.buildfastwithai.com/genai-course)
*Empowering the Next Generation of AI Innovators

# 🚀 vLLM_Fast_LLM_Inference ⚡

vLLM is a super-fast 🏎️ and easy-to-use library for LLM (Large Language Model) inference. 🧠💨 It's designed to make running those big models much more efficient. ✨

## 🚀 Key Features

✅ **Lightning Fast:** ⚡️ Optimized for high-throughput inference.  
✅ **Memory Efficient:** 🧠💪 Runs large models with less GPU memory.  
✅ **Easy to Use:** 🛠️ Simple API for generating text.  
✅ **Batch Processing:** 📦 Processes multiple prompts in parallel.  
✅ **Streaming Support:** 🌊 Generates text in real-time.  
✅ **Open Source:** 💻 Fully transparent and free to use.


### **🚀 Installation & Setup**  








In [None]:
!pip install vllm

### **🧠🚀 vLLM Model Initialization and Sampling**  








In [None]:
from vllm import LLM, SamplingParams

### **🤖🚀 Loading vLLM Model (OPT-125M)**


In [None]:
llm = LLM(model="facebook/opt-125m")

### **🎯🔥 Setting Sampling Parameters**








In [None]:
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=256)

### **📝🤖 Generating Text with vLLM**








In [None]:
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is",
]

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Processed prompts: 100%|██████████| 3/3 [00:01<00:00,  2.77it/s, est. speed input: 16.65 toks/s, output: 710.25 toks/s]

Prompt: 'Hello, my name is', Generated text: " Joel, my dad is my friend and we are in a relationship. I am from Gaviota, Greece and I am doing a video interview with a Greek TV show and you can watch this video about the relationship between me and my dad. I have lived in Greece for a few years now and I have seen many people have lived in Greece for a long time. I think the relationship between me and my father is very similar. The relationship between me and my father is very deep and deep. I am very happy to be the son of the Greek Prime Minister and I am very happy to live with my dad. My father and I are very close and my father is very happy to live with me. And I am very proud of my son. I am very happy about it.\nI love my family and I will always be happy. I always loved my family. I never wanted to have a daughter. I am very happy to be the father of a daughter and I am very happy for my life. I want to live with my family. I want my life to be a good one. That's all I can d




### **🧠📜 Creating a Large Batch of Prompts**








In [None]:
prompts = [
    "What is the meaning of life?",
    "Write a short story about a cat.",
    "Translate 'hello' to Spanish.",
    "What is the capital of Japan?",
    "Explain the theory of relativity.",
    "Write a poem about the ocean.",
    "What is the highest mountain in the world?",
    "Write a Python function to calculate the factorial of a number.",
] * 10

### **🚀📝 Generating and Displaying LLM Outputs**








In [None]:
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
    print("Generated")

Processed prompts: 100%|██████████| 80/80 [00:02<00:00, 37.47it/s, est. speed input: 346.85 toks/s, output: 3708.88 toks/s]

Prompt: 'What is the meaning of life?', Generated text: '\n\nLife is a line in the vernacular. It means that it was chosen by a Creator. All of us have a purpose. That purpose is to be born. We must be able to live from the inside. We must be able to live from inside.\n\nLife is what God created us to do.\n\nEverything that we do, it’s God created us to do. There’s nothing else in this world.\n\nThere’s nothing to do.\n\nNothing to be thankful for.\n\nEverything we do, it’s God created us to do. There’s nothing to be grateful for.\n\nEverything we do, it’s God created us to do. There’s nothing to be thankful for.\n\nEverything we do, it’s God created us to do. There’s nothing to be thankful for.\n\nEverything we do, it’s God created us to do. There’s nothing to be thankful for.\n\nEverything we do, it’s God created us to do. There’s nothing to be thankful for.\n\nEverything we do, it’s God created us to do'
Generated
Prompt: 'Write a short story about a cat.', Generated text: '\nthats 




### **🔢📊 Generating Embeddings with vLLM**








In [None]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

model = LLM(
    model="facebook/opt-125m",
    task="embed",
    enforce_eager=True,
)

outputs = model.embed(prompts)

for prompt, output in zip(prompts, outputs):
    embeds = output.outputs.embedding
    embeds_trimmed = ((str(embeds[:16])[:-1] +
                       ", ...]") if len(embeds) > 16 else embeds)
    print(f"Prompt: {prompt!r} | "
          f"Embeddings: {embeds_trimmed} (size={len(embeds)})")

INFO 02-13 18:43:32 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.2) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=facebook/opt-125m, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_

Loading pt checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 02-13 18:43:34 model_runner.py:1115] Loading model weights took 0.2404 GB


Processed prompts: 100%|██████████| 4/4 [00:00<00:00, 172.53it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Prompt: 'Hello, my name is' | Embeddings: [-0.00507354736328125, -0.00450897216796875, 0.01849365234375, -0.0009889602661132812, -0.041412353515625, 0.0150146484375, -0.00574493408203125, -0.08074951171875, -0.01068878173828125, -0.07305908203125, -0.0005578994750976562, 0.0017375946044921875, 0.04986572265625, -0.0155792236328125, 0.02618408203125, 0.022125244140625, ...] (size=768)
Prompt: 'The president of the United States is' | Embeddings: [-0.01168060302734375, 0.253173828125, 0.0169677734375, 0.052276611328125, 0.0198974609375, 0.0234222412109375, -0.028961181640625, -0.04693603515625, -0.040771484375, -0.004718780517578125, -0.0164031982421875, 0.0290069580078125, -0.0235137939453125, -0.0102386474609375, 0.012908935546875, -0.0138397216796875, ...] (size=768)
Prompt: 'The capital of France is' | Embeddings: [-0.035308837890625, 0.29638671875, 0.023223876953125, -0.0157318115234375, 0.0103607177734375, 0.032806396484375, -0.003387451171875, -0.00922393798828125, -0.081909179687




### **🧐📊 Text Classification with vLLM**








In [None]:
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

model = LLM(
    model="facebook/opt-125m",
    task="classify",
    enforce_eager=True,
)

outputs = model.classify(prompts)

for prompt, output in zip(prompts, outputs):
    probs = output.outputs.probs
    probs_trimmed = ((str(probs[:16])[:-1] +
                      ", ...]") if len(probs) > 16 else probs)
    print(f"Prompt: {prompt!r} | "
          f"Class Probabilities: {probs_trimmed} (size={len(probs)})")

INFO 02-13 18:44:12 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.2) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=facebook/opt-125m, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_

Loading pt checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 02-13 18:44:14 model_runner.py:1115] Loading model weights took 0.2398 GB


Processed prompts: 100%|██████████| 4/4 [00:00<00:00, 147.78it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Prompt: 'Hello, my name is' | Class Probabilities: [1.0, 0.0] (size=2)
Prompt: 'The president of the United States is' | Class Probabilities: [1.0, 1.1920928955078125e-07] (size=2)
Prompt: 'The capital of France is' | Class Probabilities: [0.9990234375, 0.0010175704956054688] (size=2)
Prompt: 'The future of AI is' | Class Probabilities: [1.0, 0.00011682510375976562] (size=2)



