Important
bigdl-llm
has now become ipex-llm
(see the migration guide here); you may find the original BigDL
project here.
IPEX-LLM
is a PyTorch library for running LLM on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max) with very low latency1.
Note
- It is built on top of Intel Extension for PyTorch (
IPEX
), as well as the excellent work ofllama.cpp
,bitsandbytes
,vLLM
,qlora
,AutoGPTQ
,AutoAWQ
, etc. - It provides seamless integration with llama.cpp, ollama, Text-Generation-WebUI, HuggingFace transformers, HuggingFace PEFT, LangChain, LlamaIndex, DeepSpeed-AutoTP, vLLM, FastChat, HuggingFace TRL, AutoGen, ModeScope, etc.
- 50+ models have been optimized/verified on
ipex-llm
(including LLaMA2, Mistral, Mixtral, Gemma, LLaVA, Whisper, ChatGLM, Baichuan, Qwen, RWKV, and more); see the complete list here.
See the demo of running Text-Generation-WebUI, local RAG using LangChain-Chatchat, llama.cpp and Ollama (on either Intel Core Ultra laptop or Arc GPU) with ipex-llm
below.
Intel Core Ultra Laptop | Intel Arc GPU | ||
webui.mp4 |
rag.mp4 |
llama-cpp.mp4 |
ollama.mp4 |
Text-Generation-WebUI | Local RAG using LangChain-Chatchat | llama.cpp | Ollama |
- [2024/04] You can now run Llama 3 on Intel GPU using
llama.cpp
andollama
; see the quickstart here. - [2024/04]
ipex-llm
now supports Llama 3 on both Intel GPU and CPU. - [2024/04]
ipex-llm
now provides C++ interface, which can be used as an accelerated backend for running llama.cpp and ollama on Intel GPU. - [2024/03]
bigdl-llm
has now becomeipex-llm
(see the migration guide here); you may find the originalBigDL
project here. - [2024/02]
ipex-llm
now supports directly loading model from ModelScope (魔搭). - [2024/02]
ipex-llm
added initial INT2 support (based on llama.cpp IQ2 mechanism), which makes it possible to run large-size LLM (e.g., Mixtral-8x7B) on Intel GPU with 16GB VRAM. - [2024/02] Users can now use
ipex-llm
through Text-Generation-WebUI GUI. - [2024/02]
ipex-llm
now supports Self-Speculative Decoding, which in practice brings ~30% speedup for FP16 and BF16 inference latency on Intel GPU and CPU respectively. - [2024/02]
ipex-llm
now supports a comprehensive list of LLM finetuning on Intel GPU (including LoRA, QLoRA, DPO, QA-LoRA and ReLoRA). - [2024/01] Using
ipex-llm
QLoRA, we managed to finetune LLaMA2-7B in 21 minutes and LLaMA2-70B in 3.14 hours on 8 Intel Max 1550 GPU for Standford-Alpaca (see the blog here).
More updates
- [2023/12]
ipex-llm
now supports ReLoRA (see "ReLoRA: High-Rank Training Through Low-Rank Updates"). - [2023/12]
ipex-llm
now supports Mixtral-8x7B on both Intel GPU and CPU. - [2023/12]
ipex-llm
now supports QA-LoRA (see "QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models"). - [2023/12]
ipex-llm
now supports FP8 and FP4 inference on Intel GPU. - [2023/11] Initial support for directly loading GGUF, AWQ and GPTQ models into
ipex-llm
is available. - [2023/11]
ipex-llm
now supports vLLM continuous batching on both Intel GPU and CPU. - [2023/10]
ipex-llm
now supports QLoRA finetuning on both Intel GPU and CPU. - [2023/10]
ipex-llm
now supports FastChat serving on on both Intel CPU and GPU. - [2023/09]
ipex-llm
now supports Intel GPU (including iGPU, Arc, Flex and MAX). - [2023/09]
ipex-llm
tutorial is released.
- Windows GPU: installing
ipex-llm
on Windows with Intel GPU - Linux GPU: installing
ipex-llm
on Linux with Intel GPU - Docker: using
ipex-llm
dockers on Intel CPU and GPU - For more details, please refer to the installation guide
- llama.cpp: running llama.cpp (using C++ interface of
ipex-llm
as an accelerated backend forllama.cpp
) on Intel GPU - ollama: running ollama (using C++ interface of
ipex-llm
as an accelerated backend forollama
) on Intel GPU - vLLM: running
ipex-llm
invLLM
on both Intel GPU and CPU - FastChat: running
ipex-llm
inFastChat
serving on on both Intel GPU and CPU - LangChain-Chatchat RAG: running
ipex-llm
inLangChain-Chatchat
(Knowledge Base QA using RAG pipeline) - Text-Generation-WebUI: running
ipex-llm
inoobabooga
WebUI - Benchmarking: running (latency and throughput) benchmarks for
ipex-llm
on Intel CPU and GPU
- Low bit inference
- INT4 inference: INT4 LLM inference on Intel GPU and CPU
- FP8/FP4 inference: FP8 and FP4 LLM inference on Intel GPU
- INT8 inference: INT8 LLM inference on Intel GPU and CPU
- INT2 inference: INT2 LLM inference (based on llama.cpp IQ2 mechanism) on Intel GPU
- FP16/BF16 inference
- FP16 LLM inference on Intel GPU, with possible self-speculative decoding optimization
- BF16 LLM inference on Intel CPU, with possible self-speculative decoding optimization
- Save and load
- Low-bit models: saving and loading
ipex-llm
low-bit models - GGUF: directly loading GGUF models into
ipex-llm
- AWQ: directly loading AWQ models into
ipex-llm
- GPTQ: directly loading GPTQ models into
ipex-llm
- Low-bit models: saving and loading
- Finetuning
- Integration with community libraries
- Tutorials
For more details, please refer to the ipex-llm
document website.
Over 50 models have been optimized/verified on ipex-llm
, including LLaMA/LLaMA2, Mistral, Mixtral, Gemma, LLaVA, Whisper, ChatGLM2/ChatGLM3, Baichuan/Baichuan2, Qwen/Qwen-1.5, InternLM and more; see the list below.
Model | CPU Example | GPU Example |
---|---|---|
LLaMA (such as Vicuna, Guanaco, Koala, Baize, WizardLM, etc.) | link1, link2 | link |
LLaMA 2 | link1, link2 | link |
LLaMA 3 | link | link |
ChatGLM | link | |
ChatGLM2 | link | link |
ChatGLM3 | link | link |
Mistral | link | link |
Mixtral | link | link |
Falcon | link | link |
MPT | link | link |
Dolly-v1 | link | link |
Dolly-v2 | link | link |
Replit Code | link | link |
RedPajama | link1, link2 | |
Phoenix | link1, link2 | |
StarCoder | link1, link2 | link |
Baichuan | link | link |
Baichuan2 | link | link |
InternLM | link | link |
Qwen | link | link |
Qwen1.5 | link | link |
Qwen-VL | link | link |
Aquila | link | link |
Aquila2 | link | link |
MOSS | link | |
Whisper | link | link |
Phi-1_5 | link | link |
Flan-t5 | link | link |
LLaVA | link | link |
CodeLlama | link | link |
Skywork | link | |
InternLM-XComposer | link | |
WizardCoder-Python | link | |
CodeShell | link | |
Fuyu | link | |
Distil-Whisper | link | link |
Yi | link | link |
BlueLM | link | link |
Mamba | link | link |
SOLAR | link | link |
Phixtral | link | link |
InternLM2 | link | link |
RWKV4 | link | |
RWKV5 | link | |
Bark | link | link |
SpeechT5 | link | |
DeepSeek-MoE | link | |
Ziya-Coding-34B-v1.0 | link | |
Phi-2 | link | link |
Phi-3 | link | link |
Yuan2 | link | link |
Gemma | link | link |
DeciLM-7B | link | link |
Deepseek | link | link |
StableLM | link | link |
CodeGemma | link | link |
Command-R/cohere | link | link |
- Please report a bug or raise a feature request by opening a Github Issue
- Please report a vulnerability by opening a draft GitHub Security Advisory
Footnotes
-
Performance varies by use, configuration and other factors.
ipex-llm
may not optimize to the same degree for non-Intel products. Learn more at www.Intel.com/PerformanceIndex. ↩