💫 IPEX-LLM

Important

bigdl-llm has now become ipex-llm (see the migration guide here); you may find the original BigDL project here.

💫 IPEX-LLM

IPEX-LLM is a PyTorch library for running LLM on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max) with very low latency¹.

Note

It is built on top of Intel Extension for PyTorch (IPEX), as well as the excellent work of llama.cpp, bitsandbytes, vLLM, qlora, AutoGPTQ, AutoAWQ, etc.
It provides seamless integration with llama.cpp, ollama, Text-Generation-WebUI, HuggingFace transformers, HuggingFace PEFT, LangChain, LlamaIndex, DeepSpeed-AutoTP, vLLM, FastChat, HuggingFace TRL, AutoGen, ModeScope, etc.
50+ models have been optimized/verified on ipex-llm (including LLaMA2, Mistral, Mixtral, Gemma, LLaVA, Whisper, ChatGLM, Baichuan, Qwen, RWKV, and more); see the complete list here.

`ipex-llm` Demo

See the demo of running Text-Generation-WebUI, local RAG using LangChain-Chatchat, llama.cpp and Ollama (on either Intel Core Ultra laptop or Arc GPU) with ipex-llm below.

Intel Core Ultra Laptop		Intel Arc GPU
webui.mp4	rag.mp4	llama-cpp.mp4	ollama.mp4
Text-Generation-WebUI	Local RAG using LangChain-Chatchat	llama.cpp	Ollama

Latest Update 🔥

[2024/04] You can now run Llama 3 on Intel GPU using llama.cpp and ollama; see the quickstart here.
[2024/04] ipex-llm now supports Llama 3 on both Intel GPU and CPU.
[2024/04] ipex-llm now provides C++ interface, which can be used as an accelerated backend for running llama.cpp and ollama on Intel GPU.
[2024/03] bigdl-llm has now become ipex-llm (see the migration guide here); you may find the original BigDL project here.
[2024/02] ipex-llm now supports directly loading model from ModelScope (魔搭).
[2024/02] ipex-llm added initial INT2 support (based on llama.cpp IQ2 mechanism), which makes it possible to run large-size LLM (e.g., Mixtral-8x7B) on Intel GPU with 16GB VRAM.
[2024/02] Users can now use ipex-llm through Text-Generation-WebUI GUI.
[2024/02] ipex-llm now supports Self-Speculative Decoding, which in practice brings ~30% speedup for FP16 and BF16 inference latency on Intel GPU and CPU respectively.
[2024/02] ipex-llm now supports a comprehensive list of LLM finetuning on Intel GPU (including LoRA, QLoRA, DPO, QA-LoRA and ReLoRA).
[2024/01] Using ipex-llm QLoRA, we managed to finetune LLaMA2-7B in 21 minutes and LLaMA2-70B in 3.14 hours on 8 Intel Max 1550 GPU for Standford-Alpaca (see the blog here).

More updates

[2023/12] ipex-llm now supports ReLoRA (see "ReLoRA: High-Rank Training Through Low-Rank Updates").
[2023/12] ipex-llm now supports Mixtral-8x7B on both Intel GPU and CPU.
[2023/12] ipex-llm now supports QA-LoRA (see "QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models").
[2023/12] ipex-llm now supports FP8 and FP4 inference on Intel GPU.
[2023/11] Initial support for directly loading GGUF, AWQ and GPTQ models into ipex-llm is available.
[2023/11] ipex-llm now supports vLLM continuous batching on both Intel GPU and CPU.
[2023/10] ipex-llm now supports QLoRA finetuning on both Intel GPU and CPU.
[2023/10] ipex-llm now supports FastChat serving on on both Intel CPU and GPU.
[2023/09] ipex-llm now supports Intel GPU (including iGPU, Arc, Flex and MAX).
[2023/09] ipex-llm tutorial is released.

`ipex-llm` Quickstart

Install `ipex-llm`

Windows GPU: installing ipex-llm on Windows with Intel GPU
Linux GPU: installing ipex-llm on Linux with Intel GPU
Docker: using ipex-llm dockers on Intel CPU and GPU
For more details, please refer to the installation guide

Run `ipex-llm`

llama.cpp: running llama.cpp (using C++ interface of ipex-llm as an accelerated backend for llama.cpp) on Intel GPU
ollama: running ollama (using C++ interface of ipex-llm as an accelerated backend for ollama) on Intel GPU
vLLM: running ipex-llm in vLLM on both Intel GPU and CPU
FastChat: running ipex-llm in FastChat serving on on both Intel GPU and CPU
LangChain-Chatchat RAG: running ipex-llm in LangChain-Chatchat (Knowledge Base QA using RAG pipeline)
Text-Generation-WebUI: running ipex-llm in oobabooga WebUI
Benchmarking: running (latency and throughput) benchmarks for ipex-llm on Intel CPU and GPU

Code Examples

Low bit inference
- INT4 inference: INT4 LLM inference on Intel GPU and CPU
- FP8/FP4 inference: FP8 and FP4 LLM inference on Intel GPU
- INT8 inference: INT8 LLM inference on Intel GPU and CPU
- INT2 inference: INT2 LLM inference (based on llama.cpp IQ2 mechanism) on Intel GPU
FP16/BF16 inference
- FP16 LLM inference on Intel GPU, with possible self-speculative decoding optimization
- BF16 LLM inference on Intel CPU, with possible self-speculative decoding optimization
Save and load
- Low-bit models: saving and loading ipex-llm low-bit models
- GGUF: directly loading GGUF models into ipex-llm
- AWQ: directly loading AWQ models into ipex-llm
- GPTQ: directly loading GPTQ models into ipex-llm
Finetuning
- LLM finetuning on Intel GPU, including LoRA, QLoRA, DPO, QA-LoRA and ReLoRA
- QLoRA finetuning on Intel CPU
Integration with community libraries
Tutorials

For more details, please refer to the ipex-llm document website.

Verified Models

Over 50 models have been optimized/verified on ipex-llm, including LLaMA/LLaMA2, Mistral, Mixtral, Gemma, LLaVA, Whisper, ChatGLM2/ChatGLM3, Baichuan/Baichuan2, Qwen/Qwen-1.5, InternLM and more; see the list below.

Model	CPU Example	GPU Example
LLaMA (such as Vicuna, Guanaco, Koala, Baize, WizardLM, etc.)	link1, link2	link
LLaMA 2	link1, link2	link
LLaMA 3	link	link
ChatGLM	link
ChatGLM2	link	link
ChatGLM3	link	link
Mistral	link	link
Mixtral	link	link
Falcon	link	link
MPT	link	link
Dolly-v1	link	link
Dolly-v2	link	link
Replit Code	link	link
RedPajama	link1, link2
Phoenix	link1, link2
StarCoder	link1, link2	link
Baichuan	link	link
Baichuan2	link	link
InternLM	link	link
Qwen	link	link
Qwen1.5	link	link
Qwen-VL	link	link
Aquila	link	link
Aquila2	link	link
MOSS	link
Whisper	link	link
Phi-1_5	link	link
Flan-t5	link	link
LLaVA	link	link
CodeLlama	link	link
Skywork	link
InternLM-XComposer	link
WizardCoder-Python	link
CodeShell	link
Fuyu	link
Distil-Whisper	link	link
Yi	link	link
BlueLM	link	link
Mamba	link	link
SOLAR	link	link
Phixtral	link	link
InternLM2	link	link
RWKV4		link
RWKV5		link
Bark	link	link
SpeechT5		link
DeepSeek-MoE	link
Ziya-Coding-34B-v1.0	link
Phi-2	link	link
Phi-3	link	link
Yuan2	link	link
Gemma	link	link
DeciLM-7B	link	link
Deepseek	link	link
StableLM	link	link
CodeGemma	link	link
Command-R/cohere	link	link

Get Support

Please report a bug or raise a feature request by opening a Github Issue
Please report a vulnerability by opening a draft GitHub Security Advisory

Performance varies by use, configuration and other factors. ipex-llm may not optimize to the same degree for non-Intel products. Learn more at www.Intel.com/PerformanceIndex. ↩

Name		Name	Last commit message	Last commit date
Latest commit History 2,835 Commits
.github		.github
apps		apps
docker/llm		docker/llm
docs/readthedocs		docs/readthedocs
python/llm		python/llm
.gitignore		.gitignore
.readthedocs.yml		.readthedocs.yml
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

💫 IPEX-LLM

`ipex-llm` Demo

Latest Update 🔥

`ipex-llm` Quickstart

Install `ipex-llm`

Run `ipex-llm`

Code Examples

Verified Models

Get Support

About

Releases

Packages

Languages

License

gc-fu/BigDL

Folders and files

Latest commit

History

Repository files navigation

💫 IPEX-LLM

ipex-llm Demo

Latest Update 🔥

ipex-llm Quickstart

Install ipex-llm

Run ipex-llm

Code Examples

Verified Models

Get Support

Footnotes

About

Resources

License

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Languages

`ipex-llm` Demo

`ipex-llm` Quickstart

Install `ipex-llm`

Run `ipex-llm`

Packages