Skip to content

draonbioetl/UnityLlama

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 

Repository files navigation

PerroPastor

With all of these latin american wild animals running around (Llamas, Alpacas, Guanacos, Vicuñas) we need a good Perro Pastor ("sheep dog") to get them running! Perro Pastor is a Unity package written with just a few files of C# and compute shaders to run Llama based models on any Unity compatible platform on the gpu! I built this package primarily as an exercise to understand the execution of LLM inference at a deeper level, but I think that it has the potential to be useful to game developers who want to run low latency small LLMs in their games without relying on server infrastructure.

I also intend this repo to be a good opportunity to learn how LLMs work. Although compute shaders aren't the most readable form of source code, I have included lots of detailed comments in the C# code which should help demystify what is going on.

Getting Started

  • Clone the repo.
  • Make a new Unity project
  • Package Manager -> Add Package From Disk -> Choose "package.json" from repo folder.
  • A "Perro" menu should appear in Unity, choose "Download Sample Model" to download the model and tokenizer files to StreamingAssets/Models
  • Find PerroPastor in your Packages list in the unity Project window, copy the sample scene to your project.
  • Run the scene!
  • Try entering different prompts in the Llama.Prompt field (they shold be the first few words of a story)
  • Check out the RunTransformer function for lots of comments explaining in detail what is going on.
  • To run other models, you should be able to run any ggml compatible models in fp32, fp16, or q8_0. Q8_0 is definitely the preference as far as performance goes (I get 3-4 tokens per second on my xps15 with RTX 3050 and 4 gigs of vram).

Thank you Andrej Karpathy!!!

This project was inspired by Andrej Karpathy's excellent llama2.c project, and owes a huge dept to Andrej and all of his work he has done over the years to support LLM literacy. I find it so much easier to learn how something REALLY works when it is written in a simple language with no dependencies, rather than trying to peer through all the abstractions of higher level libraries like PyTorch. You will see that lots of the naming and overall code flow still owes a great deal to llama2.c, although things have already started to diverge quite a bit, and will diverge a lot more in the next few weeks as I make things a bit more conducive to real-time game execution.

If you are new to LLMs, I HIGHLY recommend his LLM Zero to Hero series on youtube, and especially the Building ChatGPT from Scratch video. If you already know c and aren't very familiar with compute shaders, you may find it easier to read through his llama2.c project first.

TODO

  • Support a few more ggml quantization formats (right now just fp16 and q8_0).
  • LOTS of easy optimizations in the existing kernels.
  • Add topk filtering.
  • Fuse compute kernels to reduce draw calls and intermediate memory usage.
  • Add support for Sentis when they support 8 bit quantization.
  • Implement asyncrhonous model loading to reduce memory usage during startup.
  • Add support for pipelined execution of multiple instances of the same model.
  • Add support for LoRA weights at inference time to enable multiple variations on one base model at runtime.

About

Run Llama based LLMs in Unity entirely in compute shaders with no dependencies

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C# 98.7%
  • HLSL 1.3%