Inference Llama2 with High-Level C++

This is a educational project demonstrating how to inference a Llama2 model with vanilla C++20.

The only dependency is SentencePiece which is the tokenizer used by Llama2. While writing a tokenizer from scratch would help understand Llama2 better, I found it off target implementing the details of SentencePiece.

Inspirations

This project is highly inspired by the llama2.c project and simple PyTorch Llama2 implementaions like hkproj/pytorch-llama and aju22/LLaMA2.

I decided to write a new C++ implementation, because when reading the C implemenation I was struggled to understand the code because of lack of abstractions, while the PyTorch implementations were written in a way that uses if and for as less as possible.

Highlights

One difficulty I had understanding the existing Llama2 implementations was to figure out the dimensions of the tensors being operated, so in this project tensors are stored in std::array and std::span with a custom Tensor wrapper implementing multi-dimensional operations, and you can immediately get the information of tensor by looking at its type, like Tensor<float, kEmbeddingSize, kHeadsSize, kHeadDimension>.

There is no weights loading code - they are written in text files and then #included in the source code, which makes it much easier to abstract the model layers with minimal code.

There is almost no heap allocations in the code (except for a few places using std containers which do it implicitly), weights are defined as globals and temporary tensors are allocated on stack.

These decisions come with the downside that the code only works with tiny models, larger ones will result in stack overflows. But I think they serve very well for code readbilities.

How to use

# Check out the code.
git clone --recursive https://github.com/frost-beta/llama2-high-level-cpp.git
cd llama2-high-level-cpp

# Download dependencies.
./scripts/bootstrap.py

# Build.
./scripts/build.py

# Run the model.
./out/Release/frost_run

# You can also run the original llama2.c code for comparisons.
# (Note that it does not work under Windows.)
./out/Release/original_llama2_run stories15M.bin

Files

src - The main code, start from the inference.cc file.
BUILD.gn - Build rules.
assets - Store the tokenizer weights.
scripts - Scripts for building the project.
secondary - Build rules for SentencePiece.
third_party - Dependencies.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
scripts		scripts
secondary		secondary
src		src
third_party		third_party
.gitignore		.gitignore
.gitmodules		.gitmodules
.gn		.gn
BUILD.gn		BUILD.gn
LICENSE		LICENSE
README.md		README.md
export_llama2_c_weights.cc		export_llama2_c_weights.cc
original_llama2_run.c		original_llama2_run.c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

assets

assets

scripts

scripts

secondary

secondary

src

src

third_party

third_party

.gitignore

.gitignore

.gitmodules

.gitmodules

.gn

.gn

BUILD.gn

BUILD.gn

LICENSE

LICENSE

README.md

README.md

export_llama2_c_weights.cc

export_llama2_c_weights.cc

original_llama2_run.c

original_llama2_run.c

Repository files navigation

Inference Llama2 with High-Level C++

Inspirations

Highlights

How to use

Files

About

Languages

License

frost-beta/llama2-high-level-cpp

Folders and files

Latest commit

History

Repository files navigation

Inference Llama2 with High-Level C++

Inspirations

Highlights

How to use

Files

About

Topics

Resources

License

Stars

Watchers

Forks

Languages