llama-go

Inference of Facebook's LLaMA model in Golang with embedded C/C++.

Description

This project embeds the work of llama.cpp in a Golang binary. The main goal is to run the model using 4-bit quantization using CPU on Consumer-Grade hardware.

At startup, the model is loaded and a prompt is offered to enter a prompt, after the results have been printed another prompt can be entered. The program can be quit using ctrl+c.

This project was tested on Linux but should be able to get to work on macOS as well.

Requirements

The memory requirements for the models are approximately:

7B  -> 4 GB (1 file)
13B -> 8 GB (2 files)
30B -> 16 GB (4 files)
65B -> 32 GB (8 files)

Installation

# build this repo
git clone https://github.com/cornelk/llama-go
cd llama-go
make

# install Python dependencies
python3 -m pip install torch numpy sentencepiece

Obtain the original LLaMA model weights and place them in ./models - for example by using the https://github.com/shawwn/llama-dl script to download them.

Use the following steps to convert the LLaMA-7B model to a format that is compatible:

ls ./models
65B 30B 13B 7B tokenizer_checklist.chk tokenizer.model

# convert the 7B model to ggml FP16 format
python3 convert-pth-to-ggml.py models/7B/ 1

# quantize the model to 4-bits
./quantize.sh 7B

When running the larger models, make sure you have enough disk space to store all the intermediate files.

Usage

./llama-go -m ./models/13B/ggml-model-q4_0.bin -t 4 -n 128

Loading model ./models/13B/ggml-model-q4_0.bin...
Model loaded successfully.

>>> Some good pun names for a pet groomer:

Some good pun names for a pet groomer:
Rub-a-Dub, Scooby Doo
Hair Force One
Duck and Cover, Two Fleas, One Duck
...

>>>

The settings can be changed at runtime, multiple values are possible:

>>> seed=1234 threads=8
Settings: repeat_penalty=1.3 seed=1234 temp=0.8 threads=8 tokens=128 top_k=40 top_p=0.95

Name		Name	Last commit message	Last commit date
Latest commit History 169 Commits
.devops		.devops
.github		.github
examples		examples
models		models
prompts		prompts
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SHA256SUMS		SHA256SUMS
alpaca.sh		alpaca.sh
chat.sh		chat.sh
convert-gptq-to-ggml.py		convert-gptq-to-ggml.py
convert-pth-to-ggml.py		convert-pth-to-ggml.py
download-pth.py		download-pth.py
flake.lock		flake.lock
flake.nix		flake.nix
ggml.c		ggml.c
ggml.h		ggml.h
go.mod		go.mod
main.cpp		main.cpp
main.go		main.go
main.h		main.h
quantize.cpp		quantize.cpp
quantize.py		quantize.py
utils.cpp		utils.cpp
utils.h		utils.h

License

cornelk/llama-go

Folders and files

Latest commit

History

Repository files navigation

llama-go

Description

Requirements

Installation

Usage

About

Topics

Resources

License

Stars

Watchers

Forks

Languages