llama-cpp Capacitor Plugin

A native Capacitor plugin that embeds llama.cpp directly into mobile apps, enabling offline AI inference with comprehensive support for text generation, multimodal processing, TTS, LoRA adapters, and more.

llama.cpp: Inference of LLaMA model in pure C/C++

🚀 Features

Offline AI Inference: Run large language models completely offline on mobile devices
Text Generation: Complete text completion with streaming support
Chat Conversations: Multi-turn conversations with context management
Multimodal Support: Process images and audio alongside text
Text-to-Speech (TTS): Generate speech from text using vocoder models
LoRA Adapters: Fine-tune models with LoRA adapters
Embeddings: Generate vector embeddings for semantic search
Reranking: Rank documents by relevance to queries
Session Management: Save and load conversation states
Benchmarking: Performance testing and optimization tools
Structured Output: Generate JSON with schema validation
Cross-Platform: iOS and Android support with native optimizations

✅ Complete Implementation Status

This plugin is now FULLY IMPLEMENTED with complete native integration of llama.cpp for both iOS and Android platforms. The implementation includes:

Completed Features

Complete C++ Integration: Full llama.cpp library integration with all core components
Native Build System: CMake-based build system for both iOS and Android
Platform Support: iOS (arm64, x86_64) and Android (arm64-v8a, armeabi-v7a, x86, x86_64)
TypeScript API: Complete TypeScript interface matching llama.rn functionality
Native Methods: All 30+ native methods implemented with proper error handling
Event System: Capacitor event system for progress and token streaming
Documentation: Comprehensive README and API documentation

Technical Implementation

C++ Core: Complete llama.cpp library with GGML, GGUF, and all supporting components
iOS Framework: Native iOS framework with Metal acceleration support
Android JNI: Complete JNI implementation with multi-architecture support
Build Scripts: Automated build system for both platforms
Error Handling: Robust error handling and result types

Project Structure

llama-cpp/
├── cpp/                    # Complete llama.cpp C++ library
│   ├── ggml.c             # GGML core
│   ├── gguf.cpp           # GGUF format support
│   ├── llama.cpp          # Main llama.cpp implementation
│   ├── rn-llama.cpp       # React Native wrapper (adapted)
│   ├── rn-completion.cpp  # Completion handling
│   ├── rn-tts.cpp         # Text-to-speech
│   └── tools/mtmd/        # Multimodal support
├── ios/
│   ├── CMakeLists.txt     # iOS build configuration
│   └── Sources/           # Swift implementation
├── android/
│   ├── src/main/
│   │   ├── CMakeLists.txt # Android build configuration
│   │   ├── jni.cpp        # JNI implementation
│   │   └── jni-utils.h    # JNI utilities
│   └── build.gradle       # Android build config
├── src/
│   ├── definitions.ts     # Complete TypeScript interfaces
│   ├── index.ts           # Main plugin implementation
│   └── web.ts             # Web fallback
└── build-native.sh        # Automated build script

📦 Installation

npm install llama-cpp-capacitor

🔨 Building the Native Library

The plugin includes a complete native implementation of llama.cpp. To build the native libraries:

Prerequisites

CMake (3.16+ for iOS, 3.10+ for Android)
Xcode (for iOS builds, macOS only)
Android Studio with NDK (for Android builds)
Make or Ninja build system

Automated Build

# Build for all platforms
npm run build:native

# Build for specific platforms
npm run build:ios      # iOS only
npm run build:android  # Android only

# Clean native builds
npm run clean:native

Manual Build

iOS Build

cd ios
cmake -B build -S .
cmake --build build --config Release

Android Build

cd android
./gradlew assembleRelease

Build Output

iOS: ios/build/LlamaCpp.framework/
Android: android/src/main/jniLibs/{arch}/libllama-cpp-{arch}.so

iOS Setup

Install the plugin:

npm install llama-cpp

Add to your iOS project:

npx cap add ios
npx cap sync ios

Open the project in Xcode:

npx cap open ios

Android Setup

Install the plugin:

npm install llama-cpp

Add to your Android project:

npx cap add android
npx cap sync android

Open the project in Android Studio:

npx cap open android

🎯 Quick Start

Basic Text Completion

import { initLlama } from 'llama-cpp';

// Initialize a model
const context = await initLlama({
  model: '/path/to/your/model.gguf',
  n_ctx: 2048,
  n_threads: 4,
  n_gpu_layers: 0,
});

// Generate text
const result = await context.completion({
  prompt: "Hello, how are you today?",
  n_predict: 50,
  temperature: 0.8,
});

console.log('Generated text:', result.text);

Chat-Style Conversations

const result = await context.completion({
  messages: [
    { role: "system", content: "You are a helpful AI assistant." },
    { role: "user", content: "What is the capital of France?" },
    { role: "assistant", content: "The capital of France is Paris." },
    { role: "user", content: "Tell me more about it." }
  ],
  n_predict: 100,
  temperature: 0.7,
});

console.log('Chat response:', result.content);

Streaming Completion

let fullText = '';
const result = await context.completion({
  prompt: "Write a short story about a robot learning to paint:",
  n_predict: 150,
  temperature: 0.8,
}, (tokenData) => {
  // Called for each token as it's generated
  fullText += tokenData.token;
  console.log('Token:', tokenData.token);
});

console.log('Final result:', result.text);

📚 API Reference

Core Functions

`initLlama(params: ContextParams, onProgress?: (progress: number) => void): Promise<LlamaContext>`

Initialize a new llama.cpp context with a model.

Parameters:

params: Context initialization parameters
onProgress: Optional progress callback (0-100)

Returns: Promise resolving to a LlamaContext instance

`releaseAllLlama(): Promise<void>`

Release all contexts and free memory.

`toggleNativeLog(enabled: boolean): Promise<void>`

Enable or disable native logging.

`addNativeLogListener(listener: (level: string, text: string) => void): { remove: () => void }`

Add a listener for native log messages.

LlamaContext Class

`completion(params: CompletionParams, callback?: (data: TokenData) => void): Promise<NativeCompletionResult>`

Generate text completion.

Parameters:

params: Completion parameters including prompt or messages
callback: Optional callback for token-by-token streaming

`tokenize(text: string, options?: { media_paths?: string[] }): Promise<NativeTokenizeResult>`

Tokenize text or text with images.

`detokenize(tokens: number[]): Promise<string>`

Convert tokens back to text.

`embedding(text: string, params?: EmbeddingParams): Promise<NativeEmbeddingResult>`

Generate embeddings for text.

`rerank(query: string, documents: string[], params?: RerankParams): Promise<RerankResult[]>`

Rank documents by relevance to a query.

`bench(pp: number, tg: number, pl: number, nr: number): Promise<BenchResult>`

Benchmark model performance.

Multimodal Support

`initMultimodal(params: { path: string; use_gpu?: boolean }): Promise<boolean>`

Initialize multimodal support with a projector file.

`isMultimodalEnabled(): Promise<boolean>`

Check if multimodal support is enabled.

`getMultimodalSupport(): Promise<{ vision: boolean; audio: boolean }>`

Get multimodal capabilities.

`releaseMultimodal(): Promise<void>`

Release multimodal resources.

TTS (Text-to-Speech)

`initVocoder(params: { path: string; n_batch?: number }): Promise<boolean>`

Initialize TTS with a vocoder model.

`isVocoderEnabled(): Promise<boolean>`

Check if TTS is enabled.

`getFormattedAudioCompletion(speaker: object | null, textToSpeak: string): Promise<{ prompt: string; grammar?: string }>`

Get formatted audio completion prompt.

`getAudioCompletionGuideTokens(textToSpeak: string): Promise<Array<number>>`

Get guide tokens for audio completion.

`decodeAudioTokens(tokens: number[]): Promise<Array<number>>`

Decode audio tokens to audio data.

`releaseVocoder(): Promise<void>`

Release TTS resources.

LoRA Adapters

`applyLoraAdapters(loraList: Array<{ path: string; scaled?: number }>): Promise<void>`

Apply LoRA adapters to the model.

`removeLoraAdapters(): Promise<void>`

Remove all LoRA adapters.

`getLoadedLoraAdapters(): Promise<Array<{ path: string; scaled?: number }>>`

Get list of loaded LoRA adapters.

Session Management

`saveSession(filepath: string, options?: { tokenSize: number }): Promise<number>`

Save current session to a file.

`loadSession(filepath: string): Promise<NativeSessionLoadResult>`

Load session from a file.

🔧 Configuration

Context Parameters

interface ContextParams {
  model: string;                    // Path to GGUF model file
  n_ctx?: number;                   // Context size (default: 512)
  n_threads?: number;               // Number of threads (default: 4)
  n_gpu_layers?: number;            // GPU layers (iOS only)
  use_mlock?: boolean;              // Lock memory (default: false)
  use_mmap?: boolean;               // Use memory mapping (default: true)
  embedding?: boolean;              // Embedding mode (default: false)
  cache_type_k?: string;            // KV cache type for K
  cache_type_v?: string;            // KV cache type for V
  pooling_type?: string;            // Pooling type
  // ... more parameters
}

Completion Parameters

interface CompletionParams {
  prompt?: string;                  // Text prompt
  messages?: Message[];             // Chat messages
  n_predict?: number;               // Max tokens to generate
  temperature?: number;             // Sampling temperature
  top_p?: number;                   // Top-p sampling
  top_k?: number;                   // Top-k sampling
  stop?: string[];                  // Stop sequences
  // ... more parameters
}

📱 Platform Support

Feature	iOS	Android	Web
Text Generation	✅	✅	❌
Chat Conversations	✅	✅	❌
Streaming	✅	✅	❌
Multimodal	✅	✅	❌
TTS	✅	✅	❌
LoRA Adapters	✅	✅	❌
Embeddings	✅	✅	❌
Reranking	✅	✅	❌
Session Management	✅	✅	❌
Benchmarking	✅	✅	❌

🎨 Advanced Examples

Multimodal Processing

// Initialize multimodal support
await context.initMultimodal({
  path: '/path/to/mmproj.gguf',
  use_gpu: true,
});

// Process image with text
const result = await context.completion({
  messages: [
    { 
      role: "user", 
      content: [
        { type: "text", text: "What do you see in this image?" },
        { type: "image_url", image_url: { url: "file:///path/to/image.jpg" } }
      ]
    }
  ],
  n_predict: 100,
});

console.log('Image analysis:', result.content);

Text-to-Speech

// Initialize TTS
await context.initVocoder({
  path: '/path/to/vocoder.gguf',
  n_batch: 512,
});

// Generate audio
const audioCompletion = await context.getFormattedAudioCompletion(
  null, // Speaker configuration
  "Hello, this is a test of text-to-speech functionality."
);

const guideTokens = await context.getAudioCompletionGuideTokens(
  "Hello, this is a test of text-to-speech functionality."
);

const audioResult = await context.completion({
  prompt: audioCompletion.prompt,
  grammar: audioCompletion.grammar,
  guide_tokens: guideTokens,
  n_predict: 1000,
});

const audioData = await context.decodeAudioTokens(audioResult.audio_tokens);

LoRA Adapters

// Apply LoRA adapters
await context.applyLoraAdapters([
  { path: '/path/to/adapter1.gguf', scaled: 1.0 },
  { path: '/path/to/adapter2.gguf', scaled: 0.5 }
]);

// Check loaded adapters
const adapters = await context.getLoadedLoraAdapters();
console.log('Loaded adapters:', adapters);

// Generate with adapters
const result = await context.completion({
  prompt: "Test prompt with LoRA adapters:",
  n_predict: 50,
});

// Remove adapters
await context.removeLoraAdapters();

Structured Output

const result = await context.completion({
  prompt: "Generate a JSON object with a person's name, age, and favorite color:",
  n_predict: 100,
  response_format: {
    type: 'json_schema',
    json_schema: {
      strict: true,
      schema: {
        type: 'object',
        properties: {
          name: { type: 'string' },
          age: { type: 'number' },
          favorite_color: { type: 'string' }
        },
        required: ['name', 'age', 'favorite_color']
      }
    }
  }
});

console.log('Structured output:', result.content);

🔍 Model Compatibility

This plugin supports GGUF format models, which are compatible with llama.cpp. You can find GGUF models on Hugging Face by searching for the "GGUF" tag.

Recommended Models

Llama 2: Meta's latest language model
Mistral: High-performance open model
Code Llama: Specialized for code generation
Phi-2: Microsoft's efficient model
Gemma: Google's open model

Model Quantization

For mobile devices, consider using quantized models (Q4_K_M, Q5_K_M, etc.) to reduce memory usage and improve performance.

⚡ Performance Considerations

Memory Management

Use quantized models for better memory efficiency
Adjust n_ctx based on your use case
Monitor memory usage with use_mlock: false

GPU Acceleration

iOS: Set n_gpu_layers to use Metal GPU acceleration
Android: GPU acceleration is automatically enabled when available

Threading

Adjust n_threads based on device capabilities
More threads may improve performance but increase memory usage

🐛 Troubleshooting

Common Issues

Model not found: Ensure the model path is correct and the file exists
Out of memory: Try using a quantized model or reducing n_ctx
Slow performance: Enable GPU acceleration or increase n_threads
Multimodal not working: Ensure the mmproj file is compatible with your model

Debugging

Enable native logging to see detailed information:

import { toggleNativeLog, addNativeLogListener } from 'llama-cpp';

await toggleNativeLog(true);

const logListener = addNativeLogListener((level, text) => {
  console.log(`[${level}] ${text}`);
});

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

llama.cpp - The core inference engine
Capacitor - The cross-platform runtime
llama.rn - Inspiration for the React Native implementation

📞 Support

📧 Email: support@arusatech.com
🐛 Issues: GitHub Issues
📖 Documentation: GitHub Wiki

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github		.github
android		android
cpp		cpp
ios		ios
scripts		scripts
src		src
test		test
types		types
.eslintignore		.eslintignore
.gitignore		.gitignore
.npmignore		.npmignore
.prettierignore		.prettierignore
API_DOCUMENTATION.md		API_DOCUMENTATION.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
IMPLEMENTATION_SUMMARY.md		IMPLEMENTATION_SUMMARY.md
INTEGRATION_TESTING_SUMMARY.md		INTEGRATION_TESTING_SUMMARY.md
LICENSE		LICENSE
LlamaCpp.podspec		LlamaCpp.podspec
PUBLICATION_GUIDE.md		PUBLICATION_GUIDE.md
PUBLISH_GUIDE.md		PUBLISH_GUIDE.md
Package.swift		Package.swift
QUICK_REFERENCE.md		QUICK_REFERENCE.md
README.md		README.md
TESTING.md		TESTING.md
arusatech-llama-cpp-0.0.1.tgz		arusatech-llama-cpp-0.0.1.tgz
build-native.sh		build-native.sh
capacitor.config.ts		capacitor.config.ts
example.ts		example.ts
llama-cpp-capacitor-0.0.1.tgz		llama-cpp-capacitor-0.0.1.tgz
llama-cpp-capacitor-0.0.2.tgz		llama-cpp-capacitor-0.0.2.tgz
llama-cpp-capacitor-0.0.3.tgz		llama-cpp-capacitor-0.0.3.tgz
llama-cpp-capacitor-0.0.4.tgz		llama-cpp-capacitor-0.0.4.tgz
llama-cpp-capacitor-0.0.5.tgz		llama-cpp-capacitor-0.0.5.tgz
package-lock.json		package-lock.json
package.json		package.json
rollup.config.mjs		rollup.config.mjs
tsconfig.json		tsconfig.json

License

arusatech/llama-cpp

Folders and files

Latest commit

History

Repository files navigation

llama-cpp Capacitor Plugin

🚀 Features

✅ Complete Implementation Status

Completed Features

Technical Implementation

Project Structure

📦 Installation

🔨 Building the Native Library

Prerequisites

Automated Build

Manual Build

iOS Build

Android Build

Build Output

iOS Setup

Android Setup

🎯 Quick Start

Basic Text Completion

Chat-Style Conversations

Streaming Completion

📚 API Reference

Core Functions

initLlama(params: ContextParams, onProgress?: (progress: number) => void): Promise<LlamaContext>

releaseAllLlama(): Promise<void>

toggleNativeLog(enabled: boolean): Promise<void>

addNativeLogListener(listener: (level: string, text: string) => void): { remove: () => void }

LlamaContext Class

completion(params: CompletionParams, callback?: (data: TokenData) => void): Promise<NativeCompletionResult>

tokenize(text: string, options?: { media_paths?: string[] }): Promise<NativeTokenizeResult>

detokenize(tokens: number[]): Promise<string>

embedding(text: string, params?: EmbeddingParams): Promise<NativeEmbeddingResult>

rerank(query: string, documents: string[], params?: RerankParams): Promise<RerankResult[]>

bench(pp: number, tg: number, pl: number, nr: number): Promise<BenchResult>

Multimodal Support

initMultimodal(params: { path: string; use_gpu?: boolean }): Promise<boolean>

isMultimodalEnabled(): Promise<boolean>

getMultimodalSupport(): Promise<{ vision: boolean; audio: boolean }>

releaseMultimodal(): Promise<void>

TTS (Text-to-Speech)

initVocoder(params: { path: string; n_batch?: number }): Promise<boolean>

isVocoderEnabled(): Promise<boolean>

getFormattedAudioCompletion(speaker: object | null, textToSpeak: string): Promise<{ prompt: string; grammar?: string }>

getAudioCompletionGuideTokens(textToSpeak: string): Promise<Array<number>>

decodeAudioTokens(tokens: number[]): Promise<Array<number>>

releaseVocoder(): Promise<void>

LoRA Adapters

applyLoraAdapters(loraList: Array<{ path: string; scaled?: number }>): Promise<void>

removeLoraAdapters(): Promise<void>

getLoadedLoraAdapters(): Promise<Array<{ path: string; scaled?: number }>>

Session Management

saveSession(filepath: string, options?: { tokenSize: number }): Promise<number>

loadSession(filepath: string): Promise<NativeSessionLoadResult>

🔧 Configuration

Context Parameters

Completion Parameters

📱 Platform Support

🎨 Advanced Examples

Multimodal Processing

Text-to-Speech

LoRA Adapters

Structured Output

🔍 Model Compatibility

Recommended Models

Model Quantization

⚡ Performance Considerations

Memory Management

GPU Acceleration

Threading

🐛 Troubleshooting

Common Issues

Debugging

🤝 Contributing

📄 License

🙏 Acknowledgments

`initLlama(params: ContextParams, onProgress?: (progress: number) => void): Promise<LlamaContext>`

`releaseAllLlama(): Promise<void>`

`toggleNativeLog(enabled: boolean): Promise<void>`

`addNativeLogListener(listener: (level: string, text: string) => void): { remove: () => void }`

`completion(params: CompletionParams, callback?: (data: TokenData) => void): Promise<NativeCompletionResult>`

`tokenize(text: string, options?: { media_paths?: string[] }): Promise<NativeTokenizeResult>`

`detokenize(tokens: number[]): Promise<string>`

`embedding(text: string, params?: EmbeddingParams): Promise<NativeEmbeddingResult>`

`rerank(query: string, documents: string[], params?: RerankParams): Promise<RerankResult[]>`

`bench(pp: number, tg: number, pl: number, nr: number): Promise<BenchResult>`

`initMultimodal(params: { path: string; use_gpu?: boolean }): Promise<boolean>`

`isMultimodalEnabled(): Promise<boolean>`

`getMultimodalSupport(): Promise<{ vision: boolean; audio: boolean }>`

`releaseMultimodal(): Promise<void>`

`initVocoder(params: { path: string; n_batch?: number }): Promise<boolean>`

`isVocoderEnabled(): Promise<boolean>`

`getFormattedAudioCompletion(speaker: object | null, textToSpeak: string): Promise<{ prompt: string; grammar?: string }>`

`getAudioCompletionGuideTokens(textToSpeak: string): Promise<Array<number>>`

`decodeAudioTokens(tokens: number[]): Promise<Array<number>>`

`releaseVocoder(): Promise<void>`

`applyLoraAdapters(loraList: Array<{ path: string; scaled?: number }>): Promise<void>`

`removeLoraAdapters(): Promise<void>`

`getLoadedLoraAdapters(): Promise<Array<{ path: string; scaled?: number }>>`

`saveSession(filepath: string, options?: { tokenSize: number }): Promise<number>`

`loadSession(filepath: string): Promise<NativeSessionLoadResult>`

Packages