A native Capacitor plugin that embeds llama.cpp directly into mobile apps, enabling offline AI inference with comprehensive support for text generation, multimodal processing, TTS, LoRA adapters, and more.
llama.cpp: Inference of LLaMA model in pure C/C++
- Offline AI Inference: Run large language models completely offline on mobile devices
- Text Generation: Complete text completion with streaming support
- Chat Conversations: Multi-turn conversations with context management
- Multimodal Support: Process images and audio alongside text
- Text-to-Speech (TTS): Generate speech from text using vocoder models
- LoRA Adapters: Fine-tune models with LoRA adapters
- Embeddings: Generate vector embeddings for semantic search
- Reranking: Rank documents by relevance to queries
- Session Management: Save and load conversation states
- Benchmarking: Performance testing and optimization tools
- Structured Output: Generate JSON with schema validation
- Cross-Platform: iOS and Android support with native optimizations
This plugin is now FULLY IMPLEMENTED with complete native integration of llama.cpp for both iOS and Android platforms. The implementation includes:
- Complete C++ Integration: Full llama.cpp library integration with all core components
- Native Build System: CMake-based build system for both iOS and Android
- Platform Support: iOS (arm64, x86_64) and Android (arm64-v8a, armeabi-v7a, x86, x86_64)
- TypeScript API: Complete TypeScript interface matching llama.rn functionality
- Native Methods: All 30+ native methods implemented with proper error handling
- Event System: Capacitor event system for progress and token streaming
- Documentation: Comprehensive README and API documentation
- C++ Core: Complete llama.cpp library with GGML, GGUF, and all supporting components
- iOS Framework: Native iOS framework with Metal acceleration support
- Android JNI: Complete JNI implementation with multi-architecture support
- Build Scripts: Automated build system for both platforms
- Error Handling: Robust error handling and result types
llama-cpp/
βββ cpp/ # Complete llama.cpp C++ library
β βββ ggml.c # GGML core
β βββ gguf.cpp # GGUF format support
β βββ llama.cpp # Main llama.cpp implementation
β βββ rn-llama.cpp # React Native wrapper (adapted)
β βββ rn-completion.cpp # Completion handling
β βββ rn-tts.cpp # Text-to-speech
β βββ tools/mtmd/ # Multimodal support
βββ ios/
β βββ CMakeLists.txt # iOS build configuration
β βββ Sources/ # Swift implementation
βββ android/
β βββ src/main/
β β βββ CMakeLists.txt # Android build configuration
β β βββ jni.cpp # JNI implementation
β β βββ jni-utils.h # JNI utilities
β βββ build.gradle # Android build config
βββ src/
β βββ definitions.ts # Complete TypeScript interfaces
β βββ index.ts # Main plugin implementation
β βββ web.ts # Web fallback
βββ build-native.sh # Automated build script
npm install llama-cpp-capacitor
The plugin includes a complete native implementation of llama.cpp. To build the native libraries:
- CMake (3.16+ for iOS, 3.10+ for Android)
- Xcode (for iOS builds, macOS only)
- Android Studio with NDK (for Android builds)
- Make or Ninja build system
# Build for all platforms
npm run build:native
# Build for specific platforms
npm run build:ios # iOS only
npm run build:android # Android only
# Clean native builds
npm run clean:native
cd ios
cmake -B build -S .
cmake --build build --config Release
cd android
./gradlew assembleRelease
- iOS:
ios/build/LlamaCpp.framework/
- Android:
android/src/main/jniLibs/{arch}/libllama-cpp-{arch}.so
- Install the plugin:
npm install llama-cpp
- Add to your iOS project:
npx cap add ios
npx cap sync ios
- Open the project in Xcode:
npx cap open ios
- Install the plugin:
npm install llama-cpp
- Add to your Android project:
npx cap add android
npx cap sync android
- Open the project in Android Studio:
npx cap open android
import { initLlama } from 'llama-cpp';
// Initialize a model
const context = await initLlama({
model: '/path/to/your/model.gguf',
n_ctx: 2048,
n_threads: 4,
n_gpu_layers: 0,
});
// Generate text
const result = await context.completion({
prompt: "Hello, how are you today?",
n_predict: 50,
temperature: 0.8,
});
console.log('Generated text:', result.text);
const result = await context.completion({
messages: [
{ role: "system", content: "You are a helpful AI assistant." },
{ role: "user", content: "What is the capital of France?" },
{ role: "assistant", content: "The capital of France is Paris." },
{ role: "user", content: "Tell me more about it." }
],
n_predict: 100,
temperature: 0.7,
});
console.log('Chat response:', result.content);
let fullText = '';
const result = await context.completion({
prompt: "Write a short story about a robot learning to paint:",
n_predict: 150,
temperature: 0.8,
}, (tokenData) => {
// Called for each token as it's generated
fullText += tokenData.token;
console.log('Token:', tokenData.token);
});
console.log('Final result:', result.text);
Initialize a new llama.cpp context with a model.
Parameters:
params
: Context initialization parametersonProgress
: Optional progress callback (0-100)
Returns: Promise resolving to a LlamaContext
instance
Release all contexts and free memory.
Enable or disable native logging.
Add a listener for native log messages.
completion(params: CompletionParams, callback?: (data: TokenData) => void): Promise<NativeCompletionResult>
Generate text completion.
Parameters:
params
: Completion parameters including prompt or messagescallback
: Optional callback for token-by-token streaming
Tokenize text or text with images.
Convert tokens back to text.
Generate embeddings for text.
Rank documents by relevance to a query.
Benchmark model performance.
Initialize multimodal support with a projector file.
Check if multimodal support is enabled.
Get multimodal capabilities.
Release multimodal resources.
Initialize TTS with a vocoder model.
Check if TTS is enabled.
getFormattedAudioCompletion(speaker: object | null, textToSpeak: string): Promise<{ prompt: string; grammar?: string }>
Get formatted audio completion prompt.
Get guide tokens for audio completion.
Decode audio tokens to audio data.
Release TTS resources.
Apply LoRA adapters to the model.
Remove all LoRA adapters.
Get list of loaded LoRA adapters.
Save current session to a file.
Load session from a file.
interface ContextParams {
model: string; // Path to GGUF model file
n_ctx?: number; // Context size (default: 512)
n_threads?: number; // Number of threads (default: 4)
n_gpu_layers?: number; // GPU layers (iOS only)
use_mlock?: boolean; // Lock memory (default: false)
use_mmap?: boolean; // Use memory mapping (default: true)
embedding?: boolean; // Embedding mode (default: false)
cache_type_k?: string; // KV cache type for K
cache_type_v?: string; // KV cache type for V
pooling_type?: string; // Pooling type
// ... more parameters
}
interface CompletionParams {
prompt?: string; // Text prompt
messages?: Message[]; // Chat messages
n_predict?: number; // Max tokens to generate
temperature?: number; // Sampling temperature
top_p?: number; // Top-p sampling
top_k?: number; // Top-k sampling
stop?: string[]; // Stop sequences
// ... more parameters
}
Feature | iOS | Android | Web |
---|---|---|---|
Text Generation | β | β | β |
Chat Conversations | β | β | β |
Streaming | β | β | β |
Multimodal | β | β | β |
TTS | β | β | β |
LoRA Adapters | β | β | β |
Embeddings | β | β | β |
Reranking | β | β | β |
Session Management | β | β | β |
Benchmarking | β | β | β |
// Initialize multimodal support
await context.initMultimodal({
path: '/path/to/mmproj.gguf',
use_gpu: true,
});
// Process image with text
const result = await context.completion({
messages: [
{
role: "user",
content: [
{ type: "text", text: "What do you see in this image?" },
{ type: "image_url", image_url: { url: "file:///path/to/image.jpg" } }
]
}
],
n_predict: 100,
});
console.log('Image analysis:', result.content);
// Initialize TTS
await context.initVocoder({
path: '/path/to/vocoder.gguf',
n_batch: 512,
});
// Generate audio
const audioCompletion = await context.getFormattedAudioCompletion(
null, // Speaker configuration
"Hello, this is a test of text-to-speech functionality."
);
const guideTokens = await context.getAudioCompletionGuideTokens(
"Hello, this is a test of text-to-speech functionality."
);
const audioResult = await context.completion({
prompt: audioCompletion.prompt,
grammar: audioCompletion.grammar,
guide_tokens: guideTokens,
n_predict: 1000,
});
const audioData = await context.decodeAudioTokens(audioResult.audio_tokens);
// Apply LoRA adapters
await context.applyLoraAdapters([
{ path: '/path/to/adapter1.gguf', scaled: 1.0 },
{ path: '/path/to/adapter2.gguf', scaled: 0.5 }
]);
// Check loaded adapters
const adapters = await context.getLoadedLoraAdapters();
console.log('Loaded adapters:', adapters);
// Generate with adapters
const result = await context.completion({
prompt: "Test prompt with LoRA adapters:",
n_predict: 50,
});
// Remove adapters
await context.removeLoraAdapters();
const result = await context.completion({
prompt: "Generate a JSON object with a person's name, age, and favorite color:",
n_predict: 100,
response_format: {
type: 'json_schema',
json_schema: {
strict: true,
schema: {
type: 'object',
properties: {
name: { type: 'string' },
age: { type: 'number' },
favorite_color: { type: 'string' }
},
required: ['name', 'age', 'favorite_color']
}
}
}
});
console.log('Structured output:', result.content);
This plugin supports GGUF format models, which are compatible with llama.cpp. You can find GGUF models on Hugging Face by searching for the "GGUF" tag.
- Llama 2: Meta's latest language model
- Mistral: High-performance open model
- Code Llama: Specialized for code generation
- Phi-2: Microsoft's efficient model
- Gemma: Google's open model
For mobile devices, consider using quantized models (Q4_K_M, Q5_K_M, etc.) to reduce memory usage and improve performance.
- Use quantized models for better memory efficiency
- Adjust
n_ctx
based on your use case - Monitor memory usage with
use_mlock: false
- iOS: Set
n_gpu_layers
to use Metal GPU acceleration - Android: GPU acceleration is automatically enabled when available
- Adjust
n_threads
based on device capabilities - More threads may improve performance but increase memory usage
- Model not found: Ensure the model path is correct and the file exists
- Out of memory: Try using a quantized model or reducing
n_ctx
- Slow performance: Enable GPU acceleration or increase
n_threads
- Multimodal not working: Ensure the mmproj file is compatible with your model
Enable native logging to see detailed information:
import { toggleNativeLog, addNativeLogListener } from 'llama-cpp';
await toggleNativeLog(true);
const logListener = addNativeLogListener((level, text) => {
console.log(`[${level}] ${text}`);
});
We welcome contributions! Please see our Contributing Guide for details.
This project is licensed under the MIT License - see the LICENSE file for details.
- llama.cpp - The core inference engine
- Capacitor - The cross-platform runtime
- llama.rn - Inspiration for the React Native implementation
- π§ Email: support@arusatech.com
- π Issues: GitHub Issues
- π Documentation: GitHub Wiki