Skip to content

elbruno/ElBruno.LocalLLMs

Repository files navigation

ElBruno.LocalLLMs

NuGet NuGet Downloads Build Status License: MIT HuggingFace .NET GitHub stars Twitter Follow

Run local LLMs in .NET through IChatClient β€” the same interface you'd use for Azure OpenAI, Ollama, or any other provider. Powered by ONNX Runtime GenAI.

Features

  • πŸ”Œ IChatClient implementation β€” seamless integration with Microsoft.Extensions.AI
  • πŸ“¦ Automatic model download β€” models are fetched from HuggingFace on first use
  • πŸš€ Zero friction β€” works out of the box with sensible defaults (Phi-3.5 mini)
  • πŸ–₯️ Multi-hardware β€” CPU, CUDA, and DirectML execution providers
  • πŸ’‰ DI-friendly β€” register with AddLocalLLMs() in ASP.NET Core
  • πŸ”„ Streaming β€” token-by-token streaming via GetStreamingResponseAsync
  • πŸ“Š Multi-model β€” switch between Phi-3.5, Phi-4, Qwen2.5, Llama 3.2, and more

Installation

dotnet add package ElBruno.LocalLLMs

This works everywhere (CPU). To enable GPU acceleration, add one extra package:

# 🟒 NVIDIA GPU (CUDA):
dotnet add package Microsoft.ML.OnnxRuntimeGenAI.Cuda

# πŸ”΅ Any Windows GPU β€” AMD, Intel, NVIDIA (DirectML):
dotnet add package Microsoft.ML.OnnxRuntimeGenAI.DirectML

πŸš€ The library defaults to ExecutionProvider.Auto β€” it tries GPU first and falls back to CPU automatically. No code changes needed.

Quick Start

using ElBruno.LocalLLMs;
using Microsoft.Extensions.AI;

// Create a local chat client (downloads Phi-3.5 mini on first run)
using var client = await LocalChatClient.CreateAsync();

var response = await client.GetResponseAsync([
    new(ChatRole.User, "What is the capital of France?")
]);

Console.WriteLine(response.Text);

Streaming

using ElBruno.LocalLLMs;
using Microsoft.Extensions.AI;

using var client = await LocalChatClient.CreateAsync(new LocalLLMsOptions
{
    Model = KnownModels.Phi35MiniInstruct
});

await foreach (var update in client.GetStreamingResponseAsync([
    new(ChatRole.System, "You are a helpful assistant."),
    new(ChatRole.User, "Explain quantum computing in simple terms.")
]))
{
    Console.Write(update.Text);
}

Dependency Injection

builder.Services.AddLocalLLMs(options =>
{
    options.Model = KnownModels.Phi35MiniInstruct;
    options.ExecutionProvider = ExecutionProvider.DirectML;
});

// Inject IChatClient anywhere
public class MyService(IChatClient chatClient) { ... }

Supported Models

Tier Model Parameters ONNX ID
βšͺ Tiny TinyLlama-1.1B-Chat 1.1B βœ… Native tinyllama-1.1b-chat
βšͺ Tiny SmolLM2-1.7B-Instruct 1.7B βœ… Native smollm2-1.7b-instruct
βšͺ Tiny Qwen2.5-0.5B-Instruct 0.5B βœ… Native qwen2.5-0.5b-instruct
βšͺ Tiny Qwen2.5-1.5B-Instruct 1.5B βœ… Native qwen2.5-1.5b-instruct
βšͺ Tiny Gemma-2B-IT 2B βœ… Native gemma-2b-it
βšͺ Tiny StableLM-2-1.6B-Chat 1.6B πŸ”„ Convert stablelm-2-1.6b-chat
🟒 Small Phi-3.5 mini instruct 3.8B βœ… Native phi-3.5-mini-instruct
🟒 Small Qwen2.5-3B-Instruct 3B βœ… Native qwen2.5-3b-instruct
🟒 Small Llama-3.2-3B-Instruct 3B βœ… Native llama-3.2-3b-instruct
🟒 Small Gemma-2-2B-IT 2B βœ… Native gemma-2-2b-it
🟑 Medium Qwen2.5-7B-Instruct 7B βœ… Native qwen2.5-7b-instruct
🟑 Medium Llama-3.1-8B-Instruct 8B βœ… Native llama-3.1-8b-instruct
🟑 Medium Mistral-7B-Instruct-v0.3 7B βœ… Native mistral-7b-instruct-v0.3
🟑 Medium Gemma-2-9B-IT 9B βœ… Native gemma-2-9b-it
🟑 Medium Phi-4 14B βœ… Native phi-4
🟑 Medium DeepSeek-R1-Distill-Qwen-14B 14B βœ… Native deepseek-r1-distill-qwen-14b
🟑 Medium Mistral-Small-24B-Instruct 24B βœ… Native mistral-small-24b-instruct
πŸ”΄ Large Qwen2.5-14B-Instruct 14B βœ… Native qwen2.5-14b-instruct
πŸ”΄ Large Qwen2.5-32B-Instruct 32B βœ… Native qwen2.5-32b-instruct
πŸ”΄ Large Llama-3.3-70B-Instruct 70B βœ… ONNX llama-3.3-70b-instruct
πŸ”΄ Large Mixtral-8x7B-Instruct-v0.1 8x7B πŸ”„ Convert mixtral-8x7b-instruct-v0.1
πŸ”΄ Large DeepSeek-R1-Distill-Llama-70B 70B πŸ”„ Convert deepseek-r1-distill-llama-70b
πŸ”΄ Large Command-R (35B) 35B πŸ”„ Convert command-r-35b

See the Supported Models Guide for detailed model cards, performance benchmarks, and selection guidance.

Samples

Sample Description
HelloChat Minimal console chat
StreamingChat Token-by-token streaming
MultiModelChat Switch models at runtime
DependencyInjection ASP.NET Core DI registration

Requirements

  • .NET 8.0 or .NET 10.0
  • CPU (default), NVIDIA GPU (CUDA), or Windows GPU (DirectML)
  • ~2-8 GB disk space per model (depending on size and quantization)

Documentation

🀝 Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

πŸ“„ License

This project is licensed under the MIT License β€” see the LICENSE file for details.

πŸ‘‹ About the Author

Hi! I'm ElBruno 🧑, a passionate developer and content creator exploring AI, .NET, and modern development practices.

Made with ❀️ by ElBruno

If you like this project, consider following my work across platforms:

  • πŸ“» Podcast: No Tienen Nombre β€” Spanish-language episodes on AI, development, and tech culture
  • πŸ’» Blog: ElBruno.com β€” Deep dives on embeddings, RAG, .NET, and local AI
  • πŸ“Ί YouTube: youtube.com/elbruno β€” Demos, tutorials, and live coding
  • πŸ”— LinkedIn: @elbruno β€” Professional updates and insights
  • 𝕏 Twitter: @elbruno β€” Quick tips, releases, and tech news

About

C# local LLM chat completions library using ONNX Runtime, compatible with Microsoft.Extensions.AI

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors