Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion AGENTS.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
# xAI SDK implementation notes

- `GrokClient` is primarily backed by generated gRPC protocol clients, but text to speech uses xAI's documented REST/WebSocket voice endpoints because there are no generated TTS protocol types in `src\xAI.Protocol`.
- `GrokClient` is primarily backed by generated gRPC protocol clients, but voice features use xAI's documented REST/WebSocket endpoints because there are no generated voice protocol types in `src\xAI.Protocol`.
- Voice REST calls use `GrokClient.HttpHandler` (backed by `httpHandlers` cache) — a plain `SocketsHttpHandler`+Polly pipeline separate from the gRPC channel. `ChannelHandler` returns `ChannelBase` only; there is no `.Handler` property on it.
- `AsITextToSpeechClient` returns an `ITextToSpeechClient` implementation that uses `POST /v1/tts` for unary audio and `wss://.../v1/tts` for streaming audio.
- `AsISpeechToTextClient` returns an `ISpeechToTextClient` implementation that uses `POST /v1/stt` for file transcription and `wss://.../v1/stt` for raw-audio streaming transcription.
- TTS defaults follow xAI docs: voice `eve`, language `en` when omitted by `TextToSpeechOptions`, and MP3 output when no codec is specified.
- STT streaming defaults follow xAI docs: encoding `pcm` and sample rate `16000` when omitted; WebSocket input must be raw encoded audio, not MP3/WAV container bytes.
89 changes: 89 additions & 0 deletions readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,12 @@ var speech = new GrokClient(Environment.GetEnvironmentVariable("XAI_API_KEY")!)

var audio = await speech.GetAudioAsync("Hello! Welcome to xAI text to speech.",
new TextToSpeechOptions { VoiceId = "eve", Language = "en" });

var transcription = new GrokClient(Environment.GetEnvironmentVariable("XAI_API_KEY")!)
.AsISpeechToTextClient();

var text = await transcription.GetTextAsync(File.OpenRead("audio.mp3"),
new SpeechToTextOptions { TextLanguage = "en" });
```

## File Attachments
Expand Down Expand Up @@ -402,6 +408,8 @@ Console.WriteLine($"Edited image URL: {editedImage.Uri}");
## Text to Speech

Grok supports text to speech via the `ITextToSpeechClient` abstraction from Microsoft.Extensions.AI.
See the [xAI text to speech docs](https://docs.x.ai/developers/model-capabilities/audio/text-to-speech)
for supported voices, formats, and streaming details.
Use `AsITextToSpeechClient` to get a TTS client:

```csharp
Expand Down Expand Up @@ -465,6 +473,87 @@ var options = new GrokTextToSpeechOptions
var response = await speech.GetAudioAsync("Streaming at 24 kHz, 128 kbps.", options);
```

## Speech to Text

Grok supports speech to text via the `ISpeechToTextClient` abstraction from Microsoft.Extensions.AI.
See the [xAI speech to text docs](https://docs.x.ai/developers/model-capabilities/audio/speech-to-text)
for supported languages, audio formats, diarization, multichannel audio, and streaming details.
Use `AsISpeechToTextClient` to get an STT client:

```csharp
var transcription = new GrokClient(Environment.GetEnvironmentVariable("XAI_API_KEY")!)
.AsISpeechToTextClient();
```

### Unary (single response)

Call `GetTextAsync` to transcribe an audio file in a single request. The result contains transcript
text, timing information, and the raw xAI response:

```csharp
await using var audio = File.OpenRead("meeting.mp3");

var response = await transcription.GetTextAsync(audio,
new GrokSpeechToTextOptions
{
TextLanguage = "en",
Format = true,
});

Console.WriteLine(response.Text);
```

Set `Format = true` with `TextLanguage` to enable xAI's inverse text normalization, such as converting
spoken numbers and currencies into written form.

### Streaming

Call `GetStreamingTextAsync` to stream raw audio and receive transcript updates as speech is processed.
The xAI streaming endpoint expects raw encoded audio such as PCM, µ-law, or A-law rather than MP3/WAV
container bytes:

```csharp
await using var audio = File.OpenRead("audio.pcm");

await foreach (var update in transcription.GetStreamingTextAsync(audio,
new GrokSpeechToTextOptions
{
AudioFormat = "pcm",
SpeechSampleRate = 16000,
TextLanguage = "en",
InterimResults = true,
}))
{
if (update.Kind is SpeechToTextResponseUpdateKind.TextUpdating or
SpeechToTextResponseUpdateKind.TextUpdated)
{
Console.WriteLine(update.Text);
}
}
```

### Grok-Specific Options

Use `GrokSpeechToTextOptions` to control xAI transcription behavior beyond the base
`SpeechToTextOptions`:

```csharp
var options = new GrokSpeechToTextOptions
{
TextLanguage = "en",
SpeechSampleRate = 16000,
Format = true, // normalize spoken numbers, currencies, and units
AudioFormat = "pcm", // pcm | mulaw | alaw for raw audio
Diarize = true, // include speaker IDs on words when available
Multichannel = true, // transcribe each channel independently
Channels = 2,
InterimResults = true, // streaming only
Endpointing = 10, // streaming silence duration in milliseconds
};

var response = await transcription.GetTextAsync(File.OpenRead("call.pcm"), options);
```

<!-- #xai -->

# xAI.Protocol
Expand Down
93 changes: 77 additions & 16 deletions src/xAI.Tests/SanityChecks.cs
Original file line number Diff line number Diff line change
@@ -1,13 +1,8 @@
using System.Text.Json;
using Devlooped.Extensions.AI;
using DotNetEnv;
using Grpc.Core;
using Grpc.Net.Client.Configuration;
using Microsoft.Extensions.AI;
using Microsoft.Extensions.DependencyInjection;
using xAI.Protocol;
using Xunit.Abstractions;
using Xunit.Sdk;
using static ConfigurationExtensions;
using ChatConversation = Devlooped.Extensions.AI.Chat;

namespace xAI.Tests;
Expand All @@ -18,7 +13,7 @@ public class SanityChecks(ITestOutputHelper output)
public async Task NoEmbeddingModels()
{
var services = new ServiceCollection()
.AddxAIProtocol(Environment.GetEnvironmentVariable("CI_XAI_API_KEY")!)
.AddxAIProtocol(Configuration["CI_XAI_API_KEY"]!)
.BuildServiceProvider();

var client = services.GetRequiredService<Models.ModelsClient>();
Expand All @@ -33,7 +28,7 @@ public async Task NoEmbeddingModels()
public async Task ListModelsAsync()
{
var services = new ServiceCollection()
.AddxAIProtocol(Environment.GetEnvironmentVariable("CI_XAI_API_KEY")!)
.AddxAIProtocol(Configuration["CI_XAI_API_KEY"]!)
.BuildServiceProvider();

var client = services.GetRequiredService<Models.ModelsClient>();
Expand All @@ -50,7 +45,7 @@ public async Task ListModelsAsync()
public async Task ExecuteLocalFunctionWithWebSearch()
{
var services = new ServiceCollection()
.AddxAIProtocol(Environment.GetEnvironmentVariable("CI_XAI_API_KEY")!)
.AddxAIProtocol(Configuration["CI_XAI_API_KEY"]!)
.BuildServiceProvider();

var client = services.GetRequiredService<xAI.Protocol.Chat.ChatClient>();
Expand Down Expand Up @@ -161,7 +156,7 @@ public async Task ExecuteLocalFunctionWithWebSearch()
public async Task ClientSideFunction(bool streaming)
{
var getDateCalls = 0;
var grok = new GrokClient(Env.GetString("CI_XAI_API_KEY")!)
var grok = new GrokClient(Configuration["CI_XAI_API_KEY"]!)
.AsIChatClient("grok-4-1-fast")
.AsBuilder()
.UseFunctionInvocation()
Expand Down Expand Up @@ -203,7 +198,7 @@ What is today's date? Use the get_date tool.
[InlineData(true)]
public async Task AgenticWebSearch(bool streaming)
{
var grok = new GrokClient(Env.GetString("CI_XAI_API_KEY")!)
var grok = new GrokClient(Configuration["CI_XAI_API_KEY"]!)
.AsIChatClient("grok-4-1-fast");

var options = new GrokChatOptions
Expand Down Expand Up @@ -249,7 +244,7 @@ What is the current price of Tesla (TSLA) stock? Use web search (Yahoo Finance o
[InlineData(true)]
public async Task AgenticXSearch(bool streaming)
{
var grok = new GrokClient(Env.GetString("CI_XAI_API_KEY")!)
var grok = new GrokClient(Configuration["CI_XAI_API_KEY"]!)
.AsIChatClient("grok-4-1-fast");

var options = new GrokChatOptions
Expand Down Expand Up @@ -288,7 +283,7 @@ What is the top news from Tesla on X? Use the X search tool.
[InlineData(true)]
public async Task AgenticMcpServer(bool streaming)
{
var grok = new GrokClient(Env.GetString("CI_XAI_API_KEY")!)
var grok = new GrokClient(Configuration["CI_XAI_API_KEY"]!)
.AsIChatClient("grok-4-1-fast");

var options = new GrokChatOptions
Expand All @@ -299,7 +294,7 @@ public async Task AgenticMcpServer(bool streaming)
[
new HostedMcpServerTool("GitHub", "https://api.githubcopilot.com/mcp/")
{
Headers = new Dictionary < string, string > {["Authorization"] = Env.GetString("GITHUB_TOKEN") ! },
Headers = new Dictionary < string, string > {["Authorization"] = Configuration["GITHUB_TOKEN"] ! },
AllowedTools = ["list_releases", "get_release_by_tag"],
}
]
Expand Down Expand Up @@ -340,7 +335,7 @@ What is the latest release version of the {{ThisAssembly.Git.Url}} repository? U
[InlineData(true)]
public async Task AgenticFileSearch(bool streaming)
{
var grok = new GrokClient(Env.GetString("CI_XAI_API_KEY")!)
var grok = new GrokClient(Configuration["CI_XAI_API_KEY"]!)
.AsIChatClient("grok-4-1-fast");

var options = new GrokChatOptions
Expand Down Expand Up @@ -406,7 +401,7 @@ Use the collection search tool.
[InlineData(true)]
public async Task AgenticCodeInterpreter(bool streaming)
{
var client = new GrokClient(Env.GetString("CI_XAI_API_KEY")!);
var client = new GrokClient(Configuration["CI_XAI_API_KEY"]!);

var grok = client.AsIChatClient("grok-4-1-fast");

Expand Down Expand Up @@ -451,6 +446,72 @@ parseable by a decimal parser.
output.WriteLine($"Code interpreter calls: {codeInterpreterCalls.Count}");
}

[SecretsTheory("CI_XAI_API_KEY")]
[InlineData("rex")]
public async Task TextToSpeech_SpeechToText(string voiceId)
{
using var client = new GrokClient(Configuration["CI_XAI_API_KEY"]!);
using var tts = client.AsITextToSpeechClient();

var expected = "El que cree en mí, en realidad no cree en mí, sino en aquel que me envió.";
var tempFile = System.IO.Path.Combine(System.IO.Path.GetTempPath(), $"xai-tts-{Guid.NewGuid():N}.pcm");

try
{
await using (var fileStream = System.IO.File.Create(tempFile))
{
await foreach (var update in tts.GetStreamingAudioAsync(
expected,
new TextToSpeechOptions
{
VoiceId = voiceId,
Language = "es-ES",
// uses mp3 by default
}))
{
if (update.Kind == TextToSpeechResponseUpdateKind.AudioUpdating)
{
foreach (var content in update.Contents)
{
if (content is DataContent data)
{
await fileStream.WriteAsync(data.Data);
}
}
}
}
}

Assert.True(System.IO.File.Exists(tempFile));
Assert.True(new System.IO.FileInfo(tempFile).Length > 0);

using var stt = client.AsISpeechToTextClient();
await using var audioStream = System.IO.File.OpenRead(tempFile);

// auto-detect format from content
var transcription = await stt.GetTextAsync(audioStream);

Assert.Equal(
NormalizeTranscription(expected),
NormalizeTranscription(transcription.Text),
ignoreCase: true);
}
finally
{
if (System.IO.File.Exists(tempFile))
System.IO.File.Delete(tempFile);
}
}

static string NormalizeTranscription(string? text)
{
var withoutPunctuation = new string((text ?? string.Empty)
.Select(character => char.IsPunctuation(character) ? ' ' : character)
.ToArray());

return string.Join(" ", withoutPunctuation.Split((char[]?)null, StringSplitOptions.RemoveEmptyEntries));
}

static async Task<ChatResponse> GetResponseAsync(IChatClient client, ChatConversation chat, GrokChatOptions options, bool streaming)
{
if (!streaming)
Expand Down
Loading
Loading