Add ITextToSpeechClient abstraction, middleware, and OpenAI implementation#7381
Add ITextToSpeechClient abstraction, middleware, and OpenAI implementation#7381stephentoub merged 2 commits intodotnet:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds a comprehensive ITextToSpeechClient abstraction to Microsoft.Extensions.AI, mirroring the existing ISpeechToTextClient pattern. It introduces core abstraction types, middleware components (logging, OpenTelemetry, options configuration), DI integration, and an OpenAI implementation backed by AudioClient.GenerateSpeechAsync.
Changes:
- New abstraction types in
Microsoft.Extensions.AI.Abstractions:ITextToSpeechClient,DelegatingTextToSpeechClient,TextToSpeechOptions,TextToSpeechResponse,TextToSpeechResponseUpdate,TextToSpeechResponseUpdateKind,TextToSpeechClientMetadata,TextToSpeechClientExtensions, andTextToSpeechResponseUpdateExtensions. - Middleware pipeline in
Microsoft.Extensions.AI:TextToSpeechClientBuilder, DI service collection extensions,ConfigureOptionsTextToSpeechClient,LoggingTextToSpeechClient, andOpenTelemetryTextToSpeechClientwith builder extensions. - OpenAI implementation (
OpenAITextToSpeechClient) wrappingAudioClientwith voice/speed/format mapping and non-streaming fallback forGetStreamingAudioAsync.
Reviewed changes
Copilot reviewed 43 out of 43 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
src/.../TextToSpeech/ITextToSpeechClient.cs |
Core interface with GetAudioAsync, GetStreamingAudioAsync, GetService |
src/.../TextToSpeech/DelegatingTextToSpeechClient.cs |
Base class for pipeline delegation |
src/.../TextToSpeech/TextToSpeechOptions.cs |
Options with ModelId, VoiceId, Language, AudioFormat, Speed, Pitch, Volume, RawRepresentationFactory |
src/.../TextToSpeech/TextToSpeechResponse.cs |
Response type with Contents, Usage, ToTextToSpeechResponseUpdates |
src/.../TextToSpeech/TextToSpeechResponseUpdate.cs |
Streaming update type with Kind, Contents |
src/.../TextToSpeech/TextToSpeechResponseUpdateKind.cs |
Kind struct (SessionOpen, Error, AudioUpdating, AudioUpdated, SessionClose) |
src/.../TextToSpeech/TextToSpeechResponseUpdateExtensions.cs |
Coalescing extensions for updates → response |
src/.../TextToSpeech/TextToSpeechClientMetadata.cs |
Metadata with ProviderName, ProviderUri, DefaultModelId |
src/.../TextToSpeech/TextToSpeechClientExtensions.cs |
GetService extension |
src/.../AI/TextToSpeech/TextToSpeechClientBuilder.cs |
Builder pattern for pipelines |
src/.../AI/TextToSpeech/TextToSpeechClientBuilderServiceCollectionExtensions.cs |
DI registration (keyed + unkeyed) |
src/.../AI/TextToSpeech/TextToSpeechClientBuilderTextToSpeechClientExtensions.cs |
AsBuilder() extension |
src/.../AI/TextToSpeech/ConfigureOptionsTextToSpeechClient.cs |
Options configuration middleware |
src/.../AI/TextToSpeech/ConfigureOptionsTextToSpeechClientBuilderExtensions.cs |
ConfigureOptions builder extension |
src/.../AI/TextToSpeech/LoggingTextToSpeechClient.cs |
Logging middleware (skips binary audio serialization) |
src/.../AI/TextToSpeech/LoggingTextToSpeechClientBuilderExtensions.cs |
UseLogging builder extension |
src/.../AI/TextToSpeech/OpenTelemetryTextToSpeechClient.cs |
OpenTelemetry tracing/metrics middleware |
src/.../AI/TextToSpeech/OpenTelemetryTextToSpeechClientBuilderExtensions.cs |
UseOpenTelemetry builder extension |
src/.../AI/OpenTelemetryConsts.cs |
Adds TypeAudio constant |
src/.../AI.OpenAI/OpenAITextToSpeechClient.cs |
OpenAI implementation with format mapping |
src/.../AI.OpenAI/OpenAIClientExtensions.cs |
AsITextToSpeechClient() extension on AudioClient |
src/.../AI.Abstractions/Utilities/AIJsonUtilities.Defaults.cs |
Registers TTS types for source-gen JSON serialization |
src/Shared/DiagnosticIds/DiagnosticIds.cs |
Adds AITextToSpeech experimental diagnostic ID |
test/.../TestTextToSpeechClient.cs |
Test helper client |
test/.../TestJsonSerializerContext.cs |
Adds TTS types to test serialization context |
test/.../TextToSpeech/*Tests.cs |
Comprehensive tests for all new types |
test/.../OpenAITextToSpeechClientTests.cs |
Unit tests for OpenAI implementation |
test/.../OpenAITextToSpeechClientIntegrationTests.cs |
Integration tests |
test/.../TextToSpeechClientIntegrationTests.cs |
Base integration test class |
ericstj
left a comment
There was a problem hiding this comment.
This looks pretty good and rather straight forward adaptation of established patterns.
...ibraries/Microsoft.Extensions.AI.Abstractions/TextToSpeech/TextToSpeechResponseUpdateKind.cs
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.AI.Abstractions/TextToSpeech/TextToSpeechResponse.cs
Outdated
Show resolved
Hide resolved
|
@stephentoub I am wondering is we care to add another interface for generating supported voices? The TTS service should be able to support a specific voice. |
I don't understand the question. Can you elaborate? |
|
As you know, TTS providers typically offer multiple voices that can be used when synthesizing audio. In order to request synthesis using a specific voice, we need a way to discover which voices a provider supports. I recently worked on a similar initiative to add TTS support and introduced a Task<SpeechVoice[]> GetSpeechVoicesAsync() method that returns the list of supported voices (when the provider exposes them). Having such a method is very useful because it allows consumers to know which voices are available and request the synthesized utterance using one of those voices. I’m wondering if we could add a similar method to ITextToSpeechClient to expose the list of available voices. I'm not entirely sure whether ITextToSpeechClient is the best place for this or if it should be introduced through a separate interface. That said, adding it directly to ITextToSpeechClient might still make sense since it is closely related to the same client and its capabilities. |
|
Thanks. My concern is, at least as far as I'm aware, a bunch of services don't actually expose that, e.g. to my knowledge OpenAI doesn't provide an API for retrieving the list of voices, Gemini doesn't appear to, etc. Am I just missing it? |
|
I hear you. However, many providers expose these voices. In many real world use cases these are crucial. I think all providers will probably expose multiple voices as demand grows. I use ElevenLabs and Azure TTS and both have ways to retrieve it. Azure TTS done not support OpenAI as far as I know yet they expose them |
This PR adds a comprehensive
ITextToSpeechClientabstraction to Microsoft.Extensions.AI, the inverse of the existingISpeechToTextClient.Abstractions (Microsoft.Extensions.AI.Abstractions)
ITextToSpeechClient- Core interface withGetAudioAsync/GetStreamingAudioAsyncDelegatingTextToSpeechClient- Pipeline delegation base classTextToSpeechOptions- Options with ModelId, VoiceId, Language, AudioFormat, Speed, Pitch, Volume, and RawRepresentationFactoryTextToSpeechResponse/TextToSpeechResponseUpdate- Response types with DataContent binary audioTextToSpeechResponseUpdateKind- Kind struct (SessionOpen, Error, AudioUpdating, AudioUpdated, SessionClose)TextToSpeechResponseUpdateExtensions- Coalescing extensionsTextToSpeechClientMetadata/TextToSpeechClientExtensions- Metadata and GetServiceMiddleware (Microsoft.Extensions.AI)
TextToSpeechClientBuilder- Builder pattern with DI integrationConfigureOptionsTextToSpeechClient- Options configuration middlewareLoggingTextToSpeechClient- Logging middleware (skips binary audio serialization)OpenTelemetryTextToSpeechClient- OpenTelemetry tracing middlewareOpenAI Implementation (Microsoft.Extensions.AI.OpenAI)
OpenAITextToSpeechClientwrappingAudioClient.GenerateSpeechAsync, exposed viaAsITextToSpeechClient()Tests
Microsoft Reviewers: Open in CodeFlow