Skip to content

Add ITextToSpeechClient abstraction, middleware, and OpenAI implementation#7381

Merged
stephentoub merged 2 commits intodotnet:mainfrom
stephentoub:tts
Mar 10, 2026
Merged

Add ITextToSpeechClient abstraction, middleware, and OpenAI implementation#7381
stephentoub merged 2 commits intodotnet:mainfrom
stephentoub:tts

Conversation

@stephentoub
Copy link
Member

@stephentoub stephentoub commented Mar 10, 2026

This PR adds a comprehensive ITextToSpeechClient abstraction to Microsoft.Extensions.AI, the inverse of the existing ISpeechToTextClient.

Abstractions (Microsoft.Extensions.AI.Abstractions)

  • ITextToSpeechClient - Core interface with GetAudioAsync/GetStreamingAudioAsync
  • DelegatingTextToSpeechClient - Pipeline delegation base class
  • TextToSpeechOptions - Options with ModelId, VoiceId, Language, AudioFormat, Speed, Pitch, Volume, and RawRepresentationFactory
  • TextToSpeechResponse / TextToSpeechResponseUpdate - Response types with DataContent binary audio
  • TextToSpeechResponseUpdateKind - Kind struct (SessionOpen, Error, AudioUpdating, AudioUpdated, SessionClose)
  • TextToSpeechResponseUpdateExtensions - Coalescing extensions
  • TextToSpeechClientMetadata / TextToSpeechClientExtensions - Metadata and GetService

Middleware (Microsoft.Extensions.AI)

  • TextToSpeechClientBuilder - Builder pattern with DI integration
  • ConfigureOptionsTextToSpeechClient - Options configuration middleware
  • LoggingTextToSpeechClient - Logging middleware (skips binary audio serialization)
  • OpenTelemetryTextToSpeechClient - OpenTelemetry tracing middleware

OpenAI Implementation (Microsoft.Extensions.AI.OpenAI)

  • OpenAITextToSpeechClient wrapping AudioClient.GenerateSpeechAsync, exposed via AsITextToSpeechClient()
  • Maps VoiceId, Speed, AudioFormat; supports RawRepresentationFactory for full SDK escape hatch

Tests

  • 9 abstraction tests, 5 middleware tests, 16 OpenAI unit tests, 5 integration tests
  • All pass across net462, net8.0, net9.0, net10.0
Microsoft Reviewers: Open in CodeFlow

@stephentoub stephentoub requested review from a team as code owners March 10, 2026 14:10
Copilot AI review requested due to automatic review settings March 10, 2026 14:10
@github-actions github-actions bot added the area-ai Microsoft.Extensions.AI libraries label Mar 10, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a comprehensive ITextToSpeechClient abstraction to Microsoft.Extensions.AI, mirroring the existing ISpeechToTextClient pattern. It introduces core abstraction types, middleware components (logging, OpenTelemetry, options configuration), DI integration, and an OpenAI implementation backed by AudioClient.GenerateSpeechAsync.

Changes:

  • New abstraction types in Microsoft.Extensions.AI.Abstractions: ITextToSpeechClient, DelegatingTextToSpeechClient, TextToSpeechOptions, TextToSpeechResponse, TextToSpeechResponseUpdate, TextToSpeechResponseUpdateKind, TextToSpeechClientMetadata, TextToSpeechClientExtensions, and TextToSpeechResponseUpdateExtensions.
  • Middleware pipeline in Microsoft.Extensions.AI: TextToSpeechClientBuilder, DI service collection extensions, ConfigureOptionsTextToSpeechClient, LoggingTextToSpeechClient, and OpenTelemetryTextToSpeechClient with builder extensions.
  • OpenAI implementation (OpenAITextToSpeechClient) wrapping AudioClient with voice/speed/format mapping and non-streaming fallback for GetStreamingAudioAsync.

Reviewed changes

Copilot reviewed 43 out of 43 changed files in this pull request and generated no comments.

Show a summary per file
File Description
src/.../TextToSpeech/ITextToSpeechClient.cs Core interface with GetAudioAsync, GetStreamingAudioAsync, GetService
src/.../TextToSpeech/DelegatingTextToSpeechClient.cs Base class for pipeline delegation
src/.../TextToSpeech/TextToSpeechOptions.cs Options with ModelId, VoiceId, Language, AudioFormat, Speed, Pitch, Volume, RawRepresentationFactory
src/.../TextToSpeech/TextToSpeechResponse.cs Response type with Contents, Usage, ToTextToSpeechResponseUpdates
src/.../TextToSpeech/TextToSpeechResponseUpdate.cs Streaming update type with Kind, Contents
src/.../TextToSpeech/TextToSpeechResponseUpdateKind.cs Kind struct (SessionOpen, Error, AudioUpdating, AudioUpdated, SessionClose)
src/.../TextToSpeech/TextToSpeechResponseUpdateExtensions.cs Coalescing extensions for updates → response
src/.../TextToSpeech/TextToSpeechClientMetadata.cs Metadata with ProviderName, ProviderUri, DefaultModelId
src/.../TextToSpeech/TextToSpeechClientExtensions.cs GetService extension
src/.../AI/TextToSpeech/TextToSpeechClientBuilder.cs Builder pattern for pipelines
src/.../AI/TextToSpeech/TextToSpeechClientBuilderServiceCollectionExtensions.cs DI registration (keyed + unkeyed)
src/.../AI/TextToSpeech/TextToSpeechClientBuilderTextToSpeechClientExtensions.cs AsBuilder() extension
src/.../AI/TextToSpeech/ConfigureOptionsTextToSpeechClient.cs Options configuration middleware
src/.../AI/TextToSpeech/ConfigureOptionsTextToSpeechClientBuilderExtensions.cs ConfigureOptions builder extension
src/.../AI/TextToSpeech/LoggingTextToSpeechClient.cs Logging middleware (skips binary audio serialization)
src/.../AI/TextToSpeech/LoggingTextToSpeechClientBuilderExtensions.cs UseLogging builder extension
src/.../AI/TextToSpeech/OpenTelemetryTextToSpeechClient.cs OpenTelemetry tracing/metrics middleware
src/.../AI/TextToSpeech/OpenTelemetryTextToSpeechClientBuilderExtensions.cs UseOpenTelemetry builder extension
src/.../AI/OpenTelemetryConsts.cs Adds TypeAudio constant
src/.../AI.OpenAI/OpenAITextToSpeechClient.cs OpenAI implementation with format mapping
src/.../AI.OpenAI/OpenAIClientExtensions.cs AsITextToSpeechClient() extension on AudioClient
src/.../AI.Abstractions/Utilities/AIJsonUtilities.Defaults.cs Registers TTS types for source-gen JSON serialization
src/Shared/DiagnosticIds/DiagnosticIds.cs Adds AITextToSpeech experimental diagnostic ID
test/.../TestTextToSpeechClient.cs Test helper client
test/.../TestJsonSerializerContext.cs Adds TTS types to test serialization context
test/.../TextToSpeech/*Tests.cs Comprehensive tests for all new types
test/.../OpenAITextToSpeechClientTests.cs Unit tests for OpenAI implementation
test/.../OpenAITextToSpeechClientIntegrationTests.cs Integration tests
test/.../TextToSpeechClientIntegrationTests.cs Base integration test class

Copy link
Member

@ericstj ericstj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good and rather straight forward adaptation of established patterns.

@stephentoub stephentoub enabled auto-merge (squash) March 10, 2026 19:58
@stephentoub stephentoub merged commit cbde52a into dotnet:main Mar 10, 2026
6 checks passed
@MikeAlhayek
Copy link

@stephentoub I am wondering is we care to add another interface for generating supported voices? The TTS service should be able to support a specific voice.

@stephentoub stephentoub deleted the tts branch March 13, 2026 19:30
@stephentoub
Copy link
Member Author

@stephentoub I am wondering is we care to add another interface for generating supported voices? The TTS service should be able to support a specific voice.

I don't understand the question. Can you elaborate?

@MikeAlhayek
Copy link

@stephentoub

As you know, TTS providers typically offer multiple voices that can be used when synthesizing audio. In order to request synthesis using a specific voice, we need a way to discover which voices a provider supports.

I recently worked on a similar initiative to add TTS support and introduced a Task<SpeechVoice[]> GetSpeechVoicesAsync() method that returns the list of supported voices (when the provider exposes them). Having such a method is very useful because it allows consumers to know which voices are available and request the synthesized utterance using one of those voices.

I’m wondering if we could add a similar method to ITextToSpeechClient to expose the list of available voices. I'm not entirely sure whether ITextToSpeechClient is the best place for this or if it should be introduced through a separate interface. That said, adding it directly to ITextToSpeechClient might still make sense since it is closely related to the same client and its capabilities.

I did this here:
https://github.com/CrestApps/CrestApps.OrchardCore/blob/eedb7ecf8550509fb8ec222856642faf44d580ab/src/Abstractions/CrestApps.OrchardCore.AI.Abstractions/IAIClientProvider.cs#L54-L68

@stephentoub
Copy link
Member Author

Thanks. My concern is, at least as far as I'm aware, a bunch of services don't actually expose that, e.g. to my knowledge OpenAI doesn't provide an API for retrieving the list of voices, Gemini doesn't appear to, etc. Am I just missing it?

@MikeAlhayek
Copy link

I hear you. However, many providers expose these voices. In many real world use cases these are crucial. I think all providers will probably expose multiple voices as demand grows. I use ElevenLabs and Azure TTS and both have ways to retrieve it. Azure TTS done not support OpenAI as far as I know yet they expose them

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-ai Microsoft.Extensions.AI libraries

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants