Skip to content

Commit 293be37

Browse files
committed
feat(rag): share STT and vision providers between voice pipeline and multimodal RAG
Add SpeechProviderAdapter to bridge SpeechToTextProvider (voice pipeline) to ISpeechToTextProvider (multimodal indexer). Add LLMVisionAdapter re-export and createMultimodalIndexerFromResolver factory that wires both STT and vision from the existing SpeechProviderResolver and VisionPipeline into a ready-to-use MultimodalIndexer. 28 new tests covering adapter mapping, error propagation, factory resolution, and provider precedence rules.
1 parent 4011279 commit 293be37

8 files changed

Lines changed: 1067 additions & 0 deletions

src/rag/index.ts

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -193,6 +193,12 @@ export {
193193
// ============================================================================
194194

195195
export { MultimodalIndexer } from './multimodal/index.js';
196+
export { SpeechProviderAdapter } from './multimodal/index.js';
197+
export { LLMVisionAdapter, type LLMVisionAdapterConfig } from './multimodal/index.js';
198+
export {
199+
createMultimodalIndexerFromResolver,
200+
type MultimodalIndexerFromResolverOptions,
201+
} from './multimodal/index.js';
196202

197203
export type {
198204
ContentModality,
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
/**
2+
* @module rag/multimodal/LLMVisionAdapter
3+
*
4+
* Wraps a vision-capable LLM as an {@link IVisionProvider} for the
5+
* multimodal RAG indexer.
6+
*
7+
* Unlike the full {@link VisionPipeline} which runs OCR, handwriting,
8+
* document-AI tiers before escalating to cloud, this adapter goes
9+
* straight to the LLM — making it the simplest path for teams that
10+
* only need cloud vision and don't want the multi-tier pipeline.
11+
*
12+
* ## Relationship to LLMVisionProvider
13+
*
14+
* The `core/vision/providers/LLMVisionProvider` class fills the same
15+
* role and already exists. This file re-exports it under the multimodal
16+
* module namespace so consumers importing from `rag/multimodal` can
17+
* access it without reaching into `core/vision/`. The underlying
18+
* implementation is identical — this is a convenience re-export plus
19+
* an alias type.
20+
*
21+
* @see {@link LLMVisionProvider} for the implementation.
22+
* @see {@link PipelineVisionProvider} for the full-pipeline alternative.
23+
* @see {@link IVisionProvider} for the interface contract.
24+
*
25+
* @example
26+
* ```typescript
27+
* import { LLMVisionAdapter } from './LLMVisionAdapter.js';
28+
*
29+
* const vision = new LLMVisionAdapter({
30+
* provider: 'openai',
31+
* model: 'gpt-4o',
32+
* prompt: 'Describe this image for a RAG search index.',
33+
* });
34+
*
35+
* const indexer = new MultimodalIndexer({
36+
* embeddingManager,
37+
* vectorStore,
38+
* visionProvider: vision,
39+
* });
40+
* ```
41+
*/
42+
43+
// Re-export the existing LLMVisionProvider from core/vision so that
44+
// consumers importing from the multimodal module don't need to reach
45+
// into core/vision/ directly. The underlying class is unchanged.
46+
export {
47+
LLMVisionProvider as LLMVisionAdapter,
48+
type LLMVisionProviderConfig as LLMVisionAdapterConfig,
49+
} from '../../core/vision/providers/LLMVisionProvider.js';
Lines changed: 167 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,167 @@
1+
/**
2+
* @module rag/multimodal/SpeechProviderAdapter
3+
*
4+
* Adapts the voice-pipeline's {@link SpeechToTextProvider} to the narrow
5+
* {@link ISpeechToTextProvider} interface expected by the multimodal RAG
6+
* indexer.
7+
*
8+
* ## Why this adapter exists
9+
*
10+
* The speech subsystem (`src/speech/`) and the multimodal RAG pipeline
11+
* (`src/rag/multimodal/`) each define their own STT contract:
12+
*
13+
* | Contract | Input | Output |
14+
* |---------------------------|----------------------|---------------------------------|
15+
* | `SpeechToTextProvider` | `SpeechAudioInput` | `SpeechTranscriptionResult` |
16+
* | `ISpeechToTextProvider` | `Buffer` | `string` |
17+
*
18+
* The voice pipeline's providers (Whisper, Deepgram, AssemblyAI, Azure)
19+
* implement the richer `SpeechToTextProvider` contract. This adapter
20+
* wraps any of them so the multimodal indexer can consume them without
21+
* requiring separate STT configuration.
22+
*
23+
* ## Mapping details
24+
*
25+
* - **Input**: The raw `Buffer` is wrapped in a `SpeechAudioInput` with
26+
* a default MIME type of `audio/wav`. The optional `language` parameter
27+
* is forwarded via `SpeechTranscriptionOptions.language`.
28+
*
29+
* - **Output**: The full `SpeechTranscriptionResult` is reduced to just
30+
* the `text` string. Rich metadata (segments, confidence, usage) is
31+
* intentionally discarded because the indexer only needs the text for
32+
* embedding generation.
33+
*
34+
* @see {@link SpeechToTextProvider} for the voice pipeline contract.
35+
* @see {@link ISpeechToTextProvider} for the multimodal indexer contract.
36+
* @see {@link SpeechProviderResolver} for resolving STT providers.
37+
*
38+
* @example
39+
* ```typescript
40+
* import { SpeechProviderResolver } from '../../speech/SpeechProviderResolver.js';
41+
* import { SpeechProviderAdapter } from './SpeechProviderAdapter.js';
42+
*
43+
* const resolver = new SpeechProviderResolver();
44+
* await resolver.refresh();
45+
* const stt = resolver.resolveSTT();
46+
* const adapter = new SpeechProviderAdapter(stt);
47+
*
48+
* const indexer = new MultimodalIndexer({ sttProvider: adapter, ... });
49+
* ```
50+
*/
51+
52+
import type { SpeechToTextProvider } from '../../speech/types.js';
53+
import type { ISpeechToTextProvider } from './types.js';
54+
55+
// ---------------------------------------------------------------------------
56+
// Implementation
57+
// ---------------------------------------------------------------------------
58+
59+
/**
60+
* Bridges the voice-pipeline's `SpeechToTextProvider` to the multimodal
61+
* indexer's `ISpeechToTextProvider` interface.
62+
*
63+
* Converts raw `Buffer` audio into the `SpeechAudioInput` shape expected
64+
* by voice providers, forwards the language hint through
65+
* `SpeechTranscriptionOptions`, and extracts the plain transcript text
66+
* from the rich `SpeechTranscriptionResult`.
67+
*
68+
* @example
69+
* ```typescript
70+
* const whisper = resolver.resolveSTT();
71+
* const adapted = new SpeechProviderAdapter(whisper);
72+
*
73+
* // Now usable by the multimodal indexer:
74+
* const text = await adapted.transcribe(audioBuffer, 'en');
75+
* ```
76+
*/
77+
export class SpeechProviderAdapter implements ISpeechToTextProvider {
78+
/**
79+
* The underlying voice-pipeline STT provider being adapted.
80+
* Held as a readonly reference — the caller retains ownership.
81+
*/
82+
private readonly _provider: SpeechToTextProvider;
83+
84+
/**
85+
* Default MIME type applied to raw audio buffers when no format
86+
* information is available. WAV is the most universally supported
87+
* format across STT providers.
88+
*/
89+
private readonly _defaultMimeType: string;
90+
91+
/**
92+
* Create a new adapter wrapping a voice-pipeline STT provider.
93+
*
94+
* @param provider - A configured `SpeechToTextProvider` instance
95+
* (e.g. Whisper, Deepgram, AssemblyAI, Azure Speech).
96+
* @param defaultMimeType - MIME type to assume for raw audio buffers.
97+
* Defaults to `'audio/wav'` which is accepted by all major STT
98+
* providers. Override to `'audio/mpeg'` or `'audio/ogg'` when
99+
* indexing MP3/OGG files.
100+
*
101+
* @throws {Error} If provider is null or undefined.
102+
*
103+
* @example
104+
* ```typescript
105+
* const adapter = new SpeechProviderAdapter(whisperProvider);
106+
* const mp3Adapter = new SpeechProviderAdapter(whisperProvider, 'audio/mpeg');
107+
* ```
108+
*/
109+
constructor(provider: SpeechToTextProvider, defaultMimeType = 'audio/wav') {
110+
if (!provider) {
111+
throw new Error(
112+
'SpeechProviderAdapter: a SpeechToTextProvider instance is required.',
113+
);
114+
}
115+
this._provider = provider;
116+
this._defaultMimeType = defaultMimeType;
117+
}
118+
119+
/**
120+
* Transcribe audio data to text.
121+
*
122+
* Wraps the raw buffer in a `SpeechAudioInput` and delegates to the
123+
* underlying voice-pipeline provider. The rich transcription result
124+
* is reduced to the plain text string that the multimodal indexer
125+
* needs for embedding generation.
126+
*
127+
* @param audio - Raw audio data as a Buffer (WAV, MP3, OGG, etc.).
128+
* @param language - Optional BCP-47 language code hint for improved
129+
* transcription accuracy (e.g. `'en'`, `'es'`, `'ja'`).
130+
* @returns The transcribed text content.
131+
*
132+
* @throws {Error} If the underlying STT provider fails.
133+
*
134+
* @example
135+
* ```typescript
136+
* const transcript = await adapter.transcribe(wavBuffer);
137+
* const spanishTranscript = await adapter.transcribe(audioBuffer, 'es');
138+
* ```
139+
*/
140+
async transcribe(audio: Buffer, language?: string): Promise<string> {
141+
const result = await this._provider.transcribe(
142+
{
143+
data: audio,
144+
mimeType: this._defaultMimeType,
145+
},
146+
language ? { language } : undefined,
147+
);
148+
return result.text;
149+
}
150+
151+
/**
152+
* Get the display name of the underlying STT provider.
153+
*
154+
* Useful for logging and diagnostics — lets callers identify which
155+
* voice-pipeline provider is actually handling transcription.
156+
*
157+
* @returns The provider's display name or ID string.
158+
*
159+
* @example
160+
* ```typescript
161+
* console.log(`STT via: ${adapter.getProviderName()}`); // "openai-whisper"
162+
* ```
163+
*/
164+
getProviderName(): string {
165+
return this._provider.displayName ?? this._provider.id;
166+
}
167+
}
Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
/**
2+
* @module rag/multimodal/__tests__/LLMVisionAdapter.spec
3+
*
4+
* Unit tests for the {@link LLMVisionAdapter} re-export.
5+
*
6+
* The LLMVisionAdapter is a convenience re-export of `LLMVisionProvider`
7+
* from `core/vision/providers/`. These tests verify that:
8+
*
9+
* - The re-export resolves to the correct class
10+
* - The adapter implements IVisionProvider
11+
* - Constructor validates required config
12+
*
13+
* More thorough LLMVisionProvider tests live in
14+
* `core/vision/__tests__/LLMVisionProvider.spec.ts`. This file only
15+
* validates the re-export wiring and basic contract.
16+
*/
17+
18+
import { describe, it, expect } from 'vitest';
19+
import { LLMVisionAdapter, type LLMVisionAdapterConfig } from '../LLMVisionAdapter.js';
20+
import { LLMVisionProvider } from '../../../core/vision/providers/LLMVisionProvider.js';
21+
22+
// ---------------------------------------------------------------------------
23+
// Tests
24+
// ---------------------------------------------------------------------------
25+
26+
describe('LLMVisionAdapter', () => {
27+
it('should be the same class as LLMVisionProvider', () => {
28+
// The re-export should resolve to the exact same constructor
29+
expect(LLMVisionAdapter).toBe(LLMVisionProvider);
30+
});
31+
32+
it('should throw if provider name is missing', () => {
33+
expect(() => new LLMVisionAdapter({ provider: '' })).toThrow(
34+
/provider name is required/,
35+
);
36+
});
37+
38+
it('should construct with valid config', () => {
39+
const config: LLMVisionAdapterConfig = {
40+
provider: 'openai',
41+
model: 'gpt-4o',
42+
};
43+
44+
const adapter = new LLMVisionAdapter(config);
45+
expect(adapter).toBeInstanceOf(LLMVisionProvider);
46+
});
47+
48+
it('should have a describeImage method (implements IVisionProvider)', () => {
49+
const adapter = new LLMVisionAdapter({ provider: 'openai' });
50+
expect(typeof adapter.describeImage).toBe('function');
51+
});
52+
53+
it('should accept custom prompt in config', () => {
54+
// Should not throw — just verifying the config shape is accepted
55+
expect(
56+
() =>
57+
new LLMVisionAdapter({
58+
provider: 'anthropic',
59+
model: 'claude-sonnet-4-20250514',
60+
prompt: 'Describe this image for a search index.',
61+
apiKey: 'test-key',
62+
baseUrl: 'https://custom.endpoint.com',
63+
}),
64+
).not.toThrow();
65+
});
66+
});

0 commit comments

Comments
 (0)