Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions .changeset/dull-ligers-bow.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
---
'firebase': minor
'@firebase/ai': minor
---

Deprecate `sendMediaChunks()` and `sendMediaStream()`. Instead, use the new methods added to the `LiveSession` class.
Add `sendTextRealtime()`, `sendAudioReatime()`, and `sendVideoRealtime()` to the `LiveSession` class.
5 changes: 5 additions & 0 deletions .changeset/fast-rocks-sing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
'@firebase/ai': minor
---

Add support for audio transcriptions in the Live API.
18 changes: 18 additions & 0 deletions common/api-review/ai.api.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,10 @@ export interface AudioConversationController {
stop: () => Promise<void>;
}

// @public
export interface AudioTranscriptionConfig {
}

// @public
export abstract class Backend {
protected constructor(type: BackendType);
Expand Down Expand Up @@ -922,7 +926,9 @@ export interface LanguageModelPromptOptions {
// @beta
export interface LiveGenerationConfig {
frequencyPenalty?: number;
inputAudioTranscription?: AudioTranscriptionConfig;
maxOutputTokens?: number;
outputAudioTranscription?: AudioTranscriptionConfig;
presencePenalty?: number;
responseModalities?: ResponseModality[];
speechConfig?: SpeechConfig;
Expand Down Expand Up @@ -975,8 +981,10 @@ export type LiveResponseType = (typeof LiveResponseType)[keyof typeof LiveRespon

// @beta
export interface LiveServerContent {
inputTranscription?: Transcription;
interrupted?: boolean;
modelTurn?: Content;
outputTranscription?: Transcription;
turnComplete?: boolean;
// (undocumented)
type: 'serverContent';
Expand Down Expand Up @@ -1005,9 +1013,14 @@ export class LiveSession {
isClosed: boolean;
receive(): AsyncGenerator<LiveServerContent | LiveServerToolCall | LiveServerToolCallCancellation>;
send(request: string | Array<string | Part>, turnComplete?: boolean): Promise<void>;
sendAudioRealtime(blob: GenerativeContentBlob): Promise<void>;
sendFunctionResponses(functionResponses: FunctionResponse[]): Promise<void>;
// @deprecated
sendMediaChunks(mediaChunks: GenerativeContentBlob[]): Promise<void>;
// @deprecated (undocumented)
sendMediaStream(mediaChunkStream: ReadableStream<GenerativeContentBlob>): Promise<void>;
sendTextRealtime(text: string): Promise<void>;
sendVideoRealtime(blob: GenerativeContentBlob): Promise<void>;
}

// @public
Expand Down Expand Up @@ -1337,6 +1350,11 @@ export interface ToolConfig {
functionCallingConfig?: FunctionCallingConfig;
}

// @beta
export interface Transcription {
text?: string;
}

// @public
export type TypedSchema = IntegerSchema | NumberSchema | StringSchema | BooleanSchema | ObjectSchema | ArraySchema | AnyOfSchema;

Expand Down
4 changes: 4 additions & 0 deletions docs-devsite/_toc.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@ toc:
path: /docs/reference/js/ai.arrayschema.md
- title: AudioConversationController
path: /docs/reference/js/ai.audioconversationcontroller.md
- title: AudioTranscriptionConfig
path: /docs/reference/js/ai.audiotranscriptionconfig.md
- title: Backend
path: /docs/reference/js/ai.backend.md
- title: BaseParams
Expand Down Expand Up @@ -202,6 +204,8 @@ toc:
path: /docs/reference/js/ai.thinkingconfig.md
- title: ToolConfig
path: /docs/reference/js/ai.toolconfig.md
- title: Transcription
path: /docs/reference/js/ai.transcription.md
- title: URLContext
path: /docs/reference/js/ai.urlcontext.md
- title: URLContextMetadata
Expand Down
19 changes: 19 additions & 0 deletions docs-devsite/ai.audiotranscriptionconfig.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
Project: /docs/reference/js/_project.yaml
Book: /docs/reference/_book.yaml
page_type: reference

{% comment %}
DO NOT EDIT THIS FILE!
This is generated by the JS SDK team, and any local changes will be
overwritten. Changes should be made in the source code at
https://github.com/firebase/firebase-js-sdk
{% endcomment %}

# AudioTranscriptionConfig interface
The audio transcription configuration.

<b>Signature:</b>

```typescript
export interface AudioTranscriptionConfig
```
32 changes: 32 additions & 0 deletions docs-devsite/ai.livegenerationconfig.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,9 @@ export interface LiveGenerationConfig
| Property | Type | Description |
| --- | --- | --- |
| [frequencyPenalty](./ai.livegenerationconfig.md#livegenerationconfigfrequencypenalty) | number | <b><i>(Public Preview)</i></b> Frequency penalties. |
| [inputAudioTranscription](./ai.livegenerationconfig.md#livegenerationconfiginputaudiotranscription) | [AudioTranscriptionConfig](./ai.audiotranscriptionconfig.md#audiotranscriptionconfig_interface) | <b><i>(Public Preview)</i></b> Enables transcription of audio input.<!-- -->When enabled, the model will respond with transcriptions of your audio input in the <code>inputTranscriptions</code> property in [LiveServerContent](./ai.liveservercontent.md#liveservercontent_interface) messages. Note that the transcriptions are broken up across messages, so you may only receive small amounts of text per message. For example, if you ask the model "How are you today?", the model may transcribe that input across three messages, broken up as "How a", "re yo", "u today?". |
| [maxOutputTokens](./ai.livegenerationconfig.md#livegenerationconfigmaxoutputtokens) | number | <b><i>(Public Preview)</i></b> Specifies the maximum number of tokens that can be generated in the response. The number of tokens per word varies depending on the language outputted. Is unbounded by default. |
| [outputAudioTranscription](./ai.livegenerationconfig.md#livegenerationconfigoutputaudiotranscription) | [AudioTranscriptionConfig](./ai.audiotranscriptionconfig.md#audiotranscriptionconfig_interface) | <b><i>(Public Preview)</i></b> Enables transcription of audio input.<!-- -->When enabled, the model will respond with transcriptions of its audio output in the <code>outputTranscription</code> property in [LiveServerContent](./ai.liveservercontent.md#liveservercontent_interface) messages. Note that the transcriptions are broken up across messages, so you may only receive small amounts of text per message. For example, if the model says "How are you today?", the model may transcribe that output across three messages, broken up as "How a", "re yo", "u today?". |
| [presencePenalty](./ai.livegenerationconfig.md#livegenerationconfigpresencepenalty) | number | <b><i>(Public Preview)</i></b> Positive penalties. |
| [responseModalities](./ai.livegenerationconfig.md#livegenerationconfigresponsemodalities) | [ResponseModality](./ai.md#responsemodality)<!-- -->\[\] | <b><i>(Public Preview)</i></b> The modalities of the response. |
| [speechConfig](./ai.livegenerationconfig.md#livegenerationconfigspeechconfig) | [SpeechConfig](./ai.speechconfig.md#speechconfig_interface) | <b><i>(Public Preview)</i></b> Configuration for speech synthesis. |
Expand All @@ -47,6 +49,21 @@ Frequency penalties.
frequencyPenalty?: number;
```

## LiveGenerationConfig.inputAudioTranscription

> This API is provided as a preview for developers and may change based on feedback that we receive. Do not use this API in a production environment.
>

Enables transcription of audio input.

When enabled, the model will respond with transcriptions of your audio input in the `inputTranscriptions` property in [LiveServerContent](./ai.liveservercontent.md#liveservercontent_interface) messages. Note that the transcriptions are broken up across messages, so you may only receive small amounts of text per message. For example, if you ask the model "How are you today?", the model may transcribe that input across three messages, broken up as "How a", "re yo", "u today?".

<b>Signature:</b>

```typescript
inputAudioTranscription?: AudioTranscriptionConfig;
```

## LiveGenerationConfig.maxOutputTokens

> This API is provided as a preview for developers and may change based on feedback that we receive. Do not use this API in a production environment.
Expand All @@ -60,6 +77,21 @@ Specifies the maximum number of tokens that can be generated in the response. Th
maxOutputTokens?: number;
```

## LiveGenerationConfig.outputAudioTranscription

> This API is provided as a preview for developers and may change based on feedback that we receive. Do not use this API in a production environment.
>

Enables transcription of audio input.

When enabled, the model will respond with transcriptions of its audio output in the `outputTranscription` property in [LiveServerContent](./ai.liveservercontent.md#liveservercontent_interface) messages. Note that the transcriptions are broken up across messages, so you may only receive small amounts of text per message. For example, if the model says "How are you today?", the model may transcribe that output across three messages, broken up as "How a", "re yo", "u today?".

<b>Signature:</b>

```typescript
outputAudioTranscription?: AudioTranscriptionConfig;
```

## LiveGenerationConfig.presencePenalty

> This API is provided as a preview for developers and may change based on feedback that we receive. Do not use this API in a production environment.
Expand Down
28 changes: 28 additions & 0 deletions docs-devsite/ai.liveservercontent.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,11 +25,26 @@ export interface LiveServerContent

| Property | Type | Description |
| --- | --- | --- |
| [inputTranscription](./ai.liveservercontent.md#liveservercontentinputtranscription) | [Transcription](./ai.transcription.md#transcription_interface) | <b><i>(Public Preview)</i></b> Transcription of the audio that was input to the model. |
| [interrupted](./ai.liveservercontent.md#liveservercontentinterrupted) | boolean | <b><i>(Public Preview)</i></b> Indicates whether the model was interrupted by the client. An interruption occurs when the client sends a message before the model finishes it's turn. This is <code>undefined</code> if the model was not interrupted. |
| [modelTurn](./ai.liveservercontent.md#liveservercontentmodelturn) | [Content](./ai.content.md#content_interface) | <b><i>(Public Preview)</i></b> The content that the model has generated as part of the current conversation with the user. |
| [outputTranscription](./ai.liveservercontent.md#liveservercontentoutputtranscription) | [Transcription](./ai.transcription.md#transcription_interface) | <b><i>(Public Preview)</i></b> Transcription of the audio output from the model. |
| [turnComplete](./ai.liveservercontent.md#liveservercontentturncomplete) | boolean | <b><i>(Public Preview)</i></b> Indicates whether the turn is complete. This is <code>undefined</code> if the turn is not complete. |
| [type](./ai.liveservercontent.md#liveservercontenttype) | 'serverContent' | <b><i>(Public Preview)</i></b> |

## LiveServerContent.inputTranscription

> This API is provided as a preview for developers and may change based on feedback that we receive. Do not use this API in a production environment.
>

Transcription of the audio that was input to the model.

<b>Signature:</b>

```typescript
inputTranscription?: Transcription;
```

## LiveServerContent.interrupted

> This API is provided as a preview for developers and may change based on feedback that we receive. Do not use this API in a production environment.
Expand All @@ -56,6 +71,19 @@ The content that the model has generated as part of the current conversation wit
modelTurn?: Content;
```

## LiveServerContent.outputTranscription

> This API is provided as a preview for developers and may change based on feedback that we receive. Do not use this API in a production environment.
>

Transcription of the audio output from the model.

<b>Signature:</b>

```typescript
outputTranscription?: Transcription;
```

## LiveServerContent.turnComplete

> This API is provided as a preview for developers and may change based on feedback that we receive. Do not use this API in a production environment.
Expand Down
130 changes: 128 additions & 2 deletions docs-devsite/ai.livesession.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,9 +39,12 @@ export declare class LiveSession
| [close()](./ai.livesession.md#livesessionclose) | | <b><i>(Public Preview)</i></b> Closes this session. All methods on this session will throw an error once this resolves. |
| [receive()](./ai.livesession.md#livesessionreceive) | | <b><i>(Public Preview)</i></b> Yields messages received from the server. This can only be used by one consumer at a time. |
| [send(request, turnComplete)](./ai.livesession.md#livesessionsend) | | <b><i>(Public Preview)</i></b> Sends content to the server. |
| [sendAudioRealtime(blob)](./ai.livesession.md#livesessionsendaudiorealtime) | | <b><i>(Public Preview)</i></b> Sends audio data to the server in realtime. |
| [sendFunctionResponses(functionResponses)](./ai.livesession.md#livesessionsendfunctionresponses) | | <b><i>(Public Preview)</i></b> Sends function responses to the server. |
| [sendMediaChunks(mediaChunks)](./ai.livesession.md#livesessionsendmediachunks) | | <b><i>(Public Preview)</i></b> Sends realtime input to the server. |
| [sendMediaStream(mediaChunkStream)](./ai.livesession.md#livesessionsendmediastream) | | <b><i>(Public Preview)</i></b> Sends a stream of [GenerativeContentBlob](./ai.generativecontentblob.md#generativecontentblob_interface)<!-- -->. |
| [sendMediaStream(mediaChunkStream)](./ai.livesession.md#livesessionsendmediastream) | | <b><i>(Public Preview)</i></b> |
| [sendTextRealtime(text)](./ai.livesession.md#livesessionsendtextrealtime) | | <b><i>(Public Preview)</i></b> Sends text to the server in realtime. |
| [sendVideoRealtime(blob)](./ai.livesession.md#livesessionsendvideorealtime) | | <b><i>(Public Preview)</i></b> Sends video data to the server in realtime. |

## LiveSession.inConversation

Expand Down Expand Up @@ -135,6 +138,45 @@ Promise&lt;void&gt;

If this session has been closed.

## LiveSession.sendAudioRealtime()

> This API is provided as a preview for developers and may change based on feedback that we receive. Do not use this API in a production environment.
>

Sends audio data to the server in realtime.

The server requires that the audio data is base64-encoded 16-bit PCM at 16kHz little-endian.

<b>Signature:</b>

```typescript
sendAudioRealtime(blob: GenerativeContentBlob): Promise<void>;
```

#### Parameters

| Parameter | Type | Description |
| --- | --- | --- |
| blob | [GenerativeContentBlob](./ai.generativecontentblob.md#generativecontentblob_interface) | The base64-encoded PCM data to send to the server in realtime. |

<b>Returns:</b>

Promise&lt;void&gt;

#### Exceptions

If this session has been closed.

### Example


```javascript
// const pcmData = ... base64-encoded 16-bit PCM at 16kHz little-endian.
const blob = { mimeType: "audio/pcm", data: pcmData };
liveSession.sendAudioRealtime(blob);

```

## LiveSession.sendFunctionResponses()

> This API is provided as a preview for developers and may change based on feedback that we receive. Do not use this API in a production environment.
Expand Down Expand Up @@ -167,6 +209,11 @@ If this session has been closed.
> This API is provided as a preview for developers and may change based on feedback that we receive. Do not use this API in a production environment.
>

> Warning: This API is now obsolete.
>
> Use `sendTextRealtime()`<!-- -->, `sendAudioRealtime()`<!-- -->, and `sendVideoRealtime()` instead.
>

Sends realtime input to the server.

<b>Signature:</b>
Expand Down Expand Up @@ -194,7 +241,12 @@ If this session has been closed.
> This API is provided as a preview for developers and may change based on feedback that we receive. Do not use this API in a production environment.
>

Sends a stream of [GenerativeContentBlob](./ai.generativecontentblob.md#generativecontentblob_interface)<!-- -->.
> Warning: This API is now obsolete.
>
> Use `sendTextRealtime()`<!-- -->, `sendAudioRealtime()`<!-- -->, and `sendVideoRealtime()` instead.
>
> Sends a stream of [GenerativeContentBlob](./ai.generativecontentblob.md#generativecontentblob_interface)<!-- -->.
>

<b>Signature:</b>

Expand All @@ -216,3 +268,77 @@ Promise&lt;void&gt;

If this session has been closed.

## LiveSession.sendTextRealtime()

> This API is provided as a preview for developers and may change based on feedback that we receive. Do not use this API in a production environment.
>

Sends text to the server in realtime.

<b>Signature:</b>

```typescript
sendTextRealtime(text: string): Promise<void>;
```

#### Parameters

| Parameter | Type | Description |
| --- | --- | --- |
| text | string | The text data to send. |

<b>Returns:</b>

Promise&lt;void&gt;

#### Exceptions

If this session has been closed.

### Example


```javascript
liveSession.sendTextRealtime("Hello, how are you?");

```

## LiveSession.sendVideoRealtime()

> This API is provided as a preview for developers and may change based on feedback that we receive. Do not use this API in a production environment.
>

Sends video data to the server in realtime.

The server requires that the video is sent as individual video frames at 1 FPS. It is recommended to set `mimeType` to `image/jpeg`<!-- -->.

<b>Signature:</b>

```typescript
sendVideoRealtime(blob: GenerativeContentBlob): Promise<void>;
```

#### Parameters

| Parameter | Type | Description |
| --- | --- | --- |
| blob | [GenerativeContentBlob](./ai.generativecontentblob.md#generativecontentblob_interface) | The base64-encoded video data to send to the server in realtime. |

<b>Returns:</b>

Promise&lt;void&gt;

#### Exceptions

If this session has been closed.

### Example


```javascript
// const videoFrame = ... base64-encoded JPEG data
const blob = { mimeType: "image/jpeg", data: videoFrame };
liveSession.sendVideoRealtime(blob);

```

2 changes: 2 additions & 0 deletions docs-devsite/ai.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ The Firebase AI Web SDK.
| [AI](./ai.ai.md#ai_interface) | An instance of the Firebase AI SDK.<!-- -->Do not create this instance directly. Instead, use [getAI()](./ai.md#getai_a94a413)<!-- -->. |
| [AIOptions](./ai.aioptions.md#aioptions_interface) | Options for initializing the AI service using [getAI()](./ai.md#getai_a94a413)<!-- -->. This allows specifying which backend to use (Vertex AI Gemini API or Gemini Developer API) and configuring its specific options (like location for Vertex AI). |
| [AudioConversationController](./ai.audioconversationcontroller.md#audioconversationcontroller_interface) | <b><i>(Public Preview)</i></b> A controller for managing an active audio conversation. |
| [AudioTranscriptionConfig](./ai.audiotranscriptionconfig.md#audiotranscriptionconfig_interface) | The audio transcription configuration. |
| [BaseParams](./ai.baseparams.md#baseparams_interface) | Base parameters for a number of methods. |
| [ChromeAdapter](./ai.chromeadapter.md#chromeadapter_interface) | <b><i>(Public Preview)</i></b> Defines an inference "backend" that uses Chrome's on-device model, and encapsulates logic for detecting when on-device inference is possible.<!-- -->These methods should not be called directly by the user. |
| [Citation](./ai.citation.md#citation_interface) | A single citation. |
Expand Down Expand Up @@ -134,6 +135,7 @@ The Firebase AI Web SDK.
| [TextPart](./ai.textpart.md#textpart_interface) | Content part interface if the part represents a text string. |
| [ThinkingConfig](./ai.thinkingconfig.md#thinkingconfig_interface) | Configuration for "thinking" behavior of compatible Gemini models.<!-- -->Certain models utilize a thinking process before generating a response. This allows them to reason through complex problems and plan a more coherent and accurate answer. |
| [ToolConfig](./ai.toolconfig.md#toolconfig_interface) | Tool config. This config is shared for all tools provided in the request. |
| [Transcription](./ai.transcription.md#transcription_interface) | <b><i>(Public Preview)</i></b> Transcription of audio. This can be returned from a [LiveGenerativeModel](./ai.livegenerativemodel.md#livegenerativemodel_class) if transcription is enabled with the <code>inputAudioTranscription</code> or <code>outputAudioTranscription</code> properties on the [LiveGenerationConfig](./ai.livegenerationconfig.md#livegenerationconfig_interface)<!-- -->. |
| [URLContext](./ai.urlcontext.md#urlcontext_interface) | <b><i>(Public Preview)</i></b> Specifies the URL Context configuration. |
| [URLContextMetadata](./ai.urlcontextmetadata.md#urlcontextmetadata_interface) | <b><i>(Public Preview)</i></b> Metadata related to [URLContextTool](./ai.urlcontexttool.md#urlcontexttool_interface)<!-- -->. |
| [URLContextTool](./ai.urlcontexttool.md#urlcontexttool_interface) | <b><i>(Public Preview)</i></b> A tool that allows you to provide additional context to the models in the form of public web URLs. By including URLs in your request, the Gemini model will access the content from those pages to inform and enhance its response. |
Expand Down
Loading
Loading