# 8.3 Speech & Audio

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/forhow134/ai-coding-guide/blob/main/demos/08-multimodal/speech_audio.ipynb)

**预计 API 费用: ~$0.01**

Whisper 语音转文字、TTS 文字转语音。

In [None]:
!pip install -q openai

In [None]:
import os
from getpass import getpass

if not os.environ.get('OPENAI_API_KEY'):
    os.environ['OPENAI_API_KEY'] = getpass('OpenAI API Key: ')

## Experiment 1: Whisper Speech转文字
<!-- 实验 1: Whisper 语音转文字 -->

In [None]:
from openai import OpenAI

client = OpenAI()

# 示例: 如果有音频文件
# audio_file = open('meeting.mp3', 'rb')
# transcript = client.audio.transcriptions.create(
#     model='whisper-1',
#     file=audio_file,
#     response_format='text'
# )
# print(transcript)

print('提示: 准备一个音频文件 (mp3/wav/m4a) 后测试')

## Experiment 2: 不同Output格式
<!-- 实验 2: 不同输出格式 -->

In [None]:
from openai import OpenAI

client = OpenAI()

# 假设有音频文件
# audio_file = open('audio.mp3', 'rb')

# 1. 纯文本
# text = client.audio.transcriptions.create(model='whisper-1', file=audio_file, response_format='text')

# 2. JSON
# json_result = client.audio.transcriptions.create(model='whisper-1', file=audio_file, response_format='json')

# 3. SRT 字幕
# srt = client.audio.transcriptions.create(model='whisper-1', file=audio_file, response_format='srt')

# 4. Verbose JSON (详细信息)
# verbose = client.audio.transcriptions.create(
#     model='whisper-1',
#     file=audio_file,
#     response_format='verbose_json',
#     timestamp_granularities=['word', 'segment']
# )

print('提示: 取消注释后测试不同输出格式')

## Experiment 3: TTS 文字转Speech
<!-- 实验 3: TTS 文字转语音 -->

In [None]:
from openai import OpenAI
from pathlib import Path

client = OpenAI()

response = client.audio.speech.create(
    model='tts-1',
    voice='alloy',
    input='欢迎使用 AI 语音助手。今天我将为您介绍如何使用 OpenAI 的 TTS 功能。'
)

speech_file = Path('output.mp3')
response.stream_to_file(speech_file)

print(f'语音已生成: {speech_file}')

## Experiment 4: 不同音色Comparison
<!-- 实验 4: 不同音色对比 -->

In [None]:
from openai import OpenAI
from pathlib import Path

client = OpenAI()

text = '这是一段测试语音'
voices = ['alloy', 'echo', 'fable', 'onyx', 'nova', 'shimmer']

for voice in voices:
    response = client.audio.speech.create(
        model='tts-1',
        voice=voice,
        input=text
    )
    response.stream_to_file(f'{voice}.mp3')
    print(f'{voice}.mp3 已生成')

## Experiment 5: 完整SpeechConversation流程
<!-- 实验 5: 完整语音对话流程 -->

In [None]:
from openai import OpenAI

client = OpenAI()

def voice_chat(audio_path: str):
    # 1. 语音转文字
    with open(audio_path, 'rb') as audio_file:
        question = client.audio.transcriptions.create(
            model='whisper-1',
            file=audio_file
        ).text
    print(f'用户: {question}')
    
    # 2. LLM 回答
    answer = client.chat.completions.create(
        model='gpt-4o-mini',
        messages=[{'role': 'user', 'content': question}]
    ).choices[0].message.content
    print(f'AI: {answer}')
    
    # 3. 文字转语音
    response = client.audio.speech.create(
        model='tts-1',
        voice='nova',
        input=answer
    )
    response.stream_to_file('reply.mp3')
    print('语音回复已生成: reply.mp3')

# voice_chat('user_question.mp3')
print('提示: 准备音频文件后测试完整流程')

## 关键要点

1. **Whisper**: $0.006/分钟,支持 99 种语言
2. **输出格式**: text | json | srt | vtt | verbose_json
3. **TTS**: $15/1M 字符 (tts-1)
4. **6 种音色**: alloy, echo, fable, onyx, nova, shimmer
5. **语速可调**: 0.25-4.0 倍速
6. **组合使用**: STT + LLM + TTS = 语音助手

---

**下一步**: 学习 [8.4 Video & Realtime](./realtime.ipynb)