# 🤖 Agentic Audio Notebook (Py 3.12 friendly)

Agentic pipeline that works on **Google Colab Python 3.12** without Coqui TTS:

- **Agent (ReAct)** via LangChain + **Gemini**
- **PDF → Text** with PyPDF2
- **Text → Speech** using **edge-tts** (no API key required)
- **Speech → Text** using **Whisper** (local)
- **Speech → Speech (voice cloning)** using **OpenVoice** *(optional setup cell provided)*

**No Streamlit/UI** — run cells top to bottom. Use GPU for Whisper speed if you like (not required).

## 0) Install Dependencies (Py 3.12-safe)

In [4]:
#@title ⬇️ Install clean dependencies (choose CPU or GPU Torch)

USE_GPU = True  #@param {type:"boolean"}

# System deps
!apt-get -y update && apt-get -y install -qq ffmpeg > /dev/null
!pip -q install --upgrade pip setuptools wheel

# --- Torch installation ---
if USE_GPU:
    print("⚡ Installing GPU-enabled Torch (CUDA 12.1 build)...")
    !pip uninstall -y torch torchvision torchaudio -q
    !pip install torch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 \
        --index-url https://download.pytorch.org/whl/cu121
else:
    print("🐢 Installing CPU-only Torch...")
    !pip uninstall -y torch torchvision torchaudio -q
    !pip install torch==2.3.1+cpu -f https://download.pytorch.org/whl/cpu

# --- Rest of stack (no torch conflicts) ---
!pip install --no-deps \
  langchain==0.2.14 langchain-core==0.2.32 "langsmith<0.2.0" \
  langchain-google-genai==2.1.10 google-ai-generativelanguage==0.6.18 \
  openai-whisper==20231117 PyPDF2==3.0.1 pydub==0.25.1 soundfile==0.12.1 \
  librosa==0.10.2.post1 edge-tts==6.1.11

# Optional: install OpenVoice deps (only if you want speech→speech cloning)
# !pip install --no-deps huggingface_hub==0.23.0 onnxruntime==1.18.1 "protobuf<5"

import sys, torch
print("✅ Installs complete. Python:", sys.version)
print("Torch version:", torch.__version__, "| CUDA available:", torch.cuda.is_available())


0% [Working]            Hit:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
0% [Waiting for headers] [Connecting to security.ubuntu.com (185.125.190.82)] [                                                                               Hit:2 http://archive.ubuntu.com/ubuntu jammy InRelease
0% [Waiting for headers] [Connecting to security.ubuntu.com (185.125.190.82)] [                                                                               Hit:3 https://cli.github.com/packages stable InRelease
0% [Waiting for headers] [Connecting to security.ubuntu.com (185.125.190.82)] [                                                                               Hit:4 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:5 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:6 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:7 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Hit:8 http://security.ubun

## (Optional) 0b) Install OpenVoice (Speech→Speech Voice Cloning)
Run **only if you want local voice cloning**. This fetches the OpenVoice repo and checkpoints.

If you skip this, the notebook still works (PDF→speech via edge-tts, and Whisper transcription).

In [6]:
#@title ⬇️ Optional: Install OpenVoice
USE_OPENVOICE = True #@param {type:"boolean"}
if USE_OPENVOICE:
    !pip -q install 'huggingface_hub==0.23.0'
    !pip -q install 'onnxruntime==1.18.1'  # CPU fallback for some OpenVoice components
    !pip -q install 'protobuf<5'
    !pip -q install 'soundfile==0.12.1' 'numpy<2.3'
    # Clone repository
    !rm -rf OpenVoice
    !git clone -q https://github.com/myshell-ai/OpenVoice.git

    # Download checkpoints to local folder (public)
    from huggingface_hub import snapshot_download
    ckpt_dir = snapshot_download(
        repo_id='myshell-ai/OpenVoice',
        repo_type='model',
        local_dir='openvoice_ckpts',
        allow_patterns=['**/*.pt','**/*.pth','**/*.onnx','**/*.bin','**/*.ckpt','**/*.json']
    )
    print('OpenVoice checkpoints at:', ckpt_dir)
else:
    print('Skipping OpenVoice install — speech→speech cloning will be disabled.')

[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
grpcio-status 1.74.0 requires protobuf<7.0.0,>=6.31.1, but you have protobuf 4.25.8 which is incompatible.
google-adk 1.13.0 requires tenacity<9.0.0,>=8.0.0, but you have tenacity 9.1.2 which is incompatible.
ydf 0.13.0 requires protobuf<7.0.0,>=5.29.1, but you have protobuf 4.25.8 which is incompatible.
google-generativeai 0.8.5 requires google-ai-generativelanguage==0.6.15, but you have google-ai-generativelanguage 0.6.18 which is incompatible.[0m[31m
[0m

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]

OpenVoice checkpoints at: /content/openvoice_ckpts


In [1]:
# Fix NumPy / Whisper ABI mismatch
!pip install --force-reinstall "numpy<2.0" numba==0.59.1

import numpy as np
import whisper

print("✅ Whisper import fixed. NumPy:", np.__version__)


[0mCollecting numpy<2.0
  Using cached numpy-1.26.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
Collecting numba==0.59.1
  Using cached numba-0.59.1-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (2.7 kB)
Collecting llvmlite<0.43,>=0.42.0dev0 (from numba==0.59.1)
  Using cached llvmlite-0.42.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.8 kB)
Using cached numba-0.59.1-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (3.7 MB)
Using cached numpy-1.26.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.0 MB)
Using cached llvmlite-0.42.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (43.8 MB)
[0mInstalling collected packages: numpy, llvmlite, numba
[2K  Attempting uninstall: numpy
[2K    Found existing installation: numpy 1.26.4
[2K    Uninstalling numpy-1.26.4:
[2K      Successfully uninstalled numpy-1.26.4
[2K  Attempting uninstall: llvmlite
[2K    Found existing in

✅ Whisper import fixed. NumPy: 1.26.4


In [9]:
!pip install --upgrade --force-reinstall langchain langchain-core langchain-google-genai


[0mCollecting langchain
  Downloading langchain-0.3.27-py3-none-any.whl.metadata (7.8 kB)
Collecting langchain-core
  Using cached langchain_core-0.3.75-py3-none-any.whl.metadata (5.7 kB)
Collecting langchain-google-genai
  Using cached langchain_google_genai-2.1.10-py3-none-any.whl.metadata (7.2 kB)
Collecting langchain-text-splitters<1.0.0,>=0.3.9 (from langchain)
  Using cached langchain_text_splitters-0.3.11-py3-none-any.whl.metadata (1.8 kB)
Collecting langsmith>=0.1.17 (from langchain)
  Downloading langsmith-0.4.26-py3-none-any.whl.metadata (14 kB)
Collecting pydantic<3.0.0,>=2.7.4 (from langchain)
  Using cached pydantic-2.11.7-py3-none-any.whl.metadata (67 kB)
Collecting SQLAlchemy<3,>=1.4 (from langchain)
  Using cached sqlalchemy-2.0.43-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.6 kB)
Collecting requests<3,>=2 (from langchain)
  Using cached requests-2.32.5-py3-none-any.whl.metadata (4.9 kB)
Collecting PyYAML>=5.3 (from langchain)
  Using cached 

In [1]:
from langchain_google_genai import ChatGoogleGenerativeAI
print("✅ LangChain + Gemini imports working")


✅ LangChain + Gemini imports working


## 1) Configure Gemini & Imports

In [2]:
#@title 🔑 Provide your Gemini API key
GEMINI_API_KEY = "AIzaSyAt6FzjQz7fT_OpQjZjVimvAcbwTEVNo1w" #@param {type:"string"}
import os
os.environ['GOOGLE_API_KEY'] = GEMINI_API_KEY.strip()
if not os.environ['GOOGLE_API_KEY']:
    raise RuntimeError('Please paste a valid Gemini API key above.')
print('Gemini key set.')

import asyncio
import io
from typing import List, Optional
from pydub import AudioSegment
from PyPDF2 import PdfReader
import whisper
import numpy as np

from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.agents import initialize_agent, Tool

print('Imports loaded.')

Gemini key set.
Imports loaded.


## 2) Utilities: PDF text, audio concat, TTS wrapper (edge-tts)

In [3]:
def extract_pdf_text(pdf_file: str) -> str:
    reader = PdfReader(pdf_file)
    parts = []
    for p in reader.pages:
        try:
            parts.append(p.extract_text() or '')
        except Exception as e:
            print('Warn: page extract failed:', e)
    return ' '.join('\n'.join(parts).split())

def split_text(text: str, max_chars: int = 220) -> List[str]:
    out, cur = [], ''
    for w in text.split():
        if len(cur) + len(w) + 1 <= max_chars:
            cur = (cur + ' ' + w).strip()
        else:
            if cur:
                out.append(cur)
            cur = w
    if cur:
        out.append(cur)
    return out

def concat_wavs(wav_paths: List[str], out_path: str) -> str:
    seg = AudioSegment.empty()
    for p in wav_paths:
        seg += AudioSegment.from_wav(p)
    seg.export(out_path, format='wav')
    return out_path

# edge-tts is async; provide a sync helper
async def _edge_tts_synthesize_async(text: str, voice: str, out_path: str):
    import edge_tts
    communicate = edge_tts.Communicate(text, voice=voice)
    await communicate.save(out_path)

def edge_tts_synthesize(text: str, voice: str = 'en-US-JennyNeural', out_path: str = 'tts_edge.wav') -> str:
    """Synthesize speech using edge-tts and save to WAV (via pydub conversion)."""
    # edge-tts outputs mp3/ogg — we can save to mp3 then convert to wav
    mp3_path = out_path.replace('.wav', '.mp3')
    asyncio.run(_edge_tts_synthesize_async(text, voice, mp3_path))
    # Convert to wav
    AudioSegment.from_file(mp3_path).export(out_path, format='wav')
    return out_path

## 3) Whisper ASR Model (local)

In [4]:
print('Loading Whisper (base)...')
whisper_model = whisper.load_model('base')
print('Whisper ready.')

Loading Whisper (base)...


100%|███████████████████████████████████████| 139M/139M [00:01<00:00, 95.5MiB/s]


Whisper ready.


## 4) Optional: OpenVoice Helpers (speech→speech cloning)

In [8]:
#@title 🔊 Initialize OpenVoice (safe wrapper with subprocess fallback)

# Ensure flag exists
try:
    USE_OPENVOICE
except NameError:
    USE_OPENVOICE = False

OPENVOICE_OK = False
if USE_OPENVOICE:
    import os, subprocess

    # Confirm repo + checkpoints exist
    if not os.path.exists("OpenVoice") or not os.path.exists("openvoice_ckpts"):
        print("⚠️ OpenVoice repo or checkpoints not found. Run the install cell first.")
    else:
        OPENVOICE_OK = True
        print("✅ OpenVoice ready for speech→speech cloning.")
else:
    print("OpenVoice disabled by user.")

def openvoice_convert(source_wav: str, ref_speaker_wav: str, out_path: str = "s2s_openvoice.wav") -> str:
    """
    Run OpenVoice voice conversion using its inference script.
    Relies on the cloned repo structure: OpenVoice/inference/infer.py
    """
    if not OPENVOICE_OK:
        return "OpenVoice is not installed/configured."

    try:
        cmd = [
            "python", "OpenVoice/inference/infer.py",
            "--src", source_wav,
            "--ref", ref_speaker_wav,
            "--out", out_path,
            "--ckpt_dir", "openvoice_ckpts"
        ]
        result = subprocess.run(cmd, capture_output=True, text=True, check=True)
        print("🔄 OpenVoice log:", result.stdout)
        return out_path
    except subprocess.CalledProcessError as e:
        return f"OpenVoice conversion failed:\n{e.stderr}"


✅ OpenVoice ready for speech→speech cloning.


## 5) Global state & Tools (Agent uses these)

In [10]:
CURRENT_PDF: Optional[str] = None
CURRENT_SOURCE_AUDIO: Optional[str] = None
CURRENT_VOICE_REF: Optional[str] = None

def tool_pdf_to_text(_: str) -> str:
    if not CURRENT_PDF or not os.path.exists(CURRENT_PDF):
        return 'No PDF uploaded.'
    return extract_pdf_text(CURRENT_PDF)

def tool_text_to_speech(text: str) -> str:
    if not text.strip():
        return 'No input text for TTS.'
    # Choose a default neural voice (change if you prefer a male/female variant)
    return edge_tts_synthesize(text, voice='en-US-JennyNeural', out_path='edge_tts_output.wav')

def tool_transcribe(_: str) -> str:
    if not CURRENT_SOURCE_AUDIO or not os.path.exists(CURRENT_SOURCE_AUDIO):
        return 'No source audio uploaded.'
    res = whisper_model.transcribe(CURRENT_SOURCE_AUDIO)
    return res.get('text','').strip() or '(empty transcription)'

def tool_speech_to_speech(_: str) -> str:
    if not CURRENT_SOURCE_AUDIO or not os.path.exists(CURRENT_SOURCE_AUDIO):
        return 'No source audio uploaded.'
    if not CURRENT_VOICE_REF or not os.path.exists(CURRENT_VOICE_REF):
        return 'No reference voice sample uploaded.'
    if not OPENVOICE_OK:
        return 'OpenVoice not installed. Run the optional install cell above.'
    return openvoice_convert(CURRENT_SOURCE_AUDIO, CURRENT_VOICE_REF, out_path='openvoice_s2s.wav')

## 6) Build the Agent (ReAct with tool selection)

In [15]:
#@title 🤖 Initialize Audio Agent (LangGraph if available, fallback to LangChain)

!pip install -q langgraph

from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.agents import initialize_agent, Tool

# Try LangGraph import
USE_LANGGRAPH = False
try:
    from langgraph.prebuilt import create_react_agent
    USE_LANGGRAPH = True
    print("✅ LangGraph detected, will use modern ReAct agent.")
except ImportError:
    print("⚠️ LangGraph not installed, falling back to legacy initialize_agent.")

# Define the LLM
llm = ChatGoogleGenerativeAI(model='models/gemini-1.5-flash', temperature=0.3)

# Define tool wrappers
tools = [
    Tool(
        name='PDFToText',
        func=tool_pdf_to_text,
        description='Extract text from the uploaded PDF and return it as plain text.'
    ),
    Tool(
        name='TextToSpeech',
        func=tool_text_to_speech,
        description='Convert input text to audio using edge-tts (neural voices). Returns WAV path.'
    ),
    Tool(
        name='TranscribeAudio',
        func=tool_transcribe,
        description='Transcribe the uploaded source audio to text using Whisper.'
    ),
    Tool(
        name='SpeechToSpeech',
        func=tool_speech_to_speech,
        description='Convert uploaded source speech into the reference voice (OpenVoice). Returns WAV path.'
    ),
]

# Build agent
if USE_LANGGRAPH:
    agent = create_react_agent(
        model=llm,
        tools=tools,
        prompt=(
            "You are a helpful audio assistant. You can: "
            "(1) read a PDF via TTS, "
            "(2) transcribe audio, "
            "(3) convert speech→speech using OpenVoice (if installed). "
            "Plan and call tools as needed. Keep answers concise and return file paths clearly."
        ),
    )
else:
    agent = initialize_agent(
        tools=tools,
        llm=llm,
        agent='zero-shot-react-description',
        verbose=True,
        agent_kwargs={
            'prefix': (
                'You are a helpful audio assistant. You can: (1) read a PDF via TTS, '
                '(2) transcribe audio, (3) convert speech→speech using OpenVoice (if installed). '
                'Plan and call tools as needed. Keep answers concise and return file paths clearly.'
            )
        }
    )


[0m✅ LangGraph detected, will use modern ReAct agent.


## 7) Upload Files (PDF, reference voice, source audio)

In [16]:
#@title ⬆️ Upload inputs
from google.colab import files

print('Upload a PDF (optional) ...')
u1 = files.upload()
if u1:
    CURRENT_PDF = next(iter(u1.keys()))
    print('PDF:', CURRENT_PDF)

print('Upload a reference voice sample (WAV/MP3, for OpenVoice speech→speech) ...')
u2 = files.upload()
if u2:
    CURRENT_VOICE_REF = next(iter(u2.keys()))
    print('Reference voice:', CURRENT_VOICE_REF)

print('Upload a source audio (WAV/MP3, to convert to reference voice) ...')
u3 = files.upload()
if u3:
    CURRENT_SOURCE_AUDIO = next(iter(u3.keys()))
    print('Source audio:', CURRENT_SOURCE_AUDIO)

print('Ready. Proceed to instruction cell.')

Upload a PDF (optional) ...


Saving pdf.pdf to pdf.pdf
PDF: pdf.pdf
Upload a reference voice sample (WAV/MP3, for OpenVoice speech→speech) ...


Saving Zainab.wav to Zainab.wav
Reference voice: Zainab.wav
Upload a source audio (WAV/MP3, to convert to reference voice) ...


Saving Zainab.wav to Zainab (1).wav
Source audio: Zainab (1).wav
Ready. Proceed to instruction cell.


## 8) Give the Agent an Instruction
Examples:
- "Read my PDF in a natural voice and give me the WAV path."
- "Transcribe the uploaded audio."
- "Convert the uploaded speech into the reference voice and return the WAV path."

In [19]:
#@title 🧭 Agent instruction
instruction = "Read my PDF in a natural voice and give me the WAV path." #@param {type:"string"}

try:
    # Try LangGraph style
    result = agent.invoke({
        "messages": [("user", instruction)]
    })
except Exception:
    # Fallback to LangChain legacy style
    result = agent.invoke({"input": instruction})

print("\n=== Agent Result ===\n", result)



=== Agent Result ===
 {'messages': [HumanMessage(content='Read my PDF in a natural voice and give me the WAV path.', additional_kwargs={}, response_metadata={}, id='fb257455-567f-47e1-8aa2-94f1cb2d1914'), AIMessage(content='Please provide the PDF file path.', additional_kwargs={}, response_metadata={'prompt_feedback': {'block_reason': 0, 'safety_ratings': []}, 'finish_reason': 'STOP', 'model_name': 'gemini-1.5-flash', 'safety_ratings': []}, id='run--8b096f99-1898-457a-9fd0-b5fc5f1199f8-0', usage_metadata={'input_tokens': 171, 'output_tokens': 8, 'total_tokens': 179, 'input_token_details': {'cache_read': 0}})]}


## 9) Playback Helper

In [22]:
from IPython.display import Audio, display
for p in ['edge_tts_output.wav', 'openvoice_s2s.wav']:
    if os.path.exists(p):
        print('Found:', p)
        display(Audio(p))

In [25]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


---
### Notes
- **Text→speech voice** is from edge-tts (choose different voices by changing the `voice` argument).
- **Voice cloning (speech→speech)** needs the optional OpenVoice install & checkpoints; wrapper provided but may need small adjustments if OpenVoice APIs change.
- For scanned PDFs, you'll need OCR (not included here).