# OCI Speech-to-Text: Step by Step Transcription

**What this notebook does:**
Demonstrates step-by-step audio transcription using OCI Speech-to-Text. Lists MP3 files, selects one, uploads to Object Storage, submits transcription job, polls for completion, displays transcript results.

**Documentation to reference:**
- Service docs:    https://docs.oracle.com/en-us/iaas/Content/speech/home.htm
- Python SDK:      https://github.com/oracle/oci-python-sdk/tree/master/src/oci/ai_speech
- Real time:       https://github.com/oracle/oci-ai-speech-realtime-python-sdk
- Model comparison: https://docs.oracle.com/en-us/iaas/Content/speech/using/speech.htm#compare-models

**Relevant slack channels:**
- #oci_speech_service_users: *questions about speech service*
- #igiu-innovation-lab: *project ideas*
- #igiu-ai-learning: *issues with code or enviornment*

**Env setup:**
- sandbox.yaml: Contains OCI config, compartment, bucket details.
- .env: Load environment variables if needed.
- configure cwd for jupyter match your workspace python code: 
    -  vscode menu -> Settings > Extensions > Jupyter > Notebook File Root
    -  change from `${fileDirname}` to `${workspaceFolder}`


**How to run in notebook:**
- Run cells in order.

**MP3 Demos to try:**
- `voice_sample_english.mp3` (included)
- Add more to the `speech/voice_sample-*.mp3` directory if you like!

*Notebook prints transcript(s) below, no local file output!*

**Experimentation ideas:**
- Try different MP3 files by changing AUDIO_FILE selection.
- Switch between 'WHISPER_MEDIUM' and 'ORACLE' models.
- Experiment with different language codes for ORACLE model.
- Add more output formats like SRT.


##  Helpful Links and Prerequisites
Before starting, review these resources:
- **Service docs:** https://docs.oracle.com/en-us/iaas/Content/speech/home.htm
- **Python SDK:** https://github.com/oracle/oci-python-sdk/tree/master/src/oci/ai_speech
- **Real-time SDK:** https://github.com/oracle/oci-ai-speech-realtime-python-sdk
- **Support Slack channels:** #oci_speech_service_users (for speech service questions), #igiu-innovation-lab (project ideas), #igiu-ai-learning (troubleshooting)

**Prerequisites:**
- Ensure `sandbox.yaml` is set up with OCI config, compartment, and bucket details.
- Place MP3 files in the `speech/` directory (e.g., `voice_sample_english.mp3`).
- This notebook assumes you have the necessary permissions and dependencies installed.

## Step 1: Setup and Imports
In this step, we import necessary libraries, load the configuration, and set up clients. This prepares the environment for interacting with OCI services like Object Storage and Speech AI.

**Key Concepts:**
- Loading YAML config for OCI credentials and bucket settings.
- Initializing Object Storage client for uploading audio files.
- Later, we'll use the Speech client for transcription.

In [None]:
import os
import time
from pathlib import Path
import oci
from oci.object_storage import ObjectStorageClient
from oci.ai_speech import AIServiceSpeechClient
from dotenv import load_dotenv
from envyaml import EnvYAML
load_dotenv()

# Function to load config (assuming it's defined elsewhere or inline)
def load_yaml(path: Path):
    try:
        return EnvYAML(path)
    except Exception as exc:
        raise RuntimeError(f"Invalid sandbox yaml {path}.")

# Load sandbox config
SANDBOX_CONFIG_FILE = "sandbox.yaml"
scfg = load_yaml(SANDBOX_CONFIG_FILE)
if scfg is None or "oci" not in scfg or "bucket" not in scfg:
    raise RuntimeError("Invalid sandbox configuration.")

# Extract bucket and OCI config
oci_cfg = oci.config.from_file(os.path.expanduser(scfg["oci"]["configFile"]), scfg["oci"]["profile"])
compartment_id = scfg["oci"]["compartment"]
prefix = bucket_cfg["prefix"]

# Initialize Object Storage client
client = ObjectStorageClient(oci_cfg)

print("Setup complete: Config loaded, Object Storage client initialized.")

## Step 2: Select Audio File
Here, we list available MP3 files in the speech directory and select one for transcription. This demonstrates how to prepare input audio for the Speech API.

**Key Concepts:**
- Scanning a directory for audio files.
- Selecting a file (you can modify AUDIO_FILE to choose different samples).
- Supported formats: MP3 (and others like WAV, but this example uses MP3).

In [None]:
# Directory containing audio files
audio_dir = Path("speech")

# List MP3 files
mp3_files = list(audio_dir.glob("*.mp3"))
if not mp3_files:
    raise RuntimeError("No MP3 files found in speech directory.")

print("Available MP3 files:")
for i, f in enumerate(mp3_files):
    print(f"{i}: {f.name}")

# Select the first file (change index to try others)
selected_index = 0  # Modify this to select a different file
AUDIO_FILE = mp3_files[selected_index]

print(f"Selected audio file: {AUDIO_FILE}")

## Step 3: Upload Audio to Object Storage
Upload the selected audio file to OCI Object Storage. The Speech API requires audio files to be stored in Object Storage for processing.

**Key Concepts:**
- Using Object Storage client to upload files.
- Constructing object names with prefixes for organization.
- This step ensures the audio is accessible by the Speech service.

In [None]:
# Construct object name in bucket
object_name = f"{prefix}/{AUDIO_FILE.name}"

# Upload the file
with AUDIO_FILE.open("rb") as fh:
    client.put_object(bucket_cfg["namespace"], bucket_cfg["bucketName"], object_name, fh)

print(f"Uploaded {AUDIO_FILE} â†’ oci://{bucket_cfg['namespace']}/{bucket_cfg['bucketName']}/{object_name}")

## Step 4: Submit Transcription Job
Create and submit a transcription job to the Speech API. Configure the model, language, and output settings.

**Key Concepts:**
- Choosing between models: 'WHISPER_MEDIUM' (auto-detects language) or 'ORACLE' (specify language).
- Supported language codes for ORACLE: en-US, es-ES, pt-BR, etc. (list provided in code).
- Additional settings: diarization (speaker identification), punctuation, output formats (e.g., SRT).
- The job is asynchronous; we get a job ID to track progress.

In [None]:
# Supported language codes for ORACLE model
SUPPORTED_LANGUAGE_CODES = {
    "en-US": "English - United States",
    "es-ES": "Spanish - Spain",
    "pt-BR": "Portuguese - Brazil",
    "en-GB": "English - Great Britain",
    "en-AU": "English - Australia",
    "en-IN": "English - India",
    "hi-IN": "Hindi - India",
    "fr-FR": "French - France",
    "de-DE": "German - Germany",
    "it-IT": "Italian - Italy",
}

# Speech service endpoint (Phoenix region)
SPEECH_SERVICE_ENDPOINT = "https://speech.aiservice.us-phoenix-1.oci.oraclecloud.com"

# Initialize Speech client
speech_client = AIServiceSpeechClient(
    config=oci_cfg,
    signer=oci.signer.Signer(
        tenancy=oci_cfg["tenancy"],
        user=oci_cfg["user"],
        fingerprint=oci_cfg["fingerprint"],
        private_key_file_location=oci_cfg["key_file"],
    ),
    service_endpoint=SPEECH_SERVICE_ENDPOINT,
)

# Model configuration
model_type = "WHISPER_MEDIUM"  # Change to "ORACLE" if needed
language_code = "auto" if model_type == "WHISPER_MEDIUM" else "en-US"  # Adjust for ORACLE

# Input location (the uploaded audio)
object_location = oci.ai_speech.models.ObjectLocation(
    namespace_name=bucket_cfg["namespace"],
    bucket_name=bucket_cfg["bucketName"],
    object_names=[object_name],
)
input_location = oci.ai_speech.models.ObjectListInlineInputLocation(
    location_type="OBJECT_LIST_INLINE_INPUT_LOCATION",
    object_locations=[object_location],
)

# Output location (same bucket, with prefix)
output_location = oci.ai_speech.models.OutputLocation(
    namespace_name=bucket_cfg["namespace"],
    bucket_name=bucket_cfg["bucketName"],
    prefix=prefix,
)

# Normalization and settings
normalization = oci.ai_speech.models.TranscriptionNormalization(
    is_punctuation_enabled=True
)
transcription_settings = oci.ai_speech.models.TranscriptionSettings(
    diarization=oci.ai_speech.models.Diarization(is_diarization_enabled=True)
)

# Model details
model_details = oci.ai_speech.models.TranscriptionModelDetails(
    language_code=language_code,
    model_type=model_type,
    domain="GENERIC",
    transcription_settings=transcription_settings,
)

# Job details
job_details = oci.ai_speech.models.CreateTranscriptionJobDetails(
    display_name=f"{prefix}-nb-stt-job",
    compartment_id=compartment_id,
    description="STT Jupyter Notebook Demo",
    model_details=model_details,
    input_location=input_location,
    output_location=output_location,
    normalization=normalization,
    additional_transcription_formats=["SRT"],
)

# Submit the job
response = speech_client.create_transcription_job(create_transcription_job_details=job_details)
job_id = response.data.id
print(f"Transcription job submitted! OCID: {job_id}")

## Step 5: Wait for Transcription Job to Complete
Poll the job status until it finishes. This step demonstrates asynchronous job handling.

**Key Concepts:**
- Asynchronous processing: Jobs take time; we poll every few seconds.
- Possible states: In progress, Succeeded, Failed.
- Once succeeded, we can retrieve results.

In [None]:
def wait_for_job(client, job_id, poll_interval=5):
    while True:
        job = client.get_transcription_job(job_id).data
        state = job.lifecycle_state
        print(f"Job state: {state}")
        if state == "SUCCEEDED":
            return job
        elif state == "FAILED":
            raise Exception("Transcription job failed!")
        time.sleep(poll_interval)

# Wait for the job to complete
job_result = wait_for_job(speech_client, job_id)
print("Transcription job completed successfully!")

## Step 6: Retrieve and Display Transcription Results
Download and print the transcript files from Object Storage. This shows how to access job outputs.

**Key Concepts:**
- Listing objects in the bucket under the output prefix.
- Retrieving text-based outputs (e.g., .txt, .srt).
- Displaying results inline (no local file saves in this notebook).
- Experiment with different files or settings by re-running earlier steps.

In [None]:
# List output objects
outputs = client.list_objects(
    namespace_name=bucket_cfg["namespace"],
    bucket_name=bucket_cfg["bucketName"],
    prefix=job_result.output_location.prefix,
).data.objects

# Print transcripts
for obj in outputs:
    if obj.name.lower().endswith(('.txt', '.srt')):
        resp = client.get_object(bucket_cfg["namespace"], bucket_cfg["bucketName"], obj.name)
        try:
            text = resp.data.content.decode('utf-8')
        except Exception:
            text = resp.data.content.decode('latin-1')
        print("\n----- Output: {} -----\n".format(obj.name))
        print(text[:10000])  # Limit to first 10k chars for display
        print("\n[...End of {}...]\n".format(obj.name))

print("\n---\nTry another file by re-running Step 2!")

## Experiment and Project Ideas

Now that you've seen the basic steps for transcribing audio with OCI Speech-to-Text, here are some ideas to expand your understanding and build projects:

### Basic Experiments
- **Compare Models:** Switch between WHISPER_MEDIUM and ORACLE models for the same audio file. Note differences in accuracy, language detection, and processing time.
- **Language Variations:** For ORACLE model, try different language codes (e.g., 'es-ES' for Spanish, 'fr-FR' for French) on multilingual audio.
- **Diarization Toggle:** Enable/disable speaker diarization in the transcription settings and observe how it affects the output.
- **Output Formats:** Add more formats like 'WEBVTT' or 'SBV' to the `additional_transcription_formats` list and compare the results.

### Intermediate Projects
- **Batch Processing:** Modify the notebook to process multiple audio files in a loop, uploading and submitting jobs for each.
- **Audio Preprocessing:** Add steps to convert audio formats or trim files before uploading (using libraries like pydub).
- **Cost Estimation:** Calculate estimated costs based on audio duration and model used (check OCI pricing docs).
- **Error Handling:** Improve robustness by adding retries for failed uploads or jobs, and better error messages.

### Advanced Projects
- **Real-Time Integration:** Combine with the real-time speech SDK for live transcription in a web app.
- **Post-Processing with AI:** Use the transcript as input to other OCI AI services like Language for sentiment analysis or summarization.
- **Multi-Modal App:** Build a full application that transcribes audio, then uses Vision API on related images or videos.
- **Custom Vocabulary:** If supported, experiment with custom vocabularies or domain-specific models for specialized audio (e.g., medical or legal terms).

### Tips for Learning
- Always check the OCI documentation for the latest features and supported formats.
- Join the #oci_speech_service_users Slack channel for community support.
- Experiment in a free tier or sandbox environment to avoid unexpected costs.

Have fun exploring OCI Speech-to-Text!