# OCI Speech-to-Text: Step by Step Transcription

Use this notebook to transcribe audio (MP3) files with Oracle Cloud Infrastructure's Speech-to-Text. 

**MP3 Demos to try:**
- `voice_sample_english.mp3` (included)
- Add more to the `speech/voice_sample-*.mp3` directory if you like!

*Notebook prints transcript(s) below, no local file output!*


# 1 Helpful links
Docs / Links:
- Service docs:    https://docs.oracle.com/en-us/iaas/Content/speech/home.htm
- Python SDK:      https://github.com/oracle/oci-python-sdk/tree/master/src/oci/ai_speech
- Real time:       https://github.com/oracle/oci-ai-speech-realtime-python-sdk
- Support Slack:   #oci_speech_service_users   |  #igiu-innovation-lab
- Troubleshooting: #igiu-ai-learning

## 2. Imports, Logging, and Helpers

In [1]:
import os
from pathlib import Path
from dotenv import load_dotenv
from envyaml import EnvYAML
import oci
from oci.ai_speech import AIServiceSpeechClient
from oci.object_storage import ObjectStorageClient
import time
import logging
load_dotenv()
logging.basicConfig(level=logging.INFO, format="%(asctime)s  %(levelname)-8s  %(message)s")
logger = logging.getLogger("notebook_stt")

def load_config(config_path):
    try:
        return EnvYAML(config_path)
    except Exception as e:
        logger.error(f"Error loading config: {e}")
        return None

## 3. Choose an MP3 File
All `*.mp3` files from the `speech/` directory will be listed. Add your own to try them!

In [2]:
audio_dir = Path("speech")
mp3_files = list(audio_dir.glob('*.mp3'))
if not mp3_files:
    raise FileNotFoundError("No MP3 files found in 'speech/' directory.")
print("Available MP3 files:", [f.name for f in mp3_files])

# SELECT the file for transcription:
AUDIO_FILE = mp3_files[0]
print(f"Using audio file: {AUDIO_FILE}")

Available MP3 files: ['voice_sample_spanish.mp3', 'voice_sample_french.mp3', 'text_sample_hindi.mp3', 'voice_sample_hindi.mp3', 'voice_sample_english.mp3', 'text_sample_english.mp3']
Using audio file: speech/voice_sample_spanish.mp3


## 4. Set Up OCI Config and Upload Audio

In [3]:
SANDBOX_CONFIG_FILE = "sandbox.yaml"

scfg = load_config(SANDBOX_CONFIG_FILE)
if scfg is None or "oci" not in scfg or "bucket" not in scfg:
    raise RuntimeError("Invalid sandbox configuration.")

bucket_cfg = scfg["bucket"]
oci_cfg = oci.config.from_file(os.path.expanduser(scfg["oci"]["configFile"]), scfg["oci"]["profile"])
compartment_id = scfg["oci"]["compartment"]
prefix = bucket_cfg["prefix"]

client = ObjectStorageClient(oci_cfg)
object_name = f"{prefix}/{AUDIO_FILE.name}"
with AUDIO_FILE.open("rb") as fh:
    client.put_object(bucket_cfg["namespace"], bucket_cfg["bucketName"], object_name, fh)
print(f"Uploaded {AUDIO_FILE} → oci://{bucket_cfg['namespace']}/{bucket_cfg['bucketName']}/{object_name}")

Uploaded speech/voice_sample_spanish.mp3 → oci://axaemuxiyife/AISandbox-Bucket/asagarwa/voice_sample_spanish.mp3


## 5. Start Speech Transcription Job
- You can switch `model_type` to 'ORACLE' or 'WHISPER_MEDIUM' below
- make sure your language code matchdes the languge if uisng ORACLE
SUPPORTED_LANGUAGE_CODES = {
    "en-US": "English - United States",
    "es-ES": "Spanish - Spain",
    "pt-BR": "Portuguese - Brazil",
    "en-GB": "English - Great Britain",
    "en-AU": "English - Australia",
    "en-IN": "English - India",
    "hi-IN": "Hindi - India",
    "fr-FR": "French - France",
    "de-DE": "German - Germany",
    "it-IT": "Italian - Italy",

In [6]:
SPEECH_SERVICE_ENDPOINT = "https://speech.aiservice.us-phoenix-1.oci.oraclecloud.com"

speech_client = AIServiceSpeechClient(
    config=oci_cfg,
    signer=oci.signer.Signer(
        tenancy=oci_cfg["tenancy"],
        user=oci_cfg["user"],
        fingerprint=oci_cfg["fingerprint"],
        private_key_file_location=oci_cfg["key_file"],
    ),
    service_endpoint=SPEECH_SERVICE_ENDPOINT,
)

model_type = "WHISPER_MEDIUM"  # or "ORACLE"
language_code = "auto" if model_type == "WHISPER_MEDIUM" else "en-US"

object_location = oci.ai_speech.models.ObjectLocation(
    namespace_name=bucket_cfg["namespace"],
    bucket_name=bucket_cfg["bucketName"],
    object_names=[object_name],
)
input_location = oci.ai_speech.models.ObjectListInlineInputLocation(
    location_type="OBJECT_LIST_INLINE_INPUT_LOCATION",
    object_locations=[object_location],
)
output_location = oci.ai_speech.models.OutputLocation(
    namespace_name=bucket_cfg["namespace"],
    bucket_name=bucket_cfg["bucketName"],
    prefix=prefix,
)
normalization = oci.ai_speech.models.TranscriptionNormalization(
    is_punctuation_enabled=True
)
transcription_settings = oci.ai_speech.models.TranscriptionSettings(
    diarization=oci.ai_speech.models.Diarization(is_diarization_enabled=True)
)

model_details = oci.ai_speech.models.TranscriptionModelDetails(
    language_code=language_code,
    model_type=model_type,
    domain="GENERIC",
    transcription_settings=transcription_settings,
)
job_details = oci.ai_speech.models.CreateTranscriptionJobDetails(
    display_name=f"{prefix}-nb-stt-job",
    compartment_id=compartment_id,
    description="STT Jupyter Notebook Demo",
    model_details=model_details,
    input_location=input_location,
    output_location=output_location,
    normalization=normalization,
    additional_transcription_formats=["SRT"],
)
response = speech_client.create_transcription_job(create_transcription_job_details=job_details)
job_id = response.data.id
print(f"Transcription job submitted! OCID: {job_id}")

Transcription job submitted! OCID: ocid1.aispeechtranscriptionjob.oc1.phx.amaaaaaaghwivzaaco5v5e6h2jpdrpdrpcoakz3amn4qprxe5v3475ller6a


## 6. Wait for Transcription to Finish

In [7]:
def wait_for_job(client, job_id, poll_interval=5):
    while True:
        job = client.get_transcription_job(job_id).data
        state = job.lifecycle_state
        print(f"Job state: {state}")
        if state == "SUCCEEDED":
            return job
        elif state == "FAILED":
            raise Exception("Transcription job failed!")
        time.sleep(poll_interval)

job_result = wait_for_job(speech_client, job_id)

Job state: IN_PROGRESS
Job state: IN_PROGRESS
Job state: SUCCEEDED


## 7. Print the Transcript(s) from Object Storage
No downloaded files: the notebook prints transcript(s) below. Try streaming or rerunning for different `.mp3` files!

In [8]:
outputs = client.list_objects(
    namespace_name=bucket_cfg["namespace"],
    bucket_name=bucket_cfg["bucketName"],
    prefix=job_result.output_location.prefix,
).data.objects

for obj in outputs:
    if obj.name.lower().endswith(('.txt', '.srt')):
        resp = client.get_object(bucket_cfg["namespace"], bucket_cfg["bucketName"], obj.name)
        try:
            text = resp.data.content.decode('utf-8')
        except Exception:
            text = resp.data.content.decode('latin-1')
        print("\n----- Output: {} -----\n".format(obj.name))
        print(text[:10000])
        print("\n[...End of {}...]\n".format(obj.name))


----- Output: asagarwa/job-amaaaaaaghwivzaaco5v5e6h2jpdrpdrpcoakz3amn4qprxe5v3475ller6a/axaemuxiyife_AISandbox-Bucket_asagarwa/voice_sample_spanish.mp3.srt -----

1
00:00:00,250 --> 00:00:03,850
Un párrafo es una serie de oraciones
organizadas y coherentes,

2
00:00:04,150 --> 00:00:06,250
y todas relacionadas con un solo tema.

3
00:00:06,810 --> 00:00:09,990
Casi todos los escritos que escribas que
sean más largos que

4
00:00:09,990 --> 00:00:12,950
unas pocas oraciones deben organizarse en
párrafos.



[...End of asagarwa/job-amaaaaaaghwivzaaco5v5e6h2jpdrpdrpcoakz3amn4qprxe5v3475ller6a/axaemuxiyife_AISandbox-Bucket_asagarwa/voice_sample_spanish.mp3.srt...]



---
Try another file by re-running the file-selection cell!