# Descripion

- torchcodec is required to decode (read/load) files
- torchcodec follow a compatibility with torch versions "https://github.com/meta-pytorch/torchcodec?tab=readme-ov-file#installing-torchcodec"
- torchcode depend on ffmpeg [version 4 to 8]. I install ffmpeg inside of the docker `apt-get install -y ffmpeg`
- HF datasets.load_dataset() create an audio instance which expect a version of torchcodec that has "AudioDecoder" class
    - However, torchcodec is only compatible to certain torch version, and torch version need compatibility to CUDA version. Currently my CUDA version is 12.2 which allows me to upgrade my torch to v2.5.1 which only allows me to update to torchcodec v0.1. which does not have the class "AudioDecoder" expected for dataaudio instances created with `datasets.load_dataset()`
    - CUDA 12.2 → Torch 2.5.1 → Torchcodec 0.1 (no AudioDecoder) ← HF datasets (expects AudioDecoder)
    - Verification of classes in current torchcodec version
      ```
      import torchcodec.decoders
      print(dir(torchcodec.decoders))
      ```
      

* when using transformers.pipeline() some model use torch.load(). The issue is that torch version <=2.5 has vulnerability issues
* my current transformers v5.0.0. is blocking the use of torch.load() is torch version is no >=2.6.
* workaround: 
    * using model.safetensors 

# Conf

In [1]:
import os

class cfg: 

    # to store HF pre-trained models weights and configs
    HF_CACHE_ROOT = os.path.join("..", "..", "..",
                                 "data",
                                 "05_cache", 
                                 "HF"
                                )



    # to store HF pre-trained models weights and configs
    HF_FINETUNE_ROOT = os.path.join("..", "..", "..",
                                    "data",
                                    "06_fine_tune",
                                    "01_tuto",
                                    "01_hug_llm",
                                    "ch03",
                                   )

# HF Cache management

https://huggingface.co/docs/datasets/en/cache

In [2]:
print("HF_HOME:", os.environ.get("HF_HOME"))
os.environ["HF_HOME"] = cfg.HF_CACHE_ROOT
print("HF_HOME:", os.environ.get("HF_HOME"))

HF_HOME: None
HF_HOME: ../../../data/05_cache/HF


In [3]:
print("HF_HUB_CACHE:", os.environ.get("HF_HUB_CACHE"))
os.environ["HF_HUB_CACHE"] = cfg.HF_CACHE_ROOT
print("HF_HUB_CACHE:", os.environ.get("HF_HUB_CACHE"))

HF_HUB_CACHE: None
HF_HUB_CACHE: ../../../data/05_cache/HF


In [26]:
import transformers
transformers.__version__

'5.0.0'

# Import libraries

In [67]:
import io
import sys
from dotenv import load_dotenv


#_________
import torch
import torchaudio

#__________
from transformers import pipeline

#_________
import pandas as pd 



from IPython.display import Audio

# Service Token Authentication

In [27]:
# Verify token is loaded
load_dotenv()

HF_TOKEN_READ = os.getenv("07_FR_phone_TokenType_READ")
print(f"Token loaded: {'Yes' if HF_TOKEN_READ else 'No'}")

Token loaded: Yes


# Probing CTC Models

## Load Audio

In [6]:
from datasets import load_dataset
from datasets import Audio as Audio_ds # the instances generated from load_dataset use under the hood Audio to decode (Audio use Torchcoced).


dataset_path = "hf-internal-testing/librispeech_asr_dummy"
dataset = load_dataset(path=dataset_path, # dataset HF path
                       name="clean", 
                       split="validation",
                       token=HF_TOKEN_READ, 
                      )
print("\n==========================") 
dataset




Dataset({
    features: ['file', 'audio', 'text', 'speaker_id', 'chapter_id', 'id'],
    num_rows: 73
})

### Solving loading and decoding audio without torchcodec

In [7]:
# datasets expect a version of torchcodec with class 'AudioDecoder'. 
## However, I need a older version compatible with my torch version

ds = dataset.cast_column("audio", Audio_ds(decode=False)) 

In [8]:

sample = ds[2]
# print(sample["text"])
# Audio(sample["audio"]["array"], rate=sample["audio"]["sampling_rate"])

In [9]:
# Pending, this needs to be decoded. 
sample["audio"].keys()

dict_keys(['bytes', 'path'])

In [12]:
sample["audio"]["path"]

'1272-128104-0002.flac'

In [10]:
sample.keys()

dict_keys(['file', 'audio', 'text', 'speaker_id', 'chapter_id', 'id'])

In [23]:
audio_bytes = sample["audio"]["bytes"]
buffer = io.BytesIO(audio_bytes) # using io.BytesIO() to simulate file-like object from binary string data. 
waveform, sample_rate = torchaudio.load(buffer)

print(f"\n Transcription:\n {sample['text']}")
display(Audio(data=waveform, rate=sample_rate))


 Transcription:
 HE TELLS US THAT AT THIS FESTIVE SEASON OF THE YEAR WITH CHRISTMAS AND ROAST BEEF LOOMING BEFORE US SIMILES DRAWN FROM EATING AND ITS RESULTS OCCUR MOST READILY TO THE MIND


## Load HF pipeline ask

### Solving for Vulnerability with torch <= 2.5

In [28]:
from transformers import pipeline

pipe_task = "automatic-speech-recognition"
checkpoint = "facebook/wav2vec2-base-100h"

pipe = pipeline(pipe_task, 
                model=checkpoint,
                token=HF_TOKEN_READ,
                model_kwargs={"use_safetensors": True}, #<<<<<<< Read description pickle vulnerability | workaround to not upgrate torch
               )

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/378M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/212 [00:00<?, ?it/s]

Wav2Vec2ForCTC LOAD REPORT from: facebook/wav2vec2-base-100h
Key                           | Status     | 
------------------------------+------------+-
wav2vec2.mask_time_emb_vector | UNEXPECTED | 
wav2vec2.masked_spec_embed    | MISSING    | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.


tokenizer_config.json:   0%|          | 0.00/376 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/358 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/96.0 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/159 [00:00<?, ?B/s]

In [40]:
sample_rate

16000

In [47]:
waveform.numpy()[0]

array([-6.7138672e-04,  6.1035156e-05,  5.1879883e-04, ...,
        1.5258789e-04,  2.1362305e-04,  1.8310547e-04], dtype=float32)

### Solving for pipeline task ASR
https://huggingface.co/docs/transformers/v4.36.1/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline
- which use in transformer v5.0.0 

In [50]:
original_torchcodec = sys.modules.get("torchcodec", None)
print(original_torchcodec)

<module 'torchcodec' from '/usr/local/lib/python3.10/dist-packages/torchcodec/__init__.py'>


In [55]:

# Create a real dummy module
class DummyTorchCodec:
    class decoders:
        AudioDecoder = type('AudioDecoder', (), {})  # Empty class



# Temporarily remove torchcodec from sys.modules
# sys.modules["torchcodec"] = None # HF recognize this as it was a problem in loading rather that the module does not exist
sys.modules["torchcodec"] = DummyTorchCodec()


In [56]:
print(sys.modules.get("torchcodec", None))

<__main__.DummyTorchCodec object at 0x7de7e48cfac0>


In [63]:

pipe(inputs=waveform.squeeze())

{'text': 'HE TELLS US THAT AT THIS FESTIVE SEASON OF THE YEAR WITH CHRISTMAUS AND ROSE BEEF LOOMING BEFORE US SIMALYIS DRAWN FROM EATING AND ITS RESULTS OCCUR MOST READILY TO THE MIND'}

In [64]:
sys.modules["torchcodec"] = original_torchcodec

In [65]:
sys.modules.get("torchcodec", None)

<module 'torchcodec' from '/usr/local/lib/python3.10/dist-packages/torchcodec/__init__.py'>

### Solving pipeline task ASR with @contextmanager

In [68]:
from contextlib import contextmanager

In [73]:
@contextmanager  
def disable_torchcodec():
    """
    - Safely run code that would fail due to torchcodec import issues.
    - Temporarily disable torchcodec module.
    
    """
    # Create a real dummy module
    class DummyTorchCodec:
        class decoders:
            AudioDecoder = type('AudioDecoder', (), {})  # Empty class

    # out: <module 'torchcodec' from '/usr/local/lib/python3.10/dist-packages/torchcodec/__init__.py'>
    original_torchcodec = sys.modules.get("torchcodec")

    # out: <__main__.DummyTorchCodec object at 0x7de7e48cfac0>
    sys.modules["torchcodec"] = DummyTorchCodec() # simplint adding `sys.modules["torchcodec"] = None` generate ModuleNotFound error.

    #________________________________________________
    # try, except, finnally expected extructure form `with` operator in python 
    #  consider that with operator has a __enter__ and __exit__ method. 

    try:
        yield
    finally:
        if original is not None:
            sys.modules["torchcodec"] = original_torchcodec
        else:
            del sys.modules["torchcodec"]


    
    

In [76]:

with disable_torchcodec():
    out = pipe(inputs=waveform.squeeze())
out['text']

'HE TELLS US THAT AT THIS FESTIVE SEASON OF THE YEAR WITH CHRISTMAUS AND ROSE BEEF LOOMING BEFORE US SIMALYIS DRAWN FROM EATING AND ITS RESULTS OCCUR MOST READILY TO THE MIND'

In [75]:
sample["text"]

'HE TELLS US THAT AT THIS FESTIVE SEASON OF THE YEAR WITH CHRISTMAS AND ROAST BEEF LOOMING BEFORE US SIMILES DRAWN FROM EATING AND ITS RESULTS OCCUR MOST READILY TO THE MIND'

In [71]:
sys.modules.get("torchcodec", None)

<module 'torchcodec' from '/usr/local/lib/python3.10/dist-packages/torchcodec/__init__.py'>

###  shortcoming of a CTC model

- CHRISTMAUS vs. CHRISTMAS
- ROSE vs. ROAST /ˈrəʊz/ vs /ˈrəʊst/
- SIMALYIS vs. SIMILES

This highlights the shortcoming of a CTC model. A CTC model is essentially an ‘acoustic-only’ model: it consists of an encoder which forms hidden-state representations from the audio inputs, and a linear layer which maps the hidden-states to characters:

This means that the system almost entirely bases its prediction on the acoustic input it was given (the phonetic sounds of the audio), and so has a tendency to transcribe the audio in a phonetic way (e.g. CHRISTMAUS). It gives less importance to the language modelling context of previous and successive letters, and so is prone to phonetic spelling errors. A more intelligent model would identify that CHRISTMAUS is not a valid word in the English vocabulary, and correct it to CHRISTMAS when making its predictions. We’re also missing two big features in our prediction - casing and punctuation - which limits the usefulness of the model’s transcriptions to real-world applications.

### ***Bug ASR pipeline task method preprocess line 388 

https://github.com/huggingface/transformers/blob/v5.0.0/src/transformers/pipelines/automatic_speech_recognition.py

## Work with Whisper

In [77]:
device = "cuda:0" if torch.cuda.is_available() else "cpu"
device

'cuda:0'

In [78]:
from transformers import pipeline

pipe_task = "automatic-speech-recognition"
checkpoint = "openai/whisper-base"

pipe = pipeline(pipe_task, 
                model=checkpoint,
                token=HF_TOKEN_READ,
                device=device,
                model_kwargs={"use_safetensors": True}, #<<<<<<< Read description pickle vulnerability | workaround to not upgrate torch
               )

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/290M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/245 [00:00<?, ?it/s]

generation_config.json: 0.00B [00:00, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

normalizer.json: 0.00B [00:00, ?B/s]

added_tokens.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

preprocessor_config.json: 0.00B [00:00, ?B/s]

In [83]:
with disable_torchcodec():
    out = pipe(inputs=waveform.squeeze(),
               max_new_tokens=256,
               chunk_length_s=30,
               stride_length_s=5,
               
              )
out

Both `max_new_tokens` (=256) and `max_length`(=448) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


{'text': ' He tells us that at this festive season of the year, with Christmas and roast beef looming before us, similarly is drawn from eating and its results occur most readily to the mind.'}

**Warning when using chunk_length_s**

Using `chunk_length_s` is very experimental with seq2seq models. The results will not necessarily be entirely accurate and will have caveats. More information: https://github.com/huggingface/transformers/pull/20104. Ignore this warning with pipeline(..., ignore_warning=True). To use Whisper for long-form transcription, use rather the model's `generate` method directly as the model relies on it's own chunking mechanism (cf. Whisper original paper, section 3.8. Long-form Transcription).
Passing `generation_config` together with generation-related arguments=({'max_new_tokens'}) is deprecated and will be removed in future versions. Please pass either a `generation_config` object OR all generation parameters explicitly, but not both.
Using custom `forced_decoder_ids` from the (generation) config. This is deprecated in favor of the `task` and `language` flags/config options.
Transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English. This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`. See https://github.com/huggingface/transformers/pull/28687 for more details.
Both `max_new_tokens` (=256) and `max_length`(=448) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
A custom logits processor of type <class 'transformers.generation.logits_process.SuppressTokensLogitsProcessor'> has been passed to `.generate()`, but it was also created in `.generate()`, given its parameterization. The custom <class 'transformers.generation.logits_process.SuppressTokensLogitsProcessor'> will take precedence. Please check the docstring of <class 'transformers.generation.logits_process.SuppressTokensLogitsProcessor'> to see related `.generate()` flags.
A custom logits processor of type <class 'transformers.generation.logits_process.SuppressTokensAtBeginLogitsProcessor'> has been passed to `.generate()`, but it was also created in `.generate()`, given its parameterization. The custom <class 'transformers.generation.logits_process.SuppressTokensAtBeginLogitsProcessor'> will take precedence. Please check the docstring of <class 'transformers.generation.logits_process.SuppressTokensAtBeginLogitsProcessor'> to see related `.generate()` flags.


### Multilingual Whisper - Getting data config info

In [94]:
from datasets import load_dataset_builder

# Get dataset builder object with full metadata
builder = load_dataset_builder("facebook/multilingual_librispeech", "spanish")

print("=" * 50)
print(f"Dataset: {builder.info.dataset_name}")
print(f"Config: {builder.config.name}")
print(f"Description: {builder.info.description[:200]}...")
print(f"Homepage: {builder.info.homepage}")
print(f"License: {builder.info.license}")
print(f"Citation: {builder.info.citation[:100]}...")

print(f"\nFeatures (columns):")
for feature_name, feature_type in builder.info.features.items():
    print(f"  - {feature_name}: {feature_type}")

print(f"\nAvailable splits:")
for split_name, split_info in builder.info.splits.items():
    print(f"  - {split_name}: {split_info.num_examples:,} examples")

# print(f"\nDataset size: {builder.info.size_in_bytes/1024**3:.2f} GB")
# print("=" * 50)

Resolving data files:   0%|          | 0/48 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/30 [00:00<?, ?it/s]

Dataset: multilingual_librispeech
Config: spanish
Description: ...
Homepage: 
License: 
Citation: ...

Features (columns):
  - audio: Audio(sampling_rate=None, decode=True, num_channels=None, stream_index=None)
  - original_path: Value('string')
  - begin_time: Value('float64')
  - end_time: Value('float64')
  - transcript: Value('string')
  - audio_duration: Value('float64')
  - speaker_id: Value('string')
  - chapter_id: Value('string')
  - file: Value('string')
  - id: Value('string')

Available splits:
  - dev: 2,408 examples
  - test: 2,385 examples
  - train: 220,701 examples
  - 9_hours: 2,110 examples
  - 1_hours: 233 examples


In [91]:
from datasets import get_dataset_split_names

# Check splits for a specific config (language)
splits = get_dataset_split_names("facebook/multilingual_librispeech", "spanish")
print(f"Splits available for 'spanish' config:")
for split in splits:
    print(f"  - {split}")
# Output: ['dev', 'test', 'train', '9_hours', '1_hours']

Resolving data files:   0%|          | 0/48 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/30 [00:00<?, ?it/s]

Splits available for 'spanish' config:
  - dev
  - test
  - train
  - 9_hours
  - 1_hours


In [90]:
from datasets import get_dataset_config_names

# List all configurations (languages, subsets, etc.)
configs = get_dataset_config_names("facebook/multilingual_librispeech")
print("Available configs (languages):")
for config in configs:
    print(f"  - {config}")

Resolving data files:   0%|          | 0/48 [00:00<?, ?it/s]

Available configs (languages):
  - dutch
  - french
  - german
  - italian
  - polish
  - portuguese
  - spanish


### Loading multilingual librispeech (spanish) dataset

In [95]:
dataset_path = "facebook/multilingual_librispeech"
dataset = load_dataset(path=dataset_path, 
                       name="spanish",
                       split="test", 
                       streaming=True,
                       token=HF_TOKEN_READ,
)


Resolving data files:   0%|          | 0/48 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/30 [00:00<?, ?it/s]

In [100]:
ds = dataset.cast_column("audio", Audio_ds(decode=False))
ds

IterableDataset({
    features: ['audio', 'original_path', 'begin_time', 'end_time', 'transcript', 'audio_duration', 'speaker_id', 'chapter_id', 'file', 'id'],
    num_shards: 1
})

In [102]:
sample = next(iter(ds))

In [104]:
sample.keys()

dict_keys(['audio', 'original_path', 'begin_time', 'end_time', 'transcript', 'audio_duration', 'speaker_id', 'chapter_id', 'file', 'id'])

In [108]:
sample["audio"].keys()

dict_keys(['bytes', 'path'])

In [133]:
audio_bytes = sample["audio"]["bytes"]
buffer = io.BytesIO(audio_bytes) # using io.BytesIO() to simulate file-like object from binary string data. 
waveform_m, sample_rate = torchaudio.load(buffer, )

#=======================================
if sample_rate != 16000:
    waveform_m = torchaudio.functional.resample(waveform_m, sample_rate, 16000)
    sample_rate = 16000


print(f"\n Transcription:\n {sample['transcript']}")
display(Audio(data=waveform_m, rate=sample_rate))


 Transcription:
 y las almas buscando algún alivio se revuelven ansiosas y hacen el mundo que así resulta ser del dolor obra el dolor o la nada quien tenga corazón venga y escoja


In [124]:
from transformers import pipeline

pipe_task = "automatic-speech-recognition"
checkpoint = "openai/whisper-base"

pipe = pipeline(pipe_task, 
                model=checkpoint,
                token=HF_TOKEN_READ,
                device=device,
                model_kwargs={"use_safetensors": True}, #<<<<<<< Read description pickle vulnerability | workaround to not upgrate torch
               )

Loading weights:   0%|          | 0/245 [00:00<?, ?it/s]

In [136]:
with disable_torchcodec():
    out = pipe(waveform_m.squeeze().numpy(), 
               max_new_tokens=256,
               generate_kwargs={"language": "spanish", "task": "transcribe"},
               # return_timestamps=True,
               chunk_length_s=30,
               stride_length_s=5,
              )
out

Both `max_new_tokens` (=256) and `max_length`(=448) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


{'text': ' Y las almas, buscando alguna livión, se revuelven ansiosas y hacen el mundo, que así resulta ser del dolor obra. El dolor o la nada, que entenga corazón venga y escoja.'}

In [119]:
display(Audio(data=waveform_m.squeeze(), rate=sample_rate))

In [137]:
with disable_torchcodec():
    out = pipe(waveform_m.squeeze().numpy(), 
               max_new_tokens=256,
               generate_kwargs={"language": "spanish", "task": "translate"},
               # return_timestamps=True,
               chunk_length_s=30,
               stride_length_s=5,
              )
out

Both `max_new_tokens` (=256) and `max_length`(=448) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


{'text': ' And the souls, looking for some relief, are relinquishing themselves in the world, which thus results in the pain of the work. The pain or nothing, who has heart, come and take it.'}

### Chunck long audios files | Same principle for live inference

https://huggingface.co/blog/asr-chunking


Absolute Latency Targets:**

| **Application** | **Acceptable Latency** | **Industry Standard** | **Notes** |
|-----------------|-----------------------|----------------------|-----------|
| **Live Captioning** | 0.5 - 3 seconds | **< 2 seconds** | Broadcast TV, live events |
| **Video Conferencing** | 0.2 - 1 second | **< 500ms** | Zoom, Teams, Meet |
| **Telephony/IVR** | 0.1 - 0.5 seconds | **< 300ms** | Call centers, voice assistants |
| **Live Streaming** | 1 - 5 seconds | **< 3 seconds** | Twitch, YouTube Live |
| **Court Reporting** | 2 - 10 seconds | **< 5 seconds** | Accuracy > speed |
| **Subtitle Generation** | 5 - 30 seconds | **< 10 seconds** | Post-production |

