# Automatic Speech Recognition (ASR)
The **Automatic Speech Recognition (ASR)** *pipeline in Hugging Face's Transformers* library provides a streamlined way *to convert spoken audio into text.*

This pipeline abstracts away --
- the complexities of model loading,
- pre-processing, and
- post-processing,
- allowing for quick and efficient **ASR** inference.

## The ASR pipeline typically involves the following stages:
- **Feature Extraction:**

  The **raw audio input is pre-processed** by a feature extractor to convert it into a **numerical representation** suitable for the model, often **log-mel spectrograms.**

- **Model Inference:**
  
  A **pre-trained ASR** model then takes these features and performs the **sequence-to-sequence mapping**, predicting a sequence of tokens.

- **Tokenization and Post-processing:**
  
  A **tokenizer** converts these *predicted tokens into human-readable text.* This stage may also involve further **post-processing** steps depending on the specific model and task (*e.g., adding timestamps*).



## Using the ASR Pipeline:
To use the **ASR pipeline**, one can simply `import the pipeline` function from transformers and specify the task as `"automatic-speech-recognition"`. A default model will be used if none is specified, or a specific model from the **Hugging Face Hub** can be provided.

In [1]:
!pip install --upgrade transformers

Collecting transformers
  Downloading transformers-4.57.0-py3-none-any.whl.metadata (41 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.4/41.4 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
Downloading transformers-4.57.0-py3-none-any.whl (12.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.0/12.0 MB[0m [31m74.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.56.2
    Uninstalling transformers-4.56.2:
      Successfully uninstalled transformers-4.56.2
Successfully installed transformers-4.57.0


In [2]:
from transformers import pipeline

In [3]:
task = "automatic-speech-recognition"

# facebook/wav2vec2-base-960h
 -is an **automatic speech recognition (ASR)** model developed by Facebook (now Meta AI). It is a variant of the **Wav2Vec2** framework, known for its ability to learn powerful representations from speech audio alone.

## Key features
- Purpose: The model's primary function is to transcribe English audio into text.
- Architecture: It is a "base" model, meaning it is the smaller version compared to the "large" variant. It uses a Transformer-based architecture.
- Training data:
  - Pre-training: Like other Wav2Vec models, it was initially trained in a self-supervised manner on a large amount of unlabeled audio data.
  - Fine-tuning: It was then fine-tuned on 960 hours of labeled speech data from the Librispeech dataset. This final step adapts the model for the ASR task. The "960h" in its name specifically refers to this fine-tuning on 960 hours of Librispeech data.
- Input requirements: For optimal performance, the model requires audio that has been sampled at a rate of 16kHz.
- Performance: The Wav2Vec2 framework, including this base model, is known for achieving state-of-the-art results even when trained with limited labeled data.
- Platform: The model is widely available on the Hugging Face model hub, making it easy for developers to download, use, and further fine-tune for specific applications.

## How it works
The model processes speech in two main stages:
1. Feature encoding: The model's convolutional feature encoder processes raw audio waveforms to create a sequence of latent speech representations.

2. Contextualization: These representations are then fed into a Transformer model, which contextualizes them by masking specific time steps and solving a contrastive task. This forces the model to learn the underlying structure of speech.

## Use cases
The model is suitable for a variety of speech-to-text applications, such as:
- Transcribing audio files from podcasts or meetings
- Building components for voice-controlled devices
- Assisting people with disabilities who cannot use a keyboard

In [4]:
model_fb_wav2vec2_base_960h = "facebook/wav2vec2-base-960h"
source_1 = "https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac"
asr_transcribe = pipeline(task, model=model_fb_wav2vec2_base_960h)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/378M [00:00<?, ?B/s]

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/163 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/291 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/159 [00:00<?, ?B/s]

Device set to use cuda:0


In [5]:
asr_transcribe(inputs=source_1)

{'text': 'I HAVE A DREAM BUT ONE DAY THIS NATION WILL RISE UP LIVE UP THE TRUE MEANING OF ITS TREES'}

# openai/whisper-large-v3
is a state-of-the-art, **multilingual automatic speech recognition (ASR)** and **speech translation model** developed by **OpenAI**. It excels at *transcribing and translating audio with high accuracy*, even in challenging conditions like *background noise and accents*.

Trained on over **5 million hours** of labeled audio data, this model provides robust performance for various applications, including short and long-form transcription, and can be used via APIs or run locally.

## Key Features
- **High Accuracy: Whisper Large v3** offers improved accuracy over previous versions, with *reduced word error rates (WER) by 10-20%.*
- **Multilingual Support:** It supports **99 languages**, enabling both **transcription and translation** for diverse linguistic inputs.

- **Zero-Shot Generalization:** The model generalizes well to many datasets and domains, even without fine-tuning.

- **Robustness:** It is highly robust to various audio conditions, including *background noise, accents, and technical jargon.*

- **Versatile Applications:** It can be used for various applications, from short audio clips to sequential or chunked long-form audio transcription and translation.


## How It Works

* **Input:** The model takes an *audio file as input*.

* **Processing:** It utilizes its advanced architecture and training on *massive multilingual and multitask datasets* to process the audio.

* **Output:** It outputs the *transcribed text of the audio* and can also translate it into another language.


## Use Cases

- **Transcription Services:** *Transcribing meetings, interviews, and other audio content.*

- **Content Localization:** *Translating spoken content into different languages.*

- **Accessibility:** *Providing accurate captions for videos and audio.*

- **Voice Assistants:** *Enabling more robust and accurate speech understanding for voice-controlled devices.*

In [6]:
model_openai_whisper_large_v3 ="openai/whisper-large-v3"
# directly uploading from google colab
source_2 = "Aasan Nahin Yahan.mp3"
asr_transcribe = pipeline(task, model=model_openai_whisper_large_v3)

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

generation_config.json: 0.00B [00:00, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

normalizer.json: 0.00B [00:00, ?B/s]

added_tokens.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

preprocessor_config.json:   0%|          | 0.00/340 [00:00<?, ?B/s]

Device set to use cuda:0


In [7]:
asr_transcribe(inputs=source_2, return_timestamps=True, language='en')

Whisper did not predict an ending timestamp, which can happen if audio is cut off in the middle of a word. Also make sure WhisperTimeStampLogitsProcessor was used during generation.


{'text': ' 🎵 Oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, oh, Oh, oh, oh, oh, oh, oh, oh Bato se aage, wado se aage, dekho zara tum kabhi Yeh toh hai shola, yeh 

# openai/whisper-small
 -is the multilingual version of **OpenAI's Whisper small-sized** model for **automatic speech recognition (ASR)** and speech translation. The model is available on the Hugging Face Hub for developers to use and fine-tune.

## Core capabilities
- **Automatic Speech Recognition (ASR):** Transcribes spoken language into written text in the same language as the audio.

- **Speech translation:** Translates speech from a source language into English.

- **Multilingual support:** Trained on a diverse dataset of **680,000 hours** of labeled audio data, it can process speech in **98 different languages.**
Timestamp prediction: Predicts sequence-level timestamps for transcriptions.

## Technical specifications
- **Model size:** The "small" model has **244 million parameters**, balancing
performance and efficiency.

- **Model architecture:** Uses a **transformer-based encoder-decoder** architecture, also known as a **sequence-to-sequence model.**

- **Data training:** Built using a large and diverse dataset, which makes it robust to accents, background noise, and technical language.

- **Input processing:** Takes *audio input of up to* **30 seconds.** Longer audio can be transcribed by splitting it into smaller chunks using a chunking algorithm.

## How it compares to other Whisper models

Whisper models come in several sizes to accommodate different needs, balancing speed, accuracy, and computational requirements.

- **Larger models (e.g., medium, large):** Offer higher accuracy but require more memory (VRAM) and are slower.
- **Smaller models (e.g., tiny, base):** Are faster and more efficient for devices with limited resources, but are less accurate.
- **Small model:** Offers a strong balance between high accuracy and efficient performance, making it suitable for applications that require good recognition quality without the need for the largest, most resource-intensive models.

## Usage on Hugging Face
- To use openai/whisper-small via the Hugging Face library, you pair the model with a **WhisperProcessor**. This processor handles both the conversion of raw audio into a format the model can use and the conversion of the model's output tokens back into readable tex

In [10]:
model_openai_whisper_small = "openai/whisper-small"
source_3 = "Mera Hua Ek Deewane Ki Deewaniyat.mp3"
asr_transcribe = pipeline(task, model=model_openai_whisper_small)

Device set to use cuda:0


In [15]:
from transformers import pipeline

# Create an ASR pipeline configured for translation to English
translator = pipeline(task="automatic-speech-recognition", model="openai/whisper-small", language="en")

# The transcribed text from the previous step is in the output of cell W032W0CxgV5j
# We don't need the transcribed text directly for this approach,
# we will pass the audio source directly to the pipeline for translation.
source_to_translate = source_3

# Translate the audio
translated_output = translator(source_to_translate, return_timestamps=True)

# Display the translated text
print("Translated Text:")
print(translated_output['text'])

Device set to use cuda:0


Translated Text:
 Music I have become better than you I have made a promise to you I have become better than you I have made a promise to you So, what do I want from this world? You have become mine from both the worlds You have fallen in love with me You have fallen in love with me I've fallen in love with you since you were so young Oh, I have never seen you like this before I have never seen you like this before I have seen you in the evening, I have seen you in the day I have seen you in the evening, I have seen you in the day I met you a hundred times, but there was something missing Why do I feel as if I have never lived before? You are mine, so I am yours What do I want from this world? Both of us have grown up and you have become mine We have fallen in love and you have become mine We have fallen in love and you have become mine We have fallen in love and you have become mine You


In [12]:
asr_transcribe(inputs=source_3, return_timestamps=True)

Using custom `forced_decoder_ids` from the (generation) config. This is deprecated in favor of the `task` and `language` flags/config options.
Transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English. This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`. See https://github.com/huggingface/transformers/pull/28687 for more details.
Whisper did not predict an ending timestamp, which can happen if audio is cut off in the middle of a word. Also make sure WhisperTimeStampLogitsProcessor was used during generation.


{'text': ' ملکے تجسے بہتر خوانے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے کے بارے خود سے ہو گئے ہیں ہم تیری کسم چھوکے جب سے مہکے تج سے ہو گئے ہیں ہم تیری کسم تو جو میرا تو میں چاہوں اس دنیا سے ب