# Automatic Speech Recognition (ASR)
The **Automatic Speech Recognition (ASR)** *pipeline in Hugging Face's Transformers* library provides a streamlined way *to convert spoken audio into text.*

This pipeline abstracts away --
- the complexities of model loading,
- pre-processing, and
- post-processing,
- allowing for quick and efficient **ASR** inference.

## The ASR pipeline typically involves the following stages:
- **Feature Extraction:**

  The **raw audio input is pre-processed** by a feature extractor to convert it into a **numerical representation** suitable for the model, often **log-mel spectrograms.**

- **Model Inference:**
  
  A **pre-trained ASR** model then takes these features and performs the **sequence-to-sequence mapping**, predicting a sequence of tokens.

- **Tokenization and Post-processing:**
  
  A **tokenizer** converts these *predicted tokens into human-readable text.* This stage may also involve further **post-processing** steps depending on the specific model and task (*e.g., adding timestamps*).



## Using the ASR Pipeline:
To use the **ASR pipeline**, one can simply `import the pipeline` function from transformers and specify the task as `"automatic-speech-recognition"`. A default model will be used if none is specified, or a specific model from the **Hugging Face Hub** can be provided.

In [1]:
from transformers import pipeline

In [3]:
task = "automatic-speech-recognition"
model_fb_wav2vec2_base_960h = "facebook/wav2vec2-base-960h"
model_openai_whisper_large_v3 ="openai/whisper-large-v3"
model_openai_whisper_small = "openai/whisper-small"

In [4]:
asr_transcribe = pipeline(task, model=model_fb_wav2vec2_base_960h)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/378M [00:00<?, ?B/s]

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/163 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/291 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

preprocessor_config.json:   0%|          | 0.00/159 [00:00<?, ?B/s]

Device set to use cpu


In [8]:
source_1 = "https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac"
source_2 = "Aasan Nahin Yahan.mp3"

In [9]:
asr_transcribe(inputs=source_1)

{'text': 'I HAVE A DREAM BUT ONE DAY THIS NATION WILL RISE UP LIVE UP THE TRUE MEANING OF ITS TREES'}

In [10]:
asr_transcribe(inputs=source_2)

{'text': 'AA H ASANAGIACHI GODANA BUGABATATASATANAA ACHIAMITETAGESAATE SAMBANAITAEATANAA    HO H MABOSAGET BATOSAGE TEOSA TTATIA GAACATA GATITATI TTAVTT  H HH      DISMA GATITE AGAATITET DITTLE GATUUNTEA ABAGAHIGACHIGIA HAITATA TATE  H HHOO    A TAATAA'}