# Speech to text / Automatic Speech Recognition (ASR)  

# by ---  "AD ACADEMY" - AI for Aam Janta

Mentor - Dr Ayan Debnath, IIT Delhi + Harvard university Alumni

LinkedIn: [dr_ayan_debnath](https://www.linkedin.com/in/ayan-debnath/)

YouTube:[AD ACADEMY AI](https://www.youtube.com/@ad_academy)

Topic: Speech to text - using OpenAI

class on 31st August 2024

# Steps:
Importing Hugging Face library

1.   Install required packages and import libraries
2.   Model Initialization - use OpenAI LLM
3.   Audio Data Processing
4.   Generate text from the model



In [1]:
# Import necessary libraries
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch
import torchaudio

WhisperProcessor: This is a pre-trained model processor that handles the conversion of audio input into features that the model can understand.

WhisperForConditionalGeneration: This is the Whisper model itself, which generates text from the input features.

from_pretrained(model_name): This method loads the pre-trained Whisper model and processor from the model hub using the specified model_name (openai/whisper-large in this case).


In [2]:
# Load the Whisper model and processor
model_name = "openai/whisper-large"
processor = WhisperProcessor.from_pretrained(model_name)
model = WhisperForConditionalGeneration.from_pretrained(model_name)


preprocessor_config.json:   0%|          | 0.00/185k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/283k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/836k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.48M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/34.6k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.19k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.99k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/6.17G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/3.85k [00:00<?, ?B/s]

In [3]:
# Function to load and process audio file
def load_audio(audio_path):
    speech_array, sampling_rate = torchaudio.load(audio_path)
    if sampling_rate != 16000:
        resampler = torchaudio.transforms.Resample(orig_freq=sampling_rate, new_freq=16000)
        speech_array = resampler(speech_array)
    # Ensure the audio is a 1D tensor
    if speech_array.ndim > 1:
        speech_array = torch.mean(speech_array, dim=0)
    return speech_array, 16000



Downmixing to Mono: If the audio has more than one channel (e.g., stereo with left and right channels), the speech_array tensor is averaged across the channels to produce a mono signal (1D tensor).

The function returns the processed speech_array and the sampling rate (16,000 Hz).

In [4]:
# Convert speech to text
def speech_to_text(audio_path):
    speech_array, sampling_rate = load_audio(audio_path)
     # Ensure the input shape is correct for the processor
    inputs = processor(speech_array, sampling_rate=sampling_rate, return_tensors="pt")
     # Generate text from the model
    generated_ids = model.generate(inputs.input_features)
     # Decode the generated text
    transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)
    return transcription[0]

# Path to the audio file
audio_path = "/content/drive/MyDrive/Video/output_16000Hz.wav"

# Perform speech-to-text conversion
transcription = speech_to_text(audio_path)

# Print the transcription
print("Transcription:", transcription)


Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`.
Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.43.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Transcription:  I also offered to help your brother to escape, but he would not go.


Speech-to-text technology has a wide range of applications across various industries and everyday activities. Here are some notable applications:

### 1. **Virtual Assistants**
   - **Application:** Virtual assistants like Siri, Google Assistant, and Alexa use speech-to-text to understand and respond to user commands. This enables hands-free operation for tasks like setting reminders, sending messages, or searching the web.

### 2. **Transcription Services**
   - **Application:** Speech-to-text technology is used to automatically transcribe spoken content into text. This is valuable in contexts like:
     - **Meeting Minutes:** Converting spoken discussions in meetings into written minutes.
     - **Lecture and Podcast Transcription:** Providing text versions of spoken content for accessibility and archiving.
     - **Legal and Medical Transcription:** Documenting conversations, interviews, or dictations in a legal or medical context.

### 3. **Accessibility for the Deaf and Hard of Hearing**
   - **Application:** Speech-to-text technology provides real-time subtitles or captions, making spoken content accessible to people who are deaf or hard of hearing. This is commonly used in live broadcasts, online videos, and classrooms.

### 4. **Voice-Activated Systems**
   - **Application:** Devices like smart home systems, car infotainment systems, and IoT devices utilize speech-to-text for voice commands. Users can control devices, navigate, and perform tasks without needing to touch or look at a screen.

### 5. **Customer Service and Call Centers**
   - **Application:** Speech-to-text is used to transcribe customer service calls for quality assurance, training, and analysis. It helps in monitoring customer interactions, analyzing sentiment, and ensuring compliance.

### 6. **Language Learning Tools**
   - **Application:** Language learning apps use speech-to-text to provide feedback on pronunciation and spoken exercises. Users can practice speaking and receive real-time text feedback on their performance.

### 7. **Search Engines**
   - **Application:** Voice search allows users to search the web by speaking rather than typing. Speech-to-text technology converts the spoken query into text that the search engine can process.

### 8. **Real-Time Translation**
   - **Application:** Speech-to-text is often the first step in real-time translation systems, where spoken language is converted to text, translated into another language, and then converted back into speech.

### 9. **Content Creation and Note-Taking**
   - **Application:** Writers, journalists, and professionals use speech-to-text to dictate content, making the process of writing faster and more efficient. Similarly, note-taking apps allow users to quickly jot down ideas by speaking.

### 10. **Assistive Technology for Mobility Impairments**
   - **Application:** People with mobility impairments use speech-to-text to interact with computers and smartphones without needing to type. This enables them to compose emails, browse the internet, and control applications through voice commands.

### 11. **Telecommunications**
   - **Application:** In telecommunications, speech-to-text is used to convert voicemail messages into text, allowing users to read messages instead of listening to them.

### 12. **Surveillance and Security**
   - **Application:** Speech-to-text can be used to transcribe audio feeds from surveillance systems, enabling easier analysis and keyword searching in security contexts.

### 13. **Market Research and Sentiment Analysis**
   - **Application:** Speech-to-text technology can transcribe customer feedback, interviews, and focus groups, allowing companies to analyze sentiments, trends, and opinions at scale.

These applications demonstrate the versatility of speech-to-text technology in improving accessibility, enhancing user experience, and enabling new functionalities across various domains.