<a href="https://colab.research.google.com/github/dhruvm-04/EchoLang/blob/main/EchoLang.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Problem Statement
---
**Real-Time Speech-to-Text (STT) Translation and Transcription Tool**

A system designed to perform real-time transcription and translation of code-switched speech involving a mix of Hindi-English (Hinglish) and Tamil-English into **English text**.

### Core Objectives:
- Transcribe and translate speech that includes a mixture of regional Indian languages (Hindi, Tamil) and English.
- Ensure high accuracy in handling code-switched and accented speech.

### Primary Use Cases:
1. **Accessing a Yellow Page Directory**  
   Helping blue-collar workers in Tier 2/3 Indian cities to search and access local services using voice-based interaction.

2. **Automated Medical Prescription Creation**  
   Passive observation and transcription of doctor-patient conversations (often in regional dialects) to automatically generate medical prescriptions and records, particularly suited for low-resource healthcare settings in rural and semi-urban areas.


## Features
---
- **Input**: Multilingual Speech (Hindi / Tamil / English)  
- **Output**: English Text  

### Core:
- Support for mixed local languages (code-switched)
- Real-time Translation + Transcription
- Contextual Understanding of code-switched speech
- Robust Speech-to-Text (STT) conversion
- Integration with both use cases (Yellow Pages and Medical Prescriptions)
- Relatively low resource consumption for deployment in Tier 2/3 cities

### More than just a translator:
- Context-aware responses
- Conversion of input to actionable information
- Designed for environments with limited literacy requirements
- Focuses on the intent behind user input, not just the literal words


In [None]:
!pip install -q git+https://github.com/openai/whisper.git
!pip install -q pydub ffmpeg-python ipywidgets
!apt-get -qq install ffmpeg

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [None]:
import numpy as np
import torch
import whisper
import base64
from IPython.display import display, Javascript
from google.colab import output
from base64 import b64decode
from pydub import AudioSegment

# Audio recording JavaScript
RECORD_JS = '''
const sleep = time => new Promise(resolve => setTimeout(resolve, time));
const b2text = blob => new Promise(resolve => {
  const reader = new FileReader();
  reader.onloadend = e => resolve(e.srcElement.result);
  reader.readAsDataURL(blob);
});
var record = time => new Promise(async resolve => {
  stream = await navigator.mediaDevices.getUserMedia({ audio: true });
  recorder = new MediaRecorder(stream, { mimeType: 'audio/webm' });
  chunks = [];
  recorder.ondataavailable = e => chunks.push(e.data);
  recorder.start();
  await sleep(time);
  recorder.onstop = async () => {
    blob = new Blob(chunks, { type: 'audio/webm' });
    text = await b2text(blob);
    resolve(text);
  };
  recorder.stop();
});
'''

def record_audio(seconds=5):
    display(Javascript(RECORD_JS))
    print(f"Recording for {seconds} seconds...")
    audio_data = output.eval_js(f'record({seconds * 1000})')
    header, encoded = audio_data.split(',', 1)
    audio_bytes = base64.b64decode(encoded)
    with open('recording.webm', 'wb') as f:
        f.write(audio_bytes)
    audio = AudioSegment.from_file('recording.webm')
    audio.export('recording.wav', format='wav')
    print("Audio recording complete")
    return 'recording.wav'

def load_whisper_model(model_size="medium"):
    device = "cuda" if torch.cuda.is_available() else "cpu"
    model = whisper.load_model(model_size).to(device)
    print(f"Loaded {model_size} model on {device.upper()}")
    return model

def detect_language(audio_path, model):
    audio = whisper.load_audio(audio_path)
    audio = whisper.pad_or_trim(audio)
    mel = whisper.log_mel_spectrogram(audio).to(model.device)
    _, probs = model.detect_language(mel)
    detected_lang = max(probs, key=probs.get)
    print(f"Detected language: {detected_lang}")
    return detected_lang

def transcribe_audio(audio_path, model, language=None):
    options = {"fp16": False} if model.device.type == "cpu" else {}
    if language:
        options["language"] = language
    result = model.transcribe(audio_path, **options)
    return result["text"]

In [None]:
# Configure settings
DURATION = 10  # Recording duration in seconds
MODEL_SIZE = "small"

# Load model once (cached for subsequent runs)
model = load_whisper_model(MODEL_SIZE)

# Record audio
audio_path = record_audio(seconds=DURATION)

# Detect language
detected_lang = detect_language(audio_path, model)

# Transcribe audio
transcription = transcribe_audio(audio_path, model, language=detected_lang)

# Display results
print("\n" + "-"*50)
print("TRANSCRIPTION RESULT:")
print("-"*50)
print(transcription)

Loaded small model on CUDA


<IPython.core.display.Javascript object>

Recording for 10 seconds...
Audio recording complete
Detected language: en

--------------------------------------------------
TRANSCRIPTION RESULT:
--------------------------------------------------
 Random Music


In [None]:
# Install core dependencies
!pip install torch transformers
!pip install langdetect fasttext-langdetect
!pip install numpy pandas matplotlib seaborn

# IndicTrans2 dependencies
!pip install sentencepiece sacremoses
!git clone https://github.com/VarunGumma/IndicTransToolkit
%cd IndicTransToolkit && pip install --editable . --use-pep517 && cd ..

# Download language detection models
!wget -q https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin

fatal: destination path 'IndicTransToolkit' already exists and is not an empty directory.
[Errno 2] No such file or directory: 'IndicTransToolkit && pip install --editable . --use-pep517 && cd ..'
/content


In [None]:
import torch
import numpy as np
from transformers import AutoTokenizer
from collections import defaultdict, Counter
import unicodedata

# Language code mappings for common scripts
SCRIPT_LANGUAGE_MAP = {
    'Devanagari': 'hi',  # Hindi
    'Tamil': 'ta',       # Tamil
    'Latin': 'en'        # English (default for Latin script)
}

# Common language patterns in WordPiece tokens
LANGUAGE_TOKEN_PATTERNS = {
    'hindi': {
        'prefixes': ['##ा', '##ि', '##ी', '##ु', '##ू', '##े', '##ै', '##ो', '##ौ'],
        'suffixes': ['ने', 'को', 'से', 'में', 'पर', 'का', 'की', 'के'],
        'common_tokens': ['है', 'हैं', 'था', 'थी', 'थे', 'और', 'या', 'में', 'से']
    },
    'tamil': {
        'prefixes': ['##ா', '##ி', '##ீ', '##ு', '##ூ', '##ெ', '##ே', '##ை', '##ொ', '##ோ', '##ௌ'],
        'suffixes': ['ும்', 'அது', 'இது', 'அந்த', 'இந்த', 'ான்', 'ின்', 'ில்'],
        'common_tokens': ['அது', 'இது', 'ும்', 'ான்', 'பண்ண', 'என்ன', 'எல்லா']
    },
    'english': {
        'prefixes': ['##ing', '##ed', '##er', '##est', '##ly', '##tion', '##ness'],
        'suffixes': ['the', 'and', 'or', 'but', 'with', 'from', 'to', 'at', 'in', 'on'],
        'common_tokens': ['the', 'and', 'or', 'but', 'with', 'from', 'to', 'at', 'in', 'on', 'is', 'are', 'was', 'were']
    }
}

def load_wordpiece_tokenizer(model_name="bert-base-multilingual-uncased"):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    print(f"Loaded tokenizer: {model_name}")
    return tokenizer

def detect_unicode_script(text):
    script_counts = defaultdict(int)
    for char in text:
        if char.isalpha():
            script = unicodedata.name(char, '').split()[0] if unicodedata.name(char, '') else 'Unknown'
            if script.startswith('DEVANAGARI'):
                script_counts['Devanagari'] += 1
            elif script.startswith('TAMIL'):
                script_counts['Tamil'] += 1
            elif script.startswith('LATIN'):
                script_counts['Latin'] += 1
    if not script_counts:
        return 'Latin', 0.0
    primary_script = max(script_counts.items(), key=lambda x: x[1])
    total_chars = sum(script_counts.values())
    confidence = primary_script[1] / total_chars if total_chars > 0 else 0.0
    return primary_script[0], confidence

def analyze_wordpiece_token_language(token, vocab_patterns=LANGUAGE_TOKEN_PATTERNS):
    language_scores = defaultdict(float)
    for lang, patterns in vocab_patterns.items():
        for prefix in patterns.get('prefixes', []):
            if token.startswith(prefix):
                language_scores[lang] += 0.8
        for suffix in patterns.get('suffixes', []):
            if token.endswith(suffix):
                language_scores[lang] += 0.7
        if token in patterns.get('common_tokens', []):
            language_scores[lang] += 1.0
    if token.strip('##'):
        script, script_confidence = detect_unicode_script(token)
        if script in SCRIPT_LANGUAGE_MAP:
            lang_code = SCRIPT_LANGUAGE_MAP[script]
            lang_map = {'hi': 'hindi', 'ta': 'tamil', 'en': 'english'}
            if lang_code in lang_map:
                language_scores[lang_map[lang_code]] += script_confidence * 0.9
    total_score = sum(language_scores.values())
    if total_score > 0:
        language_scores = {lang: score / total_score for lang, score in language_scores.items()}
    return dict(language_scores)

def wordpiece_language_detection(text, tokenizer):
    encoded = tokenizer(text, add_special_tokens=False, return_tensors='pt')
    tokens = tokenizer.convert_ids_to_tokens(encoded['input_ids'][0])
    sentence_language_scores = defaultdict(float)
    for token in tokens:
        if token in ['[CLS]', '[SEP]', '[PAD]', '[UNK]']:
            continue
        token_lang_scores = analyze_wordpiece_token_language(token)
        token_weight = len(token.replace('##', '')) / 10.0 + 0.1
        for lang, score in token_lang_scores.items():
            sentence_language_scores[lang] += score * token_weight
    total_sentence_score = sum(sentence_language_scores.values())
    if total_sentence_score > 0:
        sentence_language_scores = {
            lang: score / total_sentence_score
            for lang, score in sentence_language_scores.items()
        }
    result = {
        'original_text': text,
        'wordpiece_tokens': tokens,
        'token_count': len(tokens),
        'sentence_languages': dict(sentence_language_scores),
        'dominant_language': max(sentence_language_scores.items(), key=lambda x: x[1])[0] if sentence_language_scores else 'unknown'
    }
    return result

def format_wordpiece_analysis(analysis_result):
    result = analysis_result
    output = []
    output.append(f"Text: {result['original_text']}")
    output.append(f"WordPiece Tokens ({result['token_count']}): {' | '.join(result['wordpiece_tokens'])}")
    output.append("")
    output.append("Sentence-Level Language Distribution:")
    for lang, score in sorted(result['sentence_languages'].items(), key=lambda x: x[1], reverse=True):
        percentage = score * 100
        output.append(f"  {lang.title()}: {percentage:.1f}%")
    output.append(f"Dominant Language: {result['dominant_language'].title()}")
    return "\n".join(output)

def batch_wordpiece_analysis(texts, tokenizer):
    results = []
    for i, text in enumerate(texts, 1):
        print(f"Analyzing text {i}/{len(texts)}...")
        result = wordpiece_language_detection(text, tokenizer)
        results.append(result)
    return results

def generate_language_statistics(results):
    all_languages = set()
    token_counts = []
    dominant_languages = []
    for result in results:
        all_languages.update(result['sentence_languages'].keys())
        token_counts.append(result['token_count'])
        dominant_languages.append(result['dominant_language'])
    stats = {
        'total_texts': len(results),
        'unique_languages_detected': len(all_languages),
        'languages_detected': sorted(list(all_languages)),
        'average_tokens_per_text': np.mean(token_counts),
        'dominant_language_distribution': dict(Counter(dominant_languages))
    }
    return stats


In [None]:
# Configuration
SAMPLE_TEXTS = [
    "Hi doctor, मेरा पेट दर्द हो रहा है, can you help pannunga?",
    "Tomorrow class है क्या? நான் calendar check pannala.",
    "मैंने homework finish कर लिया, now I'm going to play cricket with nanban.",
    "இந்த book super है, I read it last night before सोने गया।",
    "Please come early, क्योंकि हमें बस पकड़नी है, illa late aagidum.",
    "I don't understand ये वाला concept, teacher kitte केलुंगो.",
    "Naan lunch skip pannitten, अभी मुझे बहुत भूख लगी है, let's eat?",
    "He is not feeling well, इसलिए आज स्कूल नहीं आया, avar rest eduthukaraaru.",
    "My sister exam के लिए पढ़ रही है, அவள் ரொம்ப nervous-ஆ இருக்கா.",
    "चलो चलते हैं, bus stop के पास milalaam, athu nalla idea."
]

# Load WordPiece tokenizer
print("Loading BERT Multilingual WordPiece Tokenizer...")
print("=" * 60)
tokenizer = load_wordpiece_tokenizer()
print()

# Perform WordPiece-based language analysis
print("WordPiece-Based Language Detection Analysis")
print("=" * 60)
all_results = batch_wordpiece_analysis(SAMPLE_TEXTS, tokenizer)

# Display sentence-level analysis for each text
for i, result in enumerate(all_results, 1):
    print(f"Analysis {i}:")
    print("-" * 40)
    formatted_analysis = format_wordpiece_analysis(result)
    print(formatted_analysis)
    print("=" * 60)

# Generate and display summary statistics
print("\nLanguage Detection Statistics:")
print("-" * 40)
stats = generate_language_statistics(all_results)

print(f"Total texts analyzed: {stats['total_texts']}")
print(f"Unique languages detected: {stats['unique_languages_detected']}")
print(f"Languages found: {', '.join([lang.title() for lang in stats['languages_detected']])}")
print(f"Average WordPiece tokens per text: {stats['average_tokens_per_text']:.1f}")
print()

print("Dominant Language Distribution:")
for lang, count in sorted(stats['dominant_language_distribution'].items(), key=lambda x: x[1], reverse=True):
    percentage = (count / stats['total_texts']) * 100
    print(f"  {lang.title()}: {count} texts ({percentage:.1f}%)")

Loading BERT Multilingual WordPiece Tokenizer...
Loaded tokenizer: bert-base-multilingual-uncased

WordPiece-Based Language Detection Analysis
Analyzing text 1/10...
Analyzing text 2/10...
Analyzing text 3/10...
Analyzing text 4/10...
Analyzing text 5/10...
Analyzing text 6/10...
Analyzing text 7/10...
Analyzing text 8/10...
Analyzing text 9/10...
Analyzing text 10/10...
Analysis 1:
----------------------------------------
Text: Hi doctor, मेरा पेट दर्द हो रहा है, can you help pannunga?
WordPiece Tokens (20): hi | doctor | , | म | ##रा | प | ##ट | दर | ##द | हो | रहा | ह | , | can | you | help | pan | ##nung | ##a | ?

Sentence-Level Language Distribution:
  English: 59.6%
  Hindi: 40.4%
Dominant Language: English
Analysis 2:
----------------------------------------
Text: Tomorrow class है क्या? நான் calendar check pannala.
WordPiece Tokens (12): tomorrow | class | ह | कया | ? | ந | ##ான | calendar | check | panna | ##la | .

Sentence-Level Language Distribution:
  English: 78.0%
  Hin