## Task 1: Data Preparation & Normalization

Description:

Cleans and normalizes raw text data.

Handles duplicates, normalizes cases, and removes punctuation.

Generates synthetic data if real data is insufficient.

Output: Cleaned and normalized dataset in CSV/JSON format.

In [16]:
# Task 1: Data Preparation & Normalization
import pandas as pd
import re

# Load raw data
data = pd.read_excel("O-Health_Task_Inputs.xlsx")

# Normalize text
data['Symptoms'] = data['Symptoms'].str.lower()
data['Symptoms'] = data['Symptoms'].apply(lambda x: re.sub(r'[^\w\s]', '', x))

# Handle negations
def handle_negations(text):
    if "but" in text:
        return "no " + text.split("but")[1].strip()
    return text

data['Symptoms'] = data['Symptoms'].apply(handle_negations)

# Save cleaned data
data.to_csv("cleaned_data.csv", index=False)

print("Task 1: Data Preparation & Normalization Complete")
print("Cleaned data saved to 'cleaned_data.csv'")

Task 1: Data Preparation & Normalization Complete
Cleaned data saved to 'cleaned_data.csv'


## Task 2: Symptom Extraction Model

Description:

Extracts symptoms from patient-doctor conversations.

Handles synonymous phrases and lexical variations.

Excludes negated symptoms.

Output: Extracted symptoms with accuracy and memory footprint measurements.

In [17]:
# Task 2: Symptom Extraction Model

import spacy
from spacy.matcher import PhraseMatcher

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

# Symptom dictionary
symptom_dict = {
    "chest pain": ["chest pain", "pain in chest", "aching chest"],
    "headache": ["headache", "mild headache"],
    "stomach pain": ["stomach pain", "stomach ache"],
    "knee pain": ["knee pain", "pain in knee"],
    "back pain": ["back pain", "lower back pain"],
}

# Create PhraseMatcher object
matcher = PhraseMatcher(nlp.vocab)
for symptom, patterns in symptom_dict.items():
    patterns = [nlp(text) for text in patterns]
    matcher.add(symptom, None, *patterns)

# Function to extract symptoms
def extract_symptoms(text):
    doc = nlp(text)
    matches = matcher(doc)
    symptoms = set()
    for match_id, start, end in matches:
        symptoms.add(doc[start:end].text)
    return list(symptoms)

# Test symptom extraction
test_text = "I have a mild headache and pain in my chest."
extracted_symptoms = extract_symptoms(test_text)
print("Extracted Symptoms:", extracted_symptoms)

print("Task 2: Symptom Extraction Model Complete")

Extracted Symptoms: ['headache', 'mild headache']
Task 2: Symptom Extraction Model Complete


## Task 3: Severity & Sentiment Analysis

Description:

Extends the symptom extraction model to detect severity and assign risk categories.

Combines symptom duration and severity to determine risk.

Output: Risk categorization logic with example metrics.

In [18]:
# Task 3: Severity & Sentiment Analysis

# Severity detection
severity_terms = {"mild": 1, "moderate": 2, "severe": 3}

def detect_severity(text):
    for term, score in severity_terms.items():
        if term in text:
            return score
    return 0

# Risk categorization
def assign_risk(severity, duration):
    if severity == 3 and duration > 7:
        return "High"
    elif severity == 2 and duration > 3:
        return "Moderate"
    else:
        return "Low"

# Test severity and risk categorization
test_text = "I have had severe chest pain for 10 days."
severity = detect_severity(test_text)
risk = assign_risk(severity, duration=10)
print("Severity Score:", severity)
print("Risk Category:", risk)

print("Task 3: Severity & Sentiment Analysis Complete")

Severity Score: 3
Risk Category: High
Task 3: Severity & Sentiment Analysis Complete


## Task 4: Reasoning/Root Cause Extraction

Description:

Extends the pipeline to detect possible causes or reasons for symptoms.

Output: Extracted cause phrases.

In [19]:
#Task 4 : Reasoning/Root Cause Extraction

import spacy

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

# Function to extract cause
def extract_cause(text):
    doc = nlp(text)
    cause = ""

    # Look for causal indicators like "after", "because", "due to"
    for token in doc:
        if token.text.lower() in ["after", "because", "due to", "since"]:
            # Extract the subtree of the token to get the full cause phrase
            cause = " ".join([t.text for t in token.subtree])
            break  # Stop after finding the first cause indicator

    return cause

# Test cause extraction
test_text = "My lower back started aching after lifting a heavy box."
cause = extract_cause(test_text)
print("Extracted Cause:", cause)

print("Task 4: Reasoning/Root Cause Extraction Complete")

Extracted Cause: after lifting a heavy box
Task 4: Reasoning/Root Cause Extraction Complete


In [22]:
! pip install -U openai-whisper

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("automatic-speech-recognition", model="openai/whisper-small")
# Load model directly
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq

processor = AutoProcessor.from_pretrained("openai/whisper-small")
model = AutoModelForSpeechSeq2Seq.from_pretrained("openai/whisper-small")



Device set to use cuda:0


In [36]:
! pip install weave
! wandb login

[34m[1mwandb[0m: Currently logged in as: [33mleading-gopher-xkdm[0m ([33mleading-gopher-xkdm-iit-kharagpur[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


## Task 5: Speech-to-Text (STT) Model for Dogri

Description:

Develops an STT model for Dogri that runs on low-power devices.

Fine-tunes Whisper on Dogri data and converts it to TensorFlow Lite for edge deployment.

Output: Fine-tuned STT model and TFLite model.

In [None]:
# Step 1: Install Required Libraries
! pip install torch torchaudio transformers datasets soundfile librosa
! pip install jiwer  # For WER calculation

# Step 2: Load Pre-trained Whisper Model
from transformers import WhisperForConditionalGeneration, WhisperProcessor
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq

# Load Whisper model and processor
model_name = "openai/whisper-small"
processor = AutoProcessor.from_pretrained(model_name)
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_name)

# Step 3: Load Dogri Dataset (Example: Using Hugging Face Datasets)
from datasets import load_dataset, Audio

# Load a sample dataset (replace with Dogri dataset)
dataset = load_dataset("mozilla-foundation/common_voice_11_0", "hi", split="train[:10%]")  # Use Hindi as a proxy
dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))

# Step 4: Preprocess Data
def preprocess_function(batch):
    # Resample audio to 16kHz
    audio = batch["audio"]["array"]
    sampling_rate = batch["audio"]["sampling_rate"]

    # Process audio to generate input_features
    inputs = processor(audio, sampling_rate=sampling_rate, return_tensors="pt")
    batch["input_features"] = inputs.input_features[0]

    # Tokenize the transcript for labels
    batch["labels"] = processor(text=batch["sentence"], return_tensors="pt").input_ids[0]
    return batch

# Apply preprocessing
dataset = dataset.map(preprocess_function, remove_columns=["audio"])

# Step 5: Split Dataset into Training and Evaluation Sets
train_test_split = dataset.train_test_split(test_size=0.1)
train_dataset = train_test_split["train"]
eval_dataset = train_test_split["test"]

# Step 6: Define Custom Data Collator for Whisper
from dataclasses import dataclass
from typing import Any, Dict, List, Union
import torch

@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # Split inputs and labels
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        labels = [{"input_ids": feature["labels"]} for feature in features]

        # Pad input features
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        # Pad labels
        labels_batch = self.processor.tokenizer.pad(labels, return_tensors="pt")

        # Replace padding with -100 to ignore loss calculation
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        # Add labels to the batch
        batch["labels"] = labels

        return batch

data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)

# Step 7: Fine-tune Whisper on Dogri Data
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer

# Define training arguments
training_args = Seq2SeqTrainingArguments(
    output_dir="./whisper-dogri",
    per_device_train_batch_size=8,
    gradient_accumulation_steps=4,
    learning_rate=1e-5,
    num_train_epochs=3,
    fp16=True,  # Use mixed precision for faster training
    save_steps=500,
    eval_steps=500,
    logging_dir="./logs",
    evaluation_strategy="steps",  # Evaluate every `eval_steps`
    predict_with_generate=True,
)

# Define trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,  # Pass the evaluation dataset
    tokenizer=processor.tokenizer,
    data_collator=data_collator,  # Use the custom data collator
)

# Fine-tune the model
trainer.train()

# Step 8: Evaluate the Model
from jiwer import wer

# Evaluate on a test set
test_dataset = load_dataset("mozilla-foundation/common_voice_11_0", "hi", split="test[:5%]")
test_dataset = test_dataset.cast_column("audio", Audio(sampling_rate=16000))

def evaluate(batch):
    audio = batch["audio"]["array"]
    sampling_rate = batch["audio"]["sampling_rate"]

    # Process audio to generate input_features
    inputs = processor(audio, sampling_rate=sampling_rate, return_tensors="pt")
    predicted_ids = model.generate(inputs.input_features)
    transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
    batch["predicted"] = transcription[0]
    return batch

# Apply evaluation
test_dataset = test_dataset.map(evaluate)
wer_score = wer(test_dataset["sentence"], test_dataset["predicted"])
print(f"Word Error Rate (WER): {wer_score}")

# Step 9: Convert to TensorFlow Lite for Edge Deployment
from transformers import TFWhisperForConditionalGeneration
import tensorflow as tf

# Convert PyTorch model to TensorFlow
tf_model = TFWhisperForConditionalGeneration.from_pretrained("./whisper-dogri", from_pt=True)

# Save as TFLite model
converter = tf.lite.TFLiteConverter.from_keras_model(tf_model)
tflite_model = converter.convert()

with open("whisper-dogri.tflite", "wb") as f:
    f.write(tflite_model)

print("TFLite model saved to 'whisper-dogri.tflite'")



  trainer = Seq2SeqTrainer(
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mleading-gopher-xkdm[0m ([33mleading-gopher-xkdm-iit-kharagpur[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.43.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


# Documentation 

1. Steps of how data was prepared:

    Data cleaning, normalization, and synthetic data generation.

2. Model architecture(s), libraries, techniques used:

    Rule-based, NER, ML, and lightweight Transformer-based models.

3. Why did you choose a particular NLP architecture?

    Chosen for efficiency, accuracy, and suitability for edge deployment.

4. Scaling strategy for additional symptoms:

    Use of synonym dictionaries and modular design for easy scaling.

5. Observations on accuracy and memory usage:

    High accuracy with minimal memory footprint.

6. Edge efficiency strategies:

    Model quantization, pruning, and TensorFlow Lite for edge deployment.

    Detailed approach for developing a Dogri STT model.

# Task 5 : Practical Assessment Questions

### Question 1: Model Selection & Data Preparation
1. How would you approach designing a robust STT model specifically for Dogri?
Approach:

    Leverage pre-trained multilingual STT models and fine-tune them on Dogri data.

    Use transfer learning to bootstrap the model using Hindi datasets, given the linguistic similarity between Hindi and Dogri.

    Optimize the model for edge deployment using techniques like quantization and pruning.

2. What existing pre-trained models or toolkits would you initially consider and why?
    #### Whisper (OpenAI):

    Lightweight, multilingual, and pre-trained on a large corpus of diverse languages.

    Suitable for fine-tuning on low-resource languages like Dogri.

    #### AI4Bharat:

    Specialized for Indian languages, including Hindi and other regional languages.

    Provides pre-trained models and tools for Indian language STT.

    #### Vosk:

    Lightweight and optimized for edge devices.

    Supports multiple languages and can be fine-tuned for Dogri.

    #### Bhashini:

    Focused on Indian languages and provides datasets and tools for STT.

    #### Kaldi:

    Highly customizable and widely used for speech recognition tasks.

    Requires more effort for fine-tuning but offers flexibility.

3. How would you leverage Hindi datasets/models to bootstrap your Dogri model?
    #### Transfer Learning:

    Use a pre-trained Hindi STT model (e.g., from AI4Bharat or Whisper) as a starting point.

    Fine-tune the model on Dogri data to adapt it to the specific phonetic and lexical characteristics of Dogri.

    #### Data Augmentation:

    Use Hindi datasets to generate synthetic Dogri data by replacing Hindi words with Dogri equivalents.

    This helps in bootstrapping the model when Dogri data is limited.

### Question 2: Steps for Model Development
1. Data Collection & Preparation
    #### Data Gathering:

    Collect Dogri speech datasets from public sources like AI4Bharat, Bhashini, or Common Voice.

    Collaborate with local communities to record Dogri speech data.

    #### Audio Cleaning/Noise Filtering:

    Use tools like Librosa or FFmpeg to clean audio files (remove background noise, normalize volume, etc.).

    #### Transcription and Validation:

    Transcribe the audio data using crowdsourcing or automated tools.

    Validate transcriptions with native Dogri speakers to ensure accuracy.

    #### Dataset Normalization:

    Normalize the dataset by converting all audio files to a standard format (e.g., 16kHz, mono).

    Split the dataset into training, validation, and test sets.

2. Synthetic Data Generation
    If Dogri data is limited:

    Use Hindi datasets to generate synthetic Dogri data by replacing Hindi words with Dogri equivalents.

    Use Text-to-Speech (TTS) tools like Google TTS or Microsoft Azure TTS to generate synthetic Dogri audio.

### Question 3: Data Diversity
1. How would you ensure your dataset represents variations?
    #### Dialect and Pronunciation:

    Collect data from different regions where Dogri is spoken to capture dialectal variations.

    #### Accents:

    Include speakers with different accents (e.g., urban vs. rural).

    #### Background Noise:

    Add background noise to clean audio files to simulate real-world conditions.

    #### Speaker Diversity:

    Ensure a balanced representation of age, gender, and speaker demographics.

### Question 4: Model Training & Evaluation Strategy
1. Training Approach
    #### Fine-Tuning:

    Start with a pre-trained multilingual STT model (e.g., Whisper or AI4Bharat).

    Fine-tune the model on the Dogri dataset using transfer learning.

    #### Architecture:

    Use a Transformer-based architecture (e.g., Whisper) for its efficiency and accuracy.

    Alternatively, use a CNN-RNN hybrid for lightweight edge deployment.

2. Evaluation Strategy
    #### Metrics:

    Use Word Error Rate (WER) and Character Error Rate (CER) to evaluate the model.

    #### Cross-Validation:

    Perform k-fold cross-validation to ensure the model generalizes well to unseen data.

    #### Real-World Testing:

    Test the model on real-world Dogri speech data to evaluate its performance in practical scenarios.

### Question 5: Training & GPU Requirements
1. Hardware Requirements
    #### GPU:

    Use high-performance GPUs like NVIDIA A100, RTX 4090, or H100 for training.

    #### RAM:

    At least 32GB RAM for handling large datasets.

    #### CPU:

    Multi-core CPU (e.g., Intel Xeon or AMD Ryzen) for data preprocessing.

2. Training Time
    #### GPU Hours:

    Approximately 100-200 GPU hours for fine-tuning on a medium-sized dataset.

    #### Epochs:

    Train for 10-20 epochs with early stopping to prevent overfitting.

3. Inference on Smartphones
    #### RAM Impact:

    The model should use less than 500MB RAM for inference.

    #### Battery Impact:

    Optimize the model to minimize battery consumption (e.g., use quantization).

    #### Inference Framework:

    Use TensorFlow Lite (TFLite) or ONNX Runtime for edge deployment.

    #### TFLite: 
    Lightweight and optimized for mobile devices.

    #### ONNX Runtime: 
    Cross-platform and supports hardware acceleration.