<a href="https://colab.research.google.com/github/ahgomaa/TaqreeRx/blob/main/Untitled5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Install required libraries
!pip install -q transformers torch torchaudio soundfile
!pip install -q git+https://github.com/openai/whisper.git
!pip install -q accelerate bitsandbytes sentencepiece

# Set up to use the GPU
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# System imports
import whisper
import soundfile as sf
import io
import time
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for openai-whisper (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.1/60.1 MB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
[?25hUsing device: cuda


In [None]:
# Define the audio file path
AUDIO_FILE_PATH = "arabic_clinical_audio.mp3"

# --- Simulating Audio Recording ---
# In a real application, this part would handle real-time audio input
# For this PoC, we will check the file exists
try:
    with open(AUDIO_FILE_PATH, 'rb') as f:
        print(f"Successfully loaded audio file: {AUDIO_FILE_PATH}")
except FileNotFoundError:
    print(f"ERROR: Audio file '{AUDIO_FILE_PATH}' not found. Please upload it.")
    # Exit or provide placeholder text if the file is missing
    raise

Successfully loaded audio file: arabic_clinical_audio.mp3


In [None]:
# Load the Whisper Large-v3 model for Arabic transcription
# This will automatically download the model (~3GB)
print("\n--- 1. Loading ASR Model (Whisper Large-v3) ---")
asr_model = whisper.load_model("large-v3", device=device)

def transcribe_audio(file_path):
    print("Transcribing audio...")
    start_time = time.time()

    # Force language to Arabic for better results
    result = asr_model.transcribe(file_path, language="ar", verbose=False)

    end_time = time.time()
    print(f"Transcription complete in {end_time - start_time:.2f} seconds.")
    return result["text"]

# Run the transcription
transcript = transcribe_audio(AUDIO_FILE_PATH)
print("\n--- Generated Transcript ---")
print(transcript)


--- 1. Loading ASR Model (Whisper Large-v3) ---


100%|█████████████████████████████████████| 2.88G/2.88G [00:57<00:00, 53.5MiB/s]


Transcribing audio...


100%|██████████| 21326/21326 [01:00<00:00, 353.03frames/s]

Transcription complete in 62.57 seconds.

--- Generated Transcript ---
 اتفضل يا عم بفتحي اسأل ايه حضرتك خير ان شاء الله اهلا دكتور والله من ابخير قلبي وجادي ومش عارف اخد نفسي. قال في السلام عليك يا حاج. احكي لي كده. الالم دوت بدأ امتى بالزبط? فين مكانه بالتحديد? بدأ معي من حوالي يومين تلاتة كده يا دكتور. بس النهاردة الصبح زاد زاد خالص. الالم تحس انه ضغطه كده على صدري. من النص وساعات بيسمع في دراعي الشمال. وبيسمع في دراعك الشمال بس ولا ممكن كمان يروح لحد الرقبة او ناحية الفك. ويا ترى بيجي لك وانت بتعمل مجهود? ولا وانت قاعد مسترايح? غالبا بيجي لي لما اتحرك كتير واطلع السلالم بس النهاردة حتى وانا قاعد في البيت عسيت بيس حتى حاسك ايدي مخنوق يا دكتور. تمام على مقياس واحد لحد عشر. لو عشرة اقصى الالم. تقدر تحدد لي الالم قد ايه? سبعة خمسة اربعة بيجع جامد ولا ما بيجعش جامد. الله يا دكتور اقول لك سبعة تمانية. الالم كده جامد وحامي ومش بيروح بسهولة. ولما بيجي خلاص بحس بها مدان شديد جدا وعرق في وشي. اه تمام. طيب يا طيب حضرتك بتدخن? ولا كنت اه بتدخن ووقفت? كنت بقى اشرب سجايرة زبال بس بطلت بقى لي حوا




In [None]:
# --- 2. Loading LLM for Note Generation (Qwen-1.5-7B-Chat) ---
# This model will be automatically downloaded (~15GB)
print("\n--- 2. Loading LLM (Qwen-1.5-7B-Chat) ---")
model_name = "Qwen/Qwen1.5-7B-Chat"

tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load the model in 4-bit for reduced VRAM usage in Colab
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    load_in_4bit=True
)

def generate_soap_note(transcript):
    print("Generating structured SOAP note...")
    start_time = time.time()

    # Define the prompt (in Arabic) to instruct the model
    # The prompt asks the model to take the transcript and format it into a SOAP note
    system_prompt = "أنت مساعد طبي محترف. مهمتك هي تحويل النص الحواري الطبي إلى مذكرة سريرية منظمة بصيغة SOAP (Subjective, Objective, Assessment, Plan) باللغة العربية الفصحى. لا تضف أي معلومات غير موجودة في النص."
    user_prompt = f"النص الحواري: \n---\n{transcript}"

    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt}
    ]

    # Tokenize the input
    input_ids = tokenizer.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_tensors="pt"
    ).to(device)

    # Generate the response
    outputs = model.generate(
        input_ids,
        max_new_tokens=1024,
        do_sample=True,
        temperature=0.7,
        pad_token_id=tokenizer.eos_token_id
    )

    # Decode and clean the output
    response = tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True)

    end_time = time.time()
    print(f"Note generation complete in {end_time - start_time:.2f} seconds.")
    return response

# Run the note generation
soap_note = generate_soap_note(transcript)
print("\n====================================")
print("--- FINAL GENERATED CLINICAL NOTE ---")
print("====================================")
print(soap_note)


--- 2. Loading LLM (Qwen-1.5-7B-Chat) ---


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/663 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/3.96G [00:00<?, ?B/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/3.99G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/3.96G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/3.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/243 [00:00<?, ?B/s]

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Generating structured SOAP note...
Note generation complete in 54.29 seconds.

--- FINAL GENERATED CLINICAL NOTE ---
SOAP (Subjective, Objective, Assessment, Plan):

Subjective:
- The patient, "عم", is experiencing severe and persistent chest pain that worsens in the morning, specifically in the left side, with a dull, heavy sensation.
- He reports no relief from moving or resting, and the pain intensifies when he looks up or moves his arm.
- The patient has not smoked for five years but has recently increased his weight, affecting his chest discomfort.
- He has diabetes on medication for high blood sugar and hypertension, and his cholesterol levels are high.

Objective:
- Chest pain: The patient's chest pain is described as a cold, hard, and unrelenting sensation.
- Heart rate: There is a significant increase in heart rate with the pain.
- Medical history: The patient has a history of high blood pressure and a minor heart condition, with weakened cardiac muscle.

Assessment:
- The pat