# Quickstart: Using the Speech Service from Python

## Get the Speech SDK Python Package

Run the following command to install the Cognitive Services Speech SDK Python package:

```pip install azure-cognitiveservices-speech```

## Download the PriMock57 Dataset

We will be using the [PriMock57 dataset](https://github.com/babylonhealth/primock57/tree/main) for testing the relevant evaluation metrics.

We store the transcription texts as `.TextGrid` files, which requires the used of the `textgrid` Python library. This can be installed by running the following command:

`pip install textgrid`

We define some helper functions for ingesting the data using above downloaded formats.

In [30]:
import re
import textgrid

def extract_text_from_textgrid(file_path, tier_name=None):
    # Load the TextGrid file
    tg = textgrid.TextGrid.fromFile(file_path)
    
    # Find the relevant tier (if tier_name is provided)
    if tier_name:
        tier = tg.getFirst(tier_name)
    else:
        # If no specific tier is mentioned, extract from the first tier
        tier = tg[0]

    # Extract the intervals with text and concatenate them
    extracted_text = []
    for interval in tier:
        if interval.mark.strip():  # Only consider non-empty intervals
            extracted_text.append(interval.mark)
    
    # Return all the concatenated text
    return " ".join(extracted_text)

def remove_intents(text):
    """
    Remove intents marked by <UNSURE>, <UNIN/>, etc. from the text.
    """
    # Regular expression pattern to match any text within angle brackets including the brackets
    pattern = r'<[^>]*>'
    # Use re.sub to replace matches with an empty string
    cleaned_text = re.sub(pattern, '', text)
    # Optionally, you can remove extra spaces left after removing tags
    cleaned_text = re.sub(r'\s+', ' ', cleaned_text).strip()
    return cleaned_text

## Speech Recognition Using the Speech SDK

We import the relevant libraries for using Speech SDK.

In [17]:
import os
import time
from dotenv import load_dotenv
import azure.cognitiveservices.speech as speechsdk

True

Set up the subscription info for Speech Service:

In [10]:
# Set up subscription info for the Speech Service
load_dotenv() # Load environment variables such as Speech SDK API keys
AZURE_SPEECH_KEY = os.getenv("SPEECHSDK_API_KEY")
AZURE_SERVICE_REGION = os.getenv("SPEECHSDK_REGION")

The following codes uses real-time speech-to-text function of Azure Speech Services using `SpeechRecognizer` instance. It reads from an input **audio file** in a continuous stream, simulating input audio.

In [18]:
# Initialize Speech Service
def initialize_speech_service(audio_file_path):
    speech_config = speechsdk.SpeechConfig(subscription=AZURE_SPEECH_KEY, region=AZURE_SERVICE_REGION)
    audio_config = speechsdk.audio.AudioConfig(filename=audio_file_path)
    return speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)

# Perform transcription on the audio file
def transcribe_continuous_audio_file(recognizer):
    recognized_speech = []

    # Set a variable to manage the state of transcription
    done = False

    def handle_recognized(evt):
        print(f"Recognized: {evt.result.text}")
        recognized_speech.append(evt.result.text)
    
    def handle_canceled(evt):
        print(f"Recognition canceled: {evt.result.reason}")
        if evt.result.reason == speechsdk.CancellationReason.Error:
            print(f"Error details: {evt.result.error_details}")

        recognizer.stop_continuous_recognition()
        nonlocal done
        done = True

    # Attach handlers for recognized results and any cancellations
    recognizer.recognized.connect(handle_recognized)
    recognizer.canceled.connect(handle_canceled)
    
    # Start continuous recognition
    recognizer.start_continuous_recognition()
    print("Transcribing...")

    # Wait for completion (i.e., done = True)
    try:
        import time
        while not done:
            time.sleep(.5)
    except KeyboardInterrupt:
        print("Transcription stopped by user.")
        recognizer.stop_continuous_recognition()

    return " ".join(recognized_speech)  # Return all the transcriptions concatenated

In [19]:
# Example usage
recognizer = initialize_speech_service("C:/Users/faith/batcave/primock57/audio/day1_consultation01_doctor.wav")
hypothesis_text = transcribe_continuous_audio_file(recognizer)

Transcribing...
Recognized: Hello.
Recognized: Hi. Yeah. OK. Hello. Good morning. So how can I help you this morning?
Recognized: Yeah, I'm sorry to hear that. And when you say diarrhoea, what do you mean by diarrhea? Do you mean you're going to the toilet more often or are your stools more loose?
Recognized: OK. And how many times a day are you going, let's say over the last couple of days?
Recognized: 6-7 times a day and you mentioned this mainly water tree. Have you noticed any other things like blood in your stools?
Recognized: OK. And you mentioned you've had some pain in your tummy as well. Whereabouts is the pain exactly?
Recognized: One side. And what side is that?
Recognized: That's right. OK. And can you describe the pain to me?
Recognized: OK. And there's a pain. Is that is it there all the time or does it come and go?
Recognized: Does the pain move anywhere else because on between your back?
Recognized: OK, fine. And you mentioned you've been feeling quite weak and shaky as

## Evaluation Metrics

We use `jiwer`, a Python package for evaluating an automatic speech recognition system. It supports the following word measures: word error rate (WER), match error rate (MER), word information lost (WIL) and word information preserved (WIP).

- Official documentation: https://jitsi.github.io/jiwer/usage/ 
- Github repository: https://github.com/jitsi/jiwer/tree/master?tab=readme-ov-file

### 1. Word Error Rate (WER)
- **Definition**: WER is a metric used to evaluate the accuracy of speech recognition systems by comparing the output of the system (hypothesis) to a reference (ground truth).
- **Calculation**:
  $$
  \text{WER} = \frac{S + D + I}{N}
  $$
  where:
  - \(S\) = Number of substitutions
  - \(D\) = Number of deletions
  - \(I\) = Number of insertions
  - \(N\) = Total number of words in the reference
- **Interpretation**: A lower WER indicates better performance. It is expressed as a percentage of the total number of words.

### 2. Match Error Rate (MER)
- **Definition**: MER is a metric used in machine translation and speech recognition to measure the accuracy of the system's output. It is similar to WER but can be applied to both word and character-level errors.
- **Calculation**:
  $$
  \text{MER} = \frac{\text{Number of errors}}{\text{Total number of units}}
  $$
  where:
  - Errors can be substitutions, deletions, or insertions.
  - Units can be words or characters depending on the application.
- **Interpretation**: MER provides a general sense of accuracy by measuring the proportion of errors relative to the total number of units.

### 3. Word Information Lost (WIL)
- **Definition**: WIL measures the loss of information in the system's output compared to the reference. It focuses on how much of the original meaning or information is lost due to errors in the output.
- **Calculation**:
  $$
  \text{WIL} = \frac{\text{Number of information units lost}}{\text{Total number of information units}}
  $$
- **Interpretation**: A higher WIL indicates more significant loss of information. It is often used in contexts where understanding the meaning of the text is crucial.

### 4. Word Insertion Penalty (WIP)
- **Definition**: WIP is used to penalize the insertion of extra words in the output. It helps in evaluating how well the system avoids adding unnecessary words that were not present in the reference.
- **Calculation**:
  $$
  \text{WIP} = \frac{\text{Number of insertions} \times \text{Penalty factor}}{\text{Total number of words in the reference}}
  $$
- **Interpretation**: A higher WIP indicates that the system is inserting more extra words, which can affect the quality and readability of the output. The penalty factor can be adjusted based on how strictly you want to penalize insertions.


In [59]:
# Initialize reference text
reference_text = extract_text_from_textgrid("C:/Users/faith/batcave/primock57/transcripts/day1_consultation01_doctor.TextGrid")
cleaned_reference_text = remove_intents(reference_text)

In [60]:
# Visualize alignment using jiwer package
import jiwer

def get_alignment(reference_text, hypothesis_text):
    # Normalize the text by transforming it to lower case and removing punctuation
    transformation = jiwer.Compose([
        jiwer.ToLowerCase(),
        jiwer.RemovePunctuation(),
        jiwer.RemoveMultipleSpaces(),
        jiwer.Strip()
    ])

    ref = transformation(reference_text)
    hyp = transformation(hypothesis_text)
    
    out = jiwer.process_words(ref, hyp)
    print(jiwer.visualize_alignment(out))

get_alignment(cleaned_reference_text, hypothesis_text)

sentence 1
REF: hello hi um should we start yeah okay hello how um good morning sir how can i help you this morning **** ** sorry to hear that um and and when you say ********* diarrhea whatd you mean by diarrhea do you mean youre going to the toilet more often or are your stools more loose okay and how many times a day are you going lets say   in the last couple of days six seven times a day and you   mention  its mainly ***** watery have you noticed any other things like blood in your stools okay and you mentioned youve had some pain in your tummy as well whereabouts is the pain exactly one side and what side is that  left  side okay and can you describe the pain to me okay and     is the pain is that is it there all the time or does it come and go come and go does the pain move anywhere else     for example towards your back okay fine and you mentioned youve been feeling quite weak and shaky as well what do you mean by shaky do you mean youve been having uh have you been feeling fev