# Evaluating Speech Recognition

In this notebook, we evaluate the speech recognition capabilities of Azure SpeechSDK and Whisper using metrics introduced in `quickstart.ipynb`.

We evaluate the following 3 key aspects of speech recognition:
1. Real-time speech recognition (English)
2. Speech diarization
3. Multilingual speech recognition (i.e. Mother Tongue languages such as Mandarin, Tamil and Malay) - **KIV (need to check the data)**



## Description of Dataset

The dataset that we will be using is the [National Speech Corpus (NSC)](https://www.imda.gov.sg/how-we-can-help/national-speech-corpus). 

### National Speech Corpus (NSC)

The **National Speech Corpus (NSC)** is a large-scale dataset of Singaporean English speech recordings, developed to support research and development in speech technologies such as speech recognition and synthesis. It was created by Singapore’s Infocomm Media Development Authority (IMDA) in collaboration with research institutions.

#### Key features:
- **Diverse Accents**: The corpus includes a wide variety of Singaporean English accents and pronunciations, reflecting the multilingual and multicultural context of Singapore. It also contains multilingual conversation transcriptions, which are useful for diarization & multilingual evaluations.
- **Speech Data**: It contains recordings of both read and spontaneous speech, covering various speaking styles and scenarios. We can evaluate the rate at which the speech-to-text system is able to handle context-switching through this dataset.
- **Applications**: The NSC is used in developing AI systems for natural language processing, speech-to-text transcription, and improving voice-activated services tailored to local contexts.

It serves as a valuable resource for training speech recognition models, especially those designed for Singapore's linguistic environment.


In [1]:
import textgrid

def extract_text_from_textgrid(file_path, tier_name=None):
    # Load the TextGrid file
    tg = textgrid.TextGrid.fromFile(file_path)
    
    # Find the relevant tier (if tier_name is provided)
    if tier_name:
        tier = tg.getFirst(tier_name)
    else:
        # If no specific tier is mentioned, extract from the first tier
        tier = tg[0]

    # Extract the intervals with text and concatenate them
    extracted_text = []
    for interval in tier:
        if interval.mark.strip():  # Only consider non-empty intervals
            extracted_text.append(interval.mark)
    
    # Return all the concatenated text
    return " ".join(extracted_text)

In [2]:
def truncate_after_marker(input_string, phrase, marker):
    # Find the index of the phrase in the string
    phrase_index = input_string.lower().find(phrase.lower())
    
    if phrase_index != -1:
        # Find the index of the marker after the phrase
        marker_index = input_string.find(marker, phrase_index + len(phrase))
        
        if marker_index != -1:
            # Truncate the string after the marker
            truncated_string = input_string[:marker_index + len(marker)]
        else:
            # If the marker is not found after the phrase, return up to the phrase
            truncated_string = input_string[:phrase_index + len(phrase)]
    else:
        # If the phrase is not found, return the original string
        truncated_string = input_string
    
    return truncated_string

# We truncate the audio data to 8 minutes only (original is 2h long!)
# Truncate the reference text data as well:
# Last topic in the conversation was about tuition - we use this as a marker for truncation
txt1 = extract_text_from_textgrid("C:/Users/faith/Downloads/3010-1.TextGrid")
speaker1_transcript = truncate_after_marker(txt1, "tuition", "<S>") 
txt2 = extract_text_from_textgrid("C:/Users/faith/Downloads/3010-2.TextGrid")
speaker2_transcript = truncate_after_marker(txt2, "tuition", "<S>")

## Real-Time Diarization from Speech SDK

1. Set up subscription service and import necessary libraries.

In [3]:
import os
import json
import time
from dotenv import load_dotenv
import azure.cognitiveservices.speech as speechsdk

# Set up subscription info for the Speech Service
load_dotenv() # Load environment variables such as Speech SDK API keys
AZURE_SPEECH_KEY = os.getenv("SPEECHSDK_API_KEY")
AZURE_SERVICE_REGION = os.getenv("SPEECHSDK_REGION")

2. Edit below fields to customize file paths.

In [5]:
# Edit below fields
AUDIO_FILENAME = "../data/nsc/3010_short.wav" # Change audio filename
LOG_FILEPATH = "../data/nsc/3010_transcript.json" # Log transcript JSON

3. Real-time diarization quickstart code adapted from [documentation](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/get-started-stt-diarization?tabs=windows&pivots=programming-language-python).

In [6]:
# Helper function to log transcription data while conversation is being transcribed
def log_transcription(speaker_id, text):
    """Log the speaker ID and text to a JSON file."""
    log_entry = {
        "speaker_id": speaker_id,
        "transcription": text
    }
    
    # Check if the file already exists and load existing data
    if os.path.exists(LOG_FILEPATH):
        with open(LOG_FILEPATH, "r") as log_file:
            data = json.load(log_file)
    else:
        data = []
    
    # Append new entry
    data.append(log_entry)
    
    # Write the updated data back to the file
    with open(LOG_FILEPATH, "w") as log_file:
        json.dump(data, log_file, indent=4)

def conversation_transcriber_recognition_canceled_cb(evt: speechsdk.SessionEventArgs):
    print('Canceled event')

def conversation_transcriber_session_stopped_cb(evt: speechsdk.SessionEventArgs):
    print('SessionStopped event')

def conversation_transcriber_transcribed_cb(evt: speechsdk.SpeechRecognitionEventArgs):
    print('TRANSCRIBED:')
    if evt.result.reason == speechsdk.ResultReason.RecognizedSpeech:
        print('\tText={}'.format(evt.result.text))
        print('\tSpeaker ID={}'.format(evt.result.speaker_id))
                
        # Log the transcription
        log_transcription(evt.result.speaker_id, evt.result.text)

    elif evt.result.reason == speechsdk.ResultReason.NoMatch:
        print('\tNOMATCH: Speech could not be TRANSCRIBED: {}'.format(evt.result.no_match_details))

def conversation_transcriber_session_started_cb(evt: speechsdk.SessionEventArgs):
    print('SessionStarted event')

def recognize_from_file():
    # This example requires environment variables named "SPEECH_KEY" and "SPEECH_REGION"
    speech_config = speechsdk.SpeechConfig(subscription=AZURE_SPEECH_KEY, region=AZURE_SERVICE_REGION)
    speech_config.speech_recognition_language="en-SG"  # Set to Singaporean English setting

    audio_config = speechsdk.audio.AudioConfig(filename=AUDIO_FILENAME)
    conversation_transcriber = speechsdk.transcription.ConversationTranscriber(speech_config=speech_config, audio_config=audio_config)

    transcribing_stop = False

    def stop_cb(evt: speechsdk.SessionEventArgs):
        #"""callback that signals to stop continuous recognition upon receiving an event `evt`"""
        print('CLOSING on {}'.format(evt))
        nonlocal transcribing_stop
        transcribing_stop = True

    # Connect callbacks to the events fired by the conversation transcriber
    conversation_transcriber.transcribed.connect(conversation_transcriber_transcribed_cb)
    conversation_transcriber.session_started.connect(conversation_transcriber_session_started_cb)
    conversation_transcriber.session_stopped.connect(conversation_transcriber_session_stopped_cb)
    conversation_transcriber.canceled.connect(conversation_transcriber_recognition_canceled_cb)
    # stop transcribing on either session stopped or canceled events
    conversation_transcriber.session_stopped.connect(stop_cb)
    conversation_transcriber.canceled.connect(stop_cb)

    conversation_transcriber.start_transcribing_async()

    # Waits for completion.
    while not transcribing_stop:
        time.sleep(.5)

    conversation_transcriber.stop_transcribing_async()

# Main

try:
    recognize_from_file()
except Exception as err:
    print("Encountered exception. {}".format(err))

SessionStarted event
TRANSCRIBED:
	Text=Ya, so, uh, we just start talking right? OK, so umm, we should we start with maybe the first one or Oh no, what do you do in JB yesterday?
	Speaker ID=Guest-1
TRANSCRIBED:
	Text=Yesterday ah wah, that's a very jam. We we awake at the count the the checkpoint more than 1 1/2 hour in Malaysia when we reach after the checkpoint already 12.
	Speaker ID=Guest-2
TRANSCRIBED:
	Text=No time to take breakfast already. Just proceed to KSL that's all. What's what is KSLKSL is a mole lah in in JB. What time did you leave house? It's about 7:00.
	Speaker ID=Guest-2
TRANSCRIBED:
	Text=About 7, about 7:00 in the morning, ya, and then you reached a Kranji MRT reach Kranji MRT about 838 thirty, ya, very packed already. That's like, MMM. Then we took a bus, OK, 'cause every Saturday when I'm there, then there's like a very long queue for the train, for the buses into Malaysia.
	Speaker ID=Guest-2
TRANSCRIBED:
	Text=You want to have a coffee? Coffee. Then daddy say

4. Evaluate transcription data.

*Note:* Most diarization ground-truth datasets have tuple data `(start_time, end_time, speaker, transcript)`. However, the ground-truth labels in the NSC dataset does not capture the interaction well between the two speakers as they transcribe each individuals' data separately. **Manual labelling** may be required to capture **time data**, or at least **preserve the order of interaction** e.g. Speaker A -> Speaker B.

We work around this limitation by evaluating the transcriptions for both separately, and measure the amount of **information lost** (missed transcriptions) and **extraneous information** (transcribed wrongly) for each speaker.

---

**How is `start_time` and `end_time` being used in the metric? Is this important?**

The start and end times of certain segments are crucial to track how well the system **maintains speaker identities over time**. Precise `start_time` and `end_time` are necessary for **correctly aligning speaker changes and overlaps**.

For example, the **Diarization Error Rate (DiAR)** is a metric that combines several types of errors in speaker diarization into a single percentage. It includes substitution errors, insertion errors, and deletion errors. The formula for calculating DiAR is:

$$
\text{DiAR} = \frac{\text{Substitution Errors} + \text{Insertion Errors} + \text{Deletion Errors}}{\text{Total Duration}} \times 100
$$

- **Substitution Errors**: Instances where a segment is assigned the wrong speaker in seconds.
- **Insertion Errors**: Extra segments assigned that don't match any ground truth segment in seconds.
- **Deletion Errors**: Missing segments that are present in the ground truth but not in the diarization output in seconds.
- **Total Duration**: The total duration of the audio in seconds.

---

In [17]:
# First, let's remove any markers in <> brackets
import re

cleaned_speaker1_transcript = re.sub(r'<[^>]*>', '', speaker1_transcript)
cleaned_speaker2_transcript = re.sub(r'<[^>]*>', '', speaker2_transcript)

In [25]:
# Then, get hypothesis text by parsing through JSON data

def group_texts_by_speaker(json_data):
    # Create a dictionary to hold concatenated transcriptions by speaker_id
    speaker_texts = dict()
    
    # Iterate over the data and concatenate transcriptions for each speaker
    for entry in data:
        speaker_id = entry['speaker_id']
        transcription = entry['transcription']

        if speaker_id not in speaker_texts:
            speaker_texts[speaker_id] = transcription
        else:
            speaker_texts[speaker_id] += transcription + " "
    
    # Convert defaultdict to a regular dict for output
    return speaker_texts

# Read JSON data from a file
with open(LOG_FILEPATH, 'r') as file:
    data = json.load(file)

individual_transcriptions = group_texts_by_speaker(data)

In [30]:
# Calculate alignment between reference and hypothesis text for speaker 1 and speaker 2
import jiwer

def get_alignment(reference_text, hypothesis_text):
    # Normalize the text by transforming it to lower case and removing punctuation
    transformation = jiwer.Compose([
        jiwer.ToLowerCase(),
        jiwer.ExpandCommonEnglishContractions(),
        jiwer.RemoveKaldiNonWords(),
        jiwer.RemovePunctuation()
    ])

    ref = transformation(reference_text)
    hyp = transformation(hypothesis_text)
    
    out = jiwer.process_words(ref, hyp)
    print(jiwer.visualize_alignment(out))

get_alignment([cleaned_speaker1_transcript, cleaned_speaker2_transcript], 
              [individual_transcriptions["Guest-1"], individual_transcriptions["Guest-2"]])

sentence 1
REF: yes so uh we just start talking right okay so  um ppb    ppb which we start with maybe the first one or ** no what did you do in jb   yesterday what  is what is ksl mmhmm what time did you leave house about seven in the morning and then you reached uh kranji mrt eight thirty k cause every saturday when i am there then there is like a very long queue for the train no for the buses into malaysia ppb in malaysia or in singapore wait so  unc~ uncle quek wanted to meet for breakfast yesterday ppo then so what time did you reach malaysia from kranji mrt ten thir~ uh huh so what bus did you take into malaysia one ppo sb~ the sbs bus hmm from kranji mrt or maybe next saturday you can go into my shelter instead of malaysia ****** ****** **** ******** * ** ***** ** ******* **       ppl what specs frame did you get **** spec~ spectacles frame spectacles frame what was the colour and what was the shade col~ no colour ppo the frame and then the  shape rectangle         ppb rectangul