# Speech Recognition in Banking Dialogues with CRDNN and Language Models

## Introduction

This project focuses on building and fine-tuning a speech recognition system tailored for banking dialogues, using a Convolutional, Recurrent, and Dense Neural Network (CRDNN) coupled with Language Model (LM) support. By leveraging the HarperValleyBank (HVB) corpus, a publicly-available spoken dialog corpus annotated for banking interactions, we demonstrate the application of advanced speech recognition technologies in understanding and transcribing customer service dialogues in the financial sector.

### Project Objective

The main objective of this project is to enhance the accuracy and efficiency of automatic speech recognition (ASR) systems in the context of banking dialogues. By training and fine-tuning the CRDNN model with the HVB dataset and integrating it with LM support, we aim to create a robust ASR system that can accurately transcribe spoken dialogues, thereby facilitating better customer service automation and analysis in the banking industry.

### Dataset

- **HarperValleyBank (HVB) Corpus**: A publicly-available spoken dialog corpus designed for the banking sector. It contains annotated dialogues covering various banking-related interactions. More details can be found in the [HVB Corpus Documentation](https://arxiv.org/pdf/2010.13929.pdf).

### Methodology

1. **CRDNN Model Tuning**: Starting with a pre-trained CRDNN model from SpeechBrain, we further tune the model using the HVB dataset to adapt it to banking dialogues.
2. **Language Model Integration**: Incorporate language models to improve the transcription accuracy by providing contextual clues that are specific to banking conversations.
3. **Evaluation**: Assess the performance of the fine-tuned CRDNN model, with and without LM support, on a test subset of the HVB corpus to evaluate its effectiveness in transcribing banking dialogues.

### Libraries and Tools Used

- **SpeechBrain**: A PyTorch-based open-source speech toolkit that provides pre-trained models and tools for ASR systems.
- **Hugging Face**: For accessing and deploying the SpeechBrain pre-trained CRDNN model.
- **Python Libraries**: `torchaudio`, `torch`, `json`, and others for data handling and model training.


# Dependencies 

In [None]:
# !gdown 1oJh0U3g_bUx6UPX4xix2UHMVHeCE_H1y
!gdown 1_OXiLOL2RBsbdCb4WyQsLudYxzJxMDJr
!unzip -q hvb.zip
!mv content/data /content/
!rm -r /content/content

!gdown 1a0EGlsLbXnGn1xwZoSqT0tcdAQ1L2nfd # train.py
!gdown 1yCmjRbxXRxfEN5LXdnE1Zpl8ZOIzdrAO # train.yaml
!gdown 1KHmdcLVFI9ontvGmi5J6vfaropGYuKcr # inference.yaml

In [None]:
!pip install speechbrain -q

import speechbrain as sb
from speechbrain.pretrained import EncoderDecoderASR
import json
import torchaudio
import torch
from torch import nn
from tqdm import tqdm
from collections import Counter
from IPython.display import Audio
from scipy.io import wavfile

device = 'cuda' if torch.cuda.is_available() else 'cpu'

### Evaluation and Fine-Tuning of a Pretrained CRDNN Model with HVB

The process begins by utilizing the [SpeechBrain CRDNN model pretrained on LibriSpeech](https://huggingface.co/speechbrain/asr-crdnn-rnnlm-librispeech), which can be sourced easily thanks to SpeechBrain's utility functions that facilitate fetching pre-trained models from HuggingFace's repository.

Initially, the focus is on leveraging this pretrained CRDNN model for inference purposes, specifically targeting the first 500 examples from `test_manifest.json`, a file that is part of the `hvb.zip` previously acquired. Following the initial inference phase, the model undergoes fine-tuning using the HVB training dataset. The final step involves reassessing the model's performance on the test examples to observe and document any variations in its effectiveness post-fine-tuning.

In [None]:
crdnn = EncoderDecoderASR.from_hparams(
    source='speechbrain/asr-crdnn-rnnlm-librispeech',
    savedir='asr-crdnn-rnnlm-librispeech',
    run_opts={'device': 'cuda'}
)

Our approach involves using manifests prepared in JSON format, designed specifically for compatibility with SpeechBrain. These manifests adhere to a structure exemplified below:

```
{
    "15748": {
        "wav": "/content/data/segments/15748.wav",
        "length": 1.86,
        "words": "WHAT DAY WOULD YOU LIKE FOR YOUR APPOINTMENT"
    },
    ...
}
```

The initial step consists of loading these manifests. Subsequently, we focus on creating a function that organizes these data entries into batches. These batches are tailored to be easily processed by our `EncoderDecoderASR` object, ensuring efficient handling and analysis of the data.

In [None]:
TEST_SIZE = 200 # for faster processing

with open('data/test_manifest.json', 'r') as f:
    test_manifest = json.load(f)
test_manifest = {
    k: v for k, v in list(test_manifest.items())[:TEST_SIZE]
}

def batchify(manifest, batch_size):
    keys = list(manifest.keys())
    wav_paths = list(map(lambda x: x['wav'], manifest.values()))
    iterable = zip(keys, wav_paths)
    num_examples = len(manifest)
    for i in range(0, num_examples, batch_size):
        batch_wavs = nn.utils.rnn.pad_sequence([
            torchaudio.load(path)[0].squeeze(0)
            for path in wav_paths[i:min(i + batch_size, num_examples)]
        ], batch_first=True)
        batch_keys = keys[i:min(i + batch_size, num_examples)]
        batch_wav_lens = torch.tensor([
            manifest[key]['length'] for key in batch_keys
        ])
        batch_wav_lens = batch_wav_lens / batch_wav_lens.max()
        yield batch_keys, batch_wavs, batch_wav_lens

The next step is to use the pre-trained ASR model to transcribe the test examples from the HVB corpus:

In [None]:
true_dict = {key: test_manifest[key]['words'] for key in test_manifest}

def inference(model, test_manifest, batch_size=8):
    torch.cuda.empty_cache()
    pred_dict = {}
    for keys, wavs, wav_lens in tqdm(batchify(test_manifest, batch_size), total=round(len(test_manifest) / batch_size + 0.5)):
        transcriptions, _ = model.transcribe_batch(wavs.to(device), wav_lens.to(device))
        for key, transcription in zip(keys, transcriptions):
            pred_dict[key] = transcription
    return pred_dict

pred_dict = inference(crdnn, test_manifest)

### Determining Word Error Rate for Pretrained Model Predictions

The code below calculates the word error rate (WER) for the predictions made by the pretrained model on the first 200 test instances in `test_manifest.json`. It processes the transcripts by splitting them into word lists to evaluate the WER.

No new implementation is needed; simply execute the provided code to compute and review the WER for the generated results.

In [None]:
# this data structure stores WER information we use later. 
details_by_utterance = sb.utils.edit_distance.wer_details_by_utterance(
    {k: v.split() for k, v in true_dict.items()},
    {k: v.split() for k, v in pred_dict.items()},
)

In [None]:
# word error rate (WER) summary using data structure we just created
sb.utils.edit_distance.wer_summary(details_by_utterance)

It's anticipated that the WER for this pretrained system might be quite high when tested on HVB data, potentially around or exceeding 72%. This highlights the system's limitations or mismatches with the HVB dataset, even after resampling the audio to 16kHz to align with the pretrained model's training inputs.

ASR systems often reveal specific error patterns, which can help identify performance challenges. To understand where the pretrained system struggles with HVB data, start by examining some utterances with the highest misrecognition rates.

In [None]:
def summarize(detail_dict, true_dict, pred_dict):
    print(f"{detail_dict['key']}: {detail_dict['WER']}")
    print(f"\tTrue: {true_dict[detail_dict['key']]}")
    print(f"\tPred: {pred_dict[detail_dict['key']]}")

for wer_dict in sb.utils.edit_distance.top_wer_utts(details_by_utterance, 10)[0]:
    summarize(wer_dict, true_dict, pred_dict)

Seems that our predictions keep outputting the same word over and over. Let's see why.

In this section of our project tutorial, we will analyze examples where the model's predictions deviate from expected outcomes. This analysis is crucial for understanding the data's nuances that may negatively impact the model's performance. Common errors to look for include:

- Words consistently misidentified by the model across various audio samples.
- Accurate capture of some words, but overall transcripts that do not align well with the original audio content.

To perform a thorough analysis, identify at least three audio files where the model's predictions are notably inaccurate. For each file, document the specific error types observed, detailing the discrepancies between the model's output and the actual audio content.

In [None]:
example = details_by_utterance[1]
summarize(example, true_dict, pred_dict)
Audio(test_manifest[example['key']]['wav'])

## Finetuning CRDNN Model

This section explores how the pretrained model performs on our dataset, noting its initial training was on a different dataset compared to HVB's call-center transcripts. By fine-tuning this model with HVB's training data, we aim to observe potential performance improvements. Minimal adjustments are needed, primarily to initiate the training. Experiment with training and decoding parameters to observe their effects. This hands-on exercise in fine-tuning on a new corpus using SpeechBrain offers valuable experience in ASR model development relevant to industry projects.

All necessary components, including the training script, experiment YAML, and inference YAML, are provided. Reviewing these components and understanding the setup, especially the structure of a well-organized ML experiment YAML file, can be enlightening. Training this model for 2 epochs on Colab GPUs is expected to take about 1.5 hours.

During training, the model will save checkpoints for later inference/testing. This setup allows you to benefit from a fine-tuned model version without completing the entire training duration. You can submit your work with the model partially trained, as long as it has undergone some fine-tuning, even if it doesn't reach the full 2 epochs.

In [None]:
# this downloads the training and config files for our fine tuning setup
!gdown 1v_3Kl8OrUd6_1_D0ZGoYVFEuOKhZ7YMo # train.py
!gdown 17cQIpx5kLLMCD23EDaE0EYg2E9LPqMCF # train.yaml
!gdown 1CWYOD2PC97gXguW4krc9122HKAraHkYS # inference.yaml

**Finetuning with HVB Data**

The code below implements the finetuning process for the neural net ASR system using the HVB data. 

- `train.yaml`: This YAML config file specifies the network/ASR system architecture and training parameters such as loss functions, datasets, and more. While you can't modify the network architecture, you can adjust hyperparameters like loss function weights, learning rate, and training time.
- `train.py`: This file contains the main training loop for fitting the acoustic model. No modifications are needed here.

Edit `train.yaml` with your chosen hyperparameters, then run the training loop. Record the train loss, valid loss, valid CER, and valid WER from the output in `README.txt`, including the epoch number. (Using the default `train.yaml`, we achieved training and validation loss < 1.75 after epoch 1.)


In [None]:
torch.cuda.empty_cache()

!python train.py train.yaml --batch_size=4
# OOM on batch_size=5

## Evaluate your finetuned model

To run inference, ensure compatibility with the `EncoderDecoderASR` class by using a different YAML configuration file.

**Note**: Set your checkpoint path correctly in multiple locations to avoid issues, especially before any debugging attempts (e.g., downloading from HuggingFace).

To enable inference, follow these steps:

1. Determine the directory where your checkpoints are saved (usually under `./results/CRDNN_BPE_960h_LM/2602/save/{your checkpoint directory here}`).
2. Paste this directory into the `ckptdir` entry in the `inference.yaml` file.
3. Paste this directory after `ckpt_path = ` in the cell below.

Ensure the `ckpt_path` is correctly set below and also update the path in `inference.yaml` (modify `ckptdir` to point to the desired checkpoint) before or after copying it into your checkpoint path.

In [None]:
ckpt_path = "/content/results/CRDNN_BPE_960h_LM/2602/save/CKPT+2023-04-08+06-09-55+00"
!cp inference.yaml {ckpt_path}

To evaluate the finetuned model on the first 50 test sentences in `test_manifest.json`, follow these steps:

1. Ensure the checkpoint paths are correctly set in the copy of `inference.yaml` used for running inference.
2. Set up a model object.
3. Call `inference()` to generate predictions for the test subset.
4. Populate `pred_dict` with the generated inferences.
5. Compute the Word Error Rate (WER) for both the finetuned model and the pretrained model on the test subset of size 50.
6. Write down the WER for both models in `README.txt`.

**Note**: Running inference on approximately 50 utterances might require around 15 minutes of computation on a Colab CPU. The code below uses CPU inference due to issues with checkpoint-loaded DNNs on SpeechBrain's GPU inference, but you can try it yourself.

Replace `./results/finetuned_model/checkpoints` and `./results/pretrained_model/checkpoints` with the correct paths to the checkpoint directories for the finetuned and pretrained models respectively. Additionally, ensure you have a function `load_test_data` to load data from `test_manifest.json`. Finally, compute the WERs for both models and write them to `README.txt`.


In [None]:
device = 'cpu'
our_model = EncoderDecoderASR.from_hparams(
    source=ckpt_path, 
    hparams_file='inference.yaml', 
    savedir="our_ckpt",
    run_opts={'device': device}
)

In [None]:
### Populate test_manifest with the first 50 test sentences and then call inference() below###
EVAL_SIZE = 50
test_manifest = {
    k: v for k, v in list(test_manifest.items())[:EVAL_SIZE]
}
true_dict = {key: test_manifest[key]['words'] for key in test_manifest}

pred_dict = inference(our_model.to(device), test_manifest)
# this data structure stores WER information we use later. 
details_by_utterance = sb.utils.edit_distance.wer_details_by_utterance(
    {k: v.split() for k, v in true_dict.items()},
    {k: v.split() for k, v in pred_dict.items()},
)
# word error rate (WER) summary using data structure we just created
sb.utils.edit_distance.wer_summary(details_by_utterance)

In [None]:
pred_dict = inference(crdnn, test_manifest)
# this data structure stores WER information we use later. 
details_by_utterance = sb.utils.edit_distance.wer_details_by_utterance(
    {k: v.split() for k, v in true_dict.items()},
    {k: v.split() for k, v in pred_dict.items()},
)
# word error rate (WER) summary using data structure we just created
sb.utils.edit_distance.wer_summary(details_by_utterance)

Training a sentiment detection model


In [None]:
# !gdown 1-s2e8dZYSjhVgfo_TL0V_89RZVhGnZ1Y #transcript.zip
!gdown 1oCn4PoJO-9XMEh-RtuatRgtKKL6ZyXb6
!unzip -q transcript.zip

!gdown 1ChdI1XyhmGq9z8Y8M38yXPMob6oPRqPO  #train.txt
!gdown 10w15DnUbJcQRBSZWP03qjM6Oq8l7WSVQ  #dev.txt

# !gdown 1eimo-BFXZz6Z3FeZK8uC-wT84ji-JVos #hvb-audio.zip
!gdown 1xMyXiFpQo3reF5sWFi6MahUQviW4Z1RA
!unzip -q hvb-audio.zip

device = 'cuda' if torch.cuda.is_available() else 'cpu'

In [None]:
with open('transcript/370981f1f0254ebc.json', 'r') as f:
    print(f.read())

For sentiment classification on HVB utterances, follow these steps:

1. Create train and dev splits from `train.txt` and `dev.txt`. Each line in these files represents a conversation-ID with sequential utterances. Use the provided text files to create train and dev sets, and extract the relevant "emotion" metadata from the corresponding transcript JSON files.

2. Extract audio segments for the utterances using information from the transcript JSON files. The audio files are named after the conversation-ID within `/content/audio/agent` and `/content/audio/caller`. Utilize fields such as `channel_index`, `index`, `start_ms`, and `duration_ms` from the JSON files to extract the audio segments.

3. Load the pretrained CRDNN Librispeech model and extract the encoder.

4. Add a linear layer to map the audio signal encoding to sentiment predictions. This layer will be randomly initialized.

5. Train both the new linear layer and all the pretrained encoder layers using cross-entropy loss with the reference emotion labels derived from the JSON files.

6. Evaluate the trained sentiment detection model on utterances listed in `dev.txt` and compute the overall accuracy for sentiment prediction.

7. Write down the accuracy obtained in `README.txt`.

In [None]:
!pip install pydub

In [None]:
import json
from pydub import AudioSegment

train_data = []
train_audio_agent = []
train_audio_caller = []
with open('train.txt','r') as f:
    for line in f.readlines():
        convID = line.split(':')[0]
        indices = list(map(int,line.split(':')[1].split(',')))
        train_file = []
        with open(f'transcript/{convID}.json','r') as f1:
            utt = json.load(f1)
            for i in indices:
                train_obj = {}
                train_obj['words'] = utt[i - 1]['transcript']
                train_obj['emotion'] = list(utt[i - 1]['emotion'].values())
                train_obj['channel_index'] = utt[i - 1]['channel_index']
                train_obj['start_ms'] = utt[i - 1]['start_ms']
                train_obj['duration_ms'] = utt[i - 1]['duration_ms']
                train_obj['length'] = train_obj['duration_ms']/1000
                if train_obj['channel_index'] == 1:
                    path = f'audio/caller/'
                else:
                    path = f'audio/agent/'
                original_wav = AudioSegment.from_wav(path + str(convID) + '.wav')
                extracted_wav = original_wav[train_obj['start_ms']:train_obj['start_ms']+train_obj['duration_ms']]
                extracted_wav.export(path + str(convID) + '_' + str(i) + '.wav', format='wav')
                train_obj["wav"] = path + str(convID) + '_' + str(i) + '.wav'
                train_file.append(train_obj)
        train_data.extend(train_file)

dev_data = []
dev_audio_agent = []
dev_audio_caller = []
with open('dev.txt','r') as f:
    for line in f.readlines():
        convID = line.split(':')[0]
        indices = list(map(int,line.split(':')[1].split(',')))
        train_file = []
        with open(f'transcript/{convID}.json','r') as f1:
            utt = json.load(f1)
            for i in indices:
                train_obj = {}
                train_obj['words'] = utt[i - 1]['transcript']
                train_obj['emotion'] = list(utt[i - 1]['emotion'].values())
                train_obj['channel_index'] = utt[i - 1]['channel_index']
                train_obj['start_ms'] = utt[i - 1]['start_ms']
                train_obj['duration_ms'] = utt[i - 1]['duration_ms']
                train_obj['length'] = train_obj['duration_ms']/1000
                if train_obj['channel_index'] == 1:
                    path = f'audio/caller/'
                else:
                    path = f'audio/agent/'
                original_wav = AudioSegment.from_wav(path + str(convID) + '.wav')
                extracted_wav = original_wav[train_obj['start_ms']:train_obj['start_ms']+train_obj['duration_ms']]
                extracted_wav.export(path + str(convID) + '_' + str(i) + '_dev.wav', format='wav')
                train_obj["wav"] = path + str(convID) + '_' + str(i) + '_dev.wav'
                train_file.append(train_obj)
        dev_data.extend(train_file)

In [None]:
crdnn = EncoderDecoderASR.from_hparams(
    source='speechbrain/asr-crdnn-rnnlm-librispeech',
    savedir='asr-crdnn-rnnlm-librispeech',
    run_opts={'device': 'cuda'}
)

max_length = 0

def getMaxLength(manifest):
    global max_length
    keys = list(manifest.keys())
    wav_paths = list(map(lambda x: x['wav'], manifest.values()))
    iterable = zip(keys, wav_paths)
    num_examples = len(manifest)
    for i in range(0, num_examples):
        batch_wavs = nn.utils.rnn.pad_sequence([
            torchaudio.load(path)[0].squeeze(0)
            for path in wav_paths[i:min(i + 1, num_examples)]
        ], batch_first=True)
        batch_keys = keys[i:min(i + 1, num_examples)]
        batch_wav_lens = torch.tensor([
            manifest[key]['length'] for key in batch_keys
        ])
        batch_wav_lens = batch_wav_lens / batch_wav_lens.max()
        max_length = max(max_length, batch_wavs.shape[1])

train_manifest = {key: train_data[key] for key in range(len(train_data))}

getMaxLength(train_manifest)

In [None]:
import torch
from torch import nn

class Network(nn.Module):

    def __init__(self, crdnn):
        super().__init__()
        self.enc = crdnn.mods.encoder
        self.lin = nn.Linear(233*512,3)
    
    def forward(self, wav, wav_len):
        x = self.enc(wav,wav_len).reshape(1,-1)
        x = self.lin(x)
        return x

model = Network(crdnn).to(device)

In [None]:
# training
from tqdm.notebook import tqdm

def mybatchify(manifest, batch_size):
    global max_length
    keys = list(manifest.keys())
    wav_paths = list(map(lambda x: x['wav'], manifest.values()))
    iterable = zip(keys, wav_paths)
    num_examples = len(manifest)
    for i in range(0, num_examples, batch_size):
        batch_wavs = nn.utils.rnn.pad_sequence([
            torchaudio.load(path)[0].squeeze(0)
            for path in wav_paths[i:min(i + batch_size, num_examples)]
        ], batch_first=True)
        batch_wavs = torch.cat([batch_wavs, torch.zeros(batch_wavs.size(0), max_length - batch_wavs.size(1))], dim=1)
        batch_keys = keys[i:min(i + batch_size, num_examples)]
        batch_wav_lens = torch.tensor([
            manifest[key]['length'] for key in batch_keys
        ])
        batch_wav_lens = batch_wav_lens / batch_wav_lens.max()
        emotions = torch.tensor([
            manifest[key]['emotion'] for key in batch_keys
        ])
        yield batch_keys, batch_wavs, batch_wav_lens, emotions

def train(model, train_manifest, batch_size=8):
    torch.cuda.empty_cache()
    model.train()
    optim = torch.optim.Adam(model.parameters(),lr=0.0001,weight_decay=0.01)
    loss_fn = torch.nn.functional.cross_entropy
    total_loss = 0
    i = 0
    pred_dict = {}
    for keys, wavs, wav_lens, emotions in tqdm(mybatchify(train_manifest, batch_size), total=round(len(train_manifest) / batch_size + 0.5)):
        optim.zero_grad()
        preds = model(wavs.to(device), wav_lens.to(device))
        emotions = emotions.to(device)
        max_emotion = emotions // emotions.max()
        loss = torch.nn.functional.cross_entropy(preds,max_emotion)
        # if i % 20 == 0:
        #     print(loss.item())
        i += 1
        total_loss += loss.item()
        loss.backward()
        optim.step()

train(model,train_manifest,1)

In [None]:
### Populate test_manifest with the first 50 test sentences and then call inference() below###
def evaluate(model, test_manifest, batch_size=8):
    torch.cuda.empty_cache()
    model.eval()
    pred_dict = {}
    score = 0
    for keys, wavs, wav_lens, emotions in tqdm(mybatchify(test_manifest, batch_size), total=round(len(test_manifest) / batch_size + 0.5)):
        preds = model(wavs.to(device), wav_lens.to(device))
        emotions = emotions.to(device)
        for i in range(preds.shape[0]):
            if torch.argmax(preds[i]) == torch.argmax(emotions[i]):
                score += 1
    print(score / len(test_manifest))

test_manifest = {key: dev_data[key] for key in range(len(dev_data))}

evaluate(model, test_manifest,1)

# Using Whisper's pretrained model to evaluate HVB

In [None]:
!pip install openai-whisper

In [None]:
import whisper

TEST_SIZE = 200

with open('data/test_manifest.json', 'r') as f:
    test_manifest = json.load(f)
test_manifest = {
    k: v for k, v in list(test_manifest.items())[:TEST_SIZE]
}
true_dict = {key: test_manifest[key]['words'] for key in test_manifest}

whisp = whisper.load_model('small').to('cuda')

pred_dict = {}
for key in test_manifest:
    audio = test_manifest[key]['wav']
    result = whisp.transcribe(audio)
    pred_dict[key] = result['text'].upper()

# this data structure stores WER information we use later. 
details_by_utterance = sb.utils.edit_distance.wer_details_by_utterance(
    {k: v.split() for k, v in true_dict.items()},
    {k: v.split() for k, v in pred_dict.items()},
)
# word error rate (WER) summary using data structure we just created
sb.utils.edit_distance.wer_summary(details_by_utterance)