# What's all this about?

This notebook aims to show reference and inference/hypothesis transcriptions in a way that makes it easier to qualitatively compare the texts and rate them, rather than looking at texts manually. The comparison layout makes use of existing software which aligns the texts, with coloured highlights of differences between the texts. The comparison interface also includes buttons which can be used to add qualitative ratings to the inference text. These ratings can be adjusted, and you can set your own scale such as "Wow, meh, horrid".

There are a lot of alignment algorithms around, and many papers about using these algorithms, but very few implementations in ways that are easy to incorporate into existing tools. This notebook uses [TextErrors](https://github.com/RuABraun/texterrors), which shows nice summary info about insertions, deletions and substitutions, including how many times each word error occurs. It is designed for command-line usage, but can be adjusted to work in a Colab notebook. Maybe in the future it could be adapted and used in the Elpis GUI.

## How?

* Get some files to work with
* Install the required packages
* Prepare the text data
* Do the alignment 
* Build an interface to compare the texts
* Compare and rate the text


# Configuration

Run the entire notebook for a demonstration of what this notebook does. It will download some data, process the text and display the evaluation interface. When you are ready to use with your own data, edit the values in these cells to change the rating labels and set a different source folder on Google Drive in here for your own needs.

Specify the rating labels you want to use by changing the values in the list.

In [1]:
rating_options=['Useless', 'Glance', 'Refer', 'Edit', 'Wow']

longest_word_sequence_threshold = 3



Install the gdown tool to download files from Google Drive by ID

In [2]:
%%capture
!pip install gdown
import gdown

Uncomment the following lines and run this cell to delete previously downloaded files in output folder.

In [3]:
# import shutil
# shutil.rmtree(f"/content/output", ignore_errors=True)

Change the ID value here to suit your data. Get it from the Share link of your folder.

In [None]:
# id = "1LyBG54tGKuPnRcEDpCvjraXauc7A7U0l" # cer-test
# id = "1pKpnIrk0603pe8rOr1bshaEMh3jR67CO" # eaf-ref-inf/output 
# id = "1SZEDUlZbmDH-uL5_kitdh4J54PwQwqNe" # first (millions-sm)
id = "1p2XvJ5a1yPXlf3WNkcTlMMdlav4F4ZxJ" # second (pronoun and transaction)
# id = "12eQjkZGNtOIS2bHasAvatpUIZUP-RGHZ" # third (inf v inf)



# Change the suffixes here to reflect how your files are named
# eg, if your files are named with _ref and _inf suffixes, then you would change these...
ref_suffix = "_0"
inf_suffix = "_1"

# Probably don't need to change this, but can be used to have separate experiment dirs
file_dir = "output"

# This does the download
gdown.download_folder(id=id, output=file_dir, quiet=False)

## Colab setup

These cells install the required packages, set up the notebook and the functions that do the data prep are here. Shouldn't need to change any of this code.

Set an env var before installation to enable colour output in Colab. See the [issue report here](https://github.com/ipython/ipykernel/issues/1024).

In [5]:
%%capture
%env FORCE_COLOR=1

### Set some styles for better Colab layout.

In [6]:
from IPython.display import display, HTML

def set_css_in_cell_output():
    display(HTML("""<style>
        .lm-Widget > .lm-Widget.lm-Panel {
            max-height: none !important;
        }
        .widget-toggle-buttons .widget-label {
            display: none !important;
        }
        .file_panel {
            display: flex;
            flex-direction: row;
            justify-content: space-between;
            margin-top: 1em;
        }
        .ref_text {
            border-right: 1px solid pink;
            margin-right: 1em;
            padding-right: 1em;
        }
        .inf_text {
            margin-left: 1em;
        }
        .aligned_lines {
            border-right: 2px solid pink;
            padding-right: 1em;
        }
        .matched_lines {
            margin: 0 2em;
            max-width: none;
            line-height: 1.5;
        }
        .matched_lines hr {
            border: 0;
            border-top: 1px dotted black;
        }
        .widget-html-content {
            line-height: 1.6;
        }
        </style>
"""))

get_ipython().events.register('pre_run_cell', set_css_in_cell_output)

### Install the required packages

In [7]:
%%capture

!pip install texterrors
!pip install loguru
!pip install termcolor==2.1.0

# This library is used to convert the console-colour codes into HTML so we can see them in the notebook
!pip install ansi2html

# Use an accordion widget to show the content, limiting information overload if multiple ref/inf files.
!pip install "ipywidgets>=7,<8"

# Use BeautifulSoup to parse the coded HTML, to strip the coloured tags and leave remainder text (effectively the longest common word sequences)
!pip install bs4

# Jiwer for simple WER and CER
!pip install jiwer

Import libraries

In [8]:
import io
import ipywidgets as widgets
import pandas as pd
import texterrors
import difflib
import os
import jiwer
from ansi2html import Ansi2HTMLConverter
from dataclasses import dataclass
from pathlib import Path
from natsort import natsorted
from typing import List, Match


Optionally, remove the logger output. Seems to be required because of some internal logging in the TextErrors library.

In [9]:
from loguru import logger
logger.remove()

## Processing functions

These functions will be used to prepare the data, and then do the alignment and error calculations.



This function will group the files we are working with, so we can easily compare and make accordians. The result of this cell is a dictionary with the file names as keys, with the file paths and text content as values.

In [10]:
def get_file_text(file_path):
    with open(file_path, 'r') as text_file:
        return text_file.read()
        

def build_file_groups(file_dir):
    txt_file_list = natsorted(list(Path(f"/content/{file_dir}/").glob('*.txt')), key=str)

    #  Make a data structure that can hold the file paths, text and results
    file_groups = {}  

    #  Make a data structure that can hold the file paths, text and results
    file_groups = {}  
    
    for txt_file in txt_file_list:
        basename = txt_file.stem[:-len(ref_suffix)]
        group = file_groups.get(basename, {"rating": None})
        if ref_suffix in txt_file.stem[-len(ref_suffix):]:
            group["ref"] = txt_file
            group["ref_text"] = get_file_text(txt_file)
        elif inf_suffix in txt_file.stem[-len(ref_suffix):]:
            group["inf"] = txt_file
            group["inf_text"] = get_file_text(txt_file)

        file_groups[basename] = group
    return file_groups

This code handles the formatting of text from the ref/inf files, into the `Utt` data type that TextErrors uses. Ideally this could be done by calling a function from the library, but they are not exposes for import as a Python lib. Perhaps in the future this library could be forked and the functions could be exported. 


In [11]:
def read_file(utterance_file):
    @dataclass
    class Utt:
        uid: str
        words: list
        times: list = None
        durs: list = None
        
    utts = {}
    # Handle empty file
    if os.path.getsize(utterance_file) == 0:
        utts[0] = Utt(0, "")
    else: 
        with open(utterance_file) as fh:
            for i, line in enumerate(fh):
                words = line.split()
                i = str(i)
                utts[i] = Utt(i, words)
    return utts

This function handles clicks on the rating buttons. When a button is clicked, the rating value for the respective text group is updated, using the button label. Each time a rating is made a CSV is saved as `results.csv`. The CSV contains info about the files, the error metrics that TextErrors calculated, and the rating.

In [12]:
output = widgets.Output()

def show_data_frame(file_groups, longest_word_sequence_threshold):
    with output:
        df = pd.DataFrame.from_dict(file_groups)
        df = df.reindex(index = ["rating", "sequence_count", "wer", "cer"])
        df = df.transpose()
        # Show it in the output (which will be displayed with the accordion)
        output.clear_output()
        df.columns = ["rating", f"{longest_word_sequence_threshold} word sequences", "wer", "cer"]
        display(df)

        # Save the data
        df.to_csv("/content/results.csv")

# Ugh, having to use file_groups, longest_word_sequence_threshold from global scope 
# Can't see how to pass them into this handler :( 
@output.capture(clear_output=True)
def on_button_clicked(b):
    file_groups[b.owner.description]["rating"] = b.owner.value
    show_data_frame(file_groups, longest_word_sequence_threshold)

Longest word sequence, thanks Harry Keightley.

In [13]:
def common_words2(text: str, other_text: str, threshold: int) -> List[str]:
    matcher = difflib.SequenceMatcher(None, text.split(" "), other_text.split(" "))
    matches = matcher.get_matching_blocks()
    matches = (match for match in matches if match.size >= threshold)

    def match_to_text(match: Match) -> str:
        words = text.split()
        words = words[match.a : match.a + match.size]
        return " ".join(words)

    matched = [match_to_text(match) for match in matches]
    
    return len(matched), matched


Process the text lines, get alignment, error info and common word sequences.

In [14]:
import functools

def process_text(file_groups, longest_word_sequence_threshold, silent=True):

    # Configuration settings for TextErrors - shouldn't need to change any of these
    cer=True
    num_top_errors=10
    oov_set=[]
    debug=True
    use_chardiff=True
    isctm=False
    skip_detailed=False
    keywords=[]
    utt_group_map=None
    oracle_wer=False
    freq_sort=False
    nocolor=False
    insert_tok='<eps>'
    terminal_width=120
    group_stats=True

    if silent:
        debug = False

    # Use this library to convert the ANSI codes that TextErrors outputs into HTML
    conv = Ansi2HTMLConverter()

    # Will be used to hold the accordion panels
    accordion_children = []

    # Iterate the file pairs
    for key in file_groups:
        
        # Will be used to compile the coloured, aligned lines
        ansi_list = []
        ansi_lines = []

        # Add file info for the ref and hyp files to the data structure
        file_group = file_groups[key]
        ref_utts = read_file(file_group["ref"])
        hyp_utts = read_file(file_group["inf"])

        if not silent and debug:
            print("")
            print(file_group["ref_text"])
            print(file_group["inf_text"])

        # Process lines for each file pair, this does the alignment and error calc
        multilines, error_stats = texterrors.process_lines(ref_utts, hyp_utts, debug, use_chardiff, isctm, skip_detailed, terminal_width, oracle_wer, keywords, oov_set, cer, utt_group_map, group_stats, nocolor, insert_tok)
        
        # Show me the errors 
        if not silent:
            print(error_stats)

        # Rebuild the error values, the lib doesn't expose them easily
        ins_count = sum(error_stats.ins.values())
        del_count = sum(error_stats.dels.values())
        sub_count = sum(error_stats.subs.values())

        # These don't seem to handle cer for empty files well
        # wer_raw = (ins_count + del_count + sub_count) / float(error_stats.total_count)
        # cer_raw = error_stats.char_error_count / float(error_stats.char_count)

        jiwer_wer = jiwer.wer(file_group["ref_text"], file_group["inf_text"])
        jiwer_cer = jiwer.cer(file_group["ref_text"], file_group["inf_text"])

        wer = round(jiwer_wer*100, 2)
        cer = round(jiwer_cer*100, 2)

        # Keep the error values in the data structure
        file_group["wer"] = wer
        file_group["cer"] = cer

        if not silent:
            print("jiwer_wer", jiwer_wer)
            print("jiwer_cer", jiwer_cer)
            print("WER", wer)
            print("CER", cer)

        # Compile the aligned lines
        for multiline in multilines:

            for lines in multiline.iter_construct():
                # Keep the ansi-coded data in a list so we can parse it later for working out longest common word sequence
                ansi_list.append(lines)

                # Group the ansi coded data for each file for easier display in the accordion panel
                ansi_lines.append("\n".join(lines))

        # Get the longest word sequence
        sequence_count, sequence_matches = common_words2(file_group["ref_text"], file_group["inf_text"], longest_word_sequence_threshold)

        # Keep the word sequence information handy
        file_group["sequence_count"] = sequence_count
        file_group["sequence_matches"] = sequence_matches

        # Set up buttons to handle user quality rating
        buttons = widgets.ToggleButtons(
            options=rating_options,
            description=key,
            style={"button_width": "60px"},
            value=file_group["rating"]
        )
        # buttons.observe(on_button_clicked, 'value')
        buttons.observe(on_button_clicked)
        
        # Use conv.convert to reformat ansi-coded data to html so we can display it in a widget
        html_lines = conv.convert('\n\n'.join(ansi_lines))

        ref_html_section  = f"<div class='ref_text'>{file_group['ref_text']}</div>"
        inf_html_section  = f"<div class='inf_text'>{file_group['inf_text']}</div>"
        aligned_html_section  = f"<div class='aligned_lines'>{html_lines}</div>"
        matched_html_section  = f"<div class='matched_lines'>{'<hr>'.join(sequence_matches)}</div>"
        
        panel_accordion = widgets.Accordion(children=[
            widgets.HBox([widgets.HTML(ref_html_section), widgets.HTML(inf_html_section)]),
            widgets.HTML(matched_html_section), 
            widgets.HTML(aligned_html_section)], 
            selected_index=None)
        
        panel_accordion.set_title(0, "Ref/inf text")
        panel_accordion.set_title(1, f"Sequences longer than {longest_word_sequence_threshold} words")
        panel_accordion.set_title(2, "Aligned lines")

        accordion_children.append([buttons, panel_accordion])
    return accordion_children



This compiles the accordion panels into an accordion display widget.

In [15]:
def show_accordion(file_groups, accordion_children):
    for i, title in enumerate(file_groups):
        accordion = widgets.Accordion(children=[widgets.VBox(accordion_children[i])])
        accordion.set_title(0, title)
        display(accordion)

# Show me

Display the interface and show the data as a dataframe. To re-run the interface with different data, download the files above, then come back to this cell and re-run it (maybe also need to run the cell after too).

In [16]:
longest_word_sequence_threshold = 4

# Rebuild the data structure based on what's currently in file_dir
file_groups = build_file_groups(file_dir)

# This does the alignment and error calculation
accordion_children = process_text(file_groups, longest_word_sequence_threshold)

# And this draws the accordion widget
show_accordion(file_groups, accordion_children)

Accordion(children=(VBox(children=(ToggleButtons(description='ZMS_EIP_010_Pronoun_cv_all', options=('Useless',…

Accordion(children=(VBox(children=(ToggleButtons(description='ZMS_EIP_013_Transaction_cv_all', options=('Usele…

In [17]:
display(output)
output.clear_output()
show_data_frame(file_groups, longest_word_sequence_threshold)

Output()