# Metrics

Note: for a command line tool tutorial, see <https://benchmarkstt.readthedocs.io/en/latest/tutorial.html>

Note: It is assumed that you have gone through the example in the Normalization tutorial first.

## Recap

We created 3 normalizers, one for each file we intend to use in benchmark. The resulting code was:

In [1]:
from benchmarkstt.normalization import NormalizationComposite
from benchmarkstt.normalization.core import Regex, Replace, Lowercase

####### CONSTRUCT NORMALIZERS #######

# aws
normalizer_aws = NormalizationComposite()
normalizer_aws.add(Regex('^.*"transcript":"([^"]+)".*', '\\1'))
normalizer_aws.add(Lowercase())
                                  
# kaldi
normalizer_kaldi = NormalizationComposite()
normalizer_kaldi.add(Regex('^.*"text":"([^"]+)".*', '\\1'))
normalizer_kaldi.add(Lowercase())

# subtitles (reference)
normalizer_ref = NormalizationComposite()
normalizer_ref.add(Regex("</?[?!\[\]a-zA-Z][^>]*>", " ")) # Remove XML-tags
normalizer_ref.add(Regex("[\n\s]+", " ")) # Remove extra newline and spaces

# Remove non-dialogue text
normalizer_ref.add(Replace('APPLAUSE', ''))
normalizer_ref.add(Replace('SPEAKS OFF MIC', ''))
normalizer_ref.add(Replace('INDISTINCT', ''))
normalizer_ref.add(Replace('CHATTER FROM AUDIENCE', ''))
normalizer_ref.add(Replace('LAUGHTER', ''))
normalizer_ref.add(Replace('DROWNS OUT SPEECH', ''))
normalizer_ref.add(Replace('GROANING', ''))
normalizer_ref.add(Replace('CHEERING', ''))
normalizer_ref.add(Lowercase())

And the code for loading the filenames into variables:

In [2]:
# you can download these files at 
# https://github.com/ebu/benchmarkstt/tree/master/docs/_static/demos

from os import path
ROOT = "../_static/demos"

####### LOADING THE FILES #######

# Subtitle file
subtitle_file = path.join(ROOT, "qt_subs.xml")

# Transcript generated by AWS
aws_transcript_file = path.join(ROOT, "qt_aws.json")

# Transcript generated by Kaldi
kaldi_transcript_file = path.join(ROOT, "qt_kaldi.json")

## Using Input classes

We will use [`benchmarkstt.input.File`](https://benchmarkstt.readthedocs.io/en/latest/modules/benchmarkstt.input.core.html#benchmarkstt.input.core.File) to load the file contents, normalize it and split into segments.

According to the documentation, input classes are "responsible for dealing with input formats and converting them to benchmarkstt native schema". This is the expected format used by the `compare` method of Metrics classes.

In [3]:
from benchmarkstt.input.core import File

reference = File(
    subtitle_file, 
    'plaintext', 
    normalizer_ref
)

hypothesis_aws = File(
    aws_transcript_file, 
    'plaintext',
    normalizer_aws
)

hypothesis_kaldi = File(
    kaldi_transcript_file,
    'plaintext',
    normalizer_kaldi
)

## Calculating Word Error Rate

We want to calculate the metric 'WER' (= Word Error Rate). We can use the [`benchmarkstt.metrics.core.WER`](https://benchmarkstt.readthedocs.io/en/latest/modules/benchmarkstt.metrics.core.html#benchmarkstt.metrics.core.WER) class directly for this.

In [4]:
from benchmarkstt.metrics.core import WER
wer = WER()

Let's check the WER for both ref/aws and ref/kaldi:

In [5]:
print("AWS: %.4f" % wer.compare(reference, hypothesis_aws))
print("Kaldi: %.4f" % wer.compare(reference, hypothesis_kaldi))

AWS: 0.2987
Kaldi: 0.3002


We now have a Word Error Rate for both Kaldi and AWS and we can conclude in this single example, using the normalization rules as defined above, that Kaldi has a slightly higher Word Error Rate.

## Calculating DiffCounts

Using [`benchmarkstt.metrics.core.DiffCounts`](https://benchmarkstt.readthedocs.io/en/latest/modules/benchmarkstt.metrics.core.html#benchmarkstt.metrics.core.DiffCounts), we can get some more details about the differences than with [`WER`](https://benchmarkstt.readthedocs.io/en/latest/modules/benchmarkstt.metrics.core.html#benchmarkstt.metrics.core.WER).

In [6]:
from benchmarkstt.metrics.core import DiffCounts
diffcounts = DiffCounts()

In [7]:
print("AWS:")
print(diffcounts.compare(reference, hypothesis_aws))

print("Kaldi:")
print(diffcounts.compare(reference, hypothesis_kaldi))

AWS:
OpcodeCounts(equal=11462, replace=2200, insert=682, delete=1710)
Kaldi:
OpcodeCounts(equal=11645, replace=2769, insert=888, delete=958)


## Show complete differences

Using [`benchmarkstt.metrics.core.WordDiffs`](https://benchmarkstt.readthedocs.io/en/latest/modules/benchmarkstt.metrics.core.html#benchmarkstt.metrics.core.WordDiffs), we can get the full diff between reference and hypothesis.

In [8]:
from benchmarkstt.metrics.core import WordDiffs

# we are using "ansi" diff dialect here, as it shows colored output,
# making the output more human readable

worddiffs = WordDiffs('ansi') 

print("AWS:\n")
print(worddiffs.compare(reference, hypothesis_aws)[:300])

print("\n\nKaldi:")
print(worddiffs.compare(reference, hypothesis_kaldi)[:300])

AWS:

Color key: Unchanged [31mReference[0m [32mHypothesis[0m

[31m·bbc·2017·tonight,[0m[32m·tonight[0m·the·prime[31m·minister,[0m[32m·minister[0m·theresa·may,·the·leader·of·the·conservative[31m·party,·and[0m[32m·party·on·dh,[0m·the·leader·of[32m·the[0m·labour·party,·jeremy·corbyn,·face·


Kaldi:
Color key: Unchanged [31mReference[0m [32mHypothesis[0m

[31m·bbc·2017·tonight,[0m[32m·tonight[0m·the·prime[31m·minister,[0m[32m·minister[0m·theresa[31m·may,[0m[32m·may[0m·the·leader·of·the·conservative[31m·party,[0m[32m·party[0m·and·the·leader·of[32m·the[0m·labour[31m·party,


## Finetuning

You can see that some of the differences are due to punctuation. Because in out case we are only interested in the correct identification of words, these types of differences should not count as errors. To get a more accurate WER, we will remove punctuation marks.

We will do this for the reference and both hypothesis files.

In [9]:
remove_punctuation = Regex('[,.-]', '')

normalizer_ref.add(remove_punctuation)
normalizer_aws.add(remove_punctuation)
normalizer_kaldi.add(remove_punctuation)

Let's re-check the WER and the differences:

In [10]:
print("AWS WER: %.4f" % wer.compare(reference, hypothesis_aws))
print("Kaldi WER: %.4f\n" % wer.compare(reference, hypothesis_kaldi))

print("AWS diffs:\n")
print(worddiffs.compare(reference, hypothesis_aws)[:400])

print("\nKaldi diffs:\n")
print(worddiffs.compare(reference, hypothesis_kaldi)[:400])

AWS WER: 0.2403
Kaldi WER: 0.1978

AWS diffs:

Color key: Unchanged [31mReference[0m [32mHypothesis[0m

[31m·bbc·2017[0m·tonight·the·prime·minister·theresa·may·the·leader·of·the·conservative·party[31m·and[0m[32m·on·dh[0m·the·leader·of[32m·the[0m·labour·party·jeremy·corbyn·face·the·voters·welcome·to·question·time·so·over·the·next[31m·90[0m[32m·ninety[0m·minutes·the·leaders[31m·of[0m[32m·off[0m·the·two·larger·parties·are·goin

Kaldi diffs:

Color key: Unchanged [31mReference[0m [32mHypothesis[0m

[31m·bbc·2017[0m·tonight·the·prime·minister·theresa·may·the·leader·of·the·conservative·party·and·the·leader·of[32m·the[0m·labour·party·jeremy·corbyn·face·the·voters·welcome[31m·to·question·time[0m·so·over·the·next[31m·90[0m[32m·ninety[0m·minutes·the·leaders·of·the·two·larger·parties·are·going·to·be·quizzed·by·our·audience·here·


As you can see, both AWS and Kaldi show a significant drop in WER and the WordDiffs seem more like what we would expect.

It is left as an exercise to the reader to further extend the normalizers to get even more representative WERs (e.g. by adding a `Replace("ninety", "90")` to the hypothesis normalizers, removing the "bbc 2017" at the start, etc.).