# Normalization

Note: for a command line tool tutorial, see <https://benchmarkstt.readthedocs.io/en/latest/tutorial.html>


To follow this tutorial you will need a working installation of
`benchmarkstt` (see https://benchmarkstt.readthedocs.io/en/latest/INSTALL.html).

### Import

First import `benchmarkstt`.

In [1]:
import benchmarkstt

We will also use 3 test files from the source repository:

In [2]:
# you download these files at https://github.com/ebu/benchmarkstt/tree/master/docs/_static/demos
from os import path
ROOT = "../_static/demos"

# the used files:
subtitle_file = path.join(ROOT, "qt_subs.xml") # Subtitle file
aws_transcript_file = path.join(ROOT, "qt_aws.json") # Transcript generated by AWS
kaldi_transcript_file = path.join(ROOT, "qt_kaldi.json") # Transcript generated by Kaldi

Let's quickly check the data in these files.

##### Subtitle file

In [3]:
with open(subtitle_file, 'r') as f:
    lines = f.readlines()
    print(''.join(lines[0:6]) + '    [...]')
    print(''.join(lines[15:20]))

##### Transcript generated by AWS

In [4]:
with open(aws_transcript_file, 'r') as f:
    print(f.readline()[0:500])

{"jobName":"b08tllrs_qt.mp4","accountId":"038610054328","results":{"transcripts":[{"transcript":"tonight the Prime minister Theresa May, the leader of the Conservative Party on DH, the leader of the Labour Party, Jeremy Corbyn, face the voters. Welcome to question time. So over the next ninety minutes, the leaders off the two larger parties are going to be quizzed. Our audience here in York. Now this audience is made up like this. Just a third say they intend to vote conservative next week. Same


##### Transcript generated by Kaldi

In [5]:
with open(kaldi_transcript_file, 'r') as f:
    print(f.readline()[0:500])

{"metadata":{"version":"0.0.11"},"text":"tonight the prime minister theresa may the leader of the conservative party and the leader of the labour party jeremy corbyn face the voters welcome so over the next ninety minutes the leaders of the two larger parties are going to be quizzed by our audience here in york now this audience is made up like this just say they intend to vote conserve it the same numbers say they're going to vote labour and the rest either support other parties or have yet to 


### Normalization

Creating accurate verbatim transcripts for use as reference is
time-consuming and expensive. As a quick and easy alternative, we will
make a "reference" from a subtitles file. Subtitles are slightly edited
and they include additional text like descriptions of sounds and
actions, so they are not a verbatim transcription of the speech.
Consequently, they are not suitable for calculating absolute WER.
However, we are interested in calculating relative WER for illustration
purposes only, so this use of subtitles is deemed acceptable.

Warning: evaluations in this tutorial are not done for the purpose of assessing
tools. The use of subtitles as reference will skew the results so they
should not be taken as an indication of overall performance or as an
endorsement of a particular vendor or engine.

We will use the subtitles file for the BBC's Question Time Brexit
debate. This program was chosen for its length (90 minutes) and because
live debates are particularly challenging to transcribe.

The subtitles file includes a lot of extra text in XML tags. This text
shouldn't be used in the calculation: for both reference and hypotheses,
we want to run the tool on plain text only. To strip out the XML tags,
we could use the `benchmarkstt-tools` command, with the `normalization`
subcommand:

```
benchmarkstt-tools normalization --inputfile qt_subs.xml --regex "</?[?!\[\]a-zA-Z][^>]*>" " "
```

Or we could do the equivalent using python code:

In [6]:
from benchmarkstt import normalization

# The regular expression replaces XML tags with a space
normalizer_xmltags = normalization.core.Regex("</?[?!\[\]a-zA-Z][^>]*>", " ")

with open(subtitle_file, 'r') as f:
    normalized = normalizer_xmltags.normalize(f.read())

# minimalistic debug function to display specific line ranges
def debug(text, from_line, amount=5):
    print('\n'.join(text.split('\n')[from_line:from_line+amount]))

# output a couple of lines to check result
debug(normalized, 16)

   
     
       Tonight, the Prime Minister, Theresa May, 
       the leader of the Conservative Party, 
       and the leader of Labour Party, Jeremy Corbyn, face the voters. 


The normalization class [`Regex`](https://benchmarkstt.readthedocs.io/en/latest/modules/benchmarkstt.normalization.core.html#benchmarkstt.normalization.core.Regex) takes two parameters: a regular
expression pattern and the replacement string. See the documentation for more information.

In this case all XML tags will be replaced with a space. This will
result in a lot of space characters, but these are ignored by the diff
algorithm later so we don't really have to clean these up. 

However, let's do it anyway for easier debugging:

In [7]:
# The regular expresion replaces all subsequent newline and space characters with a single space
normalizer_spaces = normalization.core.Regex("[\n\s]+", " ")
normalized = normalizer_spaces.normalize(normalized)

# output first 500 characters
print(normalized[:500])

 BBC 2017 Tonight, the Prime Minister, Theresa May, the leader of the Conservative Party, and the leader of Labour Party, Jeremy Corbyn, face the voters. Welcome to Question Time. So, over the next 90 minutes, the leaders of the two larger parties are going to be quizzed by our audience here in York. Now, this audience is made up like this - just a third say they intend to vote Conservative next week. The same number say they're going to vote Labour, and the rest either support other parties, or


`normalized` now contains cleaned and readable plain text. A combination of `normalizer_xmltags` and `normalizer_spaces` would be handy.

Meet [NormalizationComposite](https://benchmarkstt.readthedocs.io/en/latest/modules/benchmarkstt.normalization.html#benchmarkstt.normalization.NormalizationComposite), the normalizer allowing us to combine normalizers.

In [8]:
from benchmarkstt import normalization

normalizer_xmltags = normalization.core.Regex("</?[?!\[\]a-zA-Z][^>]*>", " ")
normalizer_spaces = normalization.core.Regex("[\n\s]+", " ")

# making a single normalizer out of xmltags and spaces regex
normalizer = normalization.NormalizationComposite()
normalizer.add(normalizer_xmltags)
normalizer.add(normalizer_spaces)

with open(subtitle_file, 'r') as f:
    subtitles = f.read()

normalized = normalizer.normalize(subtitles)
# output first 500 characters
print(normalized[:500])

 BBC 2017 Tonight, the Prime Minister, Theresa May, the leader of the Conservative Party, and the leader of Labour Party, Jeremy Corbyn, face the voters. Welcome to Question Time. So, over the next 90 minutes, the leaders of the two larger parties are going to be quizzed by our audience here in York. Now, this audience is made up like this - just a third say they intend to vote Conservative next week. The same number say they're going to vote Labour, and the rest either support other parties, or


You can see that the XML tags are gone and we got rid of superfluous spacing. From now on we can also just simply keep adding normalizers...

Let's do a quick check of our results by checking the most common words.

In [9]:
from collections import Counter

wordcounts = Counter(normalized.split(' '))
wordcounts.most_common(50)

[('the', 728),
 ('to', 514),
 ('of', 340),
 ('that', 320),
 ('and', 318),
 ('a', 304),
 ('I', 295),
 ('in', 282),
 ('you', 234),
 ('we', 222),
 ('is', 176),
 ('have', 154),
 ('for', 138),
 ('be', 128),
 ('are', 115),
 ('it', 115),
 ('on', 107),
 ('with', 97),
 ('think', 97),
 ('about', 88),
 ('will', 80),
 ('not', 79),
 ('do', 77),
 ('as', 74),
 ('people', 73),
 ('at', 72),
 ('our', 70),
 ('this', 70),
 ('would', 70),
 ('your', 67),
 ("it's", 62),
 ('was', 59),
 ('want', 57),
 ('what', 56),
 ('APPLAUSE', 55),
 ('can', 54),
 ('but', 52),
 ('just', 49),
 ('because', 49),
 ('going', 48),
 ("I'm", 47),
 ('get', 47),
 ('We', 47),
 ('from', 46),
 ('there', 46),
 ('or', 45),
 ('so', 45),
 ('all', 44),
 ('more', 44),
 ('those', 41)]

As you can see, we mostly see normal/expected output, but the file still contains non-dialogue text like 'APPLAUSE'. Let's add another normalizer!

In [10]:
normalizer.add(normalization.core.Replace('APPLAUSE', ''))

Maybe there are other upper case words that could be filtered out? Let's use a Regex normalizer to remove all lower case characters.

In [11]:
normalization.core.Regex('[^A-Z]+ *', ' ').normalize(normalized)

' BBC T P M T M C P L P J C W Q T S Y N C T L A T BBCQT F P T S C P P M T M APPLAUSE T T T G P M Y A E W H S P M W D APPLAUSE T A I L I H S A I I I H S I I I I I I I I I I DNA D A I S APPLAUSE W A E W P M Y C P M Y A J C W APPLAUSE F I I I I I A S I Y I A E U B A I OK I P M I B I I B I I W I I B I B APPLAUSE Y W L A B L D I B W A P T S P M B Y L D I J C N L D S Y D A C J M D M N S T F INTO EU B APPLAUSE Y T T P M I I T I O A N I I Y A I B B E B B W D I Y THAT THAT N I I A Y J I N C I I APPLAUSE I P M D C I I I I B I Y E I C P Y C P N I APPLAUSE I I B B I P M M B B OK APPLAUSE W B O I I W TV APPLAUSE W I UK I I I P M I OK L APPLAUSE W I L B B C G G I EU W APPLAUSE I I N I I EU B B W W P Y C W I D E UK EU I U K I APPLAUSE Y G Y B B S B W I I I UK E U L A I I EU B B I I I B I B W I APPLAUSE I R D C P M Y R E U Y B LAUGHTER I W W I A I I EU I EU W B P I T EU B G W B W D I N O Y I B Y R N T R W A C I O I I EU I S I P M A T T Y N Y G E Y I LAUGHTER I C W I F U K A I I I I P M E U B I Y T D B

Here we can easily identify some more uppercase only text that we want removed, such as "DROWNS OUT SPEECH", "CHEERING" and "CHATTER FROM AUDIENCE". Let's add it as a couple of `Replace` normalizers.

In [12]:
normalizer.add(normalization.core.Replace('SPEAKS OFF MIC', ''))
normalizer.add(normalization.core.Replace('INDISTINCT', ''))
normalizer.add(normalization.core.Replace('CHATTER FROM AUDIENCE', ''))
normalizer.add(normalization.core.Replace('LAUGHTER', ''))
normalizer.add(normalization.core.Replace('DROWNS OUT SPEECH', ''))
normalizer.add(normalization.core.Replace('GROANING', ''))
normalizer.add(normalization.core.Replace('CHEERING', ''))
normalized = normalizer.normalize(subtitles)

# print first 500 characters
print(normalized[:500])

 BBC 2017 Tonight, the Prime Minister, Theresa May, the leader of the Conservative Party, and the leader of Labour Party, Jeremy Corbyn, face the voters. Welcome to Question Time. So, over the next 90 minutes, the leaders of the two larger parties are going to be quizzed by our audience here in York. Now, this audience is made up like this - just a third say they intend to vote Conservative next week. The same number say they're going to vote Labour, and the rest either support other parties, or


We could continue and experiment and finetune these normalizers, but for the purposes of this demo this basic normalization should suffice. We now have a normalizer for the reference file.

We ended up with:

In [13]:
from benchmarkstt import normalization

# making a single normalizer out of xmltags and spaces regex
normalizer = normalization.NormalizationComposite()

# Remove XML-tags
normalizer.add(normalization.core.Regex("</?[?!\[\]a-zA-Z][^>]*>", " ")) 
# Remove extra newline and spaces
normalizer.add(normalization.core.Regex("[\n\s]+", " "))

# Remove non-dialogue text
normalizer.add(normalization.core.Replace('APPLAUSE', ''))
normalizer.add(normalization.core.Replace('SPEAKS OFF MIC', ''))
normalizer.add(normalization.core.Replace('INDISTINCT', ''))
normalizer.add(normalization.core.Replace('CHATTER FROM AUDIENCE', ''))
normalizer.add(normalization.core.Replace('LAUGHTER', ''))
normalizer.add(normalization.core.Replace('DROWNS OUT SPEECH', ''))
normalizer.add(normalization.core.Replace('GROANING', ''))
normalizer.add(normalization.core.Replace('CHEERING', ''))

with open(subtitle_file, 'r') as f:
    subtitles = f.read()

normalized = normalizer.normalize(subtitles)
# output first 500 characters
print(normalized[:500])

 BBC 2017 Tonight, the Prime Minister, Theresa May, the leader of the Conservative Party, and the leader of Labour Party, Jeremy Corbyn, face the voters. Welcome to Question Time. So, over the next 90 minutes, the leaders of the two larger parties are going to be quizzed by our audience here in York. Now, this audience is made up like this - just a third say they intend to vote Conservative next week. The same number say they're going to vote Labour, and the rest either support other parties, or


The next step is to get the machine-generated transcripts for benchmarking.