# TEST-SCRIPT for Normalizer class in the normalization module

The Normalizer class includes common normalization operations that is useful in the context of ASR evaluation and other applications requiring literal text comparisons.


In [1]:
import re, sys, os, copy
import pkg_resources
import evalign as eva
#
data_nl = pkg_resources.resource_filename('evalign', 'data/')

In [2]:
help(eva.normalize)

Help on module evalign.normalize in evalign:

NAME
    evalign.normalize

DESCRIPTION
    @author:  compi
    @version: 0.1
    @revised: 23/11/2022

CLASSES
    builtins.object
        Normalizer
    
    class Normalizer(builtins.object)
     |  Summary:
     |  --------
     |  The Normalizer class delivers a set of text normalization operations
     |  that are handy in ASR (and NLP) evaluations
     |  
     |  A Normalizer object is formed by adding elementary operations 
     |  to the processing pipeline.
     |  
     |  Normalization can then be applied to a string or corpus (list of strings)
     |  by calling Normalizer.process(text)
     |  
     |  The elementary opertions are:
     |      ToLowerCase          convert to lower case
     |      ToUpperCase          convert to upper case
     |      ReduceWhiteSpace    converts all white space in between words to single blank 
     |      DeleteTags           delete all XML like tagged tokens (i.e. between <>)
     |      S

## Reference Example
This example shows how to implement the accepted normalization for the ASR NBEST test suite for Dutch(BE) with the 
Normalizer class.   

A pipeline normalization process is defined, consisting of:   
- removing all additional white space
- normalizing define abbreviations
- normalizing fillers
    + e.g. uh --> \<h\>
- rewrite rules for Dutch numbers < 100  (is a complement to the ASR routine that generates parts of numbers)
    + e.g. één-en tachtig -->  eenentachtig
- normalizing for spelling variants in the NBEST test set
    + e.g. E negentien --> E19
- remove all tags (normalized fillers)
- convert to lower case   

The normalization with the **norm** object is applied to a raw text **text_raw** by simply running:
>  text_norm = norm.process(text_raw)

In [3]:
text_raw="""de rijweg is vrij op de E negentien Antwerpen Brussel in de Craeybeckxtunnel \n
uh uh een goede morgen \n       \n
Antwerpen-Charleroi eindigde op  één-en tachtig drie-en zestig
"""
print("--INPUT:  ")
print(text_raw)
# 1. 
# Load specifications from files (or create them here), read functions are defined in evalign.utils
nl_abbrev = eva.LoadSubstitutionsFromFile(data_nl+'nl_abbrev.lst')
cgn_fillers = eva.LoadSubstitutionsFromFile(data_nl+'cgn_fillers.lst')
nl_getallen100 = eva.LoadSubstitutionsFromFile(data_nl+'nl_getallen100.lst')
nbest = eva.LoadSubstitutionsFromFile(data_nl+'nbest.lst')
# 2. 
# initialize a Normalizer object
# and define a processing pipe by adding elementary operations with their parameters  
norm = eva.Normalizer()
norm.add_pipe("RemovePunctuation")
norm.add_pipe("SubstituteWords",cgn_fillers)
norm.add_pipe("SubstituteWords",nl_abbrev)
norm.add_pipe("Substitute",nl_getallen100)
norm.add_pipe("Substitute",nbest)
norm.add_pipe("RemoveTags")
norm.add_pipe("Lower")
norm.add_pipe("RemoveWhiteSpace")
nbest_norm = copy.deepcopy(norm)
# 3.
# run the normalizer on a raw text input
text_norm = norm.process(text_raw)
print(text_norm)

--INPUT:  
de rijweg is vrij op de E negentien Antwerpen Brussel in de Craeybeckxtunnel 

uh uh een goede morgen 
       

Antwerpen-Charleroi eindigde op  één-en tachtig drie-en zestig

de rijweg is vrij op de e19 antwerpen brussel in de craeybeckxtunnel een goede morgen antwerpen-charleroi eindigde op eenentachtig drieënzestig


## The Elementary Operations

#### RemovePunctuation
removes white space and the most common punctuations **.,!?:;**

In [4]:
text = " This is a sentence with spaces   , tabs \t and line breaks \n  Get rid of all Punctuation !!?  except: when word,internal."
print("input:\n", text)
Norm = eva.Normalizer()
Norm.add_pipe("RemovePunctuation")
text_norm = Norm.process(text)
print("output:\n", text_norm)

input:
  This is a sentence with spaces   , tabs 	 and line breaks 
  Get rid of all Punctuation !!?  except: when word,internal.
output:
  This is a sentence with spaces    tabs 	 and line breaks 
  Get rid of all Punctuation   except when word,internal


#### RemoveWhiteSpace
removes all redundant white space, including line breaks ; i.e. input is considered one utterance

In [5]:
text = " This is a sentence with spaces   , tabs \t and line breaks \n  Get rid of them !!"
print("input:\n", text)
norm = eva.Normalizer()
norm.add_pipe("RemoveWhiteSpace")
text_norm = norm.process(text)
print("output:\n", text_norm)

input:
  This is a sentence with spaces   , tabs 	 and line breaks 
  Get rid of them !!
output:
 This is a sentence with spaces , tabs and line breaks Get rid of them !!


#### Substitute
**Substitute** implements generic pattern substitutions applied in the order specified on the full text (word, word internal, cross word). 
Substitute patterns can hence contain blanks.  
Given the broad scope, Substitute substitutions should be applied with great care to avoid unexpected modifications. 
Substitute substitutions are executed one by one, hence an original fragment may be modified multiple times

In this example numbers (written in the decompounded format used in the ESAT ASR system) are compounded in 2 steps: (1) rewriting of the first part with compounding final char (2) compound acceptance by the second part of the compound

#### SubstituteWords
**SubstituteWords** is a substitution applied to words (surrounded by white space). 
This is more robust and easier to scope than the more generic **Substitute**  substitutions.
However, be aware, that in the current implementation sentences all extraneous white space gets lost by this routine.

In this example, a file with filler normalizations into tags \<h\> and \<g\> is loaded and applied.
    
#### Loading Substitution Patters from file
Both **Substitute** and **SubstituteWords** require definitions of the target and substitution patterns.  These are passed to the Normalizer as Python dictionaries.
In the **util** module , a routine **SubstitutionsFromFile** is available. The file
should hold one substitution pattern per line with patterns to be parsed by '|'.

In [6]:
text = "vijf-en twintig duizend drie honderd drie-en tachtig"
Norm = eva.Normalizer()
print("input:\n", text)
Norm.add_pipe("RemoveWhiteSpace")
nl_getallen100 = eva.LoadSubstitutionsFromFile(data_nl+'nl_getallen100.lst')
Norm.add_pipe("Substitute",nl_getallen100)
Norm.pipe=["single_space","sub_patterns"]
text_norm = Norm.process(text)
print(nl_getallen100)
print("output:\n", text_norm)

input:
 vijf-en twintig duizend drie honderd drie-en tachtig
{'één-en': 'eenen_', 'twee-en': 'tweeën_', 'drie-en': 'drieën_', 'vier-en': 'vieren_', 'vijf-en': 'vijfen_', 'zes-en': 'zesen_', 'zeven-en': 'zevenen_', 'acht-en': 'achten_', 'negen-en': 'negenen_', 'n_ twintig': 'ntwintig', 'n_ dertig': 'ndertig', 'n_ veertig': 'nveertig', 'n_ vijftig': 'nvijftig', 'n_ zestig': 'nzestig', 'n_ zeventig': 'nzeventig', 'n_ tachtig': 'ntachtig', 'n_ negentig': 'nnegentig'}
output:
 vijfentwintig duizend drie honderd drieëntachtig


In [7]:
text = "Fillers zoals xxx uh he ggg,ggg  kunnen we reduceren tot een aantal minimale tags "
print("input:\n", text)
Norm = eva.Normalizer()
word_substitutions = eva.LoadSubstitutionsFromFile(data_nl+'cgn_fillers.lst')
Norm.add_pipe("SubstituteWords",word_substitutions)
# to see the standard number substitutions
# print(word_substitutions)
text_norm = Norm.process(text)
print("output:\n", text_norm)

input:
 Fillers zoals xxx uh he ggg,ggg  kunnen we reduceren tot een aantal minimale tags 
output:
 Fillers zoals <x> <h> <h> <g>  kunnen we reduceren tot een aantal minimale tags 


#### RemoveTags
   
**RemoveTags** will remove all tags of the form **\<???\>** from the text.  Such tags  may not be part of the text as such as they are commonly used for meta-information, class descriptors, non-speech events, sentence and speaker boundaries etc.     
In the implementation there is an inherent assumption that tags are at most 32 characters to avoid collusion with (unlikely) standalone < or > characters

In [8]:
text = "All HTML like tags like <h> <g> for fillers and others like <SPKR ID=M13> </SPKR> and <UNK> and <br> or line end and start <s> and </s> \
    can be deleted with a single call but not <012345678901234567890123456789012> which is too long"
print("input:\n", text)
Norm = eva.Normalizer()
Norm.add_pipe("RemoveTags")
text_norm=Norm.process(text)
print("output:\n", text_norm)

input:
 All HTML like tags like <h> <g> for fillers and others like <SPKR ID=M13> </SPKR> and <UNK> and <br> or line end and start <s> and </s>     can be deleted with a single call but not <012345678901234567890123456789012> which is too long
output:
 All HTML like tags like   for fillers and others like   and  and  or line end and start  and      can be deleted with a single call but not <012345678901234567890123456789012> which is too long


#### StripHyphen and SplitHyphen
**StripHyphen** will remove all hyphens in word-initial and word-final positions    
**SplitHyphen** splits hyphen-compounds into their parts
**DecompHyphen** splits hypen-compounds and attach '\_' to the compounding edges 


In [9]:
text = "De ex-gouverneur laat appel- en perencontainers plaatsen"
print("input:\n", text)
Norm = eva.Normalizer()
Norm.add_pipe("StripHyphen")
text_norm=Norm.process(text)
print("output (strip):\n", text_norm)
Norm = eva.Normalizer()
Norm.add_pipe("SplitHyphen")
text_norm=Norm.process(text)
print("output (split):\n", text_norm)
Norm = eva.Normalizer()
Norm.add_pipe("DecompHyphen")
text_norm=Norm.process(text)
print("output (split):\n", text_norm)

input:
 De ex-gouverneur laat appel- en perencontainers plaatsen
output (strip):
 De ex-gouverneur laat appel en perencontainers plaatsen
output (split):
 De ex gouverneur laat appel- en perencontainers plaatsen
output (split):
 De ex_ _gouverneur laat appel_ en perencontainers plaatsen


#### Other functions and Special Characters

There are a few other functions supported:
> **lower**: conversion to lower case   
> **upper**: conversion to upper case
 
A few characters have a reserved meaning in Normalizer and should be used with care
> '|' is used as separator in the substitution files   
> '<>' is used to determine tags   
> '_'  plays a role in certain compounding routines   
> '\' the input string and the substitution patterns can be escaped character sequences   


In [10]:
#  Remark in the text below the '\' and '\\' in isolation yielding the same input
text = """Escape characters like backslash-t(\t) and backslash-n(\n) are allowed 
but be careful with single backslash(\\) AND the sequence of operations !! """
print("input:\n", text)
Norm = eva.Normalizer()
special_chars =  {"\t":"<TAB>", "\n":"<NEWLINE>", '\\':"<BS>"}
Norm.add_pipe("Substitute",special_chars)
text_norm=Norm.process(text)
print("output:\n", text_norm)

input:
 Escape characters like backslash-t(	) and backslash-n(
) are allowed 
but be careful with single backslash(\) AND the sequence of operations !! 
output:
 Escape characters like backslash-t(<TAB>) and backslash-n(<NEWLINE>) are allowed <NEWLINE>but be careful with single backslash(<BS>) AND the sequence of operations !! 


## Examples for Dutch from CGN and NBEST 
#### Nederlandse getallen > 100 en in alle contexten

In [11]:
# 1. apply number rewriting in Dutch 1-100 as foreseen in the
# Nbest normalizer above
#
text = "In het jaar dertien honderd veertien  reden twee honderd vijf-en twintig duizend honderd en acht kruisvaarders twee-en twintig keer twee-en  half uur lang "
text1=nbest_norm.process(text)
print(text1)
# 2.
# in this case the recognizer came up with un unusual sequence (in this case twee-en not followed by a teens number)
#    some spurious compounding charcters '_'can creep into the code; 
#    they are easily removed assuming that they do not have any meaningful occurences
norm_cleanup = eva.Normalizer()
norm_cleanup.add_pipe("Substitute",{'_':' '})
text2=norm_cleanup.process(text1)
print(text2)

in het jaar dertien honderd veertien reden twee honderd vijfentwintig duizend honderd en acht kruisvaarders tweeëntwintig keer tweeën_ half uur lang
in het jaar dertien honderd veertien reden twee honderd vijfentwintig duizend honderd en acht kruisvaarders tweeëntwintig keer tweeën  half uur lang


#### NBest

In [13]:
corpus_fname = "testdata/demo2_17_ref.txt"
utts, ids = eva.read_corpus(corpus_fname,KEYS=True)
    
for utt in utts:
    utt_norm = nbest_norm.process(utt)
    if utt_norm != utt:
        print('--\n',utt)
        print(utt_norm)


--
 hm

--
 hm

--
 hm

--
 hm

--
 hm

--
 wel ik denk dat de keuze nu ook uhm uhm gebaseerd is op op meer inhoudelijke argumenten
wel ik denk dat de keuze nu ook gebaseerd is op op meer inhoudelijke argumenten
--
 dus we hadden de vorige keer dat die lijst tot stand is gekomen was eigenlijk gewoon uh alle Erasmuspartners die we toen hadden die uh hadden we toen aangeschreven
dus we hadden de vorige keer dat die lijst tot stand is gekomen was eigenlijk gewoon alle erasmuspartners die we toen hadden die hadden we toen aangeschreven
--
 dus nu is het ..
dus nu is het
--
 dus nu hebben we uhm ja op inhoudelijke xxx ..
dus nu hebben we ja op inhoudelijke
--
 en 't is dus ook de bedoeling om om die lijst te laten aangroeien
en het is dus ook de bedoeling om om die lijst te laten aangroeien
--
 dus uh als er nieuwe groepen zijn die we moeten aanschrijven
dus als er nieuwe groepen zijn die we moeten aanschrijven
--
 hm

--
 hm

--
 xxx

--
 Steven uh organiseert een uh workshop ook
steven or