# TEST-SCRIPT for EVALUATION of ASR systems

- For details on the low level functions, see also:
  + distance_test: example usage of the levenshtein() and edit_distance() routines
  + normalization_test: example usage of text normalization
  

In [1]:
# do all the imports
import sys, os
import numpy as np
import pandas as pd
import evalign as eva
import pkg_resources
resources = pkg_resources.resource_filename('evalign', 'data/')
testdata = "testdata/"

## **eval_corpus()** : Main Scoring Routine

This routine takes two aligned lists of utterances for hypothesis and reference as input
and returns results into a results dictionary.   
Both word and character error rates can be computed.
The results dictionary contains the global error rate, detailed numbers of SUBS, INS, DEL, ..   
Optionally it will also contain the alignments between input and reference and a summary of the errors that were made.

In [2]:
ref_utt = ["Minister Daems stelt de vakbonden voor de keuze ."," Good Morning Vietnam"]
hyp_utt = ["minister Daems geeft de vakbonden geen keuze","jolly good day to Nam"]

In [3]:
# by default WORD error rates will be computed
results = eva.eval_corpus(hyp_utt,ref_utt)
#print(results)
assert(results['total']==10)
eva.pp_results(results)

Error Rate: 83.33% 
Error Details: #S=6 #I=2 #D=2
Edit Distance: 10.60 
Tokens (HYP): 12    (REF): 12 
Utterances: 2


### Character Error Rates
Character error rates are computed by passing the argument TOKEN=CHAR.  
Note that punctuation is first removed and all white space is reduced to single blanks.   
There is an option to maintain the input character sequence as given, by using TOKEN=None

In [4]:
# character error rates
results = eva.eval_corpus(hyp_utt,ref_utt,TOKEN=None)
eva.pp_results(results)
assert(results['total']==30)
# character error rates
results = eva.eval_corpus(hyp_utt,ref_utt,TOKEN='CHAR')
eva.pp_results(results)
assert(results['total']==29)

Error Rate: 42.86% 
Error Details: #S=19 #I=3 #D=8
Edit Distance: 31.90 
Tokens (HYP): 65    (REF): 70 
Utterances: 2
Error Rate: 42.03% 
Error Details: #S=19 #I=3 #D=7
Edit Distance: 30.90 
Tokens (HYP): 65    (REF): 69 
Utterances: 2


In [5]:
# When adding the option ALGIN=True, alignments and error details will be in the results structure
# you can also print alignments and the errors, 
results = eva.eval_corpus(hyp_utt,ref_utt,TOKEN='CHAR',ALIGN=True)
eva.pp_results(results,['align','errors'])


ALIGNMENTS



Unnamed: 0,S,H,H.1,H.2,H.3,H.4,H.5,H.6,H.7,H.8,H.9,H.10,H.11,H.12,H.13,S.1,S.2,H.14,S.3,H.15,H.16,H.17,H.18,H.19,H.20,H.21,H.22,H.23,H.24,H.25,H.26,H.27,H.28,H.29,D,D.1,D.2,D.3,S.4,S.5,H.30,I,H.31,H.32,H.33,H.34,H.35,H.36,D.4
x,m,i,n,i,s,t,e,r,,D,a,e,m,s,,g,e,e,f,t,,d,e,,v,a,k,b,o,n,d,e,n,,_,_,_,_,g,e,e,n,,k,e,u,z,e,_
y,M,i,n,i,s,t,e,r,,D,a,e,m,s,,s,t,e,l,t,,d,e,,v,a,k,b,o,n,d,e,n,,v,o,o,r,,d,e,_,,k,e,u,z,e,


Unnamed: 0,D,S,H,I,S.1,S.2,H.1,S.3,H.2,S.4,S.5,I.1,S.6,S.7,S.8,H.3,D.1,S.9,S.10,S.11,S.12,H.4,H.5
x,_,j,o,l,l,y,,g,o,o,d,,d,a,y,,_,t,o,,N,a,m
y,,G,o,_,o,d,,M,o,r,n,_,i,n,g,,V,i,e,t,n,a,m



ERRORS



Unnamed: 0,x,y,E
0,m,M,S
1,g,s,S
2,e,t,S
3,f,l,S
4,_,v,D
5,_,o,D
6,_,o,D
7,_,r,D
8,g,,S
9,e,d,S


In [6]:
# with text normalization - implemented as a pipeline process
norm_x1 = eva.Normalizer()
norm_x1.add_pipe("RemovePunctuation")                  # remove most common punctuation
norm_x1.add_pipe("SubstituteWords",{"Nam":"Vietnam"})  # normalize synonyms
norm_x1.add_pipe("Lower")                              # decapitalize

results = eva.eval_corpus(hyp_utt,ref_utt,norm=norm_x1,ALIGN=True)
eva.pp_results(results,['results','align'])
assert(results['total']==6)

Error Rate: 54.55% 
Error Details: #S=3 #I=2 #D=1
Edit Distance: 6.30 
Tokens (HYP): 12    (REF): 11 
Utterances: 2

ALIGNMENTS



Unnamed: 0,H,H.1,S,H.2,H.3,D,S.1,H.4
x,minister,daems,geeft,de,vakbonden,_,geen,keuze
y,minister,daems,stelt,de,vakbonden,voor,de,keuze


Unnamed: 0,I,H,I.1,S,H.1
x,jolly,good,day,to,vietnam
y,_,good,_,morning,vietnam


## EVALUATION from a TEXT CORPUS
Combine:
- **read_corpus()** to read reference and hypothesis texts from file and splits lines
- **eval_corpus()** to do the evaluation
- optionally define text normalization in a **Normalizer** object to be applied to test and reference

### CGN: raw word error rate

In [7]:
ref_fname = testdata+ "cgndev1_ref.txt"
hyp_fname = testdata + "cgndev1_asr1.txt"
ref_utt = eva.read_corpus(ref_fname)
hyp_utt = eva.read_corpus(hyp_fname)

print("\nRaw (Word) Error Rate")
results = eva.eval_corpus(hyp_utt,ref_utt)
eva.pp_results(results)


Raw (Word) Error Rate
Error Rate: 36.43% 
Error Details: #S=34 #I=7 #D=10
Edit Distance: 54.40 
Tokens (HYP): 137    (REF): 140 
Utterances: 5


### CGN: with text normalization

In [9]:
# 1. Load substitution patterns  from files 
cgn_fillers = eva.LoadSubstitutionsFromFile(resources+'cgn_fillers.lst')
nl_abbrev = eva.LoadSubstitutionsFromFile(resources+'nl_abbrev.lst')
nl_getallen100 = eva.LoadSubstitutionsFromFile(resources+'nl_getallen100.lst')
nbest = eva.LoadSubstitutionsFromFile(resources+'nbest.lst')
# 2. Create the Normalizer object  
norm_nl = eva.Normalizer()
norm_nl.add_pipe("RemovePunctuation")
norm_nl.add_pipe("SubstituteWords",nl_abbrev)
norm_nl.add_pipe("SubstituteWords",cgn_fillers)
norm_nl.add_pipe("Substitute",nl_getallen100)
norm_nl.add_pipe("Substitute",nbest)
norm_nl.add_pipe("RemoveTags")
norm_nl.add_pipe("Lower")
norm_nl.add_pipe("RemoveWhiteSpace")
#norm_nl.info()
##
print("\nError Rate after normalization and allowing for compounds")
results = eva.eval_corpus(hyp_utt,ref_utt,norm=norm_nl,CMPND=['','-'])
eva.pp_results(results)


Error Rate after normalization and allowing for compounds
Error Rate: 28.47% 
Error Details: #S=25 #I=4 #D=10
Accepted Compounds: #C=2
Edit Distance: 41.90 
Tokens (HYP): 131    (REF): 137 
Utterances: 5


### CGN: Character Error Rates

In [10]:
print("\nRaw (Character) Error Rate")
results = eva.eval_corpus(hyp_utt,ref_utt,TOKEN='CHAR',ALIGN=True)
eva.pp_results(results)


Raw (Character) Error Rate
Error Rate: 17.74% 
Error Details: #S=46 #I=50 #D=47
Edit Distance: 147.60 
Tokens (HYP): 809    (REF): 806 
Utterances: 5


### Error Analysis
Remember to set the ALIGN flag should be set.   .

In [11]:
print("Error Rate after Normalization WITHOUT Compounds")
res1 = eva.eval_corpus(hyp_utt,ref_utt,norm=norm_nl,ALIGN=True)
eva.pp_results(res1)
#
print("Error Rate after Normalization WITH Compounds")
res2 = eva.eval_corpus(hyp_utt,ref_utt,norm=norm_nl,CMPND=['','-'],ALIGN=True)
eva.pp_results(res2)

Error Rate after Normalization WITHOUT Compounds
Error Rate: 31.39% 
Error Details: #S=27 #I=5 #D=11
Edit Distance: 45.70 
Tokens (HYP): 131    (REF): 137 
Utterances: 5
Error Rate after Normalization WITH Compounds
Error Rate: 28.47% 
Error Details: #S=25 #I=4 #D=10
Accepted Compounds: #C=2
Edit Distance: 41.90 
Tokens (HYP): 131    (REF): 137 
Utterances: 5


In [12]:
# alignment of first sentence for test2
print('ALIGNMENT of sentence(0) for test 2')
print(res2['align'][0])

ALIGNMENT of sentence(0) for test 2
[('alle', 'alle'), ('opleidingen', 'opleidingen'), ('die', 'die'), ('van', 'van'), ('de', 'de'), ('eerste', 'eerste'), ('cyclus', 'cyclus'), ('aan', 'aan'), ('het', 'het'), ('trucje', 'ruca'), ('of', 'of'), ('de', 'de'), ('ufsia', 'ufsia'), ('doorstromen', 'doorstromen'), ('naar', 'naar'), ('het', 'de'), ('tweede', 'tweede'), ('cyclus', 'cyclus'), ('van', 'van'), ('de', 'de'), ('uia', 'uia'), ('zullen', 'zullen'), ('per', 'per'), ('één', 'één'), ('oktober', 'oktober'), ('negenennegentig', 'negenennegentig'), ('door', 'door'), ('faculteiten', 'facultaire'), ('eu-beleid', '_'), ('organen', 'ua-beleidsorganen'), ('worden', 'worden'), ('gestuwd', 'gestuurd')]


In [13]:
print('Errors in test1')
eva.pp_results(res1,'errors')

print('\nAll Errors and Compounds in test2')
eva.pp_results(res2,'errors')

print('\nCompounds found in test2')
errc = [ error for error in res2['errors'] if (error[2] in ['C']) ]
print(errc)

Errors in test1

ERRORS



Unnamed: 0,x,y,E
0,trucje,ruca,S
1,het,de,S
2,faculteiten,facultaire,S
3,eu-beleid,_,I
4,organen,ua-beleidsorganen,S
5,gestuwd,gestuurd,S
6,verdeling,verdunning,S
7,eis,_,I
8,schaal,eischaal,S
9,collecties,eicollecties,S



All Errors and Compounds in test2

ERRORS



Unnamed: 0,x,y,E
0,trucje,ruca,S
1,het,de,S
2,faculteiten,facultaire,S
3,eu-beleid,_,I
4,organen,ua-beleidsorganen,S
5,gestuwd,gestuurd,S
6,verdeling,verdunning,S
7,eis,_,I
8,schaal,eischaal,S
9,collecties,eicollecties,S



Compounds found in test2
[('verdergaan', 'verder+gaan', 'C'), ('beet+pak', 'beetpak', 'C')]


## Some Extra Functionalities in **read_corpus()** 
+ by default it assumes LineFeeds to separate utterance and matching utterances in test and hypothesis
+ alternatively it accepts files in which utterances start with a unique KEY
    + a selection of utterances defined by keys can be made using **select_from_corpus()** 
    + a selection of matching utterances can be made with **match_corpora**
    + this is particularly handy if you want to evaluate a small test set , using a larger reference corpus

In [14]:
ref_utt, ref_ids = eva.read_corpus(testdata+"demo1_ref.txt",KEYS=True)
hyp_utt, hyp_ids = eva.read_corpus(testdata+"demo1_hyp.txt",KEYS=True)
#
# in this example there are multiple anomalies that need to be resolved 
# 1. find corresponding utterances in reference set based on test set ...
ref_utt, hyp_utt, sel_ids = eva.match_corpora(ref_utt,hyp_utt,ref_ids,hyp_ids)
#ref_utt, sel_ids = eva.select_from_corpus(ref_utt, ref_ids,selection=hyp_ids)
# 2. just in case that the test contained an utterance not in the reference ..
#hyp_utt, _ = eva.select_from_corpus(hyp_utt, hyp_ids, selection = sel_ids)
#
assert(len(hyp_utt) == len(ref_utt))
print(ref_ids,hyp_ids,sel_ids)

['v60049_nbest-dev-2008-bn-vl_004_002319-005151', 'v60049_nbest-dev-2008-bn-vl_004_012243-022033', 'v60073_nbest-dev-2008-bn-vl_001_000000-003073', 'v60073_nbest-dev-2008-bn-vl_001_044106-051592', 'v60073_nbest-dev-2008-bn-vl_001_055116-060351', 'v60073_nbest-dev-2008-bn-vl_006_000000-003472', 'v60074_nbest-dev-2008-bn-vl_001_003093-012772'] ['v60049_nbest-dev-2008-bn-vl_004_002319-005151', 'v60049_nbest-dev-2008-bn-vl_004_012243-022033', 'v60073_nbest-dev-2008-bn-vl_001_055116-060351', 'v60073_nbest-dev-2008-bn-vl_006_000000-003472', 'v60074_nbest-dev-2008-bn-vl_001_003093-012772', 'v60074_nbest-dev-2008-bn-vl_001_012792-017385'] ['v60049_nbest-dev-2008-bn-vl_004_002319-005151', 'v60049_nbest-dev-2008-bn-vl_004_012243-022033', 'v60073_nbest-dev-2008-bn-vl_001_055116-060351', 'v60073_nbest-dev-2008-bn-vl_006_000000-003472', 'v60074_nbest-dev-2008-bn-vl_001_003093-012772']


In [15]:
results = eva.eval_corpus(hyp_utt,ref_utt)
assert(results['total']==39)
eva.pp_results(results)

Error Rate: 4.63% 
Error Details: #S=20 #I=5 #D=14
Edit Distance: 41.00 
Tokens (HYP): 834    (REF): 843 
Utterances: 5


In [16]:
  #
print("\nResults after Normalization")
res1 = eva.eval_corpus(hyp_utt,ref_utt,norm=norm_nl)
eva.pp_results(res1)
#
print("\nResults after Normalization and Compounding")
res2 = eva.eval_corpus(hyp_utt,ref_utt,norm=norm_nl,CMPND=['','-'],ALIGN=True)
eva.pp_results(res2)
assert(round(res2['err'],2)==3.2)


Results after Normalization
Error Rate: 4.27% 
Error Details: #S=18 #I=4 #D=14
Edit Distance: 37.80 
Tokens (HYP): 833    (REF): 843 
Utterances: 5

Results after Normalization and Compounding
Error Rate: 3.20% 
Error Details: #S=12 #I=3 #D=12
Accepted Compounds: #C=5
Edit Distance: 29.20 
Tokens (HYP): 833    (REF): 843 
Utterances: 5


## ERROR ANALYSIS

In [17]:
# look at selected errors WITHOUT compounding in res1 and resolved compounds in res2

print('Substitutions and Deletions in test2')
err_sd = [ error for error in res2['errors'] if (error[2] in ['S','D']) ]
print(err_sd)
print('Compounds found in test2')
errc = [ error for error in res2['errors'] if (error[2] in ['C']) ]
print(errc)

Substitutions and Deletions in test2
[('een', 'één', 'S'), ('_', 'en', 'D'), ('_', 'meer', 'D'), ('begroting', 'begroeting', 'S'), ('ballet', 'palais', 'S'), ('_', 'we', 'D'), ('russen', 'zullen', 'S'), ('liggen', 'je', 'S'), ('worden', 'woede', 'S'), ('er', 'het', 'S'), ('_', 'nu', 'D'), ('_', 'ze', 'D'), ('_', 'een', 'D'), ('verloor', 'verloog', 'S'), ('antwerps', 'antwerpse', 'S'), ('dassen', 'dasse', 'S'), ('_', 'wel', 'D'), ('_', 'dat', 'D'), ('_', 'de', 'D'), ('_', 'de', 'D'), ('voren', 'voor', 'S'), ('_', 'de', 'D'), ('_', 'toch', 'D'), ('dassen', 'dasse', 'S')]
Compounds found in test2
[('laurent-désiré', 'laurent+désiré', 'C'), ('nul-twee', 'nul+twee', 'C'), ('vijftig+duizend', 'vijftigduizend', 'C'), ('zolang', 'zo+lang', 'C'), ('mini+koningskwestie', 'mini-koningskwestie', 'C')]


## Dutch ASR Evaluation for NBest and CGN benchmarks

## MINI-CGN DEMO

In [18]:
# Get subset from NBest ready
ref_utt = eva.read_corpus(testdata+"cgndev1_ref.txt")
hyp_utt = eva.read_corpus(testdata+"cgndev1_asr1.txt")
res_word = eva.eval_corpus(hyp_utt,ref_utt)
res_char = eva.eval_corpus(hyp_utt,ref_utt,TOKEN="CHAR")
print("WORD ERROR RATE")
eva.pp_results(res_word)
print("CHARACTER ERROR RATE")
eva.pp_results(res_char)

WORD ERROR RATE
Error Rate: 36.43% 
Error Details: #S=34 #I=7 #D=10
Edit Distance: 54.40 
Tokens (HYP): 137    (REF): 140 
Utterances: 5
CHARACTER ERROR RATE
Error Rate: 17.74% 
Error Details: #S=46 #I=50 #D=47
Edit Distance: 147.60 
Tokens (HYP): 809    (REF): 806 
Utterances: 5


## NBest dev-set (full)
Showing error rate reduction from 8.3% to 4.3%  by Normalization and Compound processing !!   
This test takes some time (a few minutes) depending on the speech of your machine. Cause is the very long paragraph length
utterances (> 100 words/paragraph) that need to be processed

In [19]:
ref_file = testdata+"nbdev_ref.txt"
hyp_file = testdata+"nbdev_asr100.txt"
ref_utt, ref_ids = eva.read_corpus(ref_file,KEYS=True)
hyp_utt, hyp_ids = eva.read_corpus(hyp_file,KEYS=True)

print("\nRaw Error Rate")
results1 = eva.eval_corpus(hyp_utt,ref_utt)
eva.pp_results( results1 )
print("\nError Rate after normalization")
eva.pp_results( eva.eval_corpus(hyp_utt,ref_utt,norm=norm_nl) )
print("\nError Rate after normalization and allowing for compounds")
results3 = eva.eval_corpus(hyp_utt,ref_utt,norm=norm_nl,CMPND=['','-']) 
eva.pp_results(results3)
assert( round(results1['err'],2) == 8.31 )
assert( round(results3['err'],2) == 4.32 )


Raw Error Rate
Error Rate: 8.31% 
Error Details: #S=519 #I=220 #D=119
Edit Distance: 909.90 
Tokens (HYP): 10425    (REF): 10324 
Utterances: 71

Error Rate after normalization
Error Rate: 6.24% 
Error Details: #S=373 #I=151 #D=120
Edit Distance: 681.30 
Tokens (HYP): 10350    (REF): 10319 
Utterances: 71

Error Rate after normalization and allowing for compounds
Error Rate: 4.32% 
Error Details: #S=268 #I=73 #D=105
Accepted Compounds: #C=101
Edit Distance: 493.00 
Tokens (HYP): 10350    (REF): 10319 
Utterances: 71
