<a href="https://colab.research.google.com/github/danijel3/CroatianSpeech/blob/main/Matching.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Automatic close matching

This is an attempt to match ASR output with long sequences of text automatically. The ASR output is obviously not very accurate (nor is the reference text), but we give our best to match the text as good as possible.

Here we download 3 files:
* char-level ASR output in `output.json`
* LM corrected ASR output in `output_lm.json`
* reference text in `ParlaMint...txt `

In [2]:
!wget https://github.com/danijel3/CroatianSpeech/raw/main/output.json
!wget https://github.com/danijel3/CroatianSpeech/raw/main/output_lm.json
!wget https://github.com/danijel3/CroatianSpeech/raw/main/ParlaMint-HR_S01.text.txt

--2021-12-26 21:18:51--  https://github.com/danijel3/CroatianSpeech/raw/main/output.json
Resolving github.com (github.com)... 13.114.40.48
Connecting to github.com (github.com)|13.114.40.48|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/danijel3/CroatianSpeech/main/output.json [following]
--2021-12-26 21:18:52--  https://raw.githubusercontent.com/danijel3/CroatianSpeech/main/output.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11513 (11K) [text/plain]
Saving to: ‘output.json’


2021-12-26 21:18:52 (91.0 MB/s) - ‘output.json’ saved [11513/11513]

--2021-12-26 21:18:52--  https://github.com/danijel3/CroatianSpeech/raw/main/output_lm.json
Resolving github.com (github.com)... 13.114.40.48
C

Next we also download some normalization code. The ASR output is normalized, so the reference text also has to be normalized to be able to match anything.

In [3]:
!wget https://github.com/danijel3/TextNormalize/raw/master/hrvatski.py
!wget https://github.com/danijel3/TextNormalize/raw/master/strings_hr.py

--2021-12-26 21:18:55--  https://github.com/danijel3/TextNormalize/raw/master/hrvatski.py
Resolving github.com (github.com)... 13.114.40.48
Connecting to github.com (github.com)|13.114.40.48|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/danijel3/TextNormalize/master/hrvatski.py [following]
--2021-12-26 21:18:56--  https://raw.githubusercontent.com/danijel3/TextNormalize/master/hrvatski.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5761 (5.6K) [text/plain]
Saving to: ‘hrvatski.py’


2021-12-26 21:18:56 (55.9 MB/s) - ‘hrvatski.py’ saved [5761/5761]

--2021-12-26 21:18:56--  https://github.com/danijel3/TextNormalize/raw/master/strings_hr.py
Resolving github.com (github.com)... 140.82.112.4


Below we define a class for the reference text dictionary. We replace all words with integer IDs to make the whole task faster and more robust. We also define a function to load the reference file - this creates the dictionary and automatically converts the reference to integer IDs.

In [54]:
from hrvatski import normalize
from dataclasses import dataclass,field
from typing import Dict,Set,List,Tuple

@dataclass
class Dictionary:
  word2id:Dict[str,int]=field(default_factory=lambda:{})
  id2word:Dict[int,str]=field(default_factory=lambda:{})
  oov_token:str='<unk>'

  def put(self, words:Set):
    for id,word in enumerate(sorted(list(words))):
      self.word2id[word]=id
      self.id2word[id]=word

  def get_id(self,word:str,warn_oov:bool=False)->int:
    if word not in self.word2id:
      if warn_oov:
        print(f'WARN: missing word "{word}"')
      return -1
    return self.word2id[word]
  
  def get_word(self,id:int)->str:
    if id==-1:
      return self.oov_token
    return self.id2word[id]

  def get_text(self,ids:List[int])->str:
    return ' '.join([self.get_word(x) for x in ids])
  
  def get_ids(self,text:str,warn_oov:bool=False)->List[int]:
    return [self.get_id(x,warn_oov) for x in text.strip().split()]


def load_ref(file):
  lines=[]
  text=[]
  words=set()
  vocab=Dictionary()
  with open(file) as f:
    for l in f:
      tok=normalize(l.strip()).strip().split()
      lines.append(tok)
      words.update(tok)
    vocab.put(words)
    for l in lines:
      for w in l:
        text.append(vocab.get_id(w))
  return text,vocab

Here we use the above method to load the reference:

In [5]:
text,vocab=load_ref('ParlaMint-HR_S01.text.txt')

Looking at first couple of words:

In [6]:
print(text[:10])
print(vocab.get_text(text[:10]))

[1937, 5788, 6501, 5783, 30234, 36700, 36530, 29594, 4281, 21203]
cijenjene gospođe i gospodo sukladno članku četiri stavak dva poslovnika


Next we load the ASR output:

In [36]:
import json
with open('output.json') as f:
  output=json.load(f)

And the LM corrected version:

In [53]:
import json
with open('output_lm.json') as f:
  output_lm=json.load(f)

This is what the ASR text output looks like. We can also convert it to IDs if neccessary.

In [14]:
print(output[0]['text'])
print(vocab.get_ids(output[0]['text'],True))

potpredsjedniče poštovane kolegice i kolege ovo je još jedan od zakona koji raspravljamo ovih dana ili ćemo raspravljati narednih dana koji su možda imali dobar motiv da budu upućeni hrvatskom saboru ali
[21617, 22179, 9837, 6501, 9833, 18701, 8899, 9107, 8901, 16637, 35299, 9809, 25836, 18664, 2180, 6624, 36406, 25844, 14112, 2180, 9809, 30076, 13050, 6648, 3058, 13009, 2105, 1731, 32899, 6431, 27303, 226]


## Main method

This is the main method for finding the closely matching word sequences. The althorithm is as follows:
1. first find all the occurences of the first word from the ASR sequence in the reference sequence
2. this is usually anything from a few to a few hundred words; if none aer found, then try looking for the second, third, and so on, until you find more than 0 matches
3. next use the SequenceMatcher to calculate the levenshtein-like ratio between each of the sequences from the previous step
4. find the sequence with the best ratio
5. usually this will yield one result; if none are found then print an error, if more are found then this means that the ASR output occurrs several times in the reference output
6. finally, determine the length within the reference sequence by creating another SequenceMatcher object and finding the end of the last matching block

In [56]:
from sys import int_info
from difflib import SequenceMatcher

def findall(id, sequence):
  ret=[]
  off=0
  N=len(sequence)
  while off<N:
    try:
      pos=sequence.index(id,off)
      ret.append(pos)
      off=pos+1
    except ValueError:
      break
  return ret

def close_match(ids:List[int],text:List[int])->Tuple[int,int]:
  N=len(ids)
  p1=findall(ids[0],text)
  poff=1
  while len(p1)==0 and poff<N:
    p2=findall(ids[poff],text)
    p1=[x-poff for x in p2]
    poff+=1
  max_r=0
  pf=[]
  for p in p1:
    sm=SequenceMatcher(a=ids,b=text[p:p+N],autojunk=False)
    r=sm.ratio()
    if r>max_r:
      max_r=r
      pf=[p]
    elif r==max_r:
      pf.append(p)

  if len(pf)==0:
    print('ERROR: no candidates found!')
    return None
  elif len(pf)>1:
    print('WARNING: multiple candidates found!')

  pf=pf[0]

  mb=SequenceMatcher(a=ids,b=text[pf:pf+N+10],autojunk=False).get_matching_blocks()
  m=mb[-2]
  M=m.b+m.size

  return pf,pf+M

We can test this on all ASR outputs:

In [60]:
for seg in output:
  ids=vocab.get_ids(seg['text'])
  ps,pe=close_match(ids,text)
  print('ASR: '+seg['text'])
  print('REF: '+vocab.get_text(text[ps:pe]))

ASR: potpedsjednie poštovane kolegice i kolege ovo je još jedan od zakona koji raspravljamo ovih dana ili ćemo raspravljati narednih dana koji su možda imali dobar motiv da budu upućeni hrvatskom saboru ali
REF: potpredsjedniče poštovane kolegice i kolege ovo je još jedan od zakona koji raspravljamo ovih dana ili ćemo raspravljati narednih dana koji su možda imali dobar motiv da budu upućeni hrvatskom saboru ali
ASR: oblik u kojem su došli u hrvatski sabor prijedlog za rješenje problema ili motiva koji je naše kolege motivirao da to upute u hrvatski sabor nažalost nije takav i bojimo se da će stvoriti mnogo više teškoća nego li što će donijeti rješenja sa sobom to je ujedno iskustvo
REF: oblik u kojem su došli u hrvatski sabor prijedlog za rješenje problema ili motiva koji je naše kolege motivirao da to upute u hrvatski sabor nažalost nije takav i bojimo se da će stvoriti mnogo više teškoća nego li što će donijeti rješenja sa sobom to je ujedno iskustvo
ASR: nadam se da se to vidi i po

And the LM corrected output (which is generally simpler):

In [61]:
for seg in output_lm:
  ids=vocab.get_ids(seg['text'])
  ps,pe=close_match(ids,text)
  print('ASR: '+seg['text'])
  print('REF: '+vocab.get_text(text[ps:pe]))

ASR: potpredsjedniče poštovane kolegice i kolege ovo je još jedan od zakona koji raspravljamo ovih dana ili ćemo raspravljati narednih dana koji su možda imali dobar motiv da budu upućeni hrvatskom saboru ali
REF: potpredsjedniče poštovane kolegice i kolege ovo je još jedan od zakona koji raspravljamo ovih dana ili ćemo raspravljati narednih dana koji su možda imali dobar motiv da budu upućeni hrvatskom saboru ali
ASR: oblik u kojem su došli u hrvatski sabor prijedlog za rješenje problema ili motiva koji je naše kolege motivirao da to upute u hrvatski sabor nažalost nije takav i bojimo se da će stvoriti mnogo više teškoća nego li što će donijeti rješenja sa sobom to je ujedno iskustvo
REF: oblik u kojem su došli u hrvatski sabor prijedlog za rješenje problema ili motiva koji je naše kolege motivirao da to upute u hrvatski sabor nažalost nije takav i bojimo se da će stvoriti mnogo više teškoća nego li što će donijeti rješenja sa sobom to je ujedno iskustvo
ASR: nadam se da se to vidi i 