# Bleu Scores For Transcriptions

Why BLEU? It's designed to score a hypothesis transcription against multiple expert transcriptions and uses higher order n-grams to approximate 'fluency' and other
stylistic elements. If we use it to score the expert transcriptions amongst themselves, that should be a reasonable metric for transcriber agreement??

### Load Data

In [1]:
from asr_dataset.police import BpcETL, AmbiguityStrategy
from asr_dataset.constants import Cluster
import pandas as pd
import numpy as np
import librosa

In [2]:
etl = BpcETL(Cluster.RCC, 
    filter_inaudible=False, 
    filter_numeric=False, 
    filter_uncertain=False,
    ambiguity=AmbiguityStrategy.ALL)

In [3]:
data = etl.extract()

### Define Utterance Alignments

The problem is that transcribers disagree on utterance existence and duration.
The goal is to produce an alignment (mapping) that links utterances across transcribers.
The solution is a monotonic alignment that increments when there's unanimous agreement the utterance is over. In other words, we merge utterances if there's any evidence they're connected.

Algorithm:
Every utterance starts in a singleton group. Sort groups by end time. Consider
ith and (i-1th) group. If ith group overlaps previous group in time, merge them.
Otherwise add ith utterance as new group. Continue until all groups are disjoint.

In [4]:
data = data.assign(end=data['offset']+data['duration'])

In [5]:
class OverInterval(pd.Interval):
    """ An Interval that is considered equal to other overlapping Intervals."""
    def __lt__(self, other):
        return other.left > self.right
    def __gt__(self, other):
        return self.left > other.right
    def __le__(self, other):
        return self < other or self == other
    def __ge__(self, other):
        return self > other or self == other
    def __eq__(self, other):
        return not self < other and not self > other


In [6]:
# Group all utterances that overlap transitively,
#   e.g. group a,b,c together if a & b and b & c regardless of a & c
alignments = {}
for tup in data.sort_values('end').itertuples():
    row = tup._asdict()
    current = OverInterval(row['offset'], row['end'])
    intervals = alignments.get(row['original_audio'], [])
    if intervals and intervals[-1].right >= current.left:
        left = min(intervals[-1].left, current.left)
        right = max(intervals[-1].right, current.right)
        intervals[-1] = OverInterval(left, right)
    else:
        intervals.append(current)
    alignments[row['original_audio']] = intervals

In [7]:
from bisect import bisect_left, bisect_right

def index(a, x):
    'Locate the leftmost value exactly equal to x'
    i = bisect_left(a, x)
    if i != len(a) and a[i] == x:
        return i
    raise ValueError

In [8]:
def aligner(row):
    iv = OverInterval(row['offset'], row['end'])
    return index(alignments[row['original_audio']], iv)

aligned = data.apply(aligner, axis=1)
data = data.assign(alignment = aligned)

### Filter mono-transcriptions

We can't compute agreement scores when only one transcriber saw the audio.

In [9]:
scribers = data[['original_audio', 'transcriber']].drop_duplicates()
scribecount = scribers.groupby('original_audio').count().rename(columns={'transcriber':'count'}).reset_index()
multi_scribers = scribecount.loc[scribecount['count'] > 1, 'original_audio']
scribers = scribers.loc[scribers['original_audio'].isin(multi_scribers)]

### Infill missing utterances

After merging utterances, some transcribers still did not record speech during
the merged interval. To get consistent corpus BLEU scores, we need to fill these
in as empty strings.

In [40]:
aud_intervals = data[['original_audio', 'alignment']].drop_duplicates()
corpus_prep = aud_intervals.merge(scribers) \
                    .merge(data, how='left') \
                    .assign(text = lambda x: x['text'].fillna(""))

### Use alignment to concatenate utterances and corpora

In [123]:
# Sentence-level data frame for sentence-level bleu score
utterances = corpus_prep.sort_values('offset') \
            .groupby(['original_audio', 
                    'alignment',
                    'transcriber']) \
            .agg({"text": " ".join}) \
            .reset_index()      

In [124]:
# Corpus-level data frame for corpus-level bleu score
corpuses = utterances.sort_values('alignment') \
                    .groupby(['original_audio', 'transcriber']) \
                    .agg({"text": lambda x: list(x)}) \
                    .reset_index()       

### Define BLEU

We average over all audio files and using each transcriber as the hypothesis.

In [115]:
import warnings
from nltk.translate.gleu_score import sentence_gleu, corpus_gleu
from nltk.translate.bleu_score import sentence_bleu, corpus_bleu

def sentence_metric(sents, mfunc):
    """Computes sentence level blue from list of candidate strings"""
    metric = 0
    assert len(sents) > 1
    for i in range(len(sents)):
        refs = [x.split() for x in sents]
        hyp = refs.pop(i)
        metric += mfunc(refs, hyp)
    return metric / len(sents)

def corpus_metric(corps, mfunc):
    """Computes sentence level blue from list of list of candidate strings"""
    metric = 0
    for i in range(len(corps)):
        refs = [[sent.split() for sent in corp] for corp in corps]
        hyp = refs.pop(i)
        refs = [list(i) for i in zip(*refs)]  # collate sentences by author
        metric += mfunc(refs, hyp)
    return metric / len(corps)

def sblue(sents):
    return sentence_metric(sents, sentence_bleu)

def sglue(sents):
    return sentence_metric(sents, sentence_gleu)

def cblue(corps):
    return corpus_metric(corps, corpus_bleu)

def cglue(corps):
    return corpus_metric(corps, corpus_gleu)

def score_sentence(utterances, text_col):
    """ 
        Expects data frame of {original_audio, alignment, transcriber, text_col} 
        where text_col contains sentence strings
    """
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        scores = utterances.groupby(['original_audio', 'alignment']) \
                        .agg({text_col: [sblue, sglue]}) \
                        .agg('mean')
        print(scores)

def score_corpus(corpuses, text_col):
    """ 
        Expects data frame of {original_audio, transcriber, text_col} 
        where text_col contains a list of sentence strings
    """
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        scores = corpuses.groupby(['original_audio']) \
                .agg({text_col: [cblue, cglue]}) \
                .agg('mean')
        print(scores)


### Compute BLEU

**Takeaway** They're kinda low, right?

In [114]:
score_sentence(utterances, 'text')

text  sblue    0.144809
      sglue    0.330698
dtype: float64


In [116]:
score_corpus(corpuses, 'text')

text  cblue    0.335554
      cglue    0.339263
dtype: float64


### Repeat BLEU with word stemming

**Takeaway** Stemming barely changes the scores, and is insensitive to stem algorithm.

In [117]:
from nltk.stem import SnowballStemmer
stemmer = SnowballStemmer('english')
corpuses = corpuses.assign(stemmed = [[" ".join([stemmer.stem(wrd) for wrd in sent.split()]) for sent in corp] for corp in corpuses['text']])
utterances = utterances.assign(stemmed = [" ".join([stemmer.stem(wrd) for wrd in sent.split()]) for sent in utterances['text']])

In [118]:
score_sentence(utterances, 'stemmed')

stemmed  sblue    0.148959
         sglue    0.334716
dtype: float64


In [119]:
score_corpus(corpuses, 'stemmed')

stemmed  cblue    0.343932
         cglue    0.346531
dtype: float64


In [120]:
from nltk.stem import LancasterStemmer
stemmer = LancasterStemmer()
corpuses = corpuses.assign(stemmed = [[" ".join([stemmer.stem(wrd) for wrd in sent.split()]) for sent in corp] for corp in corpuses['text']])
utterances = utterances.assign(stemmed = [" ".join([stemmer.stem(wrd) for wrd in sent.split()]) for sent in utterances['text']])

In [121]:
score_sentence(utterances, 'stemmed')

stemmed  sblue    0.148435
         sglue    0.334357
dtype: float64


In [122]:
score_corpus(corpuses, 'stemmed')

stemmed  cblue    0.340857
         cglue    0.343985
dtype: float64
