# Extended Examples

In [1]:
import gtn
import nb_utils
import math
nb_utils.init()

## Counting $n$-grams

In this example, we'll use graph operations to count the number of $n$-grams in a string. 

Suppose we have a string $aaabaa$ and we want to know the frequency of each bigram. In this case the bigrams contained in the string are $aa$, $ab$, and $ba$ with frequencies of $3$, $1$, and $1$ respectively. In the general case we are given an input string $x$ and an $n$-gram, $y$, and the goal is to count the number of occurrences of $y$ in $x$.

For a given $n$-gram, the first step is to construct the graph which matches that $n$-gram at any location in the string. We want to construct the graph equivalent of the regular expression $.*y.*$ where $.*$ indicates zero of more occurrences of any token in the token set. 

Suppose we want to count the number of ocurrences of the bigram $aa$ in $aaabaa$. For the bigram $aa$, and the token set alphabet $\{a, b, c\}$, the $n$-gram matching graph is shown below.

<div class="figure">
  <div class="img">
    <img src="figures/bigram_aa.svg"/>
  </div>
  <div class="caption" markdown="span">
      A hierarchy of classes graph.
  </div>
</div>

We can encode the string $aaabaa$ as a simple chain graph.

In [2]:
symbols = {0: "a", 1: "b", 2: "c"}

# Encode the string "aaabaa" as integer ids:
x = [0, 0, 0, 1, 0, 0]  
g = gtn.Graph()
g.add_node(start=True)
for i, l in enumerate(x):
    g.add_node(accept=(i + 1==len(x)))
    g.add_arc(src_node=i, dst_node=i + 1, label=l)
gtn.draw(g, "figures/nb/ngram_string.svg", isymbols=symbols)

bigram = gtn.Graph()
bigram.add_node(start=True)
bigram.add_node()
bigram.add_node(accept=True)
bigram.add_arc(src_node=0, dst_node=1, label=0)
bigram.add_arc(src_node=1, dst_node=2, label=0)
for l in range(len(symbols)):
    bigram.add_arc(src_node=0, dst_node=0, label=l)
    bigram.add_arc(src_node=2, dst_node=2, label=l)

<div class="figure">
  <div class="img">
    <img src="figures/nb/ngram_string.svg"/>
  </div>
  <div class="caption" markdown="span">
      A hierarchy of classes graph.
  </div>
</div>

We then compute the intersection of the graph representing the string and the graph representing the bigram. The number of paths in this graph represents the number of occurrences of the bigram in the string.

In [19]:
bigram_paths = gtn.intersect(g, bigram)
gtn.draw(bigram_paths, "figures/nb/bigram_paths.svg", isymbols=symbols)

<div class="figure">
  <div class="img">
    <img src="figures/nb/bigram_paths.svg"/>
  </div>
  <div class="caption" markdown="span">
      A hierarchy of classes graph.
  </div>
</div>

Since each path has a weight of $0$, we can count the number of unique paths in the intersected graph by using the forward score. Assume the intersected graph has $p$ paths. The forward score of the graph is $s = \log \sum_{i=1}^p e^0 = \log p$. So the total number of paths is $p = e^s$.

In [20]:
s = gtn.forward_score(bigram_paths)
p = math.exp(s.item())
print(f"The number of occurrences of 'ab' in 'aaabaa' is {p:.0f}")

The number of occurrences of 'ab' in 'aaabaa' is 3


## Edit Distance

In this example we'll use transducers to compute the Levenshtein edit distance between two sequences. The edit distance is a way to measure the similarity between to sequences by computing the minimum number of operations required to change one sequence into the other. The Levenshtein edit distance allows for insertion, deletion, and substitution operations. 

For example, consider the two strings "saturday" and "sunday". The edit distance between them is $3$. One way to minimally edit "saturday" to "sunday" is with two deletions (D) and a substitution (S) as below:
```
s a t u r d a y
  D D   S
s     u n d a y
```

We can compute the edit distance between two strings with the use transducers. The idea is to transduce the first string into the second according to the allowed operations encoded as a graph.

In [15]:
edits = gtn.Graph()
edits.add_node(True)
edits.add_node(accept=True)
edits.add_node(accept=True)
edits.add_node(accept=True)
edits.add_node(accept=True)
edits.add_arc(0, 1, gtn.epsilon, 0, 1)
edits.add_arc(0, 1, gtn.epsilon, 1, 1)
edits.add_arc(0, 2, 0, gtn.epsilon, 1)
edits.add_arc(0, 2, 1, gtn.epsilon, 1)
edits.add_arc(0, 3, 0, 1, 1)
edits.add_arc(0, 3, 1, 0, 1)
edits.add_arc(0, 4, 0, 0)
edits.add_arc(0, 4, 1, 1)
gtn.draw(edits, "figures/nb/edits.svg", isymbols=symbols, osymbols=symbols)

We first construct an edits graph $E$ which encodes the allowed operations. An example of an edits graph assuming a token set of $\{a, b\}$ is shown below. The insertion of a token is represented by the arcs from state $0$ to state $1$ hand has a cost of $1$. The deletion of a token is represented by the arcs from state $0$ to state $2$ which also incur a cost of $1$. All possible substitutions are encoded in the arcs from $0$ to $3$ and again have a cost of $1$. We also have to encode the possibility of leaving a token unchanged. This is represented on the arcs from $0$ to $4$, and the cost is $0$.

<div class="figure">
  <div class="img">
    <img src="figures/nb/edits.svg"/>
  </div>
  <div class="caption" markdown="span">
      TODO
  </div>
</div>

We then take closure of the edits graph $E$ to represent the fact that we can make zero or more of any of the allowed edits. We then encode the first string $x$ in a graph $X$ and the second string $y$ in a graph $Y$. All possible ways of editing $x$ to $y$ can be computed by taking the composition:

$$
P = X \circ E^* \circ Y
$$

The graph $P$ represents the set of all possible unique ways we can edit the string $x$ into $y$. The score of a given path in $P$ is the associated cost. We can then find the edit distance by computing the path with the smallest score in $P$. For this, we use Viterbi algorithm with a $\min$ instead of a $\max$. Alternatively, we can use weights of $-1$ instead of $1$ in $E$ and use the Viterbi algorithm unchanged. The actual edits (*i.e.* the insertions, deletions, and substitutions) can be found by computing the Viterbi path.

---

### Example

Compute the edit distance between $x = abab$ and $y = aaabb$ using graphs.

We first construct the edits graph $E^*$ and the graphs $X$ and $Y$ corresponding to the strings $x$ and $y$. Note that we are using a minimal representation of $E^*$, but the graph is equivalent to the closure of the edits graph above.

In [4]:
edits = gtn.Graph()
edits.add_node(start=True, accept=True)
edits.add_arc(0, 0, gtn.epsilon, 0, -1)
edits.add_arc(0, 0, gtn.epsilon, 1, -1)
edits.add_arc(0, 0, 0, gtn.epsilon, -1)
edits.add_arc(0, 0, 1, gtn.epsilon, -1)
edits.add_arc(0, 0, 0, 1, -1)
edits.add_arc(0, 0, 1, 0, -1)
edits.add_arc(0, 0, 0, 0)
edits.add_arc(0, 0, 1, 1)

X = gtn.Graph()
X.add_node(start=True)
X.add_node()
X.add_node()
X.add_node()
X.add_node(accept=True)
X.add_arc(0, 1, 0)
X.add_arc(1, 2, 1)
X.add_arc(2, 3, 0)
X.add_arc(3, 4, 1)

Y = gtn.Graph()
Y.add_node(start=True)
Y.add_node()
Y.add_node()
Y.add_node()
Y.add_node()
Y.add_node(accept=True)
Y.add_arc(0, 1, 0)
Y.add_arc(1, 2, 0)
Y.add_arc(2, 3, 0)
Y.add_arc(3, 4, 1)
Y.add_arc(4, 5, 1);

Note also that with the edits graph, $E^*$ we use scores of $-1$ so that we can use the Viterbi score which is the maximum scoring path to find the smallest edit distance. The next step is to compute $P = X \circ E^* \circ Y$.  The graph $P$ is shown below. Each path in $P$ represents a unique conversion of $X$ into $Y$ using insertion, deletion, and substitution operations. The negation of the score of the path is the number of such operations required.

For example the path along the state sequence $0 \rightarrow 1 \rightarrow 4 \rightarrow 9 \rightarrow 16 \rightarrow 25$ converts $x$ to $y$ with a substitution for the second letter and an insertion at the end:
```
x = a b a b
y = a a a b b
      S     I
```

In [9]:
edit_paths = gtn.compose(X, gtn.compose(edits, Y))
gtn.draw(edit_paths, "figures/nb/edit_paths.svg", isymbols=symbols, osymbols=symbols)

<div class="figure">
  <div class="img">
    <img src="figures/nb/edit_paths.svg"/>
  </div>
  <div class="caption" markdown="span">
    TODO
  </div>
</div>

The Viterbi score and Viterbi path then yield the edit distance between $x$ and $y$ and one sequence of edit operations required to attain the edit distance. The Viterbi path for the example is shown below.

In [12]:
# The edit distance is the negation of the Viterbi score
edit_distance = -gtn.viterbi_score(edit_paths).item()
edit_path = gtn.viterbi_path(edit_paths)
gtn.draw(edit_path, "figures/nb/edit_path.svg", isymbols=symbols, osymbols=symbols)

<div class="figure">
  <div class="img">
    <img src="figures/nb/edit_path.svg"/>
  </div>
  <div class="caption" markdown="span">
    TODO
  </div>
</div>

## $n$-gram Language Model
$\DeclareMathOperator*{\LSE}{\textrm{LSE}}$In this example we'll encode an $n$-gram language model as an acceptor. We'll then use the acceptor to compute the language model probability of a given sequence.

Let's start with a very simple example. Suppose we have the token set $\{a, b, c, \textrm{</s>}\}$ and we want to construct a unigram langauge model. Note  $\textrm{</s>}$ is the end of sentence token and is required for the language model to be a valid probability distribution. Given counts of occurrences for each token in the vocabulary, we can construct an acceptor to represent the unigram language model. Suppose we are given the log probabilities $0.5$, $0.2$, and $0.3$ for $a$, $b$, and $c$ respectively. The corresponding unigram graph is shown below. Note the edge weights are the log probabilities.

<div class="figure">
  <div class="img">
    <img src="figures/nb/unigram.svg"/>
  </div>
  <div class="caption" markdown="span">
    TODO
  </div>
</div>

Now assume we are given the sequence $aa$ for which we would like to compute the probability. The probability under the language model is $\frac{1}{2} \cdot \frac{1}{2} = \frac{1}{4}$. We can compute the log probability of $aa$ by intersecting its graph representation $X$ with the unigram graph $U$ and then computing the forward score:

$$
\log p(aa) = \LSE (X \circ U)
$$

In [44]:
# The unigram graph U:
U = gtn.Graph()
U.add_node(start=True, accept=True)
#U.add_node(accept=True)
U.add_arc(src_node=0, dst_node=0, label=0)
U.add_arc(src_node=0, dst_node=0, label=1)
U.add_arc(src_node=0, dst_node=0, label=2)
U.set_weights([math.log(p) for p in [0.5, 0.2, 0.3]])

# The graph representing the sequence "aa":
X = gtn.Graph()
X.add_node(start=True)
X.add_node()
X.add_node(accept=True)
X.add_arc(src_node=0, dst_node=1, label=0)
X.add_arc(src_node=1, dst_node=2, label=0)

# Compute the unigram score of X:
x_scored = gtn.intersect(X, U)
x_prob = math.exp(gtn.forward_score(x_scored).item())
gtn.draw(x_scored, "figures/nb/unigram_aa_scored.svg", isymbols=symbols)

The graph below shows the intersection $X \circ U$. The arc edges in the intersected graph contain the correct unigram scores, and the forward score gives the log probability of the sequence $aa$. In this case the Viterbi score would give the same result since the graph has only one path.

<div class="figure">
  <div class="img">
    <img src="figures/nb/unigram_aa_scored.svg"/>
  </div>
  <div class="caption" markdown="span">
    TODO
  </div>
</div>

For an arbitrary sequence $x$ with a graph representation $X$ and an arbitrary $n$-gram language model with graph representation $N$, the log probability of $x$ is given by:

$$
\log p(x) = \LSE(X \circ N)
$$

Next, let's see how to represent a bigram language model as a graph. From there, the generalization to arbitrary order $n$ is relatively straightforward. Assume again we have the token set $\{a, b, c\}$. The bigram model is shown in the graph below.

<div class="figure">
  <div class="img">
    <img src="figures/nb/bigram.svg"/>
  </div>
  <div class="caption" markdown="span">
    TODO
  </div>
</div>

Each state is labeled with the token representing the most recently seen input. For a bigram model we only need to remember the previous token to know which score to use on when processing the next token. For a trigram model we would need to remember the previous two tokens. For an $n$-gram model we would need to remember the previous $n-1$ tokens. The label and score pair leaving each state represent the corresponding conditional probability (technically these should be log probabilities). Each state has an outgoing arc for every possible token in the token set.

---

### Example

Compute the number of states and arcs in a graph representation of an $n$-gram language model for a given order $n$ and a token set size of $v$.

For order $n$, the graph needs a state for every possible token sequence of length $n-1$. This means that the graph will have $v^{n-1}$ states. Each state has $v$ outgoing arcs. Thus the total number of arcs in the graph is $v \cdot v^{n-1}= v^n$. This should be expected given that the language model assigns a score for every possible sequence of length $n$.

---

## Automatic Segmentation Criterion
$\newcommand{\bX}{{\bf X}}
\newcommand{\by}{{\bf y}}
\newcommand{\bx}{{\bf x}}
\newcommand{\ba}{{\bf a}}
\newcommand{\bp}{{\bf p}}
$
In speech recognition and other problems, we often need to compute a conditional probability of an output sequence given an input sequence when the two sequences do not have the same length. The Automatic Segmentation criterion (ASG) is one of several common loss functions for which this is possible. However, ASG is limited to the case when the output sequence is no longer than the input sequence. 

Assume we have an input sequence of vectors $\bX = [\bx_1, \ldots, \bx_T]$ of length $T$ and an output token sequence $\by = [y_1, \ldots, y_U]$ of length $U$ such that $U \le T$. We don't know the actual alignment between $\by$ and $\bX$ and in most applications, including speech recognition, we don't need it. To get around not knowing this alignment, the ASG criterion marginalizing over all possible alignments between $\bX$ and $\by$.

In ASG, the output sequence is aligned to a given input, by allowing one or more consecutive repeats of each token in the output. Let's look at an example. Suppose we have an input of length $5$ and the output sequence $ab$. Some possible alignments of the output are $aaabb$, $abbbb$, and $aaaab$. Some invalid alignmets are $abbbba$, $aaab$, and $aaaaa$. The first corresponds to the output $aba$, the second is too short, and the third corresponds to the output $a$.

For each time-step of the input, our model assigns a score for every possible output token. Let $\ba = [a_1, \ldots, a_T]$ be one possible aligment between $\bX$ and $\by$. The alignment $\ba$ also has length $T$. To compute a score for $\ba$ sum the sequence of scores for each token:

$$
s(\ba) = \sum_{t=1}^T s_t(a_t)
$$

Let $\mathcal{A}_{\bX,\by}$ denote the set of all possible alignments between $\bX$ and $\by$. We can then use the individual alignment scores to compute a conditional probability of the output $\by$ given the input $\bX$:

$$
\log p(\by \mid \bX) = \sum_{\ba \in \mathcal{A}_{\bX, \by}} e^{s(\ba)} - \log Z
$$

where $Z$ is a normalization term:

$$
Z = \sum_{\bp \in \mathcal{Z}_\bX} e^{s(\bp)}
$$

where $\mathcal{A}_\bX$ is the set of all possible alignments of the same length as $X$. Computing the summations over $\mathcal{A}_{\bX,\by}$ and $\mathcal{Z}_\bX$ explicitly is not tractable because the sizes of these sets grow rapidly with the lengths of $\bX$ and $\by$. We will instead use automata to encode these sets and efficiently compute the summation using the forward score algorithm.

Let's start with the normalization term $Z$. The set $\mathcal{Z}_\bX$ just encodes all possible outputs of length $T$, where $T$ is the length of $\bX$. Assuming $T=4$ and we have three possible output tokens. If the scores for each otuput are independent, we can represent $\mathcal{Z}_\bX$ with the graph below. Each output token at each step is given a score by the model. These scores are often called the *emissions* and the graph itself is sometimes called the emissions graph. We'll use $\mathcal{E}$ to represent the emissions graph. In this case the emissions graph $\mathcal{E}$ is the same as the normalizaation graph $\mathcal{Z}_\bX$; however, in general they may be different. The normalization term is the forward score of the emissions graph, $Z = \LSE(\mathcal{E})$.

In [5]:
E = gtn.linear_graph(4, 3)
gtn.draw(E, "figures/nb/asg_emissions.svg", isymbols=symbols)

<div class="figure">
  <div class="img">
    <img src="figures/nb/asg_emissions.svg"/>
  </div>
  <div class="caption" markdown="span">
    TODO
  </div>
</div>

Let's turn to the set $\mathcal{A}_{\bX, \by}$ which we will also represent as an acceptor. This acceptor should have a path for every possible alignment between $\bX$ and $\by$. We'll construct $\mathcal{A}_{\bX, \by}$ in two steps. First, we can encode the set of allowed alignment of arbitrary length for a given sequence $\by$ using a graph. For example, for the sequence $ab$ the graph is shown below.

In [6]:
ay = gtn.Graph()
ay.add_node(start=True)
ay.add_node()
ay.add_node(accept=True)
ay.add_arc(src_node=0, dst_node=0, label=0)
ay.add_arc(src_node=0, dst_node=1, label=0)
ay.add_arc(src_node=1, dst_node=1, label=1)
ay.add_arc(src_node=1, dst_node=2, label=1)
gtn.draw(ay, "figures/nb/asg_alignments.svg", isymbols=symbols)

<div class="figure">
  <div class="img">
    <img src="figures/nb/asg_alignments.svg"/>
  </div>
  <div class="caption" markdown="span">
    TODO
  </div>
</div>

This graph has a simple interpretation. Each token in the output $ab$ can repeat one or more times in the alignment. We'll use the symbol $\mathcal{A}\by$ to represent this graph as it is only dependent on $\by$ and not $\bX$. We can then construct $\mathcal{A}_{\bX,\by}$ by intersecting $\mathcal{A}_\by$ with the graph containing all possible sequences of length $T$. The latter graph is precisely the emissions graph $\mathcal{E}$ we constructed above, so the $\mathcal{A}_{\bX,\by} = \mathcal{A}_\by \circ \mathcal{E}$. An example graph is shown below for the sequence $ab$ with $T=4$.

In [7]:
axy = gtn.intersect(ay, E)
gtn.draw(axy, "figures/nb/asg_constrained.svg", isymbols=symbols)

<div class="figure">
  <div class="img">
    <img src="figures/nb/asg_constrained.svg"/>
  </div>
  <div class="caption" markdown="span">
    TODO
  </div>
</div>

In terms of graph operations, we can then write the ASG loss function as:

$$
p(\by \mid \bX) = \LSE(\mathcal{A}_{\by} \circ \mathcal{E}) - \LSE(\mathcal{E})
$$

### Transitions

The original ASG loss function also includes bigram transition scores. If we let $h(a_{t-1}, a_t)$ denote the transition function, then we can incorporate the transition score into the score of the alignment:

$$
s(\ba) = \sum_{t=1}^T s_t(a_t) + h(a_t, a_{t-1})
$$

where we assume $a_0$ is a special start of sequence token $\textrm{<s>}$. We can use the alignment scores in the same manner as above and the rest of the loss function remains unchanged. 

We'll next show how to incorporate transitions using an acceptor and graph operations. I'll rely on the ideas introduced in the section on $n$-gram langauge models, so now is a good time to read or review that section. The first step is to encode the bigram model as a graph as shown below:

<div class="figure">
  <div class="img">
    <img src="figures/nb/asg_bigrams.svg"/>
  </div>
  <div class="caption" markdown="span">
    TODO
  </div>
</div>


### ASG with Transducers

As a final step, I'll show how to construct the ASG criterion from even simpler transducer buildling blocks. The advantage of this approach is that it lets us experiment with changes to the criterion
