# MLMI3: Probabilistic Automata
Lecturer: Prof. Bill Byrne

----

# Table of Contents
## 1. Formal Languages and Language Hierarchies
* 1.1. Finite Languages
* 1.2. Finite State Acceptors
* 1.3. Regular Languages
* 1.4. Context Free Grammars
* 1.5. Right Linear Languages

## 2. Probabilistic Automata
* 2.1. PCFGs
* 2.2. Weighted Finite State Acceptors
* 2.3. WFSA Operations
* 2.4. Weighted Finite State Transducers
* 2.5. WFST Operations

## 3. Distance, Kernels, Semirings
* 3.1. String Distances
* 3.2. Kernels and Counting Transducers
* 3.3. Semirings

## 4. Applications of Weighted Automata
* 4.1. Acoustic Likelihoods
* 4.2. Language Models
* 4.3. Lexicons
* 4.4. CI-to-CD Transducers
* 4.5. WFSA ASR 
* 4.6. Tagging
* 4.7. Keyboards

## 5. Inference and Computation
* 5.1. MLE
* 5.2. MBR Training
* 5.3. Distance Computation
* 5.4. Inference Functions
----

# 1. Formal Languages and Language Hierarchies

* **The Chomsky Hierarchy**

>|Type|Name|Rule|Instance|
|-|-----------------|-----------------------------------------------|-----------|
|0|Turing Equivalent|||
|1|Context Sensitive|$\alpha A \beta \rightarrow \alpha \beta \gamma, \gamma \neq \epsilon$||
|2|Context Free     |$A \rightarrow \gamma$                         |           |
|3|Regular          |$A \rightarrow xB$ or $A \rightarrow x$        |FSA        |
|-|Finite           |$x \in \{x_1, \dots, x_N ; x_n \in \Sigma^* \}$|N-Best list|

## 1.1. Finite Languages

* **Formal Languages**

>* **Formal Language** $=$ set of **strings** 
>* **String** $=$ composed of **symbols**
>* **Symbols** $=$ drawn from a set $\Sigma$ (**alphabet** or **vacabulary**)
>* **Automata** $=$ **acceptor** or **generator**

* **Finite Languages**

>* **Finite vocabulary** & Strings of a **fixed maximum length**
>* $\Rightarrow$ possible to enumerate every sentence

## 1.2. Finite State Acceptors

* **Lattice**: Weighted, directed acyclic graph

>* $\Rightarrow$ used to represent the *output* of language processing systems

* **FSA(Finite State Acceptor)**: directed graph specified as a 5-tuple $M=(Q, \Sigma, E, q_0, F)$

>* $Q$: finite set of states
>* $\Sigma$: alphabet (or vocabulary)
>* $E$: set of edges - from $s(e)$ to $f(e)$, each labelled with a symbol $i(e) \in \Sigma$
>* $q_0$: initial state
>* $F \subset Q$: set of final states 

* **Complete Path**: $p=e_1 \dots e_{n_p}$

>* Initial state, $i_p = s(e_1)$
>* Final state $f_p = f(e_{n_p})$
>* Produces the string $x=i(e_1) \dots i(e_{n_p})$
>* FSA *accepts* or *generates* the string $x$
>* The generated strings form the language $L_M$

## 1.3. Regular Languages

* **Operations on Strings**

>* Length: $|s|$
>* Empty string: $\epsilon$, $|\epsilon|=0$, $xy = \epsilon xy = x \epsilon y = xy \epsilon$
>* Concatenation:
>  * $abc$ with $cde$ $\rightarrow$ $abccde$
>  * $abc$ with $\epsilon$ $\rightarrow$ $abc$
>  * string $x$ with itself $n$ times $\rightarrow$ $x^n$

* **Operationson Sets of Strings**

>* **Union**: $L_1 \cup L_2$
>* **Intersection**: $L_1 \cap L_2$
>* **Concatenation**: $L_1 L_2$ $\rightarrow$ $L^0=\{\epsilon\}, L^1=L, L^2=LL, \dots$
>* **Kleene Closure**: $L^* = \cup_{n \geq 0} L^n$
>* **Positive Closure**: $L^+ = \cup_{n \geq 1} L^n = LL^*$ (Kleene closure but excluding $\epsilon$)

* **Regular Languages**: class of languages that are definable by regular expressions

>* If $L_1$ and $L_2$ are regular languages over $\Sigma$, so are (closure under)

>  * $L_1 \cup L_2$ , $L_1 \cap L_2$ , $L_1 L_2$ , $L_1 - L_2$
>  * Complementation: $\Sigma^* - L_1$ (all possible strings not in $L_1$)
>  * Reversal: $L_1^R$ (the reversal of all strings in $L_1$)
>  * Kleene Closure: $L_1^*$ and $L_2^*$

* **Pumping Lemma for Regular Languages**

>$$\text{Let } L \text{ be an infinite regular language,}$$
>$$\text{Any sufficiently long } w \in L \text{ can be split as } w=xyz \text{ such that } xy^nz \in L, n \geq 0$$

>* String longer than the no. of states in FSA $\rightarrow$ some states are visited multiple times
>* $y$: string that is **pumped**
>* If there is no string that can be pumped $\rightarrow$ the language is **not regular**

## 1.4. Context Free Grammars

* **CFG**: defined as a 4-tuple $G=(N,\Sigma,R,S)$ 

>* $N$: set of nonterminal symbols 
>  * Can be renamed arbitrarily
>  * For any two grammars $G_1$ and $G_2$, we can assume that $N_1 \cap N_2 = \phi$
>* $\Sigma$: set of terminal symbols
>* $R$: set of productions $A \rightarrow \beta$, where $A \in N$ and $\beta \in (\Sigma \cup N)^*$
>* $S$: start symbol

* **Derivations**

>* **Direct Derivation:** $\alpha A \gamma \Rightarrow \alpha \beta \gamma$ (rule: $A \rightarrow \beta$ and $\alpha, \gamma \in (\Sigma \cup N)^*$)
>* **Derivation:** $\alpha_1 \underset{G}{\overset{*}{\Rightarrow}} \alpha_m$ (arbitrary no. of rules derive $\alpha_m$)
>* **Language:** set of strings derived from the start symbol

>$$\mathcal{L}_G = \{ w|w \in \Sigma^* \text{ and } S \underset{G}{\overset{*}{\Rightarrow}} w \}$$

>* **Ambiguity:** same string of words can have different tree
>* $\mathcal{T}_G(S)$: all the trees with yield $S$ generated by the grammar $G$

## 1.5. Right Linear Languages

* **Definitions**

>* **Linear Grammar:** CFG with at most **one** non-terminal on the RHS of its rules
>* **Right-Linear:** All non-terminals in RHS are at the right ends
>* **Left-Linear:** All non-terminals in RHS are at the left ends

* **Construct FSA from a Right-Linear Grammar**

>$$G = (N, \Sigma, P, S) \;\; \rightarrow \;\; M = (Q, \Sigma, E, q_0, F)$$

>* $Q = N \cup \{f\}$ and $F=\{f\}$
>* $A \rightarrow xB \;\;\; \Rightarrow \;\;\; x: (A) \overset{x}{\rightarrow} (B)$
>* $A \rightarrow x \;\;\; \Rightarrow \;\;\; x: (A) \overset{x}{\rightarrow} (f)$

* **Linear Languages: closure under**

>* **Union:**  $L_A \cup L_B = L_C$
>* **Concatenation:** $L_A L_B = L_C$
>* **Kleene Closure:** $L_A^* = L_C$

# 2. Probabilistic Automata

>$$0 \leq P(w) \leq 1 \;\;\;,\;\;\; \sum_{w \in L} P(w)=1$$

>$$P(w): \text{ how sensible } w \text{ is}$$

## 2.1. PCFGs

* **PCFG = CFG + Probabilities**, $G = (N, \Sigma, R, S)$
* $R$: set of productions of the form $A \rightarrow \beta/p$, where $A \in N$ and $\beta \in (\Sigma \cup N)^*$

>$$p=P(A \rightarrow \beta | A) \;\;\;,\;\;\; \sum_\beta P(A \rightarrow \beta | A) = 1$$

* **Probabilities over Derivations**

>$$d = r_1, \dots, r_n \text{ derives } s \in \Sigma^* \text{ from } S$$

>$$p(d|S) = \prod^n_{i=1} p(r_i) = \prod_{r \in R} p(r)^{\#_d (r)}$$

>* Probability of a tree = Probability of its derivation

>* $\#_d (r)$: no. of times $r$ occurs in the derivation $d$

* **Ambiguity in PCFGs**

>$$P(S) = \sum_{T \in \mathcal{T}_G (S)} P(T)$$

>* $\mathcal{T}_G (S)$: trees with yield $S$ generated by the grammar $G$
>* $P(S)$: prob. over sentences / $P(T)$: prob. over trees

## 2.2. Weighted Finite State Acceptors

* **WFSA = FSA + Weights**

>$$ M = (Q, \Sigma, E, q_0, F, \rho) $$

>$$s(e) \overset{i(e)/w(e)}{\longrightarrow}f(e) \;\;\;,\;\;\; \rho(f): \text{weight for final states } f \in F$$

* **Weights Assigned to Strings by Acceptors**

>$$w(p) = w(e_1) \otimes \dots \otimes w(e_{n_p}) \otimes \rho(f_p) = (\otimes^{n_p}_{j=1} w(e_j)) \otimes \rho(f_p)$$

>$$[\![ A ]\!] (x) = \underset{p \in P(x)}{\bigoplus} w(p)$$

>* $P(x)$: set of complete paths which generate $x$
>* $[\![ A ]\!] (x)$: cost assigned to the string $x$ by the acceptor

* **Operations on Weights**

>|**Semiring**|$\mathbb{K}$|$\oplus$|$\otimes$|$\bar{0}$|$\bar{1}$|
|-----------|---------------------------------------|------------|-|-|-|
|**Probability**|$\mathbb{R}_+$                         |$+$|$\times$|$0$|$1$|
|**Log**        |$\mathbb{R} \cup \{ -\infty, \infty \}$|$\oplus_{\log}$|$+$|$\infty$|$0$|
|**Tropical**   |$\mathbb{R} \cup \{ -\infty, \infty \}$|$\min$|$+$|$\infty$|$0$|

>* $k_1 \oplus_{\log} k_2 = - \log (e^{-k_1} + e^{-k_2})$

* **Weights under Semirings** (paths $p_1$ and $p_2$ generate $\text{"a b"}$)

>\begin{align}
\textbf{Probability: } [\![ A ]\!] (\text{"a b"}) &= p_1 (\text{"a b"}) p_1 + p_2 (\text{"a b"}) p_2 \\
&= \text{marginal probability} \\
\\
\textbf{Log: } [\![ A ]\!] (\text{"a b"}) &= - \log \big[ p_1 (\text{"a b"}) p_1 + p_2 (\text{"a b"}) p_2 \big] \\
&= \text{negative log marginal probability} \\
\\
\textbf{Tropical: } [\![ A ]\!] (\text{"a b"}) &= -\max \big[ \log p_1 (\text{"a b"}) p_1, \log p_2 (\text{"a b"}) p_2 \big] \\
&= \text{negative log Viterbi likelihood}
\end{align}

* **Tropicalization**

><img src = 'images/image01.png' width=500>

>* Replace $(+, \times)$ by $(\min, +)$
>* Replace joint probabilities by their negative logarithms
>* This process is consistent (i.e. invertible) for arc weights and path weights

## 2.3. WFSA Operations

* **Intersection:** $[\![ C ]\!] (x) = [\![ A ]\!] (x) \otimes [\![ B ]\!] (x)$

* **Union:** $[\![ C ]\!] (x) = [\![ A ]\!] (x) \oplus [\![ B ]\!] (x)$



><img src = 'images/image02.png' width = 600>

><img src = 'images/image03.png' width = 400>

><img src = 'images/image04.png' width = 500>

* **Concatenation:** $[\![ C ]\!] (x) = \underset{x_1,x_2:x=x_1x_2}{\bigoplus} [\![ A ]\!] (x_1) \otimes [\![ B ]\!] (x_2)$
* **Closure:** $[\![ A^* ]\!] (x) = \bigoplus^\infty_{n=0} [\![ A^n ]\!] (x)$



><img src = 'images/image05.png' width = 550>

* **Determinization**

>* After **determinization**,
>  * unique starting state
>  * no two transitions leaving a state share the same input label 
>  * arc weights may change / but string weights are unchanged
>  * there may be new epsilon arcs
>* **Minimization:** finds an equivalent machine with a minimal no. of states and arcs

* **Pruning**

>$$\text{edge: } e = p \overset{i/w}{\longrightarrow} n$$

>$$\text{delete } e \text{ if } d^* \otimes c < d^r [p] \otimes w \otimes d[n]$$

>* $d^*$: the weight of the best path through the FST
>* $d^r[p]$: distance from the start state to $p$
>* $d[n]$: shortest distance from $n$ to a final state

* **Pushing**

>* **Pushing** moves weights and/or labels towards the start or the end state
>  * Towards the **start** state: **improve pruning**
>  * Towards the **end** state: **help accumulating costs over paths**
>* **Algorithm:**
>  * Pushing makes the WFSA stochastic (in real semiring, for each state, weights add to 1)
>  * Tropical & log semiring $\rightarrow$ multiplicative inverse is simply arithmetic subtraction

>$$w \leftarrow (d[p])^{-1} w \otimes d[n] \;\;\;,\;\;\; d[q]=\underset{\pi \in P(q,F)}{\bigoplus} w[\pi]$$

* **Failure Transitions**

>||Consumes no symbol|Consumes symbol|
|-|-|-|
|Matches all|$\epsilon$|$\sigma$|
|Mathches rest|$\phi$|$\rho$|

## 2.4. Weighted Finite State Transducers

* **WFST = WFSA + symbol-to-symbol mappings**

>$$ M = (Q, \Sigma, \Delta, E, q_0, F) $$

>$$s(e) \overset{i(e):o(e)/w(e)}{\longrightarrow}f(e) \;\;\;,\;\;\; \Sigma \text{ and } \Delta: \text{input and output alphabet}$$

* **Weighted Mapping**

>$$p \in P(x,y) \;\;\;,\;\;\; x = i(e_1) \dots i(e_{n_p}) \;\;\;,\;\;\; y=o(e_1) \dots o(e_{n_p})$$

>$$w(p) = (\otimes^{n_p}_{j=1} w(e_j)) \otimes \rho(f_p)$$

>$$[\![ T ]\!] (x,y) = \underset{p \in P(x,y)}{\bigoplus} w(p)$$

>* $[\![ T ]\!] (x,y)$: sum of all path weights along which $x$ is mapped to $y$

* **WFST: Mapping between Regular Languages**

>* $T_1$ and $T_2$: WFSAs $\rightarrow$ $L_{T_1}$ and $L_{T_2}$: Regular languages
>* $T$ maps strings $x \in L_{T_1}$ to $y \in L_{T_2}$ with weight $[\![ T ]\!] (x,y)$
>* Regular languages and context-free languages are **closed under (finite) transduction**

## 2.5. WFST Operations

* **Projection**

>* $T_1$: Projection on input $\rightarrow$ $L_{T_1}$: input language
>* $T_2$: Projection on output $\rightarrow$ $L_{T_2}$: output language

* **Composition**

>$$[\![ A \circ B ]\!] (x,z) = \underset{y}{\bigoplus} [\![ A ]\!] (x,y) \otimes [\![ B ]\!] (y,z)$$

* **Union**

>$$[\![ A \oplus B ]\!] (x,y) = [\![ A ]\!] (x,y) \oplus [\![ B ]\!] (x,y)$$

* **Concatenation**

>$$[\![ A \otimes B ]\!] (x,y) = \underset{x=x_1 x_2, y=y_1 y_2}{\bigoplus} [\![ A ]\!] (x_1,y_1) \otimes [\![ B ]\!] (x_2,y_2)$$

* **Closure**

>$$[\![ T^* ]\!] (x,y) = \overset{\infty}{\underset{n=0}{\bigoplus}} [\![ T^n ]\!] (x,y)$$

* **Disambiguation**

>* **Ambiguity:** multiple paths accept same input string
>* **Non-Functional:** multiple output paths for a single input string
>* **Disambiguation:** creating a new WFST that encodes only the best-scoring path of each input string, while maintaining the arc-level mapping between input and output symbols

* **Other Operations**

>* **Connect:** remove useless states and arcs
>* **Invert:** swaps input and output labels
>* **Reverse:** reverse input and output languages

* **Operational Complexity**

>* Operations on 1 automata
>  * Reversal, Inversion, Projection, Connection $\rightarrow$ $O(|Q|+|E|)$
>  * Epsilon removal $\rightarrow$ **cubic**
>  * Determinization $\rightarrow$ **exponential**

>* Operations on 2 automata
>  * Composition, Intersection, Difference $\rightarrow$ $O((|Q_1|+|E_1|)(|Q_2|+|E_2|))$

# 3. Distances, Kernels, Semirings

## 3.1. String Distances

* **Symbol-to-Symbol Distance**

>$$d(x,y) = \bigg\{ \begin{matrix} 0 & y=x \\ d_r & y \neq x \end{matrix} \;\;\; \text{or} \;\;\; d(x,y) = \Bigg\{ \begin{matrix} 0 & y=x \\ d_r & y \neq x  \\ d_d & x=\epsilon \text{ or } y=\epsilon \end{matrix}$$

>* **Ambiguity** $\rightarrow$ choose minimum distance under all allowable alignments

* **Edit Distance Transducers** $T$

>* Assigns cost for edits(replacement, deletion, insertion)

>$$A \circ T \circ B \rightarrow \text{all alignments with all costs}$$

>* **Method 1:** shortest path on $A \circ T \circ B$
>* **Method 2:** input projection $\rightarrow$ epsilon removal $\rightarrow$ determinization (wrt tropical semiring) $\rightarrow$ single path

>  * But the actual symbol-to-symbol alignment can be lost

* **Lattice & String**

>* **Method 1:** find every alignment of every string in $C$ to $B$ ($C \circ T \circ B$)
>* **Method 2:** find the single string in $C$ that aligns best to $B$ (ShortestPath $C \circ T \circ B$)
>* **Method 3:** find the cost of the best alignment of every string in $C$ to $B$
>* **Method 4:** use disambiguation algorithm (useful for Lattice-to-Lattice)

## 3.2. Kernels and Counting Transducers

* **Kernel Functions**

>* $\kappa (x,x')$: Measures the similarity between two strings
>* Typically **symmetric** & **positive**

* **Mercer Kernel**

>$$ \kappa (x,x') = \sum_{s \in \Sigma^*} w_s \phi_s (x) \phi_s (x')$$

>* $\phi_s (x)$: no. of times a substring $s$ occurs in a string $x$
>* Monotonicity constraints relaxed (or removed) / but subsequences in strings should be counted

* **Examples**

>* **Bag-of-Characters** kernel: $w_s=0$ for $|s|>1$
>* **Bag-of-Words** kernel: $w_s=0$ unless $s$ is bounded by white space
>* **All-subsequences** kernel: $w_s = 1$
>* **K-Spectrum** kernel: $\kappa(x,x') = \sum_{s\in \Sigma^k} \phi_s (x) \phi_s (x')$

* **Kernels for Lattices**

>\begin{align}
\text{expected count: } c(A,s) &= \sum_{x \in L_A} P_A (x) \phi_s (x) \\
\text{lattice kernel: } \kappa (A,B) &= \sum _ {s \in \Sigma^*} c(A,s) c(B,s) \\
&= \sum_{s \in \Sigma^*} \sum_{x \in L_A} P_A (x) \phi_s (x) \sum_{x' \in L_B} P_B (x') \phi_s (x') \\
&= \sum_{x \in L_A} \sum_{x' \in L_B} P_A (x) P_B (x') \kappa(x,x')
\end{align}

>* Compares WFSAs as the **weighted similarity** of the strings in their languages
>* $A \cap B = \phi$ does not imply that $\kappa(A,B)=0$

* **Counting Transducers** (efficiently count n-gram in log semiring)

>$$ A \circ T1 (\text{or } T2) \rightarrow \text{Output Projection} \rightarrow \epsilon \;\text{Removal} \rightarrow \text{Determinization}$$

>* **Gappy N-Gram Kernels:** penalty $\lambda$ for each gap

## 3.3. Semirings

* **Tropical Semirings - Feature Vectors**

>$$s(e) \overset{i(e)/v(e)}{\longrightarrow} f(e) \;\;\;\Rightarrow\;\;\; w(e)=\theta \cdot v(e)$$

>* $v(e)$: unweighted n-dim feature vector
>* $\theta$: Parameter vector, applied to compute weights

>\begin{align}
\otimes &: \;\;\; v_3 = v_1 + v_2 &\Rightarrow \theta \cdot v_3 = w_1 \otimes w_2 \\
\oplus &: \;\;\; v_3 = \bigg\{ \begin{matrix} v_1 \;\;\; \text{if} \;\;\; \theta \cdot v_1 \leq \theta \cdot v_2 \\ v_2 \;\;\; \text{if} \;\;\; \theta \cdot v_2 < \theta \cdot v_1 \end{matrix} &\Rightarrow \theta \cdot v_3 = w_1 \oplus w_2
\end{align}

* **Transducer Composition**

>$$\underset{x,y}{\min}[\![ A \circ B ]\!] (x,y) =\underset{x,y}{\min} \underset{z}{\min} ( [\![ A ]\!](x,z) [\![ B ]\!](z,y))$$

>$$x \in L_{A_1} \;\;\;,\;\;\; z \in L_{A_2} \cap L_{B_1} \;\;\;,\;\;\; y \in L_{B_2}$$

>* The weights of each component transducer have their own position in the feature vectors
>* $\Rightarrow$ The contribution of each transducer can be tracked

* **Bottleneck Semiring**

>* $\mathbb{K} = \mathbb{R}, \;\; \bar{0}=-\infty, \;\; \bar{1}=\infty$
>* $\otimes = \min$: **bottleneck** along any particular path
>* $\oplus = \max$: cost along the path with the most throughput
>* Measures maximal **throughput** or **capacity** through a network
>* Arc weights: analogous to **pipe widths**

* **Possibilistic Semiring**

>* $\mathbb{K} = [0,1], \;\; \bar{0}=-\infty, \;\; \bar{1}=1$
>* $\otimes = \times$: probability of success of any particular sequence
>* $\oplus = \max$: best probability of success
>* Measures **maximal possibility** or **maximum reliability**

* **Formal Language Semiring**

>* $\mathbb{K} = P(\Sigma^*)$ (power set)$, \;\; \bar{0}=\epsilon, \;\; \bar{1}=\epsilon$
>* $\otimes = \cup$
>* $\oplus = \cdot\;$ (concatenation)
>* The distance from the start state to the final state: yields the language of the automata

# 4. Applications of Weighted Automata

## 4.1. Acoustic Likelihoods

* **HMM Likelihood**

>$$P(O,X) = a_{x(0),x(1)} \prod^T_{t=1} b_{x(t)}(o_t) a_{x(t),x(t+1)}$$

* **Conditional Likelihood**

>$$[\![ A ]\!] (X) = P(O|X) = \prod^T_{t=1} b_{x(t)} (o_t)$$

* **HMM Likelihood with WFSAs in Tropical Semiring**

>$$\log P(O,X) = \log a_{x(0),x(1)} + \sum^T_{t=1} \log b_{x(t)}(o_t) + \log a_{x(t),x(t+1)}$$

>$$\text{Joint Likelihood: } [\![ A \circ B ]\!] (X) = -\log P(O,X)$$

>\begin{align}
&A: \text{HMM observation distribution} &(t-1) \overset{n/-\log b_n(o_t)}{\longrightarrow} (t) \\
&B: \text{HMM transition probabilities} &(n) \overset{n'/-\log a_{n,n'}}{\longrightarrow} (n') 
\end{align}

## 4.2. Language Models

* **Back-off Bigram Language Model**

>$$\hat{P}(w_j|w_i) = \bigg\{ \begin{matrix} p(w_i,w_j) & f(w_i,w_j) > C \\ \alpha(w_i) \hat{P}(w_j) & \text{otherwise} \end{matrix}$$

>$$p(w_i,w_j) = d(f(w_i,w_j)) \frac{f(w_i,w_j)}{f(w_i)}$$

>* $f(w_i,w_j)$: no. of times $w_i, w_j$ are observed
>* WFSAs: cannot implement **otherwise** $\rightarrow$ use **failure transition** (e.g. $\phi$)

* **A Small Back-off Bigram Language Model**

><img src = 'images/image09.png' width = 600>

>* Approximate implementation: uses $\epsilon$ instead of $\phi$
>* $\Rightarrow$ The back-off patch can be taken even if a non-back-off path is present

* **Building a WFSA for a Bigram Language Model**

><img src = 'images/image10.png' width = 600>

>* **State** for every word & a unigram back-off state $\epsilon$
>* **Arc** for each pair of words $w$ and $w'$ for which $f(w,w')>C$
>* **Back-off Arc** from $w$ to $\epsilon$ / **Unigram Arc** from $\epsilon$ to $w'$

## 4.3. Lexicons

><img src = 'images/image11.png' width = 600>

>* Maps **phone sequences** to **word sequences**

## 4.4. CI-to-CD Transducers

* **CI-to-CD transducer**

>* Maps **monophone sequence** to **triphone sequence**

>$$(p_1,p_2) \overset{p_3:t=p_1\text{-}p_2\text{+}p_3}{\longrightarrow} (p_2,p_3) $$

>* Silence models, monophones, etc must be handled differently

* **WFST to Map Triphone Sequences to HMM State Sequences**

>* **State-clustering:** share GMM across triphones
>* Each triphone state has a pointer to one of the HMM states: $p(t_m)=s_n$
>* Transducer maps from HMM state sequences to triphone sequences

## 4.5. WFSA ASR

>$$\text{Acoustic Likelihoods} \overset{A \circ B}{\longrightarrow} \text{HMM State Sequences} \overset{S}{\rightarrow}
\text{Triphone Sequences} \overset{C}{\rightarrow} $$
>$$\text{Monophone Sequences} \overset{L}{\rightarrow} \text{Word Sequences} \overset{G}{\rightarrow} \text{Language Model Scores}$$

## 4.6. Tagging

* **Truecasing**

>$$B \circ T_\text{case} \circ G_\text{case}$$

>* $B$: acceptor for uncased sentence
>* $T_\text{case}$: maps between all cased and uncased variants found in the text
>* $G_\text{case}$: accpetor for cased N-gram LM
>* $P_\text{case}(W) = \prod^N_{j=1} P(W_j|W_{j-1},...,W_{j-N+1})$
>* `Fstshortestpath` over $B \circ T_\text{case} \circ G_\text{case}$ to find $\hat{W} = \underset{W:lc(W)=w}{\text{argmax}} \; P_\text{case}(W)$

* **Part-of-Speech Tagging**

>$$\hat{t}^n_1 = \underset{t^n_1}{\text{argmax}} \; P(t^n_1|w^n_1) \approx \prod^n_{i=1} P(w_i|t_i)P(t_i|t_{i-1})$$

>* $P(w_i|t_i)$ and $P(t_i|t_{i-1})$: estimated from annotated text collections


## 4.7. Keyboards

>* **Input modes:** tap typing and gesture typing
>* **Challenges:** fat finger errors / ambiguity
>* **Desired Features:** autocorrection / word suggestions / alternative suggestions
>* **Keyboard Transducers:** handles bi-key sequences in either input mode
>* **Transliteration:** Graphemic conversion (i.e., from one script to another)
>* **Romanization:** Mapping sequences in the target script to sequences of Latin symbols

# 5. Inference and Computation

* Estimate **automata weights** from **data**

## 5.1. MLE

* **Likelihood Function**

>$$Q(\mathcal{T}) = \prod_{x \in \mathcal{T}} Q(x) = \prod_{x \in \mathcal{X}} Q(x)^{\#(x)}$$

>* $\mathcal{X}$: random variable set
>* $\mathcal{T}$: finite training set (i.i.d. samples)

* **KL Divergence**

>$$D(P||Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)}$$

>* Define empirical distribution $\hat{P}(x) = \#(x)/M$,
>* If $D(\hat{P}||Q_1) \leq D(\hat{P}||Q_2)$, then $Q_2(\mathcal{T}) \leq Q_1(\mathcal{T})$

* **MLE for Bigram LM**

>* Likelihood

>$$P(W) = \prod^N_{n=1} P(w_n|w_{n-1}) = \prod_w \prod_{w'} p(w'|w)^{\#_{w,w'}(W)}$$

>* MLE solution

>$$\tilde{p}(w'|w) = \frac{\hat{\#}_{w,w'}}{\hat{\#}_{w}} \;\;\;\text{where}\;\;\; \hat{\#}_{w,w'}=\sum_{W\in\mathcal{T}} \hat{\#}_{w,w'}(W) \;\;\;\text{and}\;\;\; \hat{\#}_{w}=\sum_{w'} \hat{\#}_{w,w'}$$

* **MLE and WFSAs**

>* Extend to WFSAs from labelled training sets (**N-Gram**, **Part-of-Speech**, **PCFGs**, ...)
>* Unlabelled data $\rightarrow$ replace empirical counts with expected counts & use **EM algorithm**

* **Expectation Semiring** (E-step in EM)

>* **Step 1:** Define a **value function** $v(e) \in \mathbb{R}^n$ $\Rightarrow$ value of path $v(\pi)=\sum^n_{i=1} v(e_i)$
>* **Step 2:** Introduce paired weights $(p,v) \in \mathbb{R} \times \mathbb{R}^n$ 
>* **Step 3:** Define the semiring as 

>$$(p_1,v_1) \oplus (p_2,v_2) = (p_1+p_2, v_1+v_2)$$

>$$(p_1,v_1) \otimes (p_2,v_2) = (p_1p_2, p_1v_2+p_2v_1)$$

>$$\bar{0} = (0,0) \;\;\; , \;\;\; \bar{1} = (1,0)$$

>* **Step 4:** Suppose $\pi = \pi_1 \pi_2$ (prefix + suffix)

>$$P(\pi) = P(\pi_1) P(\pi_2) \;\;\; \text{and} \;\;\; v(\pi) = v(\pi_1) + v(\pi_2)$$

>\begin{align}
(P(\pi_1),v(\pi_1)) \otimes (P(\pi_2),v(\pi_2)) &= (P(\pi_1)P(\pi_2), P(\pi_1)P(\pi_2)(v(\pi_1)+v(\pi_2))) \\
&= (P(\pi), P(\pi)v(\pi)) \\
\oplus_{\pi \in \prod} (P(\pi),P(\pi)v(\pi)) &= \left( \sum_{\pi \in \Pi} P(\pi), \sum_{\pi \in \Pi} P(\pi)v(\pi) \right) \\
&= \left( P(\Pi),E[v(\pi) \times 1_{\Pi}(\pi)] \right) \\
E[v(\pi)|\Pi] &= E[v(\pi) \times 1_{\Pi}(\pi)]/P(\Pi)
\end{align}

## 5.2. MBR Training

* **Baum-Welch** (improves the likelihood of the correct answer)

>$$\underset{\lambda}{\text{argmax}} P_\lambda (O,W_{ref})$$

* **MBR: Minimum Bayes Risk estimation** (Generalization of Baum-Welch)

>$$\underset{\lambda}{\text{argmin}} \sum_{W'} L(W_{ref},W') P_\lambda (W'|O)$$

>* $L(W_{ref},W')$: loss function (e.g. no. of word errors)

* **Optimizing expected WER via sampling for speech recognition**

>* Acoustic model: stacked LSTM network
>* Produces logit vector sequence $z_t \in [0,1]^Q$

><img src = 'images/image13_.png' width = 500>

>* **Unrolled decoder graph:** $U(z) = S(z) \circ (C \circ L \circ G)$

>  * Input: $z$'s / Output: ASR word hypotheses $o(\pi)$

>* For each path $\pi$ with weight $w(\pi,z)$, the posterior is $P(\pi|z,\lambda)=\frac{w(\pi,z)}{\sum_{\pi'} w(\pi',z)}$ so that

>$$\nabla_z \log P(\pi|z) = \nabla_z \log w(\pi,z) - \mathbb{E}_{P(\pi|z)} \nabla_z \log w(\pi,z)$$

>* $\nabla_z \log P(\pi|z)$: needed for back propagatino of the gradient wrt $\lambda$
>* $\nabla_z \log w(\pi,z)$: $T\times Q$ matrix with $1$ at each $(t,q_t)$ in $\pi$ and $0$ elsewhere

* **Backward filtering - forward sampling**

>* **Sample $\pi_i \text{ ~ } P(\pi|z,\lambda)$ from $U(z)$:**
>  * Push the $U(z)$'s weights to make it stochastic
>  * Ancestral sampling: sample edges from the initial state to create complete paths $\pi_i$
>  * For any path $\pi$, compute $L(\pi)=L(W_{ref},o(\pi))$

>* **Expected loss** (approximated using Monte Carlo approximation)

>$$\mathbb{E}_{P(\pi|z)} L(\pi) = \sum_\pi P(\pi|z) L(\pi) \approx 1/I \sum^I_{i=1} L(\pi_i) = \overline{L(\pi_i)}$$

>* **MBR gradient**

>\begin{align}
\nabla_z \mathbb{E}_{P(\pi|z)} L(\pi) &= \sum_\pi P(\pi|z) L(\pi) \nabla_z \log P(\pi|z) \\
&= \mathbb{E}_{P(\pi|z)} L(\pi) \nabla_z \log w(\pi,z) - \mathbb{E}_{P(\pi|z)} L(\pi) \mathbb{E}_{P(\pi|z)} \nabla_z \log w(\pi,z) \\
&\approx \frac{I}{I-1} \overline{(L(\pi)-\overline{L(\pi_i)}) \nabla_z \log w(\pi_i,z)}
\end{align}

## 5.3. Distance Computation

* **Tropical Arithmetic and Dynamic Programming**

>* Addition and Multiplication

>$$x \oplus y = \min(x,y) \;\;\;,\;\;\; x \otimes y = x+y$$

>* Exponentiation

>$$(x \oplus y)^n = x^n \oplus y^n$$

>* Matrix and vector operations

>$$(u_1,u_2,u_3) \otimes (v_1,v_2,v_3)^T = u_1 \otimes v_1 \oplus u_2 \otimes v_2 \oplus u_3 \otimes v_3 = \min \{ u_1+v_1, u_2+v_2, u_3+v_3 \}$$

* **Shortest Paths in a Weighted Directed Graph**

>* **Adjacency matrix**

>$$D_G = [d_{i,j}] \;\;\;,\;\;\; d_{i,j}:\text{distance}$$

>* $D_G^{\otimes n-1}$: $n \times n$ matrix with entries in $\mathbb{R}_{\geq 0} \cup \{+\infty\}$
>* **Proposition:** entry of $[D_G^{\otimes n-1}]_{i,j}$: length of the shortest path from node $i$ to node $j$

## 5.4. Inference Functions

* **Inference function:** maps an observation to an explanation
* **The Few Inference Functions Theorem**

>* The no. of inference fn. grows polynomially in the complexity of the graphical model. However, very few of the $k^{n(l)^n}$ mappings are inference functions. (e.g. for HMM, there are at most $C_{k,l} n^{k(k+l)}$ explanations)


* **Example: HMM**

>* A set of HMM parameters $\theta$ specifies a particular inference fn.

>$$\hat{X} = \underset{X}{\text{argmax}} P_\theta (Y,X)$$

>* There are $k^{n(l)^n}$ possible functions mapping strings from $\Delta^n$ to $\Sigma^n$

>* **Joint Probability:**

>\begin{align}
P(W,T) &= \prod_i p(w_i|t_i)p(t_i|t_{i-1}) \\
&= \prod_{w,t,t'} p(w|t)^{\#_{w,t}(W,T)} p(t|t')^{\#_{t,t'}(T)} \\
\log P(W,T) &= \sum_{w,t,t'} \#_{w,t}(W,T) \log p(w|t) + \#_{t,t'}(T) \log p(t|t') \\
&= \sum_{w,t,t'} \theta_{w,t} \#_{w,t} (W,T) + \theta_{t,t'} \#_{t,t'}(T) = \theta \cdot \#(W,T)
\end{align}

>* **Condition:** $T_k = \text{argmax}_T P(W_k,T)$ for all $k=1,...,l^n$
>* Or equivalently,

>$$\theta \cdot [\#(W_k,T) - \#(W_k,T_k)] \leq 0 \;\;\; \forall T$$

* **More Challenging Inference Problem**

>* For a pair $(X',Y')$, find all parameters $\theta$ s.t. $X'=\text{argmax}_X P_{\theta} (X,Y')$