# 1. Probability of Error & Decision Boundaries

## 1.1. Bayes' Decision Rule

* **Expected Loss**

>$$\mathcal{L}_{\text{act}} = \int \Big[ \sum^K_{i=1} \mathcal{L} (f(\mathbf{x},\boldsymbol{\theta}),\omega_i) P(\omega_i|\mathbf{x}) \Big] p(\mathbf{x})d\mathbf{x}$$

>* $f(\mathbf{x},\boldsymbol{\theta})$: prediction given model parameters

* **Empirical Loss**

>$$\mathcal{L}_{\text{eval}} = \frac{1}{N} \sum^N_{i=1} \mathcal{L} \big( f(\mathbf{x}_i,\boldsymbol{\theta}),y_i \big)$$

>* Computed using **held-out** evaluation set
>* $y_i \in \{\omega_1,...,\omega_K\}$: labels
>* As $N\rightarrow \infty$, $\mathcal{L}_{\text{eval}} \rightarrow \mathcal{L}_{\text{act}}$

* **Bayes' Decision Rule** 

>* BDR: make decision that minimizes loss (i.e. probability of error)
>* Assume that $\mathcal{L}(\hat{\omega},\omega_i) = 0$ if $\hat{\omega}=\omega_i$ and $1$ otherwise

>\begin{align}
\hat{\omega} &= f(\mathbf{x}^\star, \boldsymbol{\theta}) \\
&= \text{argmin}_\omega \bigg\{ \sum^K_{i=1} \mathcal{L}(\omega,\omega_i) P(\omega_i|\mathbf{x}^\star) \bigg\} \\
&= \text{argmax}_\omega \bigg\{ P(\omega|\mathbf{x}^\star) \bigg\}
\end{align}

>* $P(\omega|\mathbf{x}^\star)$: **classifier**

## 1.2. Classifier

* **Types of Classifiers**

>* **Generative Models:** model the joint distribution $p(\mathbf{x},\omega;\boldsymbol{\theta})$
>  * Posterior: obtained from **Bayes' rule**

>$$P(\omega_i|\mathbf{x}^\star;\boldsymbol{\theta}) = \frac{p(\mathbf{x}^\star,\omega_i;\boldsymbol{\theta})}{\sum^K_{j=1} p(\mathbf{x}^\star,\omega_j;\boldsymbol{\theta})}$$

>* **Discriminative Models:** model the posterior $P(\omega|\boldsymbol{x}^\star;\boldsymbol{\theta})$

>* **Discriminant Functions:** model the mapping directly (no posterior)

* **Binary Classification** (assume $\omega_1 = 1, \omega_2 = -1$)

>* **Empirical Loss**

>$$\mathcal{L}_{\text{eval}} = P(\text{error}) = \frac{1}{2N} \sum^N_{i=1} |f(\mathbf{x}_i,\boldsymbol{\theta})-y_i|$$

>* **True Probability of Error**

><img src="images/image01.png" width=250>

>\begin{align}
P(\text{error}) &= P(\mathbf{x}\in\Omega_2,\omega_1) + P(\mathbf{x}\in\Omega_1,\omega_2) \\
&= P(\mathbf{x}\in\Omega_2|\omega_1)P(\omega_1) + P(\mathbf{x}\in\Omega_1|\omega_2)P(\omega_2) \\
&= \int_{\Omega_2} p(\mathbf{x}|\omega_1)P(\omega_1)d\mathbf{x} + \int_{\Omega_1} p(\mathbf{x}|\omega_2)P(\omega_2)d\mathbf{x} 
\end{align}

>* **$P(\text{error})$ marginalizes over joint distribution**

>$$P(\text{error}) = \int_{\Omega_2} p(\mathbf{x},\omega_1)d\mathbf{x} + \int_{\Omega_1} p(\mathbf{x},\omega_2)d\mathbf{x}$$

>* **Generative Model**

>$$P(\text{error}) = \int_{\Omega_2} p(\mathbf{x}|\omega_1)P(\omega_1)d\mathbf{x} + \int_{\Omega_1} p(\mathbf{x}|\omega_2)P(\omega_2)d\mathbf{x} $$

>* **Discriminative Model**

>$$P(\text{error}) = \int_{\Omega_2} P(\omega_1|\mathbf{x})p(\mathbf{x})d\mathbf{x} + \int_{\Omega_1} P(\omega_2|\mathbf{x})p(\mathbf{x})d\mathbf{x} $$

* **Unequal Loss Function** 

>* **Loss**

>\begin{align}
\mathcal{L}(f(\mathbf{x}^\star,\boldsymbol{\theta})=\omega_2,\omega_1) &= \mathcal{C}_{21} \\ \mathcal{L}(f(\mathbf{x}^\star,\boldsymbol{\theta})=\omega_1,\omega_2) &= \mathcal{C}_{12}
\end{align}

>* **Bayes' Decision Rule**

>$$\hat{\omega} = \text{argmin}_\omega \bigg\{ \sum^2_{i=1} \mathcal{L}(\omega,\omega_i) P(\omega_i|\mathbf{x}^\star;\boldsymbol{\theta})\bigg\}$$

>$$\frac{P(\omega_1|\mathbf{x}^\star;\boldsymbol{\theta})}{P(\omega_2|\mathbf{x}^\star;\boldsymbol{\theta})} = \frac{\mathcal{C}_{12}}{\mathcal{C}_{21}} \rightarrow \text{operating threshold}$$

## 1.3. Parameter Estimation

* **MLE - Maximum Likelihood Estimation**

>* **Supervised:**

>\begin{align}
\boldsymbol{\theta} &= \text{argmax}_\boldsymbol{\theta} \bigg\{ \log P(y_1,...,y_N|\mathbf{x}_1,...,\mathbf{x}_N;\boldsymbol{\theta}) \bigg\} \\
&= \text{argmax}_\boldsymbol{\theta} \bigg\{ \sum^N_{i=1} \log P(y_i|\mathbf{x}_i;\boldsymbol{\theta}) \bigg\}
\end{align}

>* **Unsupervised:**

>$$\boldsymbol{\theta} = \text{argmax}_\boldsymbol{\theta} \bigg\{ \sum^N_{i=1} \log p(\mathbf{x}_i;\boldsymbol{\theta}) \bigg\}$$

* **Generative Models**

>$$P(\omega_i|\mathbf{x}^\star) \approx \frac{p(\mathbf{x}^\star,\omega_i;\boldsymbol{\theta})}{\sum^K_{j=1} p(\mathbf{x}^\star,\omega_j;\boldsymbol{\theta})} = \frac{p(\mathbf{x}^\star|\omega_i;\boldsymbol{\theta})P(\omega_i)}{\sum^K_{j=1} p(\mathbf{x}^\star|\omega_j;\boldsymbol{\theta})P(\omega_j)}$$

>$$\hat{\boldsymbol{\theta}}_i = \text{argmax}_{\boldsymbol{\theta}} \bigg\{ \sum_{j:y_j=\omega_i} \log p(\mathbf{x}_j|\omega_i;\boldsymbol{\theta}) \bigg\}$$

* **Multivariate Gaussian Class Conditional PDFs**

>$$p(\mathbf{x}|\omega_i;\boldsymbol{\theta}_i) = \mathcal{N}(\mathbf{x};\boldsymbol{\mu}_i,\boldsymbol{\Sigma}_i) = \frac{1}{(2\pi)^{\frac{d}{2}} |\boldsymbol{\Sigma}_i|^{\frac{1}{2}}} \exp \left( -\frac{1}{2} (\mathbf{x}-\boldsymbol{\mu}_i)^T \boldsymbol{\Sigma}_i^{-1} (\mathbf{x}-\boldsymbol{\mu}_i) \right)$$

>\begin{align}
\hat{\boldsymbol{\mu}}_i &= \frac{\sum_{j:y_j=\omega_i} \mathbf{x}_j}{\sum_{j:y_j=\omega_i} 1} \\
\hat{\boldsymbol{\Sigma}}_i &= \frac{\sum_{j:y_j=\omega_i} (\mathbf{x}_j-\hat{\boldsymbol{\mu}}_i)(\mathbf{x}_j-\hat{\boldsymbol{\mu}}_i)^T}{\sum_{j:y_j=\omega_i} 1}
\end{align}

## 1.4. Decision Boundary

* **Binary Classification**

>* Boundary occurs when class posteriors are the same

>$$\log P(\omega_1|\mathbf{x};\boldsymbol{\theta}) = \log P(\omega_2|\mathbf{x};\boldsymbol{\theta})$$

>* For generative classifier:

>$$\log (P(\omega_1)p(\mathbf{x}|\omega_1;\boldsymbol{\theta}_1)) = \log (P(\omega_2)p(\mathbf{x}|\omega_2;\boldsymbol{\theta}_2))$$

* **Multivariate Gaussian**

>* General: **Hyper-quadratic** decision boundary
>* $\boldsymbol{\Sigma}_1=\boldsymbol{\Sigma}_2$ $\rightarrow$ **Linear** decision boundary

>$$\mathbf{x}^T\mathbf{Ax} + \mathbf{b}^T\mathbf{x} + c = 0$$

>* $\mathbf{A} = \boldsymbol{\Sigma}^{-1}_1 - \boldsymbol{\Sigma}^{-1}_2$
>* $\mathbf{b} = 2(\boldsymbol{\Sigma}^{-1}_2 \boldsymbol{\mu}_2 - \boldsymbol{\Sigma}^{-1}_1 \boldsymbol{\mu}_1)$
>* $c=\boldsymbol{\mu}^T_1 \boldsymbol{\Sigma}^{-1}_1 \boldsymbol{\mu}_1 - \boldsymbol{\mu}^T_2 \boldsymbol{\Sigma}^{-1}_2 \boldsymbol{\mu}_2 - \log \left( \frac{|\boldsymbol{\Sigma}_2|}{|\boldsymbol{\Sigma}_1|} \right) - 2\log \left( \frac{P(\omega_1)}{P(\omega_2)} \right)$

# 2. Graphical Models and Conditional Independence

* **Structured distribution:** written as a product of simpler factors $\rightarrow$ **CI**
* **Graphical Models:** nodes are random variables / edges connect variables for which no CI exist
* **Inference in Graphical Models:** use factorization to reduce computational cost

## 2.1. Bayesian Networks 

* **Bayesian Networks:** directed acyclic graph

* **Factorization**

>$$p(X_1,...,X_d) = \prod^d_{i=1} p(X_i|PA^\mathcal{G}_{X_i})$$
>* $PA^\mathcal{G}_{X_i}$: parents of $X_i$ in $\mathcal{G}$

* **Conditional Independencies**

>$$X_i \perp ND^\mathcal{G}_{X_i} | PA^\mathcal{G}_{X_i}$$
>* $ND^\mathcal{G}_{X_i}$: non-descendants of $X_i$ in $\mathcal{G}$

## 2.2. Markov Networks

* **Markov Networks:** undirected graphical models

* **Gaussian with Sparse Precision Matrix** ($\lambda_{i,j} \neq 0$ for $(i,j)\in\mathcal{E}$)

>\begin{align}
p(X_1,...,X_d) &= \frac{1}{\sqrt{2\pi|\boldsymbol{\Sigma}|}} \exp \{ (X_1,...,X_d)^T \boldsymbol{\Sigma}^{-1} (X_1,...,X_d) \} \\
&\propto \exp \bigg\{ -\frac{1}{2} \sum_{(i,j)\in \mathcal{E}} \lambda_{i,j} X_i X_j\bigg\} = \prod_{(i,j)\in \mathcal{E}} \exp \bigg\{ -\frac{1}{2} \lambda_{i,j} X_i X_j \bigg\}
\end{align}

* **Positive Potential Functions** and **Cliques**

>$$\phi_1(\mathbf{D}_1),...,\phi_k(\mathbf{D}_k)$$

>* $\mathbf{D}_i$: forms a clique of $\mathcal{G}$
>* **Clique:** fully connected subset of nodes

* **Factorization**

>$$p(X_1,...,X_d) = Z^{-1} \prod^k_{i=1} \phi_i (\mathbf{D}_i)$$

>* **Partition function** (or normalizing constant): $Z = \sum_{X_{1:d}} \prod^k_{i=1} \phi_i (\mathbf{D}_i)$

* **Conditional Independencies**

>$$A \perp B | C$$
>* Such that $C$ separates $A$ from $B$ in $\mathcal{G}$ (i.e. $C$ blocks all paths in $\mathcal{G}$ between $A$ and $B$)


# 3. Latent Variable and Sequence Models

## 3.1. Latent Variable Models

* **Latent Variables**

>* Do not have to have any meaning
>* Never observed in test / possibly in training
>* Marginalized over to get probabilities

><img src="images/image02.png" width=400>

>* **Discrete (mixture models, HMMs):** $\sum^M_{m=1} P(c_m)P(\mathbf{x}|c_m)$
>* **Continuous (factor-analysis):** $\int p(\mathbf{x}|\mathbf{z})p(\mathbf{z})d\mathbf{z}$

* **Gaussian Mixture Models**

>$$p(\mathbf{x}) = \sum^M_{m=1} P(c_m)p(\mathbf{x}|c_m) = \sum^M_{m=1} P(c_m) \mathcal{N}(\mathbf{x};\boldsymbol{\mu}_m,\boldsymbol{\Sigma}_m)$$

>* component **prior** $\times$ component **distribution**

* **Factor Analysis**

>* Can be viewed as: (1) Low-dim manifold representation or (2) compact covariance matrix for multivariate Gaussians

>$$p(\mathbf{x}) = \int p(\mathbf{x}|\mathbf{z})p(\mathbf{z})d\mathbf{z} = \mathcal{N}(\mathbf{x};\mathbf{0},\mathbf{CC}^T + \boldsymbol{\Sigma}_{\text{diag}})$$

>* $p(\mathbf{z}) = \mathcal{N}(\mathbf{z};\mathbf{0},\mathbf{I})$: low-dim subspace representation
>* $p(\mathbf{x}|\mathbf{z}) = \mathcal{N}(\mathbf{x};\mathbf{Cz},\boldsymbol{\Sigma}_{\text{diag}})$ where $\mathbf{C}$: loading matrix / $\boldsymbol{\Sigma}_{\text{diag}}$: diagonal covariance matrix
>* If $\boldsymbol{\Sigma}_{\text{diag}} = \sigma^2 \mathbf{I}$ $\rightarrow$ probabilistic PCA

## 3.2. Expectation Maximization

* **Simple EM for GMM**

>* Make initial guess of the parameters $\boldsymbol{\theta}^{(0)}$
>* Repeat
>  * Assign each observation to a component using **Bayes' decision rule**
>  * Update the parameters $\boldsymbol{\theta}^{(k+1)}$

* **Jensen's Inequality**

>$$f \left( \sum^M_{m=1} \lambda_m x_m \right) \geq \sum^M_{m=1} \lambda_m f(x_m)$$

>* $f(\cdot)$: any concave function & $\sum_m \lambda_m = 1$

* **EM for Discrete Latent Variables**

>\begin{align}
\mathcal{L}(\boldsymbol{\theta}^{(k+1)}) - \mathcal{L}(\boldsymbol{\theta}^{(k)}) &= \sum^N_{i=1} \log \left( \frac{p(\mathbf{x}_i;\boldsymbol{\theta}^{(k+1)})}{p(\mathbf{x}_i;\boldsymbol{\theta}^{(k)})} \right) \\
&= \sum^N_{i=1} \log \left( \frac{1}{p\left(\mathbf{x}_i|\boldsymbol{\theta}^{(k)}\right)} \sum^M_{m=1} \left( p(\mathbf{x}_i,c_m|\boldsymbol{\theta}^{(k+1)})\right)\right) \\
&= \sum^N_{i=1} \log \left( \frac{1}{p\left(\mathbf{x}_i|\boldsymbol{\theta}^{(k)}\right)} \sum^M_{m=1} \left( \frac{P(c_m|\mathbf{x}_i,\boldsymbol{\theta}^{(k)}) p(\mathbf{x}_i,c_m|\boldsymbol{\theta}^{(k+1)})}{P(c_m|\mathbf{x}_i,\boldsymbol{\theta}^{(k)})} \right)\right) \\
&\geq \sum^N_{i=1} \sum^M_{m=1} P(c_m|\mathbf{x}_i,\boldsymbol{\theta}^{(k)}) \log \left( \frac{p(\mathbf{x}_i,c_m|\boldsymbol{\theta}^{(k+1)})}{p\left(\mathbf{x}_i|\boldsymbol{\theta}^{(k)}\right)P(c_m|\mathbf{x}_i,\boldsymbol{\theta}^{(k)})} \right) \\
&= \mathcal{Q}(\boldsymbol{\theta}^{(k)},\boldsymbol{\theta}^{(k+1)}) - \mathcal{Q}(\boldsymbol{\theta}^{(k)},\boldsymbol{\theta}^{(k)})
\end{align}

* **Auxiliary Function**

>$$\mathcal{Q}(\boldsymbol{\theta}^{(k)},\boldsymbol{\theta}^{(k+1)}) = \sum^N_{i=1} \sum^M_{m=1} P(c_m|\mathbf{x}_i,\boldsymbol{\theta}^{(k)}) \log \left(p(\mathbf{x}_i,c_m|\boldsymbol{\theta}^{(k+1)}) \right)$$

* **Continuous Auxiliary Functions**

>$$\mathcal{Q}(\boldsymbol{\theta}^{(k)},\boldsymbol{\theta}^{(k+1)}) = \int p\left(\mathbf{Z}|\mathbf{X},\boldsymbol{\theta}^{(k)}\right) \log \left( p\left(\mathbf{X},\mathbf{Z}|\boldsymbol{\theta}^{(k+1)}\right) \right) d\mathbf{Z}$$

## 3.3. Hidden Markov Models

* **Discrete Kalman Filters**

><img src="images/image03.png" width=300>

>$$\mathbf{z}_t = \mathbf{Az}_{t-1} + \boldsymbol{\nu}_t \;\;\;,\;\;\; \mathbf{x}_t = \mathbf{Cz}_t + \boldsymbol{\epsilon}_t$$

>* $\boldsymbol{\nu}_t \sim \mathcal{N}(\mathbf{0},\boldsymbol{\Sigma}_\boldsymbol{\nu})$ and $\boldsymbol{\epsilon}_t \sim \mathcal{N}(\mathbf{0},\boldsymbol{\Sigma}_\boldsymbol{\epsilon})$ 

>\begin{align}
p(\mathbf{x}_t|\mathbf{x}_{1:t-1}) &= \int p(\mathbf{x}_t|\mathbf{z}_t) p(\mathbf{z}_t|\mathbf{x}_{1:t-1}) d\mathbf{z}_t \\
&= \int p(\mathbf{x}_t|\mathbf{z}_t) \int p(\mathbf{z}_t|\mathbf{z}_{t-1}) p(\mathbf{z}_{t-1}|\mathbf{x}_{1:t-1}) d\mathbf{z}_{t-1} d\mathbf{z}_t \\
&= \int p(\mathbf{x}_t|\mathbf{z}_t) \int p(\mathbf{z}_t|\mathbf{z}_{t-1}) \frac{p(\mathbf{x}_{t-1}|\mathbf{z}_{t-1}) p(\mathbf{z}_{t-1}|\mathbf{x}_{1:t-2})}{p(\mathbf{x}_{t-1}|\mathbf{x}_{1:t-2})} d\mathbf{z}_{t-1} d\mathbf{z}_t
\end{align}

* **Hidden Markov Models**

><img src="images/image04.png" width=250>

>* $q_t$: latent variables / discrete state-space
>* States - emitting or non-emitting
>* **Conditional independence:** $P(q_t|q_{0:t-1}) = P(q_t|q_{t-1})$ and $p(\mathbf{x}_t|\mathbf{x}_{1:t-1}, q_{0:t}) = p(\mathbf{x}_t|q_t)$

* **Likelihood**

>$$p(\mathbf{x}_{1:T}) = \sum_{\mathbf{q}\in \mathbf{Q}_T} P(\mathbf{q})p(\mathbf{x}_{1:T}|\mathbf{q}) = \sum_{\mathbf{q}\in \mathbf{Q}_T} P(q_0) \prod^T_{t=1} P(q_t|q_{t-1})p(\mathbf{x}_t|q_t)$$

* **Parameters**

>* **Transition matrix:** $\mathbf{A}$ $\rightarrow$ $a_{ij}=P(q_t=s_j|q_{t-1}=s_i)$
>* **State output probability:** $b_j(\mathbf{x}_t) = p(\mathbf{x}_t|q_t=s_j)$
>* Parameters usually estimated using **EM**

## 3.4. Viterbi Algorithm

* **Viterbi Algorithm**

>$$p(\mathbf{x}_{1:T}) = \sum_{\mathbf{q}\in\mathbf{Q}_T} p(\mathbf{x}_{1:T},\mathbf{q}) \approx p(\mathbf{x}_{1:T},\hat{\mathbf{q}})$$

>* $\hat{\mathbf{q}} = \text{argmax}_{\mathbf{q}\in\mathbf{Q}_T} p(\mathbf{x}_{1:T},\mathbf{q})$
>* Method: **extend partial paths in time** OR **best partial path to a state/time**
>* **Total cost:** log sum the costs of all paths - $\log(\exp(a)+\exp(b))$

* **Formulation**

>* **Initialization**
>  * $\phi_1(0) = 0.0$, $\phi_j(0) = \log (0)$, $\phi_1(t) = \log (0)$ for any $t$

>* **Recursion**
>  * for $t=1,...,T$ / for $j=2,...,N-1$
>  * $\phi_j(t) = \max_{1\leq k < N} \{ \phi_k (t-1) + \log(a_{kj}) \} + \log (b_j(\mathbf{x}_t))$

>* **Termination**
>  * $\log (p(\mathbf{x}_{1:T},\hat{\mathbf{q}})) = \max_{1 < k < N} \{ \phi_k(T) + \log(a_{kN}) \}$

## 3.5. Forward-Backward Algorithm

* **Forward Probability**

>\begin{align}
\alpha_j(t) &= \log(p(\mathbf{x}_{1:t}, q_t=s_j)) \\
&= \log \left( \sum^N_{k=1} \exp \left( \alpha_k(t-1) + \log(a_{kj}) \right) \right) + \log (b_j(\mathbf{x}_t))
\end{align}

* **Backward Probability**

>\begin{align}
\beta_j(t) &= \log (p(\mathbf{x}_{t+1:T}|q_t=s_j)) \\
&= \log \left( \sum^N_{k=1} \exp \left( \beta_k(t+1) + \log(a_{kj}) + \log(b_k(\mathbf{x}_{t+1})) \right) \right)
\end{align}

* **Posterior**

>\begin{align}
P(q_t=s_j|\mathbf{x}_{1:T}) &= \frac{\exp (\alpha_j(t)+\beta_j(t))}{Z} \\
Z &= \sum^N_{i=1} \exp(\alpha_i(t) + \beta_i(t))
\end{align}

## 3.6. Conditional Random Fields

* **Maximum Entropy Markov Model**

>\begin{align}
P(q_{0:T}|\mathbf{x}_{1:T}) &= \prod^T_{t=1} P(q_t|q_{t-1},\mathbf{x}_t) \\
P(q_t|q_{t-1},\mathbf{x}_t) &= \frac{1}{Z_t} \exp \left( \sum^D_{i=1} \lambda_i f_i (q_t,q_{t-1},\mathbf{x}_t) \right)
\end{align}

>* **Extend to complete sequence**

>$$P(q_{0:T}|\mathbf{x}_{1:T}) = \frac{1}{Z} \exp \left( \sum^D_{i=1} \lambda_i f_i (q_{0:T},\mathbf{x}_{1:T}) \right)$$

* **Simple Linear Chain CRF**

><img src="images/image05.png" width=250>

>$$P(q_{0:T}|\mathbf{x}_{1:T}) = \frac{1}{Z} \exp \left( \sum^T_{t=1} \left( \sum^{D_t}_{i=1} \lambda_i^t f_i (q_t,q_{t-1}) + \sum^{D_a}_{i=1} \lambda^a_i f_i (q_t,\mathbf{x}_t) \right) \right)$$

>* $D_t$: # transition style features with parameters $\boldsymbol{\lambda}^t$
>* $D_a$: # acoustic style features with parameters $\boldsymbol{\lambda}^a$
>* Directly related to unnormalized HMM parameters

* **Linear Chain CRF**

><img src="images/image06.png" width=250>

>$$P(q_{0:T}|\mathbf{x}_{1:T}) = \frac{1}{Z} \exp \left( \sum^T_{t=1} \left( \sum^D_{i=1} \lambda_i f_i (q_t,q_{t-1},\mathbf{x}_t) \right) \right)$$

>* Features similar to general MEMM / but normalized globally, not locally

* **Normalization Term**

>$$Z = \sum_{\mathbf{q} \in \mathbf{Q}_T} \exp \left( \sum^T_{t=1} \left( \sum^{D_t}_{t=1} \lambda_i^t f_i (q_t,q_{t-1}) + \sum^{D_a}_{i=1} \lambda^a_i f_i (q_t,\mathbf{x}_t) \right) \right)$$

>* Use equivalent of forward-backward algorithm

* **General Sequence CRF**

>* Undirected graph repeated each tim instance - set of cliques

>$$P(q_{0:T}|\mathbf{x}_{1:T}) = \frac{1}{Z} \exp \left( \sum^T_{t=1} \sum_{\mathcal{C} \in \mathbf{C}} \boldsymbol{\lambda}^T_{\mathcal{C}} \mathbf{f}(\mathbf{q}_{\mathcal{C}t},\mathbf{x}_{1:T},t) \right)$$

>* $\boldsymbol{\lambda}^T_{\mathcal{C}}$: time-independent parameters associated with clique $\mathcal{C}$
>* $\mathbf{f}(\mathbf{q}_{\mathcal{C}t},\mathbf{x}_{1:T},t)$: time-dependent features extracted from clique $\mathcal{C}$ with time-dependent label sequence $\mathbf{q}_{\mathcal{C}t}$

* **Simple Example**

><img src="images/image07.png" width=250>

>\begin{align}
P(q_{0:T}|\mathbf{x}_{1:T}) &= \frac{1}{Z} \exp \left( \sum^T_{t=1} \sum_{\mathcal{C} \in \mathbf{C}} \boldsymbol{\lambda}^T_{\mathcal{C}} \mathbf{f} (\mathbf{q}_{\mathcal{Ct}},\mathbf{x}_{1:T},t) \right) \\
&= \frac{1}{Z} \exp \left( \sum^T_{t=1} \left( \boldsymbol{\lambda}^{tT} \mathbf{f} (q_t,q_{t-1}) + \boldsymbol{\lambda}^{aT} \mathbf{f}(q_t,\mathbf{x}_t) \right) \right)
\end{align}

>* **Parameter Estimation** (fully observed, no need for EM)
>* $\hat{\boldsymbol{\lambda}} = \text{argmax}_{\boldsymbol{\lambda}} \{P(y_{1:T}|\mathbf{x}_{1:T},\boldsymbol{\lambda})\} $
