# Undirected Graphical models (Markov random fields)

## Introduction

name: undirected graphical model (UGM), Markov random field (MRF) or Markov Network

![](../images/19.UGM.png)

### Conditional independence of UGMs
+ Global Markov property: $\mathbf { x } _ { A } \perp_G \mathbf { X } _ { B } \left| \mathbf { x } _ { C }\right.$, e.g.: $1 \perp 7 | \mathrm { rest }$
+ Local Markov Property: $t \perp \mathcal { V } \backslash \operatorname { cl } ( t ) | \mathrm { mb } ( t )$, where closure: $\mathrm { cl } ( t ) \triangleq \mathrm { mb } ( t ) \cup \{ t \}$, e.g.: $1 \perp \text { rest } | 2,3$
+ Pairwise Markov Property: $s \perp t \left| \mathcal { V } \backslash \{ s , t \} \Longleftrightarrow G _ { s t } = 0\right.$, e.g.: $1,2 \perp 6,7 | 3,4,5$

### Comparing directed and undirected graphical models:
![](../images/19.Compare.png)


## Parameterization of MRFs
Representing joint distribution for a UGM:

+ The Hammersley-Clifford theorem:

    A positive distribution p(y) > 0 satisﬁes the CI properties of an undirected graph G iff p can be represented as a product of factors, one per maximal clique, i.e.,

    $$p ( \mathbf { y } | \boldsymbol { \theta } ) = \frac { 1 } { Z ( \boldsymbol { \theta } ) } \prod _ { c \in \mathcal { C } } \psi _ { c } \left( \mathbf { y } _ { c } | \boldsymbol { \theta } _ { c } \right)$$

    where potential or factors for clique $c$ by: $\psi _ { c } \left( \mathbf { y } _ { c } | \boldsymbol { \theta } _ { c } \right) \geq 0$, $C$ is the set of all the (maximal) cliques of $G$ and $Z(\theta)$ is the partition function ensures that the overall distribution sums to 1:

    $$Z ( \boldsymbol { \theta } ) \triangleq \sum _ { \mathbf { x } } \prod _ { c \in \mathcal { C } } \psi _ { c } \left( \mathbf { y } _ { c } | \boldsymbol { \theta } _ { c } \right)$$

    Gibbs distribution:
    $$p ( \mathbf { y } | \boldsymbol { \theta } ) = \frac { 1 } { Z ( \boldsymbol { \theta } ) } \exp \left( - \sum _ { c } E \left( \mathbf { y } _ { c } | \boldsymbol { \theta } _ { c } \right) \right)$$

    where $E(y_c) > 0$ is the energy associated with the variables in clique $c$, $\psi _ { c } \left( \mathbf { y } _ { c } | \boldsymbol { \theta } _ { c } \right) = \exp \left( - E \left( \mathbf { y } _ { c } | \boldsymbol { \theta } _ { c } \right) \right)$

    High probability states correspond to low energy configurations ==> energy based models. 

    ![](../images/10.PGM.png)

    $$p ( \mathbf { y } | \boldsymbol { \theta } ) = \frac { 1 } { Z ( \boldsymbol { \theta } ) } \psi _ { 123 } \left( y _ { 1 } , y _ { 2 } , y _ { 3 } \right) \psi _ { 234 } \left( y _ { 2 } , y _ { 3 } , y _ { 4 } \right) \psi _ { 35 } \left( y _ { 3 } , y _ { 5 } \right)$$

    where: $$Z = \sum _ { \mathbf { y } } \psi _ { 123 } \left( y _ { 1 } , y _ { 2 } , y _ { 3 } \right) \psi _ { 234 } \left( y _ { 2 } , y _ { 3 } , y _ { 4 } \right) \psi _ { 35 } \left( y _ { 3 } , y _ { 5 } \right)$$

    Pairwise MRF: restrict parameterization to the edges of the graph rather than the maximal cliques

    $$\begin{aligned} p ( \mathbf { y } | \boldsymbol { \theta } ) & \propto \psi _ { 12 } \left( y _ { 1 } , y _ { 2 } \right) \psi _ { 13 } \left( y _ { 1 } , y _ { 3 } \right) \psi _ { 23 } \left( y _ { 2 } , y _ { 3 } \right) \psi _ { 24 } \left( y _ { 2 } , y _ { 4 } \right) \psi _ { 34 } \left( y _ { 3 } , y _ { 4 } \right) \psi _ { 35 } \left( y _ { 3 } , y _ { 5 } \right) \\ & \propto \prod _ { s \sim t } \psi _ { s t } \left( y _ { s } , y _ { t } \right) \end{aligned}$$

+ Representing potential functions:
    log potentials: $$\log \psi _ { c } \left( \mathbf { y } _ { c } \right) \triangleq \boldsymbol { \phi } _ { c } \left( \mathbf { y } _ { c } \right) ^ { T } \boldsymbol { \theta } _ { c }$$

    where $\boldsymbol { \phi } _ { c } \left( \mathbf { y } _ { c } \right)$ is a feature vector derived from the values of the variables $y_c$

    ==> log probability (**Maximum entropy, log-linear model**): $$\log p ( \mathbf { y } | \boldsymbol { \theta } ) = \sum _ { c } \boldsymbol { \phi } _ { c } \left( \mathbf { y } _ { c } \right) ^ { T } \boldsymbol { \theta } _ { c } - Z ( \boldsymbol { \theta } )$$

## Examples of MRFs:

### Ising model:
+ Pairwise clique potential: 
$$\psi _ { s t } \left( y _ { s } , y _ { t } \right) = \left( \begin{array} { c c } { e ^ { w _ { s t } } } & { e ^ { - w _ { s t } } } \\ { e ^ { - w _ { s t } } } & { e ^ { w _ { s t } } } \end{array} \right)$$

$w_{st}$ is the coupling strength between nodes $s$ and $t$. We assume all edges have the same strength: $w_{st} = J$. 

+ Assume $y_t \in \{-1, +1\}$, we have the unnormalized log probability:

$$\log \tilde { p } ( \mathbf { y } ) = \sum _ { s \sim t } w _ { s t } y _ { s } y _ { t } + \sum _ { s } b _ { s } y _ { s } = \frac { 1 } { 2 } \mathbf { y } ^ { T } \mathbf { W } \mathbf { y } + \mathbf { b } ^ { T } \mathbf { y }$$

where $\theta = (W,b)$, the normalizing term: $Z$ requires summing over all $2^D$ bit vectors, which is NP-hard in general

### Hopfield network:
+ used for associative moemory or pattern completion, 
+ we have iterative conditional modes (ICM) for inference. set each node to its most likely (lowest energy) state, given all its neighbors, recurrent neural network:

$$p \left( y _ { s } = 1 | \mathbf { y } _ { - s } , \boldsymbol { \theta } \right) = \operatorname { sigm } \left( \mathbf { w } _ { s,: } ^ { T } \boldsymbol { y }_{-s}  + b _ { s } \right)$$

+ Boltzmann machine: generalizes the Hopfield/Ising model by including some hidden nodes, which makes the model representationally more powerful

### Potts model:
Generalize the Ising model to multiple discrete states $y _ { t } \in \{ 1,2 , \dots , K \}$.

+ Potential function:
$$\psi _ { s t } \left( y _ { s } , y _ { t } \right) = \left( \begin{array} { c c c } { e ^ { J } } & { 0 } & { 0 } \\ { 0 } & { e ^ { J } } & { 0 } \\ { 0 } & { 0 } & { e ^ { J } } \end{array} \right)$$

+ Used as a prior for image segmentation: neighboring pixels are likely to have the same discrete label and hence belong to the same segment.
$$p ( \mathbf { y } , \mathbf { x } | \boldsymbol { \theta } ) = p ( \mathbf { y } | J ) \prod _ { t } p \left( x _ { t } | y _ { t } , \boldsymbol { \theta } \right) = \left[ \frac { 1 } { Z ( J ) } \prod _ { s \sim t } \psi \left( y _ { s } , y _ { t } ; J \right) \right] \prod _ { t } p \left( x _ { t } | y _ { t } , \boldsymbol { \theta } \right)$$

+ Chain graph: combination of an undirected and directed graph, local evidence: $p(x_t|y_t, \theta)$

![](../images/19.Potts.png)

### Gaussian MRFs:

$$\begin{aligned} p ( \mathbf { y } | \boldsymbol { \theta } ) & \propto \prod _ { s \sim t } \psi _ { s t } \left( y _ { s } , y _ { t } \right) \prod _ { t } \psi _ { t } \left( y _ { t } \right) \\ \psi _ { s t } \left( y _ { s } , y _ { t } \right) & = \exp \left( - \frac { 1 } { 2 } y _ { s } \Lambda _ { s t } y _ { t } \right) \\ \psi _ { t } \left( y _ { t } \right) & = \exp \left( - \frac { 1 } { 2 } \Lambda _ { t t } y _ { t } ^ { 2 } + \eta _ { t } y _ { t } \right) \end{aligned}$$

## Learning
+ Training MaxEnt models using gradient methods:
    $$p ( \mathbf { y } | \boldsymbol { \theta } ) = \frac { 1 } { Z ( \boldsymbol { \theta } ) } \exp \left( \sum _ { c } \boldsymbol { \theta } _ { c } ^ { T } \boldsymbol { \phi } _ { c } ( \mathbf { y } ) \right)$$

    Scaled log-likelihood: 
    $$\ell ( \boldsymbol { \theta } ) \triangleq \frac { 1 } { N } \sum _ { i } \log p \left( \mathbf { y } _ { i } | \boldsymbol { \theta } \right) = \frac { 1 } { N } \sum _ { i } \left[ \sum _ { c } \boldsymbol { \theta } _ { c } ^ { T } \boldsymbol { \phi } _ { c } \left( \mathbf { y } _ { i } \right) - \log Z ( \boldsymbol { \theta } ) \right]$$

    so we have $$\frac { \partial \ell } { \partial \boldsymbol { \theta } _ { c } } = \left[ \frac { 1 } { N } \sum _ { i } \boldsymbol { \phi } _ { c } \left( \mathbf { y } _ { i } \right) \right] - \mathbb { E } \left[ \boldsymbol { \phi } _ { c } ( \mathbf { y } ) \right] = \mathbb { E } _ { p _ { \mathrm { emp } } } \left[ \boldsymbol { \phi } _ { c } ( \mathbf { y } ) \right] - \mathbb { E } _ { p ( \cdot | \boldsymbol { \theta } ) } \left[ \boldsymbol { \phi } _ { c } ( \mathbf { y } ) \right]$$

    In the first term, we fix $y$ to its observed values, or **clamped term**. In the second term, $y$ is free, **unclamped term** or constrastive term. Note that **computing the unclamped term requires inference in the model, and this must be done once per gradient step**. That makes UGM training much slower than DGM training. The gradient of the log likelihood can be rewritten as the expected feature vector according to the empirical distribution minus the model’s expectation of the feature vector:

    So we have: 
    $$\mathbb { E } _ { p _ { \text { omp } } } \left[ \phi _ { c } ( \mathbf { y } ) \right] = \mathbb { E } _ { p ( \cdot | \theta ) } \left[ \phi _ { c } ( \mathbf { y } ) \right]$$

    ==> Moment matching

+ Training partially observed Maxent models (with hidden variables)
    $$p ( \mathbf { y } , \mathbf { h } | \boldsymbol { \theta } ) = \frac { 1 } { Z ( \boldsymbol { \theta } ) } \exp \left( \sum _ { c } \boldsymbol { \theta } _ { c } ^ { T } \boldsymbol { \phi } _ { c } ( \mathbf { h } , \mathbf { y } ) \right)$$
    
    Log likelihood:
    $$\ell ( \boldsymbol { \theta } ) = \frac { 1 } { N } \sum _ { i } \log \left( \sum _ { \mathbf { h } _ { i } } p \left( \mathbf { y } _ { i } , \mathbf { h } _ { i } | \boldsymbol { \theta } \right) \right) = \frac { 1 } { N } \sum _ { i } \log \left( \frac { 1 } { Z ( \boldsymbol { \theta } ) } \sum _ { \mathbf { h } _ { i } } \tilde { p } \left( \mathbf { y } _ { i } , \mathbf { h } _ { i } | \boldsymbol { \theta } \right) \right)$$
    
    where unnormalized distribution: 
    $$\tilde { p } ( \mathbf { y } , \mathbf { h } | \boldsymbol { \theta } ) \triangleq \exp \left( \sum _ { c } \boldsymbol { \theta } _ { c } ^ { T } \boldsymbol { \phi } _ { c } ( \mathbf { h } , \mathbf { y } ) \right)$$
    
    Therefore, averaging over $h$
    $$\frac { \partial \ell } { \partial \boldsymbol { \theta } _ { c } } = \frac { 1 } { N } \sum _ { i } \left\{ \mathbb { E } \left[ \boldsymbol { \phi } _ { c } \left( \mathbf { h } , \mathbf { y } _ { i } \right) | \boldsymbol { \theta } \right] - \mathbb { E } \left[ \boldsymbol { \phi } _ { c } ( \mathbf { h } , \mathbf { y } ) | \boldsymbol { \theta } \right] \right\}$$

## Conditional Random Field (CRFs)

### Introduction
a.k.a Discriminative random field, a version of MRF where all clique potentials are conditioned on input features. It is structured output extension of logistic regression.

$$p ( \mathbf { y } | \mathbf { x } , \mathbf { w } ) = \frac { 1 } { Z ( \mathbf { x } , \mathbf { w } ) } \prod _ { c } \psi _ { c } \left( \mathbf { y } _ { c } | \mathbf { x } , \mathbf { w } \right)$$

where: $\psi _ { c } \left( \mathbf { y } _ { c } | \mathbf { x } , \mathbf { w } \right) = \exp \left( \mathbf { w } _ { c } ^ { T } \boldsymbol { \phi } \left( \mathbf { x } , \mathbf { y } _ { c } \right) \right)$, $\boldsymbol { \phi } \left( \mathbf { x } , \mathbf { y } _ { c } \right)$ is a feature vector derived from the global inputss $x$ and local set of labels $y_c$. 

+ Advantage of CRF over MRF is analogous to the advantage of a discriminative classifer over a generative classifier:
    + We don't waste resources modeling things taht we always observe. Instead, we focus our attention on modeling that we care about, namely the distribution of labels given the data
    + make potentials of the model be data-dependent
    
+ Disadvantage: require labeled training data => slower to train.

### Chain-structured CRFs, MEMMs:
![](../images/19.CRF.png)
+ Hidden Markov Models (HMMs):
$$p ( \mathbf { x } , \mathbf { y } | \mathbf { w } ) = \prod _ { t = 1 } ^ { T } p \left( y _ { t } | y _ { t - 1 } , \mathbf { w } \right) p \left( \mathbf { x } _ { t } | y _ { t } , \mathbf { w } \right)$$

+ Maximum Entropy Markov Models (MEMMs):
$$p ( \mathbf { y } | \mathbf { x } , \mathbf { w } ) = \prod _ { t } p \left( y _ { t } | y _ { t - 1 } , \mathbf { x } , \mathbf { w } \right)$$

+ Chain-structured CRFs:
$$p ( \mathbf { y } | \mathbf { x } , \mathbf { w } ) = \frac { 1 } { Z ( \mathbf { x } , \mathbf { w } ) } \prod _ { t = 1 } ^ { T } \psi \left( y _ { t } | \mathbf { x } , \mathbf { w } \right) \prod _ { t = 1 } ^ { T - 1 } \psi \left( y _ { t } , y _ { t + 1 } | \mathbf { x } , \mathbf { w } \right)$$

### Training:
+ log-likelihood:
$$\ell ( \mathbf { w } ) \triangleq \frac { 1 } { N } \sum _ { i } \log p \left( \mathbf { y } _ { i } | \mathbf { x } _ { i } , \mathbf { w } \right) = \frac { 1 } { N } \sum _ { i } \left[ \sum _ { c } \mathbf { w } _ { c } ^ { T } \boldsymbol { \phi } _ { c } \left( \mathbf { y } _ { i } , \mathbf { x } _ { i } \right) - \log Z \left( \mathbf { w } , \mathbf { x } _ { i } \right) \right]$$

+ Gradient:
$$\begin{aligned} \frac { \partial \ell } { \partial \mathbf { w } _ { c } } & = \frac { 1 } { N } \sum _ { i } \left[ \boldsymbol { \phi } _ { c } \left( \mathbf { y } _ { i } , \mathbf { x } _ { i } \right) - \frac { \partial } { \partial \mathbf { w } _ { c } } \log Z \left( \mathbf { w } , \mathbf { x } _ { i } \right) \right] \\ & = \frac { 1 } { N } \sum _ { i } \left[ \boldsymbol { \phi } _ { c } \left( \mathbf { y } _ { i } , \mathbf { x } _ { i } \right) - \mathbb { E } \left[ \boldsymbol { \phi } _ { c } \left( \mathbf { y } , \mathbf { x } _ { i } \right) \right] \right] \end{aligned}$$

+ Prevent overfitting by using Gaussian prior:
$$\ell ^ { \prime } ( \mathbf { w } ) \triangleq \frac { 1 } { N } \sum _ { i } \log p \left( \mathbf { y } _ { i } | \mathbf { x } _ { i } , \mathbf { w } \right) - \lambda \| \mathbf { w } \| _ { 2 } ^ { 2 }$$

or $l_1$ for the edge weights $w_e$ to learn a sparse graph structure and $l_2$ for the node weights $w_n$: 
$$\ell ^ { \prime } ( \mathbf { w } ) \triangleq \frac { 1 } { N } \sum _ { i } \log p \left( \mathbf { y } _ { i } | \mathbf { x } _ { i } , \mathbf { w } \right) - \lambda _ { 1 } \left\| \mathbf { w } _ { e } \right\| _ { 1 } - \lambda _ { 2 } \left\| \mathbf { w } _ { n } \right\| _ { 2 } ^ { 2 }$$



