# Markov Chains and Markov Processes — Comprehensive Reference

**Scope:** Discrete and continuous-time Markov models, deep mathematical intuitions, key theorems, derivations, worked examples, and important applications (finance, NLP, queues, physics). All mathematics appears in displayed LaTeX using `$$`.

---

## Table of Contents

1. [Core intuition: the Markov property](#core-intuition-the-markov-property)
2. [Discrete-Time Markov Chain (DTMC): formal definition](#discrete-time-markov-chain-dtmc-formal-definition)
3. [Evolution, Chapman–Kolmogorov, and k-step transition probabilities](#evolution-chapmankolmogorov-and-k-step-transition-probabilities)
4. [Classification of states: communication, recurrence, transience, absorbing states](#classification-of-states-communication-recurrence-transience-absorbing-states)
5. [Stationary (invariant) distributions, limiting distributions & ergodicity](#stationary-invariant-distributions-limiting-distributions--ergodicity)
6. [Reversibility and detailed balance; MCMC insight](#reversibility-and-detailed-balance-mcmc-insight)
7. [Spectral view, convergence rates and mixing times](#spectral-view-convergence-rates-and-mixing-times)
8. [Absorbing chains and the fundamental matrix](#absorbing-chains-and-the-fundamental-matrix)
9. [Hitting times, mean recurrence times, and systems of linear equations](#hitting-times-mean-recurrence-times-and-systems-of-linear-equations)
10. [Continuous-Time Markov Chains (CTMC): generator, Kolmogorov eqns, matrix exponential](#continuous-time-markov-chains-ctmc-generator-kolmogorov-eqns-matrix-exponential)
11. [Birth–Death and queueing examples (M/M/1)](#birthdeath-and-queueing-examples-mm1)
12. [Markov processes on continuous state spaces (diffusions)](#markov-processes-on-continuous-state-spaces-diffusions)
13. [Strong Markov property and stopping times](#strong-markov-property-and-stopping-times)
14. [Examples: random walk, PageRank, credit migration, HMM (NLP)](#examples-random-walk-pagerank-credit-migration-hmm-nlp)
15. [Key theorems (Perron–Frobenius, ergodic theorem) — statements and sketches](#key-theorems-perronfrobenius-ergodic-theorem--statements-and-sketches)
16. [Practical implementation notes and modeling tips](#practical-implementation-notes-and-modeling-tips)
17. [References & further reading](#references--further-reading)

---

## 1. Core intuition — *the Markov property*

A stochastic process $\{X_t\}$ (index $t$ may be discrete or continuous) is said to have the **Markov property** if the future distribution depends only on the present state, not on the full past. Formally, for discrete-time $t\in\{0,1,2,\dots\}$:

$$
\Pr\bigl(X_{t+1}=j \,\big|\, X_t=i, X_{t-1}=i_{t-1},\dots,X_0=i_0\bigr)
= \Pr\bigl(X_{t+1}=j \,\big|\, X_t=i\bigr).
$$

This is the **memoryless** (one-step) property. It is the defining simplification: to specify dynamics, it suffices to give the transition law from the current state.

---

## 2. Discrete-Time Markov Chain (DTMC): formal definition

Let the state space be finite or countable: $S=\{1,2,\dots,n\}$ (or $\mathbb{Z}$, etc.). A discrete-time Markov chain is specified by a **transition matrix** $P=[p_{ij}]$, where

$$
p_{ij} = \Pr\bigl(X_{t+1}=j \mid X_t=i\bigr),
$$

and for each $i$:

$$
\sum_{j\in S} p_{ij} = 1,\qquad p_{ij}\ge 0.
$$

If the initial distribution is the row vector $\pi^{(0)}$, then the distribution at time $t$ is

$$
\pi^{(t)} = \pi^{(0)} P^t.
$$

(Here matrices act on row vectors on the left; other authors use column vectors — be consistent.)

---

## 3. Evolution, Chapman–Kolmogorov, and k-step transition probabilities

The **k-step transition probabilities** are defined by

$$
p_{ij}^{(k)} = \Pr\bigl(X_{t+k}=j \mid X_t=i\bigr).
$$

The Chapman–Kolmogorov equations (composition law) state:

$$
\mathbf{P}^{(m+n)} = \mathbf{P}^{(m)}\mathbf{P}^{(n)},
$$

or entrywise:

$$
\;p_{ij}^{(m+n)} = \sum_{k\in S} p_{ik}^{(m)}\, p_{kj}^{(n)}.
$$

Thus $P^k$ gives the k-step transition matrix. This leads directly to closed-form expressions for many finite chains (via matrix powers or diagonalization when possible).

---

## 4. Classification of states: communication, recurrence, transience, absorbing states

**Accessibility and communication.** State $j$ is *accessible* from $i$ (write $i\to j$) if for some $k\ge 0$:

$$
p_{ij}^{(k)} > 0.
$$

If $i\to j$ and $j\to i$, the two states **communicate** (denoted $i\leftrightarrow j$). Communication is an equivalence relation; classes partition the state space.

**Irreducibility.** The chain is *irreducible* if all states communicate (single communicating class).

**Period.** The period of state $i$ is

$$
d(i) = \gcd\{k\ge 1: p_{ii}^{(k)} > 0\}.
$$

If $d(i)=1$ the state is aperiodic. In an irreducible chain all states have the same period.

**Recurrence and transience (for countable chains).** A state $i$ is recurrent if, starting from $i$, the chain returns to $i$ with probability 1; transient otherwise. Equivalently, let $f_i=\Pr(\text{return to }i\,|\,X_0=i)$. Then $i$ recurrent iff $f_i=1$.

**Absorbing states** are those with $p_{ii}=1$ (once entered, the chain stays there). These are recurrent.

---

## 5. Stationary (invariant) distributions, limiting distributions & ergodicity

A **stationary distribution** (also called invariant measure) $\pi= (\pi_i)_{i\in S}$ satisfies

$$
\pi = \pi P,\qquad \sum_i \pi_i = 1,\quad \pi_i\ge 0.
$$

Interpretation: if $X_0\sim\pi$ then $X_t\sim\pi$ for all $t$.

**Existence and uniqueness (finite state irreducible chains).** For a finite irreducible chain there exists a unique stationary distribution with strictly positive entries; this follows from the Perron–Frobenius theorem applied to the positive sub-stochastic structure of $P$.

**Limiting distribution.** Under additional conditions (irreducible and aperiodic), the distribution converges to stationarity regardless of initial state:

$$
\lim_{t\to\infty} p_{ij}^{(t)} = \pi_j,\quad\text{for all }i,j.
$$

**Ergodic theorem (time averages).** If the chain is irreducible and positive recurrent, then for any function $f:S\to\mathbb{R}$ with finite expectation,

$$
\frac{1}{n}\sum_{t=1}^n f(X_t) \xrightarrow{a.s.} \sum_i \pi_i f(i) \quad\text{as }n\to\infty.
$$

This justifies replacing expectation under the stationary law by long-run empirical averages.

**Relation to mean recurrence times.** For a positive recurrent state $i$, the stationary probability relates to mean recurrence time $m_i=\mathbb{E}_i[T_i]$ by

$$
\pi_i = \frac{1}{m_i}.
$$

---

## 6. Reversibility and detailed balance; MCMC insight

A chain with stationary distribution $\pi$ is **reversible** if the time-reversed process has the same law as the forward process. The detailed balance equations are:

$$
\pi_i p_{ij} = \pi_j p_{ji},\quad\text{for all }i,j.
$$

Detailed balance implies $\pi$ is stationary, since summing both sides over $i$ gives $\sum_i \pi_i p_{ij} = \pi_j$.

**MCMC connection.** Metropolis–Hastings and many MCMC samplers construct transition probabilities that satisfy detailed balance with respect to a target $\pi$, ensuring $\pi$ is the stationary distribution. Typical acceptance probability (proposal kernel $q(i\to j)$) is

$$
\alpha(i\to j) = \min\!\left(1, \frac{\pi_j q(j\to i)}{\pi_i q(i\to j)}\right).
$$

This builds an ergodic chain whose long-run samples approximate $\pi$.

---

## 7. Spectral view, convergence rates and mixing times

Consider diagonalizing (or Jordan-decomposing) $P$. For finite reversible chains, $P$ is real symmetric in the inner product weighted by $\pi$, so it has a real eigenvalue decomposition. The eigenvalues satisfy:

$$
1 = \lambda_1 > \lambda_2 \ge \lambda_3 \ge \dots \ge \lambda_n \ge -1.
$$

As $t\to\infty$, powers $P^t$ project onto the eigenspace for $\lambda_1=1$. The **spectral gap** $1-\lambda_2$ governs the exponential rate of convergence to stationarity: error terms decay like $\lambda_2^t$.

**Total variation mixing time.** Define total variation distance from stationarity when starting from state $i$:

$$
\|P^t(i,\cdot) - \pi\|_{TV} = \frac{1}{2}\sum_j |p_{ij}^{(t)} - \pi_j|.
$$

The mixing time $t_{\text{mix}}(\varepsilon)$ is the smallest $t$ such that $\max_i \|P^t(i,\cdot)-\pi\|_{TV} \le \varepsilon$.

**Conductance and Cheeger inequality.** Conductance (bottleneck measure) bounds mixing times. For subset $A\subset S$ with $\pi(A)\le 1/2$, the conductance

$$
\Phi = \min_{A}\frac{\sum_{i\in A,j\notin A} \pi_i p_{ij}}{\pi(A)}
$$

satisfies Cheeger-type inequalities linking $\Phi$ and spectral gap.

---

## 8. Absorbing chains and the fundamental matrix

Partition the state space into transient states $T$ and absorbing states $A$. Reorder $P$ so transient states come first:

$$
P = \begin{pmatrix} Q & R \\ 0 & I \end{pmatrix}
$$

where $Q$ is the submatrix of transitions among transient states, and $R$ gives transitions from transient to absorbing states.

The **fundamental matrix** is

$$
N = (I - Q)^{-1} = I + Q + Q^2 + \cdots.
$$

Interpretation: $N_{ij}$ is the expected number of visits to transient state $j$ starting from transient state $i$, before absorption.

**Expected time to absorption** starting from transient state $i$ is

$$
\mathbf{t} = N \mathbf{1}, \qquad t_i = \sum_j N_{ij}.
$$

**Absorption probabilities.** The matrix of absorption probabilities is

$$
B = NR,
$$

so $B_{ik}$ is the probability of being absorbed in absorbing state $k$ when starting from transient $i$.

---

## 9. Hitting times, mean recurrence times, and linear systems

Many quantities satisfy linear systems due to the Markov property.

**Expected hitting time to set $A$.** Define for $i\notin A$:

$$
h(i) = \mathbb{E}_i[\tau_A],\qquad \tau_A = \min\{t\ge 0: X_t\in A\}.
$$

Then $h(i)$ solves

$$
\begin{cases}
\;h(i)=0, & i\in A,\\
\;h(i)=1+\sum_{j} p_{ij} h(j), & i\notin A.
\end{cases}
$$

In matrix form for transient indices this yields a linear system $(I - P_{TT}) h_T = \mathbf{1}$.

**Mean return time and stationary distribution.** If state $i$ is positive recurrent with mean return time $m_i$, then $\pi_i = 1/m_i$ as noted earlier. This is derived by renewal arguments and ergodicity.

---

## 10. Continuous-Time Markov Chains (CTMC): generator, Kolmogorov equations, matrix exponential

A CTMC on finite or countable $S$ is specified by an **infinitesimal generator** (rate matrix) $Q=[q_{ij}]$, with

$$
q_{ij}\ge 0 \quad (i\ne j),\qquad q_{ii} = -\sum_{j\ne i} q_{ij}.
$$

For small $h>0$:

$$
\Pr(X_{t+h}=j \mid X_t=i) = \begin{cases} q_{ij} h + o(h), & j\ne i,\\ 1 + q_{ii}h + o(h), & j=i.\end{cases}
$$

The transition matrix function $P(t)=[p_{ij}(t)]$ satisfies the **Kolmogorov forward and backward equations**. Writing $P(t)$ with rows indexed by initial state and columns by terminal state:

**Forward equation:**

$$
\frac{d}{dt} P(t) = P(t) Q, \qquad P(0) = I.
$$

**Backward equation:**

$$
\frac{d}{dt} P(t) = Q P(t), \qquad P(0) = I.
$$

(Both are valid; they are matrix forms of the two ways to apply the Chapman–Kolmogorov composition law.) The formal solution is the **matrix exponential**:

$$
P(t) = e^{Qt} = \sum_{n=0}^{\infty} \frac{(Qt)^n}{n!}.
$$

This gives a practical way to compute $p_{ij}(t)$ when $Q$ is small or diagonalizable.

**Stationarity in CTMCs.** A probability row vector $\pi$ is stationary for the CTMC iff

$$
\pi Q = 0.
$$

(Interpretation: net flow out of each state equals net flow in.)

---

## 11. Birth–Death processes and queueing example (M/M/1)

**Birth–death processes** are CTMCs on $S=\{0,1,2,\dots\}$ with transitions only between neighboring states. Let birth rates $\lambda_n$ ($n\to n+1$) and death rates $\mu_n$ ($n\to n-1$). The generator has nonzero entries

$$
q_{n,n+1} = \lambda_n,\qquad q_{n,n-1} = \mu_n,\qquad q_{nn} = - (\lambda_n + \mu_n).
$$

**M/M/1 queue.** Here $\lambda_n=\lambda$ (arrival rate) and $\mu_n=\mu$ (service rate), constant. The stationary distribution (if $\rho=\lambda/\mu<1$) is geometric:

$$
\pi_n = (1-\rho) \rho^n,\quad n\ge 0.
$$

This chain is a canonical CTMC example; many performance metrics (queue length distribution, blocking probabilities in finite-capacity variants) are derived analytically.

---

## 12. Markov processes on continuous state spaces (diffusions)

A continuous-state, continuous-time Markov process can often be represented as the solution to an SDE:

$$
dX_t = \mu(X_t,t)\,dt + \sigma(X_t,t)\, dW_t,
$$

where $W_t$ is Brownian motion. These diffusions have the Markov property and possess transition densities $p(x,t\mid y,s)$ solving a PDE.

The **Fokker–Planck (forward Kolmogorov) equation** for the density $p(x,t)$ is

$$
\frac{\partial}{\partial t} p(x,t) = -\frac{\partial}{\partial x}[\mu(x,t) p(x,t)] + \frac{1}{2} \frac{\partial^2}{\partial x^2}[\sigma^2(x,t) p(x,t)].
$$

**Brownian motion** (Wiener process) satisfies $dX_t = dW_t$; transition density is Gaussian:

$$
X_t - X_s \sim \mathcal{N}(0, t-s),\qquad p(x,t\mid y,s)=\frac{1}{\sqrt{2\pi (t-s)}} e^{-\frac{(x-y)^2}{2(t-s)}}.
$$

**Ornstein–Uhlenbeck (OU)** process: mean-reverting SDE

$$
dX_t = \theta(\mu - X_t)\,dt + \sigma\, dW_t.
$$

The OU process is Gaussian with stationary distribution $\mathcal{N}(\mu,\; \sigma^2/(2\theta))$ (for $\theta>0$). Its transition law is known in closed form.

---

## 13. Strong Markov property and stopping times

The **strong Markov property** strengthens the basic Markov property by allowing conditioning at stopping times (random times determined by the process history). If $\tau$ is a stopping time, then given $X_\tau=x$, the future $\{X_{\tau+t}:t\ge 0\}$ is independent of the pre-$\tau$ history and has the same law as the process started at $x$.

This property is crucial in optional sampling, gambler's ruin calculations, renewal theory, and many probabilistic proofs.

---

## 14. Examples: random walk, PageRank, credit migration, HMM (NLP)

### 14.1 Simple two-state DTMC (worked algebra)

Let

$$
P = \begin{pmatrix} 1-a & a \\ b & 1-b \end{pmatrix},\qquad 0\le a,b\le 1.
$$

Solve for stationary $\pi=(\pi_1,\pi_2)$ satisfying $\pi = \pi P$:

$$
\begin{cases}
\pi_1 = \pi_1 (1-a) + \pi_2 b,\\
\pi_1 + \pi_2 = 1.
\end{cases}
$$

From the first equation: $\pi_1 a = \pi_2 b$. Using normalization:

$$
\pi_1 = \frac{b}{a+b},\qquad \pi_2 = \frac{a}{a+b}.
$$

If $a,b>0$ and $a+b>0$ the chain is irreducible and aperiodic (if neither a nor b equals 1 in a manner that creates period 2 behavior), and distributions converge to $\pi$.

### 14.2 Random walk on a finite graph

For a simple random walk on an undirected graph, transition probability from node $i$ to neighbor $j$ is $1/\deg(i)$. The stationary distribution is

$$
\pi_i = \frac{\deg(i)}{\sum_k \deg(k)}.
$$

This is the basis for node centrality measures and PageRank variants.

### 14.3 PageRank (brief formalization)

View the web as nodes with directed links. The Google matrix with damping $\alpha$ is

$$
G = \alpha S + (1-\alpha)\mathbf{1}v^\top,
$$

where $S$ is the column-stochastic link matrix after fixing dangling nodes, and $v$ is a teleportation vector. PageRank vector $x$ satisfies

$$
x = Gx = \alpha S x + (1-\alpha) v.
$$

This is a stationary distribution computation for an irreducible aperiodic chain (with teleportation).

### 14.4 Credit-rating migration (finance)

Ratings form states, with an absorbing or near-absorbing default state. Given one-year transition matrix $P$, k-year migration is $P^k$. Credit-risk quantities (survival probabilities, default probabilities, expected loss) come directly from powers of $P$ and recovery assumptions.

### 14.5 Hidden Markov Models (HMM) in NLP

Let hidden states $z_t\in\{1,\dots,m\}$ and observations $x_t$ (words). The joint probability of a state sequence and observations:

$$
\Pr(z_{1:T}, x_{1:T}) = \pi_{z_1} b_{z_1}(x_1) \prod_{t=2}^T a_{z_{t-1},z_t} b_{z_t}(x_t),
$$

where $A=[a_{ij}]$ is state transition matrix and $b_j(x)$ are emission probabilities. Key algorithms:

* **Forward algorithm (likelihood):** $\alpha_t(j) = \sum_i \alpha_{t-1}(i) a_{ij} b_j(x_t)$.
* **Viterbi (most likely state path):** $\delta_t(j) = \max_i [\delta_{t-1}(i) a_{ij}] b_j(x_t)$ with backpointers.

HMMs are central to POS tagging, speech recognition, and other sequence tasks (though modern approaches often use neural variants incorporating Markov-like structure implicitly).

---

## 15. Key theorems — statements and sketches

### 15.1 Perron–Frobenius (finite irreducible, aperiodic chains)

For an irreducible stochastic matrix $P$ on a finite state space, 1 is a simple eigenvalue and there exists a strictly positive left eigenvector $\pi$ with $\pi P = \pi$. If the chain is aperiodic, then $P^t\to \mathbf{1}\pi$ as $t\to\infty$.

**Sketch idea:** apply Perron–Frobenius to the positive power $P^k$ (some power becomes strictly positive if irreducible and aperiodic), deduce dominant eigenvector positivity and uniqueness, then analyze power iteration.

### 15.2 Ergodic theorem (law of large numbers for Markov chains)

If a Markov chain is irreducible, positive recurrent, and aperiodic with stationary $\pi$, then for any integrable function $f$:

$$
\frac{1}{n}\sum_{t=1}^n f(X_t) \xrightarrow{a.s.} \sum_x \pi_x f(x).
$$

**Sketch idea:** regenerative structure via returns to a state and renewal theory.

---

## 16. Practical implementation notes and modeling tips

* For **finite large graphs** use sparse matrix representations and power iteration for stationary vectors (PageRank) rather than dense methods.
* When building **MCMC**, verify irreducibility and aperiodicity of your proposal+acceptance kernel; enforce minorization or add small random moves if mixing is poor.
* For **credit migration**, beware of non-stationary transition matrices — calibrate over appropriate time windows and test stability. Continuous-time generator-based models (calibrating a generator $Q$) often yield better interpolation between horizons.
* In **NLP**, HMMs are interpretable but can be outperformed by discriminative or neural models; still useful for primers, baselines, and constrained problems.
* Use **coupling** and **conductance** estimates when you need theoretical mixing time guarantees; use diagnostics (autocorrelation, effective sample size) in practice.

---

## 17. References & further reading

* Norris, J.R., *Markov Chains.* Cambridge University Press, 1997. (Classic, readable.)
* Levin, Peres, Wilmer, *Markov Chains and Mixing Times.* (Detailed, modern treatment with algorithms.)
* Karlin & Taylor, *A First Course in Stochastic Processes.* (Good for CTMC and birth–death processes.)
* Norris / Ross — sections on continuous-time chains and queuing.
* Durrett, *Probability: Theory and Examples* — for more depth in recurrence/transience, coupling.
* Bishop / Rabiner papers for HMMs (speech recognition literature).

---

*End of document — comprehensive reference prepared for study, lecture notes, or to adapt into slides.*
