layout | title | description | category | tags | image | published | |||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
post-light-feature |
How does physics connect to machine learning? |
Did Richard Feynman help seed a key machine learning technique in the 60s? |
articles |
|
|
true |
Mandarin translation available - 用普通话阅读这篇文章: WeChat
I struggled to learn machine learning. I was used to variational tricks, MCMC samplers, and discreet Taylor expansions from years of physics training. Now the concepts were mixed up. The intuitive models of physical systems were replaced by abstract models of ‘data’ and amechanical patterns of cause and effect.
I had to fit these fields together. Physics and machine learning are intricately connected, but it is taking me years to make the overlaps precise. This process requires representing the new with the familiar, mapping jargon from one field to another.
A simple model of magnets---the Ising model---will help illustrate the rich connection between these fields. We first analyze this model with physics intuition. Then we derive the variational principle in physics and show that it recovers the same solution.
We then discover how that very same variational principle in physics opens a window into machine learning. We identify Boltzmann distributions as exponential families to make the mapping transparent, and show how approximate posterior inference is scaled to massive data thanks to the variational principle.
If you have a physics background, I hope you will have a better sense of machine learning and be able to read papers in the field. If you are a machine learner, I hope you will have the context to read a statistical physics paper about mean-field theory and the Ising model.
If this article is confusing, falls short of these goals, or could be improved in any way please email me, @ me, or submit a pull request.
Consider a lattice of spins that point up or down:
{% asset ising-model.svg %}What features might make this a convincing model of magnetism?
Think about playing with magnets---if you put them close together, they pull each other closer. They repulse each other when their poles oppose, and if they’re far apart, they don't attract.
This means neighboring spins should affect each other in our model: if the spins around
Let’s refer to the spin at location
We can capture our intuition about spins being attracted to each other (they want to point in the same direction) or repulsed (they want to point in opposite directions) by introducing a parameter
If two neighboring spins point in the same direction, we'll have them contribute a term
This lets us write the energy function, or Hamiltonian, of the system:
Here
A spin configuration or state of the system is a specific setting of values for all spins. The set
The second law of thermodynamics says that at a fixed temperature and entropy, a system will seek configurations that minimize its energy. This lets us reason about interactions.
If the interaction strength
Now let’s introduce a magnetic field
We can reason about the magnetic field strength
Now that we have defined the Ising model and its characteristics, let’s think about our goals. What questions can we answer about this Ising model? For example, if we observe the system, what state will it be in---what are the most likely spin configurations? What is the average magnetization?
Can we make our goals more precise and make math from words? To do this, we need to define a distribution over spin configurations. It is straightforward to derive the probability of finding the system in an equilibrium state 1:
This is the Boltzmann distribution. The numerator is called the Boltzmann factor for a particular configuration. This factor gives high or low weight to a specific state of the system according to the energy for that state.
We query the Boltzmann distribution at a specific configuration of spins to get the probability of finding the system in this state.
For example, say the first spins in our configuration happen to be up, up, down, etc. We plug this in and get
This distribution behaves intuitively: low energy states are more probable than configurations with high energy. For example, if
The parameter
The denominator
I explicitly wrote out the sum to illustrate why we can’t evaluate this distribution: we need to sum over all possible configurations. Each spin has two states, and there are
We arrived at a probability distribution describing which states of the system are likely, however we were stumped by the intractable partition function. Let's temporarily assume we have infinite computation and can calculate the Boltzmann distribution's partition function. What are some interesting things we can learn about the system from it's Boltzmann distribution?
This distribution lets us to calculate properties of the system as a whole by taking expectations (i.e. calculating observable quantities). For example, the magnetization
Why should we care about this magnetization? It tells us about the system as a whole, the macrostate, rather than a specific microstate. We lose specificity because we can't say anything about the first spin
If the spins are aligned, the system is in an ordered state and the magnetization has a positive or negative sign. If the spins are anti-aligned, the system is disordered and the average magnetization is zero.
These are global phases of the system, and they depend on temperature. If the temperature
Let's remember that we can't evaluate the partition function
Because we cannot evaluate the intractable sum required to calculate the partition function, we turn to mean-field theory.
This is an approximation technique that can still let us answer questions about the system such as the average magnetization. We will study the dependence of the magnetization
To demonstrate the technique, it is easiest to focus on a single spin:
{% asset ising-model-single-spin.svg %} The first spin of the Ising model in a magnetic field H. The magnetic field is shown with dashed lines. Its nearest neighbors provide an effective field through the interactions, denoted by lines connecting the spins.The contribution of this single spin to the total energy of the system is simply the corresponding term in the energy:
The sum is over the
The next step is crucial: we will ignore the fluctuations of neighboring spins around their mean value. In other words, we assume that the term
When is this true?
When the fluctuations around the mean value are small, such as at low temperature ‘ordered’ phases. This assumption greatly simplifies the Hamiltonian for the spin:
This is the mean-field energy function for a single spin. It is equivalent to a non-interacting spin in an effective magnetic field,
Why do we say this spin is non-interacting? The energy function for the spin only depends on its state,
In this mean-field model, each spin feels the effects of the magnetic field applied to the entire system,
We clarify this interpretation by writing
where
By ignoring the fluctuations of each spin, we have reduced the complexity of the problem. Instead of
We write the energy function for the mean-field model as
This shows that there are no interaction terms anymore (the term
In other words, we can treat each spin independently, then combine the results appropriately to model the entire system!
We have radically changed the nature of the problem.
Instead of computing the partition function
This is straightforward and has an analytic solution 4:
The partition function for the entire mean-field model with
With the partition function in hand, we can get the Boltzmann distribution and answer questions about the system such as magnetization.
We get the magnetization by taking an expectation over the distribution for the spin. The last step is to require that for any spin
This gives us a clean equation for the magnetization,
where we used that the mean-field parameter is
This is a formula for the magnetization
First let’s think about the case when there is no external field,
For high temperatures, the equation only has one solution:
For low temperatures, we see three solutions:
The "critical temperature" at which the phase transition occurs is when
This gives us a testable prediction: we can take a magnetic material, and measure what temperature its phase transition occurs at.
Have we accomplished our goal?
We set out to understand the behavior of this model at various temperatures, in terms of global properties like the magnetization.
By considering a single spin and approximating the effects of other spins as an effective magnetic field, we were able to reduce the complexity of the problem. This allowed us to study phase transitions. However, our exposition felt a little hand-wavy, so let's dive into a rigorous foundation to justify our intuitions.
Can we learn what tradeoffs we make when we make the assumption of 'ignoring fluctuations' of spins around their mean values? Specifically, how can we gauge the quality of results derived from our mean-field theory?
We can rederive the mean-field results in the previous section by directly attacking the problem of the intractable partition function. We can try to approximate this partition function with a simpler one.
Recall that the partition function
where as before, the energy for the system is
The complexity of computing the partition function comes from the interaction term with
To derive the variational principle, we will therefore assume an energy function of the form
Previously we saw that the mean-field parameter is
Now we ask the question: is this the optimal effective magnetic field? We can think of
This is known as perturbation theory: we are perturbing the magnetic field of the system and trying to find the optimal perturbation that yields a good approximation to the original system.
What does a ‘good approximation’ entail? Our difficulties were in computing the partition function. We therefore want to approximate the partition function of the original system
First let’s see if we can express the partition function of the original system
This lets us reëxpress the original partition function as:
For the next step, we need the definition of an expectation of a function
This means we can write the partition function of the system in terms of the mean-field partition function as:
This is an exact factorization of the partition function of the original system. It is the mean-field partition function weighted by the expected Boltzmann factor for energy fluctuations away from the reference system.
However, integrating this complicated exponential function is difficult, even with respect to the mean-field system. We’ll simplify it with a classic physics trick---by pulling a Taylor expansion.
Let’s assume that the fluctuations of the energy are small;
We have neglected terms of second order in the fluctuations
How good is the approximation? We need a simple identity 5:
Let’s apply this to the expectation in the exact factorization of the partition function, taking
Now we can get a lower bound on the partition function:
This inequality is the Gibbs-Bogoliubov-Feynman inequality. It tells us that with our mean-field approximation, we get a lower bound on the original partition function.
Let’s apply this theory: do we recover the same results for magnetization in the Ising model?
In the mean-field Ising model, we treat each spin independently, so the energy function of the system decomposes into independent parts:
Here
Let’s plug this into the lower bound on the partition function from the Gibbs-Bogoliubov-Feynman inequality. Then we take the derivative to maximize the lower bound:
First, we need to evaluate the expectation:
where we used the mean-field assumption that the spins are independent, hence
We also assumed that for a large enough system, spins at the edges of the model (boundary conditions) can be ignored, so all spins have the same average magnetization:
Plugging this in to the lower bound on the partition function and differentiating gives
We used that
This confirms our earlier reasoning, that the optimal mean-field parameter is
Now let’s frame what we just did in the language of machine learning. More specifically, let’s think in terms of probabilistic modeling.
We need some definitions to see how the variational principle is equivalent to variational inference in machine learning.
The Ising model is an undirected graphical model or Markov random field. We can represent the conditional dependencies of the model using a graph; the nodes in the graph are random variables. These random variables are the spins of the Ising model, so two nodes are connected by an edge if they interact. This lets us encode the joint distribution of the random variables in the following diagram:
{% asset ising-model-graphical-model.svg %} A representation of the Ising model as an undirected graphical model. The nodes are random variables (spins) and edges denote conditional dependencies between their distributions.The Boltzmann distribution is a parameterization of the joint distribution of this graphical model. This figure looks very similar to the physics spin-based representation---the spins are random variables. We can also write the joint distribution of the nodes in exponential family form. Exponential family distributions let us reason about a broad class of models and deserve a header.
A way to parameterize probability distributions like the Ising model is with exponential families. These are families of distributions that support a representation in this specific, convenient mathematical form:
Here
For example, we are used to seeing the Bernoulli distribution in the following form 6:
We can rewrite this in exponential family form:
Comparing to the above formula for exponential families reveals the natural parameter, base measure, sufficient statistic, and log normalizer for the Bernoulli, given by
More connections to physics: the log normalizer is the log of the partition function. This is made clear in the exponential family form of the Bernoulli:
Let’s connect this to the energy function of the Ising model by writing its Boltzmann distribution in exponential family form:
We have introduced some new notation common to graphical models: we have specified a joint distribution over a collection of random variables
This is the exponential family form of the Ising model, a probability model with model parameters
For the Ising model, we can see that there are two sets of model parameters. The spin-spin interaction parameter multiplied by the inverse temperature
This is a subtle but important point. Our joint distribution over the set of random variables (the
Computing the magnetization
But calculating the marginal distribution is intractable for reasons we already discussed: it requires marginalizing over all other nodes
The situation is hopeless: not only do we need to calculate the normalizing constant for the joint distribution of
This is identical to what we saw in the partition function, when thinking about this model from a physics perspective.
Can we still answer questions about the marginal distributions by resorting to a variational principle?
If we could calculate the sum over all configurations of random variables, we could calculate the partition function. But we can’t, because the sum grows as
With our physics hat on, our strategy was to approximate to the partition function.
From a machine learning perspective, this technique is known as variational inference. We vary something simple to infer something complicated.
Let’s look at how the variational free energy is derived in machine learning and used to approximate partition functions.
We have a probability model of random variables
Let’s construct a simpler probability distribution
How good is our approximation? One way of measuring how close our approximation is to our goal distribution is with the Kullback-Leibler divergence.
This divergence between
This gives us a criteria with which to vary our approximation. We vary the
The KL divergence is written with a double vertical bar as
Let’s assume we are dealing with an exponential family distribution such as the Ising model. We let
We assume that
To measure how much information we lose when we use our approximation
$$= \mathbb{E}{q\lambda} [\log q_\lambda(s)] - \mathbb{E}{q\lambda}[-\beta E(s)] + \log Z$$
where we have defined the variational lower bound
$$\mathcal{L}(\lambda) = \mathbb{E}{q\lambda}[-\beta E(s)] - \mathbb{E}{q\lambda}[\log q_\lambda(s) ]$$
We can move the variational lower bound to the other side of the equation to get the following identity:
With Jensen’s inequality it is easy to show that the KL divergence is always greater than or equal to zero. This means that if we make
This means we can vary the parameters
Note that in the definition of the variational lower bound, we do not need to worry about the arduous task of calculating the partition function: it does not depend on
This is awesome: we have constructed an approximation
The interesting part is that we get can improve the approximation to our model
Is this too clever to be true? Have we surrendered anything? We have lost the ability to measure how good our approximation is, in absolute terms---for that, we still need to calculate the partition function to compute the KL divergence. We do know that as long as our lower bound
Let’s see if this is the same as the Gibbs-Bogoliubov-Feynman inequality we saw in physics. Recall that the inequality is
Taking logarithms:
$$\Rightarrow \log Z \geq \mathbb{E}{q\lambda}[-\beta E(s)] - \mathbb{E}{q\lambda} [\log q_\lambda(s) ]$$
Where we have identified that the variational family we are using, is the mean-field Boltzmann distribution
This shows that variational inference in machine learning---maximizing a lower bound on the partition function---is exactly the Gibbs-Bogoliubov-Feynman inequality in action.
In machine learning we care about patterns in data. This gives rise to the concept of latent variables, unobserved random variables that capture patterns in observed data.
For example, in linear regression we might posit a linear relationship between someone's age and their income. This scalar coefficient captures a latent pattern that we seek to infer from many examples of (age, income) tuples.
We refer to a probability model as a model of latent variables
What is a posterior? In our regression example of the relationship between age and income, we want the posterior distribution of the regression coefficient after observing data. Our choice of prior on the coefficient is a modeling decision and reflects our belief about the statistical relationship we hope to observe.
The posterior is given by Bayes’ rule:
The denominator is the evidence; the marginal distribution of the data:
Can we still do posterior inference despite the intractable partition function?
The refrain is familiar: we have an intractable sum in our partition function, but we can approximate it using the tools we developed earlier! Variational inference to the rescue. Let's write out the variational lower bound on the partition function:
$$\log Z = \log p(x) \geq \mathcal{L}(\lambda) = \mathbb{E}{q\lambda}[\log p(x, z)] - \mathbb{E}{q\lambda}[\log q_\lambda (z)]$$
Again, by varying the parameters
If we are using the variational method to learn an approximate posterior, our partition function is the evidence
This technique has been used in machine learning for the past two decades. It is becoming popular because intractable partition functions come with the need to analyze large datasets. Because the variational principle relies on optimizing a lower bound, the field has borrowed heavily from the optimization literature to scale Bayesian inference to massive data. It's an exciting area, as new techniques from stochastic optimization may enable us to explore new physics and machine learning models.
There are many techniques for approximating partition functions developed in the machine learning community that may find use in physics.
For example, black box variational inference and automatic differentiation variational inference are generic methods that may be useful in physics. They develop frameworks for constructing expressive approximate distributions and efficient optimization techniques.
Question for physicists familiar with variational methods: is stochastic optimization used in variational methods? Would this be useful?
Yes! The Gibbs-Bogoliubov-Feynman inequality was originally developed in physics and found its way to machine learning through Michael Jordan’s group at MIT in the 90s.
There seems to be a separate literature on constructing flexible families of distributions to approximate distributions. The replica trick, renormalization group theory, and others are just some topics that are beginning to make their way from statistical physics to machine learning.
Another example of tools from physics used in machine learning is operator variational inference. In this work, we developed a framework for constructing operators (such as the KL divergence) that measure how good an approximation is. The framework enables making explicit the tradeoffs between how good our approximation is and how much computation a variational method requires. The Langevin-Stein operator is equivalent to the Hamiltonian operator in physics (note) and was originally developed in a Physical Review Letters paper.
A fun question to ponder is "why KL divergence?" and the physics perspective is illuminating. It corresponds to the first-order Taylor expansion of the partition function and comes with assumptions about the non-equilibrium perturbed distribution. Does the second-order Taylor expansion correspond to another divergence and yield more accurate solutions?
I recently learned about replica theory. The replica trick is a technique for calculating the partition function of a system exactly, using an insane formula. It begs the question: what assumptions do we need to use this for probabilistic graphical models?
I’m excited to see more work in this area as physicists migrate to data science and machine learning.
How can we make transitions faster? How can we efficiently move techniques between machine learning and physics? Would code samples be helpful?
This post is an attempt at mapping the language from one community to another. Another idea is a long review paper that to give detailed examples of models solved within a statistical physics framework (with mean-field methods, replica theory, renormalization theory, etc) and solved with modern variational inference from a machine learning perspective (black box variational inference, stochastic optimization, etc). This would highlight how the fields complement each other.
- Expectations: the angle brackets $$\langle
\cdot\rangle$$ denote an expectation. In the machine learning literature, this is denoted as $$\mathbb{E}p[\cdot]$$ for the expectation of a quantity with respect to the distribution $$p$$. For example, $$\langle f(\vec{s}) \rangle$$ denotes an expectation of a function of the spins $$f(\vec{s})$$. The expectation is implicitly with respect to the Boltzmann distribution: $$\langle f(\vec{s}) \rangle = \mathbb{E}p[f(\vec{s})] = \sum{{s_1, ..., s_N}} f(\vec{s}) p(\vec{s})$$ $$= \sum{{s_1, ..., s_N}} f(\vec{s})=\frac{e^{-\beta H(\vec{s})}}{Z}$$ - Spins in physics are called random variables in statistics and machine learning.
- The evidence lower bound in variational inference is the negative free energy in physics terminology.
Anything to add or fix in this article to reduce confusion and increase clarity? Please email me, tweet, or submit a pull request.
- Peterson & Anderson (1987) used solutions to time-dependent Ising models to learn the parameters of Boltzmann machines. This is a canonical reference for the ‘start’ of variational inference as it is known in the machine learning community.
- You can go deep into Ising models: there are hundreds of lectures and references on line. Here are the sources I used for these notes: from Basel and Munich.
- Dave’s course, Foundations of Graphical Models
- Wainwright & Jordan (2008) is challenging but worthwhile.
- David MacKay's Information Theory, Inference, and Learning Algorithms has a section on variational free energy (Chapter 33, p. 422).
- David Chandler's Introduction to Modern Statistical Mechanics (1987) has a simple derivation of the variational free energy (Section 5.1, pp. 135-138) that I followed in this exposition.
- Feynman, Statistical Mechanics - A set of lecture notes (1972) derives the variational free energy using a perturbation expansion (Section 2.11, pp. 67-71).
- Parisi's Statistical Field Theory (1988) derives the variational principle in three different ways (Section 3.2, pp. 24-31).
- Matthew Beal’s thesis has interesting references, and Rich Turner has notes on correspondences between physics and machine learning.
Thanks to Bohdan Kulchytskyy, Florian Wentzel, Siddharth Mishra-Sharma, Smiti Kaul, Guillaume Verdon, Henri Palacci, Sam Ritter, Mattias Fitzpatrick, and Sophie Kleber for comments and encouragement. Image credits: Freepik for iconography, and Analytical Scientific for the Newton's cradle image.
This blog post ended up seeding the first several chapters of my thesis.
Footnotes
-
For a tiny system, e.g. with three spins, we have $$8$$ states and the sum is doable - but the system is uninteresting. ↩
-
For example, the magnetization of dysprosium aluminium garnet at low temperatures is exactly described by this model. ↩
-
To see this, recall that $$\cosh x = \frac{e^x + e^{-x}}{2}$$ ↩
-
Visual proof that $$e^x \geq x + 1$$. ↩
-
The semicolon notation means "the distribution over $$ x $$ is parameterized in terms of the parameter $$ \pi $$". ↩
-
Writing the parameters of a distribution as a subscript ($$p_\theta(s)$$) is shorthand for writing them after the semicolon ($$p(s; \theta)$$). ↩
-
In the variational treatment of the Ising model we had one variational parameter, the perturbation to the static magnetic field $$\lambda = \Delta H$$. ↩