# Machine learning the phase transition in the 2D Ising model

*Authors: Enze Chen (University of California, Berkeley)*

![ML model](https://raw.githubusercontent.com/enze-chen/learning_modules/master/fig/ML_Ising_schematic.png)

This notebook teaches you how to use machine learning (ML) models to learn the phase transition in the 2D ferromagnetic Ising model on a square lattice. In particular we'll use **logistic regression** and a **feed-forward neural network** (FFNN). This notebook will cover the entire pipeline, including:
1. Generating the training data using Monte Carlo (MC).
1. Constructing and training the ML models. 
1. Making predictions and visualizations with the ML models.

I have tried to keep the code simple and explanations plentiful so that someone who is comfortable with Python and computational MSE can understand everything. While the *implementation* of the code isn't necessarily hard thanks to several wonderful Python libraries, the *theory* behind it—particularly the ML portions—can be a little challenging.

## How to run this notebook

If you are viewing this notebook on Google Colaboratory, then all the software is already set up for you (hooray). If you want to run the notebook locally, make sure all the Python libraries in the [`requirements.txt`](https://github.com/enze-chen/learning_modules/blob/master/requirements.txt) file are installed.

For pedagogical reasons, there are a few sections for you to complete the code in order to run the simulation. These are delineated with the dashed lines as follows, and you should **only change what's inside**. You don't have to edit the text or code anywhere else. I've also included "**TODO**" to separate the background context from the actual instructions.
```python
# ---------------------- #
# YOUR CODE HERE

# ---------------------- #
```
If you edit the code in a cell, just press `Shift+Enter` to run it again. You have to execute **all** the code cells in this notebook from top to bottom (so don't skip around). A number `[#]` will appear to the left of the code cell once it's done executing. When all done successfully, you should be able to see a few images of your system and plots of the system properties as a function of $T$.

## Acknowledgements

This notebook was inspired by the recent work of [Carrasquilla, J. and Melko, R.G. *Nature Physics*, **13**, 2017](https://www.nature.com/articles/nphys4035) and [Mehta et al. *arXiv:1803.08823*, 2018](https://arxiv.org/abs/1803.08823). I also drew inspiration from [Carsten Bauer's tutorial in Julia](https://juliaphysics.github.io/PhysicsTutorials.jl/tutorials/machine_learning/ml_ising/ml_ising.html). I also thank my advisor [Prof. Mark Asta](https://mse.berkeley.edu/people_new/asta/) for encouraging me in my education-related pursuits. An interactive version of this notebook can be found online at [Google Colaboratory](https://colab.research.google.com/github/enze-chen/learning_modules/blob/master/mse/Machine_learning_Ising_model.ipynb). 

## Important equations

### Ising model 

I assume that you're familiar with the Ising model and how to simulate the phase transition using Monte Carlo. If not, you can check out [my other notebook](https://github.com/enze-chen/learning_modules/blob/master/mse/Monte_Carlo_Ising_model.ipynb) or countless other resources, such as the textbooks by [Newman and Barkema](https://global.oup.com/academic/product/monte-carlo-methods-in-statistical-physics-9780198517979?cc=us&lang=en&) or [Landau and Binder](https://www.cambridge.org/core/books/guide-to-monte-carlo-simulations-in-statistical-physics/2522172663AF92943C625056C14F6055).

The most important takeaway is that in the 2D Ising model on a square lattice, there is a critical temperature for a magnetic phase transition that has a theoretical value of:

$$ T_c = \frac{2J}{k_B \ln \left( 1 + \sqrt{2} \right)} \approx 2.269 $$

This phase transition can be simulated fairly accurately using Monte Carlo in the limit of large system sizes and accurate sampling. 

Therefore, we can then ask ourselves the following question: **Is it possible to use ML to predict the phase of the system purely from the raw spin configurations?** The goal of this notebook is to show you that the answer is "yes."

First, we will use a standard Metropolis-Hastings MC algorithm to generate some training data. Then we will split the data into ordered ($T \ll T_C$), disordered ($T \gg T_c$), and critical ($T \approx T_c$) subsets based on the temperature used in the simulation. We'll use most of the ordered and disordered data for the training data and we'll use everything else for the test data. 

### Machine learning

![Sigmoid](https://raw.githubusercontent.com/enze-chen/learning_modules/master/fig/sigmoid.png)

The first ML model we will try is [**logistic regression**](https://en.wikipedia.org/wiki/Logistic_regression), which is one of the simplest [linear models](https://en.wikipedia.org/wiki/Generalized_linear_model) for binary classification. It is parameterized by a set of weights $\vec{w}$ and outputs a probability $g \in (0, 1)$ according to the **logistic function** (also known as the **sigmoid function**), which looks like the image above and is parameterized as follows:

$$ g(\vec{x}; \vec{w}) = \dfrac{1}{1 + \exp \left( -\vec{w}^{\top} \vec{x} \right)} $$

The predicted label is then $0$ if $g < 0.5$ and $1$ otherwise. Training the logistic regression model involves supplying labeled data with labels in the set $\{0, 1\}$ and optimizing the parameters $\vec{w}$.

![Neuron math](https://raw.githubusercontent.com/enze-chen/learning_modules/master/fig/neuron_math.png)

The second ML model we will try is a [**feed-forward neural network**](https://en.wikipedia.org/wiki/Feedforward_neural_network) (also known as a **multilayer perceptron**), which is one of the simplest neural network (NN) models. A NN consists of layers of neurons (a single neuron is shown above), where that pass information one layer to the next according to two steps:

1. The inputs $\vec{x}^{[i-1]}$ from the previous layer $\ell_{i-1}$ are multiplied by a matrix of weights $W^{[i-1]}$ and summed with a vector of biases $\vec{b}^{[i-1]}$ to produce a vector $\vec{z}^{[i]}$ in the current layer $\ell_{i}$ according to:

$$ \vec{z}^{[i]} = W^{[i-1]} \vec{x}^{[i-1]} + \vec{b}^{[i-1]} $$

2. The result is then passed through a **non-linear** activation function $g(\cdot)$ in the current layer $\ell_{i}$ according to:

$$\vec{a}^{[i]} = g(\vec{z}^{[i]})$$

where $\vec{a}^{[i]}$ then becomes the inputs $\vec{x}^{[i]}$ for the next layer $\ell_{i+1}$. In this notebook $g()$ will be the sigmoid function. Note that the above image and the math contained within is for a *single neuron*, whereas the math presented in Markdown here is for a *layer of neurons* and so the scalar/vector/matrix expressions are slightly different.

Training the NN model involves supplying the same labeled data and optimizing the parameters $W^{[\cdot]}$ and $\vec{b}^{[\cdot]}$ associated with **all the layers**. We'll provide more details on the specific NN structure (also known as the *NN architecture*) in the relevant sections below.

## Known issues

* As the code isn't heavily optimized, it will be slow if you run it for too many iterations or for too large of a system. Please be gentle. ❤


## Python library imports

These are all the required Python libraries. `sklearn` is the library name for the popular [scikit-learn](https://scikit-learn.org/stable/index.html) machine learning package in Python that we'll be using for convenience.

In [None]:
# General libraries
import os
import random

# Scientific computing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Machine learning libraries
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split

# Set random seed
seed = 2020
random.seed(seed)
np.random.seed(seed)

## Generating the data with MC

We will use some standard MC code to generate the data for the 2D Ising model. The function is provided below.

In [None]:
Tc = 2 / (np.log(1 + np.sqrt(2)))
def mc_sweep(spins, beta):
    n = len(spins)
    for _ in range(n**2):
        i = np.random.randint(0, n)
        j = np.random.randint(0, n)
        nb_sum = spins[(i + 1) % n, j] + spins[(i - 1) % n, j] + \
                 spins[i, (j + 1) % n] + spins[i, (j - 1) % n]
        dE = 2 * spins[i, j] * nb_sum
        if np.random.random() < np.exp(-dE * beta):
            spins[i, j] *= -1
    return spins

def mc_run(Ts, L=8, eqsteps=2000, mcsteps=200, dt=100):
    data = []
    labels = []
    temps = []
    for T in Ts:
        spins = np.random.choice([1, -1], size=(L, L))
        beta = 1 / T
        for _ in range(eqsteps):
            mc_sweep(spins, beta)
        for i in range(mcsteps):
            mc_sweep(spins, beta)
            if i % dt == 0:
                temps.append(T)
                data.append(spins.flatten())
                if T < Tc:
                    labels.append(1)
                else:
                    labels.append(0)
        print(f'Finished generating data for T = {T}.')
    return (np.array(temps), np.array(data), np.array(labels))

We'll choose a $12 \times 12$ lattice and sample every $500$ MC steps to balance statistics with computational speed, though these are just my *unproven heuristics*. This results in $1000$ samples for each $T$. But first, we check to see if a dataset already exists so that we don't have to run the previous function.

In [None]:
datapath = os.path.join('..', 'data', 'mc_data.csv')
labelpath = os.path.join('..', 'data', 'mc_labels.csv')
tempspath = os.path.join('..', 'data', 'mc_temps.csv')
dataurl = 'https://raw.githubusercontent.com/enze-chen/learning_modules/master/data/mc_data.csv'
labelurl = 'https://raw.githubusercontent.com/enze-chen/learning_modules/master/data/mc_labels.csv'
tempsurl = 'https://raw.githubusercontent.com/enze-chen/learning_modules/master/data/mc_temps.csv'

if "I don't want to wait forever,":
    try:
        temps = np.loadtxt(tempspath, delimiter=',')
        data = np.loadtxt(datapath, delimiter=',')
        labels = np.loadtxt(labelpath, delimiter=',')
        print('Loading data from disk...')
    except:
        data = pd.read_csv(dataurl, header=None).to_numpy()
        labels = pd.read_csv(labelurl, header=None).to_numpy().ravel()
        temps = pd.read_csv(tempsurl, header=None).to_numpy().ravel()
        print('Loading data from online...')
    print(f'Found existing data for {data.shape[0]} examples, ' + \
          f'{data.shape[1]} features, and {len(np.unique(temps))} temperatures.')
else:
    Ts = np.linspace(1.25, 3.25, 9)
    temps, data, labels = mc_run(Ts=Ts, L=12, eqsteps=5000, mcsteps=500000, dt=500)
    np.savetxt(tempspath, temps, delimiter=',')
    np.savetxt(datapath, data, delimiter=',')
    np.savetxt(labelpath, labels, delimiter=',')

## Splitting the data into training, validation, and test sets

Now we have to split the data. First we'll choose to **stratify the data based on** $T$ to create a harder test case by grouping together the examples with $2.0 \le T \le 2.5$. These might easily get misclassified since they lie close to the transition temperature.

Then we'll use Scikit-learn's [`train_test_split()`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function to create our training and validation data sets from the **remaining data**, which are more clearly ordered ($T < 2.0$) or disordered ($T > 2.5$).

In [None]:
ord_ind = np.where(temps < 2.0)[0]
dis_ind = np.where(temps > 2.5)[0]
safe_ind = np.concatenate((ord_ind, dis_ind), axis=0)
crit_ind = np.where((temps >= 2.0) & (temps <= 2.5))[0]

X_data = data[safe_ind, :]
y_data = labels[safe_ind]
X_test = data[crit_ind, :]
y_test = labels[crit_ind]
indices = np.arange(len(safe_ind))

X_train, X_val, y_train, y_val, ind_train, ind_val = \
    train_test_split(X_data, y_data, indices, test_size=0.3, shuffle=True)

print(f'There are {X_train.shape[0]} training examples.')
print(f'There are {X_val.shape[0]} validation examples.')
print(f'There are {X_test.shape[0]} test examples.')
print(f'There are {X_train.shape[1]} features (spins).')

## Train an ML model

Now that we have the data prepared, it's time to begin training! We'll start with the `DummyClassifier`.

In [None]:
# Initialize and train
dummy_clf = DummyClassifier(random_state=seed)
dummy_clf.fit(X_train, y_train)

# Print accuracy of predictions
print(f'The accuracy on the training set is {dummy_clf.score(X_train, y_train):.4f}')
print(f'The accuracy on the validation set is {dummy_clf.score(X_val, y_val):.4f}')
print(f'The accuracy on the test set is {dummy_clf.score(X_test, y_test):.4f}')

In [None]:
ordered = []
ordered_err = []
disordered = []
disordered_err = []

Ts = np.unique(temps)
for T in Ts:
    ind = (temps == T)
    probs = dummy_clf.predict_proba(data[ind, :])
    means = np.mean(probs, axis=0)
    stds = np.std(probs, axis=0)
    disordered.append(means[0])
    ordered.append(means[1])
    disordered_err.append(stds[0])
    ordered_err.append(stds[1])

Next we plot the probabilities for both states as a function of temperature.

In [None]:
plt.rcParams.update({'figure.figsize':(7,5), 'lines.linewidth':5, \
                     'axes.linewidth':2, 'lines.markersize':10, 'font.size':16})
fig, ax = plt.subplots()
ax.axvline(x=2.2, ymin=0, ymax=1, ls='--', c='C2', alpha=0.3)
ax.errorbar(Ts, ordered, ordered_err, color='C0', fmt='-o', \
            capsize=8, capthick=3, elinewidth=3, ecolor='#c0c0c0dd', label='ordered')
ax.errorbar(Ts, disordered, disordered_err, color='C1', fmt='-o', \
            capsize=8, capthick=3, elinewidth=3, ecolor='#c0c0c0dd', label='disodered')
ax.set_xlim(min(Ts) - 0.1, max(Ts) + 0.1)
ax.set_ylim(0, 1)
ax.set_xlabel('Temperature')
ax.set_ylabel('Probability')
ax.set_title('Logistic regression')
plt.legend()
plt.show()

### Logistic regression

Let's start with Scikit-learn's [`LogisticRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) classifier. Please see the help page for more information on its input arguments.
* `random_state` is an argument that sets how the data is shuffled.
* The `solver` argument specifies the optimization routine. `lbfgs` is the default, but `liblinear` works well for small datasets.
* `max_iter` describes the maximum number of iterations the solver will take to obtain convergence. `100` is the default.

In [None]:
# Initialize and train
lr_clf = LogisticRegression(random_state=seed, solver='liblinear', max_iter=1e3)
lr_clf.fit(X_train, y_train)

# Print accuracy of predictions
print(f'The accuracy on the training set is {lr_clf.score(X_train, y_train):.4f}')
print(f'The accuracy on the validation set is {lr_clf.score(X_val, y_val):.4f}')
print(f'The accuracy on the test set is {lr_clf.score(X_test, y_test):.4f}')

Our scores look promising! But often times reporting just an accuracy isn't sufficient because it's not clear if the model has actually *learned* anything. For example, some of the results might just be due to data imbalance and the classifier predicting the majority label every time.

We'd like some more solid statistics and visualizations. Motivated by the work of Carrasquilla and Melko, let's create a plot of the classifier's predictions (with error bars) as a function of temperature.

In [None]:
ordered = []
ordered_err = []
disordered = []
disordered_err = []

Ts = np.unique(temps)
for T in Ts:
    ind = (temps == T)
    probs = lr_clf.predict_proba(data[ind, :])
    means = np.mean(probs, axis=0)
    stds = np.std(probs, axis=0)
    disordered.append(means[0])
    ordered.append(means[1])
    disordered_err.append(stds[0])
    ordered_err.append(stds[1])

Next we plot the probabilities for both states as a function of temperature.

In [None]:
plt.rcParams.update({'figure.figsize':(7,5), 'lines.linewidth':5, \
                     'axes.linewidth':2, 'lines.markersize':10, 'font.size':16})
fig, ax = plt.subplots()
ax.axvline(x=2.2, ymin=0, ymax=1, ls='--', c='C2', alpha=0.3)
ax.errorbar(Ts, ordered, ordered_err, color='C0', fmt='-o', \
            capsize=8, capthick=3, elinewidth=3, ecolor='#c0c0c0dd', label='ordered')
ax.errorbar(Ts, disordered, disordered_err, color='C1', fmt='-o', \
            capsize=8, capthick=3, elinewidth=3, ecolor='#c0c0c0dd', label='disodered')
ax.set_xlim(min(Ts) - 0.1, max(Ts) + 0.1)
ax.set_ylim(0, 1)
ax.set_xlabel('Temperature')
ax.set_ylabel('Probability')
ax.set_title('Logistic regression')
plt.legend()
plt.show()

### Neural network

Next we will try to use a NN model, [also from Scikit-learn](https://scikit-learn.org/stable/modules/neural_networks_supervised.html). The more popular NN libraries like [TensorFlow](https://www.tensorflow.org/), [PyTorch](https://pytorch.org/), and [Keras](https://keras.io/) are good to know, but they have a bit more setup required than I would like, so I opted not to use them for the sake of the exercise.


In [None]:
# Initialize and train
mlp_clf = MLPClassifier(hidden_layer_sizes=(20,), activation='relu', \
                        solver='lbfgs', random_state=seed)
mlp_clf.fit(X_train, y_train)

# Print accuracy of predictions
print(f'The accuracy on the training set is {mlp_clf.score(X_train, y_train):.4f}')
print(f'The accuracy on the validation set is {mlp_clf.score(X_val, y_val):.4f}')
print(f'The accuracy on the test set is {mlp_clf.score(X_test, y_test):.4f}')

As you can see, the structure of your code was pretty much identical. This was one of the design principles behind scikit-learn that makes it very user friendly. As such, we'll also try and make a plot of the predicted probabilities which are given by the cross-entropy loss in the `MLPClassifier()`.

In [None]:
ordered = []
ordered_err = []
disordered = []
disordered_err = []

Ts = np.unique(temps)
for T in Ts:
    ind = (temps == T)
    probs = mlp_clf.predict_proba(data[ind, :])
    means = np.mean(probs, axis=0)
    stds = np.std(probs, axis=0)
    disordered.append(means[0])
    ordered.append(means[1])
    disordered_err.append(stds[0])
    ordered_err.append(stds[1])

Next we plot the probabilities for both states as a function of temperature.

In [None]:
fig, ax = plt.subplots()
ax.axvline(x=2.2, ymin=0, ymax=1, ls='--', c='C2', alpha=0.3)
ax.errorbar(Ts, ordered, ordered_err, color='C0', fmt='-o', \
            capsize=8, capthick=3, elinewidth=3, ecolor='#c0c0c0dd', label='ordered')
ax.errorbar(Ts, disordered, disordered_err, color='C1', fmt='-o', \
            capsize=8, capthick=3, elinewidth=3, ecolor='#c0c0c0dd', label='disodered')
ax.set_xlim(min(Ts) - 0.1, max(Ts) + 0.1)
ax.set_ylim(0, 1)
ax.set_xlabel('Temperature')
ax.set_ylabel('Probability')
ax.set_title('Multilayer perceptron')
plt.legend()
plt.show()

### Bonus: Unsupervised learning with t-SNE

Previous examples were all *supervised learning*. Now we'll try an *unsupervised learning* method called **t-SNE** ("tee-snee"), which is short for "t-distributed stochastic neighbor embedding." Created recently by [van der Maaten, L. and Hinton, G. *Journal of Machine Learning Research*, **9**, 2008](http://www.jmlr.org/papers/v9/vandermaaten08a.html) for dimensionality reduction and visualization. We will once again use [the TSNE class](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html) from scikit-learn.

As a word of warning, the results may or may not be good. In a way I'm just trying to share a (somewhat flashy) unsupervised learning technique; but more importantly I'm trying reflect how an ML engineer might think about problems and the various aspects of your data that you should consider.

In [None]:
from sklearn.manifold import TSNE
tsne = TSNE(random_state=seed)
X_embed = tsne.fit_transform(data)

In [None]:
fig, ax = plt.subplots()
ax.scatter(X_embed[safe_ind, 0], X_embed[safe_ind, 1], label='training')
ax.scatter(X_embed[crit_ind, 0], X_embed[crit_ind, 1], label='test')
ax.tick_params(left=False, labelleft=False, bottom=False, labelbottom=False)
ax.legend()
plt.show()

...so you see how the "S" in t-SNE stands for "stochastic?" It turns out that as a result, we cannot predict what the output from t-SNE will look like! The t-SNE plot that I get on my computer *will differ* from the one you get on yours. For that reason, t-SNE plots always have to be taken with a grain of salt. However, I hope you're able to see distinct clusters appear in your data, corresponding to the ordered and disordered states; and possibly the states near $T_c$ as well.

## Conclusion

I hope you learned how we can use train various ML models on MC-generated data to predict the phases in the 2D Ising model. We covered both supervised and unsupervised learning, though admittedly this notebook was brief. If you have any remaining questions or ideas for this and other modules, please don't hesitate to reach out.


## Extensions

Not surprisingly, when it comes to any ML project, there are *tons* of little knobs you can adjust to fine-tune your model. Here are a few common suggestions:

* Regularization

* Optimizers

* Neural network structure

* Different ML algorithms: SVM, Random forest, kNN (slow!)


## Answers

If you found yourself stuck at certain points, I provide some sample answers [here](https://github.com/enze-chen/learning_modules/blob/master/data/answers.md#Machine_learning_Ising_model).