### Maximum likelihood estimator

Our goal is to find the set of parameters $J_{i, j}(a, b)$ for all $i, j \in \{1, \dots, N\}$, $a, b \in \{1, \dots, q\}$ that maximises the likelihood
\begin{align*}
    \mathcal{L}\left(\boldsymbol{J}; \left\{\boldsymbol{x}^{(m)}\right\}_{m = 1}^M\right) & \coloneqq \mathbb{P}\left(\left\{\boldsymbol{x}^{(m)}\right\}_{m = 1}^M; \boldsymbol{J}\right)
    = \prod_{m = 1}^{M} \mathbb{P} \left(\boldsymbol{x}^{(m)}; \boldsymbol{J}\right) = \\
    & = \prod_{m = 1}^{M} \frac{1}{Z(\boldsymbol{J})} \exp\left(\sum_{i, j = 1}^N\sum_{a, b = 1}^q J_{i, j}(a, b) \ \delta_{x_i^{(m)}, a} \ \delta_{x_j^{(m)}, b}\right) = \\
    & = \frac{1}{Z^M} \exp \left(\sum_{m = 1}^M \sum_{i, j = 1}^N \sum_{a, b = 1}^q J_{i, j}(a, b) \ \delta_{x_i^{(m)}, a}\ \delta_{x_j^{(m)}, b}\right),
\end{align*}
where
\begin{equation*}
    Z(\boldsymbol{J}) = \sum_{x_1, \dots, x_N = 1}^q \exp \left(\sum_{i, j = 1}^N \sum_{a, b = 1}^q J_{i, j}(a, b) \ \delta_{x_i, a} \ \delta_{x_j, b}\right)
\end{equation*}
is the normalization constant.
To this aim we compute the log-likelihood (and divide by $M$), getting
\begin{equation*}
    l\left(\boldsymbol{J}; \left\{\boldsymbol{x}^{(m)}\right\}_{m = 1}^M\right) \coloneqq \left(\frac{1}{M} \sum_{m=1}^M\sum_{i, j = 1}^N\sum_{a, b = 1}^q J_{i, j}(a, b) \ \delta_{x_i^{(m)}, a} \ \delta_{x_j^{(m)}, b}\right) - \log{Z(\boldsymbol{J})}.
\end{equation*}
Deriving w.r.t. $J_{i, j}(a, b)$, for fixed $i, j, a, b$, we get
\begin{equation*}
    \frac{\partial l\left(\boldsymbol{J}; \left\{\boldsymbol{x}^{(m)}\right\}_{m = 1}^M\right)}{\partial J_{i,j}(a,b)} 
	= \left(\frac{1}{M} \sum_{m = 1}^M \delta_{x_i^{(m)}, a} \ \delta_{x_j^{(m)}, b}\right) - \frac{1}{Z(\boldsymbol{J})} \frac{\partial Z(\boldsymbol{J})}{\partial J_{i, j}(a, b)},
\end{equation*}
where
\begin{equation*}
    \frac{\partial Z(\boldsymbol{J})}{\partial J_{i, j}(a, b)} = \sum_{x_1, \dots, x_N = 1}^q \delta_{x_i,a} \ \delta_{x_j,b} \ \exp\left(\sum_{i, j = 1}^N\sum_{a, b = 1}^q J_{i, j}(a, b) \ \delta_{x_i, a} \ \delta_{x_j, b}\right)
\end{equation*}.
Plugging into the previous expression, we find that
\begin{align*}
   \frac{\partial l\left(\boldsymbol{J}; \left\{\boldsymbol{x}^{(m)}\right\}_{m = 1}^M\right)}{\partial J_{i,j}(a,b)} & = \left(\frac{1}{M} \sum_{m = 1}^M \delta_{x_i^{(m)}, a} \ \delta_{x_j^{(m)}, b}\right) - \sum_{x_1, \dots, x_N = 1}^q \delta_{x_i,a} \ \delta_{x_j,b} \ \frac{1}{Z(\boldsymbol{J})} \exp\left(\sum_{i, j = 1}^N\sum_{a, b = 1}^q J_{i, j}(a, b) \ \delta_{x_i, a} \ \delta_{x_j, b}\right) = \\
   & = \langle \delta_{(x_i, x_j), (a, b)} \rangle_{\rm{data}} - \langle \delta_{(x_i, x_j), (a, b)} \rangle_{\rm{model}},
\end{align*}
where $\langle \cdot \rangle_{\rm{data}}$ stands for the empirical mean of the observations, $\langle \cdot \rangle_{\rm{model}}$ is the mean of $\delta_{(x_i, x_j), (a, b)}$ computed on the distribution of $\boldsymbol{x} | \boldsymbol{J}$ and $\delta$ denotes again the Kronecker delta
\begin{equation*}
	\delta_{(x_i, x_j), (a, b)} \coloneqq 
	\begin{cases}
		1 & \text{if } x_i = a \text{ and } x_j = b \\
		0 & \text{otherwise}
	\end{cases}.
\end{equation*}
Hence, in order to find the value of $J$ for which the function $\mathcal{L}\left(\boldsymbol{J}; \left\{\boldsymbol{x}^{(m)}\right\}_{m = 1}^M\right)$ is maximised we have to impose
\begin{equation*}
	\frac{\partial l\left(\boldsymbol{J}; \left\{\boldsymbol{x}^{(m)}\right\}_{m = 1}^M\right)}{\partial J_{i,j}(a,b)} = 0 \iff \langle \delta_{(x_i, x_j), (a, b)} \rangle_{\rm{data}} = \langle \delta_{(x_i, x_j), (a, b)} \rangle_{\rm{model}}.
\end{equation*}

### Boltzmann machine learning scheme

What we found can be exploited iteratively to estimate the coupling matrices through a gradient ascent algorithm (Boltzmann machine learning):
\begin{align*}
	& J_{i, j}^{0}(a, b) = 0, \\
	& J_{i, j}^{t + 1}(a, b) \leftarrow J_{i, j}^{t}(a, b) + \lambda \left[\langle \delta_{(x_i, x_j), (a, b)} \rangle_{\rm{data}} - \langle \delta_{(x_i, x_j), (a, b)} \rangle_{\rm{model}(t)}\right], \ \forall t \geq 0.
\end{align*}
<!-- question: nell'inverse ising problem usa lo stesso learning parameter per tutti i parametri che vuole inferire, anche qui conviene fare così? perchè in teoria si potrebbe settare un learning parameter diverso per ogni J_{i, j}(a, b)... -->
It is clear that at every step $t$ we should perform the computation of $\langle \delta_{(x_i, x_j), (a, b)} \rangle_{\rm{model(t)}}$ which costs $O(q^N)$, so we bypass the problem using a Metropolis-Hastings algorithm to sample from 
\begin{equation*}
	\pi_t\left(\boldsymbol{x}\right) \coloneqq \frac{1}{Z(\boldsymbol{J}^t)} \exp\left(\sum_{i, j = 1}^N\sum_{a, b = 1}^q J_{i, j}^{t}(x_i, x_j)\right)
\end{equation*}
and later estimate $\langle \delta_{(x_i, x_j), (a, b)} \rangle_{\rm{model(t)}}$:
- `set` an initial condition $\boldsymbol{x}^{0}$ (extract randomly from the $q^N$ possible configurations);
- `for` $s \in \{1, \dots, T_{\rm{burn-in}} + T_{\rm{tot}} \times T_{\rm{wait}}\}$:
	1. `draw` $\boldsymbol{x} \sim p\left(\boldsymbol{x}|\boldsymbol{x}^{(s - 1)}\right)$ with 
	\begin{align*}
		p\left(\boldsymbol{x}|\boldsymbol{x}^{(s - 1)}\right) = 
		\begin{cases}
			\frac{1}{2N} & \text{if } \boldsymbol{x} = \left(x^{(s - 1)}_1, \dots, x^{(s - 1)}_{i - 1}, \left(x^{(s - 1)}_{i} \pm 1\right) \text{ mod } q, x^{(s - 1)}_{i + 1}, \dots, x^{(s - 1)}_{N}\right), \ \forall i \in {1, \dots, N} \\
			0 & \text{otherwise}
		\end{cases}
	\end{align*};
	<!-- remark: qui ho usato come neighborhood la cosa più semplice, magari si puù fare di meglio -->
	2. `compute` the acceptance ratio $a\left(\boldsymbol{x}|\boldsymbol{x}^{(s - 1)}\right)$:
	\begin{align*}
		a\left(\boldsymbol{x}|\boldsymbol{x}^{(s - 1)}\right) & = \min\left[1, \frac{p\left(\boldsymbol{x}^{(s - 1)}|\boldsymbol{x}\right) \pi_t(x)}{p\left(\boldsymbol{x}|\boldsymbol{x}^{(s - 1)}\right) \pi_t(x^{(s - 1)})}\right] = \\
		& = \min\left[1, \mathbf{1}_A(\boldsymbol{x}) \exp\left(\sum_{i, j = 1}^N J_{i, j}^{t}\left(x_i, x_j\right) - J_{i, j}^{t}\left(x^{(s - 1)}_i, x^{(s - 1)}_j\right)\right)\right],
	\end{align*}
	where we adopt the convention $\frac{p\left(\boldsymbol{x}^{(s - 1)}|\boldsymbol{x}\right)}{p\left(\boldsymbol{x}|\boldsymbol{x}^{(s - 1)}\right)} = \mathbf{1}_A(\boldsymbol{x})$ with $A \coloneqq \left\{x \,|\, p\left(\boldsymbol{x}|\boldsymbol{x}^{(s - 1)}\right) > 0\right\}$ (this notation has only a theoretical purpose).
	Now assuming 
	\begin{equation*}
		\boldsymbol{x} = \left(x^{(s - 1)}_1, \dots, x^{(s - 1)}_{k - 1}, \left(x^{(s - 1)}_{k} \pm 1\right) \text{ mod } q, x^{(s - 1)}_{k + 1}, \dots, x^{(s - 1)}_{N}\right),
	\end{equation*}
	we have
	\begin{align*}
		a\left(\boldsymbol{x}|\boldsymbol{x}^{(s - 1)}\right) = \min\Bigg[1, & \exp\Bigg(\sum_{i \neq k} J_{i, k}^{t}\left(x^{(s - 1)}_i, \left(x^{(s - 1)}_{k} \pm 1\right) \text{ mod } q\right) - J_{i, k}^{t}\left(x^{(s - 1)}_i, x^{(s - 1)}_k\right) + \\
		& + \sum_{j \neq k} J_{k, j}^{t}\left(\left(x^{(s - 1)}_{k} \pm 1\right) \text{ mod } q, x^{(s - 1)}_j\right) - J_{i, k}^{t}\left(x^{(s - 1)}_k, x^{(s - 1)}_j\right) + \\
		& + \left(J_{k, k}^{t}\left(\left(x^{(s - 1)}_{k} \pm 1\right) \text{ mod } q, \left(x^{(s - 1)}_{k} \pm 1\right) \text{ mod } q\right) - J_{k, k}^{t}\left(x^{(s - 1)}_k, x^{(s - 1)}_k\right)\right) \Bigg)\Bigg].
	\end{align*}
	<!-- question: non mi pare di aver fatto errori di calcolo e non mi pare che questa bestia si semplifichi, quindi credo che ci siano delle simmetrie da assumere tipo J_{i, j} = J_{j, i} e anche J_{i, j} a loro volta simmetriche, però onestamente non saprei cosa assumere -->
	3. `draw` $u \sim U[0,1)$ (with the command `rand()`);
	4. `set`
	\begin{equation*}
		\boldsymbol{x}^{(s)} \coloneqq 
		\begin{cases}
			\boldsymbol{x} & \text{if } u \leq a \\
			\boldsymbol{x}^{(s - 1)} & \text{otherwise}
		\end{cases};
	\end{equation*}
- estimate $\langle \delta_{(x_i, x_j), (a, b)} \rangle_{\rm{model(t)}}$ with $T_{\rm{tot}}$ configurations obtained removing the burn-in and the waiting times:
\begin{equation*}
	\langle \delta_{(x_i, x_j), (a, b)} \rangle_{\rm{model(t)}} \sim \sum_{s = 1}^{T_{\rm{tot}}} \delta_{(x^{(s)}_i, x^{(s)}_j), (a, b)}.
\end{equation*}