# Problem 1. Explain the concept of score matching

## 1. Concept of score matching
We usually hope to learn the **distribution model** of a data: 
$$p(x;\theta)=\frac{1}{Z(\theta)}e^{-E(x;\theta)},$$
where $E(x;\theta)$ is the energy function, and $Z(\theta)=\int e^{E(x;\theta)}dx$ is a **patition function**.  

In the sense of MLE, the goal is to maximum $$L(\theta)=\mathbb{E}_{q(x)}\ln p(x;\theta)$$  

But usually in high-dimensional data set, $Z(\theta)$ is almost not integrable.  

Hence, the goal of ***Score Matching(Hyvärinen et.al, 2005)*** is **to estimate $E(x;\theta)$ without knowing $Z(\theta)$.**

### Definition ---
The **score function** of a probability distribution is defined as:

$$
\psi(x; \theta) = \nabla_x \log p(x; \theta).
$$

It represents the direction in the data space along which the probability density
increases most rapidly — in other words, it is a **gradient map of the data distribution**.

The true data distribution  $q(x)$ also has its own score:

$$
\nabla_x \log q(x).
$$

The core idea of score matching is simple:

> If the model’s score function equals the true data score function,  
> then the two distributions share the same shape (up to a normalization constant).

This means that by learning the gradient field $ \psi(x; \theta) $,  
we can indirectly capture the geometry of the data density  
without ever computing the intractable partition function $ Z(\theta) $.






## 2. Denoising Score Matching (DSM)

Vincent (2010) extended Hyvärinen’s score matching idea by introducing **denoising score matching (DSM)**.  
The intuition is simple: if a data point is corrupted by Gaussian noise, a good model should learn a vector field
that points from the noisy sample back toward the clean data.

Assume a clean sample $x$ is perturbed by Gaussian noise:
$$
\tilde{x} = x + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma^2 I).
$$
Then, the conditional distribution of the noisy sample is:
$$
q_\sigma(\tilde{x}\,|\,x)
= \frac{1}{(2\pi\sigma^2)^{d/2}}
  \exp\!\left(-\frac{\|\tilde{x}-x\|^2}{2\sigma^2}\right).
$$
The **denoising score matching objective** is:
$$
J_{\text{DSM}}(\theta)
= \mathbb{E}_{q_\sigma(x, \tilde{x})}
  \Bigg[
  \frac{1}{2}
  \Big\|
  \psi(\tilde{x};\theta)
  - \nabla_{\tilde{x}} \log q_\sigma(\tilde{x}\,|\,x)
  \Big\|^2
  \Bigg].
$$

Since
$$
\nabla_{\tilde{x}} \log q_\sigma(\tilde{x}\,|\,x)
= \frac{1}{\sigma^2}(x - \tilde{x}),
$$
the objective becomes:
$$
J_{\text{DSM}}(\theta)
= \mathbb{E}_{q_\sigma(x, \tilde{x})}
  \Bigg[
  \frac{1}{2}
  \Big\|
  \psi(\tilde{x};\theta)
  - \frac{1}{\sigma^2}(x - \tilde{x})
  \Big\|^2
  \Bigg].
$$

Thus, the network learns to predict the direction from a noisy point $\tilde{x}$$
back to the clean point $x$.
Vincent showed that this objective is **mathematically equivalent**
to training a **denoising autoencoder (DAE)** with squared reconstruction loss.


---



## 3. Connection to Diffusion / Score-Based Generative Models

Modern **score-based generative models** (Song & Ermon, 2019, 2020)
extend DSM to multiple noise levels.
Instead of adding noise once, data are progressively perturbed
to create a sequence of noisy distributions $p_t(x)$, indexed by time $t$.

A neural network $s_\theta(x_t, t)$ is trained to approximate
the score of each noisy distribution:
$$
s_\theta(x_t, t) \approx \nabla_{x_t} \log p_t(x_t).
$$

Training uses a weighted DSM loss over all noise levels:
$$
\mathbb{E}_{t, x_0, \epsilon}
  \Big[
  \lambda(t)
  \big\|
  s_\theta(x_t, t)
  - \nabla_{x_t} \log p_t(x_t|x_0)
  \big\|^2
  \Big].
$$

Once trained, new samples are generated by **reverse diffusion**:
starting from pure Gaussian noise and integrating
a reverse-time stochastic differential equation (SDE)
driven by the learned score function $s_\theta(x_t, t)$.
Hence, the model learns the gradient of log-density
for all intermediate noise levels and can “walk back”
from noise to data.

---



## 4. Summary

| Concept | Mathematical Formulation | Key Idea |
|----------|--------------------------|-----------|
| **Score Matching (SM)** | $ \frac{1}{2}\|\psi - \nabla \log q\|^2 $ | Learn the gradient (score) of the data density without computing normalization. |
| **Implicit Score Matching (ISM)** | $ \frac{1}{2}\|\psi\|^2 + \nabla_x \!\cdot\! \psi $ | Remove dependence on the true density $q$ via integration by parts. |
| **Denoising Score Matching (DSM)** | $ \frac{1}{2}\|\psi(\tilde{x}) - (x-\tilde{x})/\sigma^2\|^2 $ | Learn to move noisy samples back to clean data (equivalent to a DAE). |
| **Diffusion Models** | $ s_\theta(x_t,t) \approx \nabla_{x_t}\log p_t(x_t) $ | Extend DSM to a continuous noise schedule; reverse diffusion generates samples. |

---


> *Reference: Vincent, P. (2010). “A Connection Between Score Matching and Denoising Autoencoders.”*
> *Technical Report 1358, Université de Montréal.*  
> **Remark:** The original English phrasing and order were revised with assistance from ChatGPT.