🏷️sec_glove
Word-word co-occurrences
within context windows
may carry rich semantic information.
For example,
in a large corpus
word "solid" is
more likely to co-occur
with "ice" than "steam",
but word "gas"
probably co-occurs with "steam"
more frequently than "ice".
Besides,
global corpus statistics
of such co-occurrences
can be precomputed:
this can lead to more efficient training.
To leverage statistical
information in the entire corpus
for word embedding,
let us first revisit
the skip-gram model in :numref:subsec_skip-gram
,
but interpreting it
using global corpus statistics
such as co-occurrence counts.
🏷️subsec_skipgram-global
Denoting by
$$q_{ij}=\frac{\exp(\mathbf{u}_j^\top \mathbf{v}i)}{ \sum{k \in \mathcal{V}} \text{exp}(\mathbf{u}_k^\top \mathbf{v}_i)},$$
where
for any index
Consider word
Now let us denote the multiplicity of element
eq_skipgram-x_ij
We further denote by
eq_skipgram-x_ij
can be rewritten as
eq_skipgram-p_ij
In :eqref:eq_skipgram-p_ij
, eq_skipgram-p_ij
will allow
the predicted conditional distribution
to get close to
the conditional distribution
from the global corpus statistics.
Though being commonly used
for measuring the distance
between probability distributions,
the cross-entropy loss function may not be a good choice here.
On one hand, as we mentioned in :numref:sec_approx_train
,
the cost of properly normalizing
In view of this,
the GloVe model makes three changes
to the skip-gram model based on squared loss :cite:Pennington.Socher.Manning.2014
:
- Use variables $p'{ij}=x{ij}$ and $q'_{ij}=\exp(\mathbf{u}j^\top \mathbf{v}i)$ that are not probability distributions and take the logarithm of both, so the squared loss term is $\left(\log,p'{ij} - \log,q'{ij}\right)^2 = \left(\mathbf{u}_j^\top \mathbf{v}i - \log,x{ij}\right)^2$.
- Add two scalar model parameters for each word
$w_i$ : the center word bias$b_i$ and the context word bias$c_i$ . - Replace the weight of each loss term with the weight function
$h(x_{ij})$ , where$h(x)$ is increasing in the interval of$[0, 1]$ .
Putting all things together, training GloVe is to minimize the following loss function:
$$\sum_{i\in\mathcal{V}} \sum_{j\in\mathcal{V}} h(x_{ij}) \left(\mathbf{u}_j^\top \mathbf{v}i + b_i + c_j - \log,x{ij}\right)^2.$$
:eqlabel:eq_glove-loss
For the weight function, a suggested choice is:
It should be emphasized that
if word
We can also interpret the GloVe model from another perspective.
Using the same notation in
:numref:subsec_skipgram-global
,
let tab_glove
lists several co-occurrence probabilities
given words "ice" and "steam"
and their ratios based on statistics from a large corpus.
:Word-word co-occurrence probabilities and their ratios from a large corpus (adapted from Table 1 in :cite:Pennington.Socher.Manning.2014
:)
|
solid | gas | water | fashion |
---|---|---|---|---|
0.00019 | 0.000066 | 0.003 | 0.000017 | |
0.000022 | 0.00078 | 0.0022 | 0.000018 | |
8.9 | 0.085 | 1.36 | 0.96 | |
🏷️tab_glove
|
We can observe the following from :numref:tab_glove
:
- For a word
$w_k$ that is related to "ice" but unrelated to "steam", such as$w_k=\text{solid}$ , we expect a larger ratio of co-occurence probabilities, such as 8.9. - For a word
$w_k$ that is related to "steam" but unrelated to "ice", such as$w_k=\text{gas}$ , we expect a smaller ratio of co-occurence probabilities, such as 0.085. - For a word
$w_k$ that is related to both "ice" and "steam", such as$w_k=\text{water}$ , we expect a ratio of co-occurence probabilities that is close to 1, such as 1.36. - For a word
$w_k$ that is unrelated to both "ice" and "steam", such as$w_k=\text{fashion}$ , we expect a ratio of co-occurence probabilities that is close to 1, such as 0.96.
It can be seen that the ratio
of co-occurrence probabilities
can intuitively express
the relationship between words.
Thus, we can design a function
of three word vectors
to fit this ratio.
For the ratio of co-occurrence probabilities
$$f(\mathbf{u}_j, \mathbf{u}k, {\mathbf{v}}i) \approx \frac{p{ij}}{p{ik}}.$$
:eqlabel:eq_glove-f
Among many possible designs for eq_glove-f
,
it must hold that
$$f(\mathbf{u}_j, \mathbf{u}_k, {\mathbf{v}}_i) = \frac{\exp\left(\mathbf{u}_j^\top {\mathbf{v}}_i\right)}{\exp\left(\mathbf{u}k^\top {\mathbf{v}}i\right)} \approx \frac{p{ij}}{p{ik}}.$$
Now let us pick
$\exp\left(\mathbf{u}j^\top {\mathbf{v}}i\right) \approx \alpha p{ij}$,
where $\alpha$ is a constant.
Since $p{ij}=x_{ij}/x_i$, after taking the logarithm on both sides we get $\mathbf{u}_j^\top {\mathbf{v}}i \approx \log,\alpha + \log,x{ij} - \log,x_i$.
We may use additional bias terms to fit
$$\mathbf{u}_j^\top \mathbf{v}i + b_i + c_j \approx \log, x{ij}.$$
:eqlabel:eq_glove-square
Measuring the squared error of
:eqref:eq_glove-square
with weights,
the GloVe loss function in
:eqref:eq_glove-loss
is obtained.
- The skip-gram model can be interpreted using global corpus statistics such as word-word co-occurrence counts.
- The cross-entropy loss may not be a good choice for measuring the difference of two probability distributions, especially for a large corpus. GloVe uses squared loss to fit precomputed global corpus statistics.
- The center word vector and the context word vector are mathematically equivalent for any word in GloVe.
- GloVe can be interpreted from the ratio of word-word co-occurrence probabilities.
- If words
$w_i$ and$w_j$ co-occur in the same context window, how can we use their distance in the text sequence to redesign the method for calculating the conditional probability$p_{ij}$ ? Hint: see Section 4.2 of the GloVe paper :cite:Pennington.Socher.Manning.2014
. - For any word, are its center word bias and context word bias mathematically equivalent in GloVe? Why?