# 15.5. Word Embedding with Global Vectors (GloVe)

Global Vectors for Word Representation (GloVe) is an unsupervised learning algorithm for obtaining vector representations of words. GloVe combines the advantages of two major model families in the literature: global matrix factorization and local context window methods.

## Key Concepts

**Global Matrix Factorization**: Methods like Latent Semantic Analysis (LSA) efficiently leverage statistical information by training on global word-word co-occurrence counts, but perform poorly on word analogy tasks.

**Local Context Window Methods**: Methods like skip-gram model may perform better on analogy tasks but poorly utilize the statistics of the corpus since they train on separate local context windows.

**GloVe Model**: Combines both approaches by training on global word-word co-occurrence statistics in a way that can produce meaningful linear substructures in the word vector space.

## How GloVe Works

1. **Co-occurrence Matrix**: Construct a word-word co-occurrence matrix from the corpus
2. **Objective Function**: Minimize a weighted least squares regression model that relates word vectors to global co-occurrence counts
3. **Training**: Learn word vectors that capture meaningful relationships between words

## Advantages

- Captures global corpus statistics effectively
- Produces meaningful vector space structures
- Computationally efficient
- Good performance on word analogy and similarity tasks

In [14]:
import collections

symbols = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm',
           'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z',
           '_', '[UNK]']

In [15]:
raw_token_freqs = {'fast_': 4, 'faster_': 3, 'tall_': 5, 'taller_': 4}
token_freqs = {}
for token, freq in raw_token_freqs.items():
    token_freqs[" ".join(list(token))] = freq

token_freqs

{'f a s t _': 4, 'f a s t e r _': 3, 't a l l _': 5, 't a l l e r _': 4}

In [16]:
def get_max_freq_pair(token_freqs):
    pairs = collections.defaultdict(int)
    for token, freq in token_freqs.items():
        symbols = token.split()
        for i in range(len(symbols) - 1):
            # Key of `pairs` is a tuple of two consecutive symbols
            pairs[symbols[i], symbols[i + 1]] += freq
    
    return max(pairs, key=pairs.get)  # Key of `pairs` with the max value

In [17]:
def merge_symbols(max_freq_pair, token_freqs, symbols):
    symbols.append(''.join(max_freq_pair))
    new_token_freqs = dict()
    for token, freq in token_freqs.items():
        new_token = token.replace(' '.join(max_freq_pair),
                                  ''.join(max_freq_pair))
        new_token_freqs[new_token] = token_freqs[token]
    return new_token_freqs

In [18]:
num_merges = 10
for i in range(num_merges):
    max_freq_pair = get_max_freq_pair(token_freqs)
    token_freqs = merge_symbols(max_freq_pair, token_freqs, symbols)
    print(f'merge #{i + 1}:', max_freq_pair)

merge #1: ('t', 'a')
merge #2: ('ta', 'l')
merge #3: ('tal', 'l')
merge #4: ('f', 'a')
merge #5: ('fa', 's')
merge #6: ('fas', 't')
merge #7: ('e', 'r')
merge #8: ('er', '_')
merge #9: ('tall', '_')
merge #10: ('fast', '_')


In [21]:
print(symbols)
print(token_freqs.keys())

['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '_', '[UNK]', 'ta', 'tal', 'tall', 'fa', 'fas', 'fast', 'er', 'er_', 'tall_', 'fast_']
dict_keys(['fast_', 'fast er_', 'tall_', 'tall er_'])


In [22]:
def segment_BPE(tokens, symbols):
    outputs = []
    for token in tokens:
        start, end = 0, len(token)
        cur_output = []
        # Segment token with the longest possible subwords from symbols
        while start < len(token) and start < end:
            if token[start: end] in symbols:
                cur_output.append(token[start: end])
                start = end
                end = len(token)
            else:
                end -= 1
        if start < len(token):
            cur_output.append('[UNK]')
        outputs.append(' '.join(cur_output))
    return outputs

In [23]:
tokens = ['tallest_', 'fatter_']
print(segment_BPE(tokens, symbols))

['tall e s t _', 'fa t t er_']
