# Preparing news vectors

We want to use [word2vec] pre-computed word vectors to approximate
the semantic distance between user queries and dictionary definitions.

See Daniel Dacanay, Antti Arppe, and Atticus Harrigan, [Computational Analysis versus Human Intuition: A Critical Comparison of Vector Semantics with Manual Semantic Classification in the Context of Plains Cree][vecpaper1], in 
Proceedings of the 4th Workshop on the Use of Computational Methods in the Study of Endangered Languages; and 
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean, [Efficient Estimation
of Word Representations in Vector Space][eewrvs], in Proceedings of Workshop at
ICLR, 2013.

[eewrvs]: http://arxiv.org/pdf/1301.3781.pdf
[word2vec]: https://code.google.com/archive/p/word2vec/
[vecpaper1]: https://computel-workshop.org/wp-content/uploads/2021/02/2021.computel-1.5.pdf

But first we are going to massage the precomputed vectors into an easier-to-use form.

There are a couple of things to do first:
  - Use a file format that’s faster to load
  - Save time and space by pruning keys we’ll never query for

## The upstream files

First, we’ll store the files in the `res/vector_models` directory.
Let’s make a variable that points at that.

In [1]:
import os
from pathlib import Path

# jupyter does not expose the filename of the notebook, and
# the kernel working directory appears to be the directory
# containing the first notebook opened in the jupyter session.
def find_project_root(target_filename='Pipfile'):
    """Walk upwards from current dir, looking for target_filename"""
    start_directory = directory = Path(os.getcwd())
    while directory.parent != directory:
        if (directory / target_filename).exists():
            return directory
        directory = directory.parent
    else:
        raise Exception(f'Could not find {target_filename!r} in any parent of {start_directory}')
    return directory

ROOT = find_project_root()
VECTOR_DIR = ROOT / 'CreeDictionary' / 'res' / 'vector_models'

The upstream `GoogleNews-vectors-negative300.bin.gz` file is not checked in here, so you’ll have to get it elsewhere.

In [2]:
!env BLOCK_SIZE="'1" ls -s $VECTOR_DIR

total 1,647,046,656
1,647,046,656 GoogleNews-vectors-negative300.bin.gz


In [3]:
%%time

from gensim.models import KeyedVectors

vectors = VECTOR_DIR / 'GoogleNews-vectors-negative300.bin.gz'

wv = KeyedVectors.load_word2vec_format(vectors, binary=True)

CPU times: user 30.2 s, sys: 2.82 s, total: 33 s
Wall time: 30.8 s


It’s called a keyed vector because it maps keys to vectors.

If we run some basic stats, we see: there are 3 million, 300-dimensional vectors.

In [4]:
import numpy as np
def shortprint(a):
    with np.printoptions(threshold=10):
        print(a)
shortprint(wv['hello'])

[-0.05419922  0.01708984 -0.00527954 ... -0.36523438 -0.13769531
 -0.12890625]


In [5]:
len(wv['hello'])

300

In [6]:
len(wv.key_to_index)

3000000

We can query for similar concepts:

In [7]:
wv.similar_by_vector(wv['hello'])

[('hello', 1.0),
 ('hi', 0.6548984050750732),
 ('goodbye', 0.6399056315422058),
 ('howdy', 0.6310956478118896),
 ('goodnight', 0.5920578241348267),
 ('greeting', 0.5855877995491028),
 ('Hello', 0.5842196345329285),
 ("g'day", 0.5754078030586243),
 ('See_ya', 0.5688871741294861),
 ('ya_doin', 0.5643119812011719)]

And then the deep magic is that this vector model appears to capture semantic relationships.

Take the physics away from Einstein, add painting, and what do you get?

In [8]:
wv.similar_by_vector(wv['Einstein'] - wv['physics'] + wv['painting'])

[('painting', 0.6713024973869324),
 ('Picasso', 0.5790765285491943),
 ('Rembrandt', 0.5525932312011719),
 ('Guercino', 0.5175808072090149),
 ('paintings', 0.5061424970626831),
 ('Picasso_Monet', 0.5017005205154419),
 ('Balthus', 0.5001558661460876),
 ('Cezanne', 0.4969138503074646),
 ('Warhol', 0.49574798345565796),
 ('Vincent_Van_Gogh', 0.4956248998641968)]

### Faster file format

Let’s use the built-in gensim file format, which saves the vectors into a memory-mapping numpy array on disk.

In [9]:
import os

REDUCED_FILE = os.fspath(VECTOR_DIR / 'news_vectors.kv')

wv.save(REDUCED_FILE)

In [10]:
!env BLOCK_SIZE="'1" ls -s $VECTOR_DIR

total 5,363,204,096
1,647,046,656 GoogleNews-vectors-negative300.bin.gz
  116,154,368 news_vectors.kv
3,600,003,072 news_vectors.kv.vectors.npy


In [11]:
%%time

wv = KeyedVectors.load(REDUCED_FILE, mmap='r')
with np.printoptions(threshold=10):
    print(wv['hello'])
wv.similar_by_vector(wv['hello'])

[-0.05419922  0.01708984 -0.00527954 ... -0.36523438 -0.13769531
 -0.12890625]
CPU times: user 2.57 s, sys: 1.12 s, total: 3.69 s
Wall time: 1.2 s


[('hello', 1.0),
 ('hi', 0.6548984050750732),
 ('goodbye', 0.6399056315422058),
 ('howdy', 0.6310956478118896),
 ('goodnight', 0.5920578241348267),
 ('greeting', 0.5855877995491028),
 ('Hello', 0.5842196345329285),
 ("g'day", 0.5754078030586243),
 ('See_ya', 0.5688871741294861),
 ('ya_doin', 0.5643119812011719)]

In [12]:
%%time
wv.similar_by_vector(wv['hello'])

CPU times: user 1.75 s, sys: 440 ms, total: 2.19 s
Wall time: 145 ms


[('hello', 1.0),
 ('hi', 0.6548984050750732),
 ('goodbye', 0.6399056315422058),
 ('howdy', 0.6310956478118896),
 ('goodnight', 0.5920578241348267),
 ('greeting', 0.5855877995491028),
 ('Hello', 0.5842196345329285),
 ("g'day", 0.5754078030586243),
 ('See_ya', 0.5688871741294861),
 ('ya_doin', 0.5643119812011719)]

This file is *much* faster to load, ~1 second instead of 30 seconds, but it is also much larger.

#### Float precision

Interestingly, if we look inside the original file, we only started out with 16-bit floats, but they’re being stored as 32-bit ones. We can halve the file size by setting the data type correctly.

In [13]:
!zcat $VECTOR_DIR/GoogleNews-vectors-negative300.bin.gz \
    | head -c 300M | tail -c 256 | hexdump -C

00000000  93 be 00 00 14 3e 00 00  31 3e 00 00 41 3d 00 00  |.....>..1>..A=..|
00000010  b8 3d 00 00 8f be 00 00  32 be 00 00 16 bf 00 00  |.=......2.......|
00000020  08 be 00 00 c9 3e 00 00  9f be 00 00 5e 3c 00 00  |.....>......^<..|
00000030  30 be 00 00 58 bc 00 00  57 3e 00 00 8d 3e 00 00  |0...X...W>...>..|
00000040  a2 be 00 00 ee bd 00 00  e5 be 00 00 8f bd 00 00  |................|
00000050  21 be 00 00 3f be 00 00  74 3e 00 00 05 3f 00 00  |!...?...t>...?..|
00000060  2f 3e 00 00 dd bd 00 00  fa 3d 00 00 f6 3e 00 00  |/>.......=...>..|
00000070  16 be 00 00 5c 3e 00 00  3c 3e 00 00 2d be 00 00  |....\>..<>..-...|
00000080  9d bd 00 00 5a 3e 00 00  fe 3c 00 00 e6 3e 00 00  |....Z>...<...>..|
00000090  24 be 00 00 ec 3e 00 00  59 be 00 00 c9 3d 00 00  |$....>..Y....=..|

gzip: 000000a0  f7 bd 00 00 07 bf 00 00  9d 3e 00 00 d8 bc 00 00  |.........>......|
stdout: Broken pipe
000000b0  e9 3d 00 00 20 ba 00 00  4d 3e 00 00 b6 be 00 00  |.=.. ...M>......|
000000c0  ba bc 00 00 a7 

In [14]:
wv.vectors.dtype

dtype('float32')

In [15]:
wv2 = KeyedVectors.load(REDUCED_FILE, mmap='r')
wv2.vectors = wv2.vectors.astype('float16')
shortprint(wv2['hello'])

[-0.0542   0.01709 -0.00528 ... -0.3652  -0.1377  -0.1289 ]


But, sadly, doing so makes lookups take more than **15x** as long, going from a fraction of a second to multiple seconds. This is because modern CPUs do not generally have built-in 16-bit float operations.

In [16]:
%%time
wv2.similar_by_vector(wv2['hello'])

CPU times: user 3.79 s, sys: 924 ms, total: 4.71 s
Wall time: 2.34 s


[('hello', 1.0000041723251343),
 ('hi', 0.6548804640769958),
 ('goodbye', 0.639906108379364),
 ('howdy', 0.6311008334159851),
 ('goodnight', 0.5920441150665283),
 ('greeting', 0.5855898857116699),
 ('Hello', 0.5842040777206421),
 ("g'day", 0.5753953456878662),
 ('See_ya', 0.5688733458518982),
 ('ya_doin', 0.5643098950386047)]

In [17]:
%%time
wv.similar_by_vector(wv['hello'])

CPU times: user 1.78 s, sys: 591 ms, total: 2.37 s
Wall time: 154 ms


[('hello', 1.0),
 ('hi', 0.6548984050750732),
 ('goodbye', 0.6399056315422058),
 ('howdy', 0.6310956478118896),
 ('goodnight', 0.5920578241348267),
 ('greeting', 0.5855877995491028),
 ('Hello', 0.5842196345329285),
 ("g'day", 0.5754078030586243),
 ('See_ya', 0.5688871741294861),
 ('ya_doin', 0.5643119812011719)]

If we need to be careful with disk space, we could save the vectors on disk as float16 and then do `.astype('float32')` on load, which would only takes a few seconds. It would use more disk space, but may be much faster than dealing with a compressed file.

However, anything that’s not `mmap`ing a file gets risky in terms of memory use. A few gigs of data in memory isn’t a big deal for a server with lots of RAM, but (1) if the data isn’t all ready *before* the webserver forks worker processes, 10 copies of a few gigs of data adds up quickly, and (2) it could substantially increase the requirements for developer machines, which might not have many gigabytes of spare RAM.

So for now I think we’ll stick with the bigger file that can be processed more efficiently both in terms of RAM and CPU.

## Pruning keys

### Keys with punctuation

The file is still quite large. There’s probably a *lot* of stuff in there we will never, ever query for.

For example, what’s the millionth entry?

In [18]:
wv.index_to_key[1_000_000]

'Starwood_Hotels_HOT'

That’s not something we’ll ever query the dictionary for.

What are the top keys?

In [19]:
keys = list(wv.key_to_index.keys())

In [20]:
", ".join(keys[:100])

'</s>, in, for, that, is, on, ##, The, with, said, was, the, at, not, as, it, be, from, by, are, I, have, he, will, has, ####, his, an, this, or, their, who, they, but, $, had, year, were, we, more, ###, up, been, you, its, one, about, would, which, out, can, It, all, also, two, after, first, He, do, time, than, when, We, over, last, new, other, her, people, into, In, our, there, A, she, could, just, years, some, U.S., three, million, them, what, But, so, no, like, if, only, percent, get, did, him, game, back, because, now, #.#, before'

Right away we see that `#`—presumably a placeholder for a number—and `$` are common terms. What keys containing punctuation can we drop?

In [21]:
from collections import Counter
import string

[(char, f"{count:,}")
     for (char, count) in Counter(''.join(keys)).most_common()
     if char not in string.ascii_letters + string.digits][:30]

[('_', '2,881,208'),
 ('#', '261,534'),
 ('-', '202,358'),
 ('.', '187,835'),
 ('=', '29,938'),
 ('/', '24,442'),
 ("'", '17,088'),
 ('@', '10,956'),
 (':', '9,749'),
 ('é', '6,339'),
 (',', '5,517'),
 ('®', '4,268'),
 ('+', '2,530'),
 ('&', '2,493'),
 ('*', '2,350'),
 ('™', '2,206'),
 ('â', '1,647'),
 ('á', '1,483'),
 ('•', '1,428'),
 ('€', '1,324'),
 ('ó', '1,181'),
 ('í', '1,143'),
 ('ü', '1,125'),
 ('ñ', '1,045'),
 ('ö', '993'),
 ('è', '780'),
 ('ä', '599'),
 ('ç', '507'),
 ('е', '480'),
 ('ο', '464')]

We also see some duplication in terms of case; both “it” and “It” appear as keys.

In [22]:
wv.similar_by_key('It')

[('That', 0.8260787129402161),
 ('This', 0.8164012432098389),
 ("It'sa", 0.7155401706695557),
 ('But', 0.6960429549217224),
 ('Of_course', 0.6675450801849365),
 ('And', 0.6665104031562805),
 ('Certainly', 0.650726854801178),
 ("That'sa", 0.6421756148338318),
 ('Obviously', 0.6368812918663025),
 ('Actually', 0.6258060336112976)]

In [23]:
wv.similar_by_key('it')

[('that', 0.6775559782981873),
 ('something', 0.6162784695625305),
 ('just', 0.6107823848724365),
 ('actually', 0.5887327790260315),
 ('It', 0.5808414220809937),
 ('what', 0.5651708245277405),
 ('anyway', 0.5644350647926331),
 ('really', 0.5597794055938721),
 ('so', 0.5579650402069092),
 ('if', 0.5520145297050476)]

The distinction would definitely be useful for some purposes, but our dictionary lowercases all queries on input, so that would be lost on us.

In [24]:
import re
re_double_underscore = re.compile('.*_.*_.*')

def figure_out_items_to_keep():
    # new_key, vector
    to_keep = {}
    
    # The original data does not seem to include frequencies, but we
    # assume that the keys are in frequency order, so we will see
    # the most common term first.
    for key in keys:        
        
        pruned_key = key.lower()
        if pruned_key in to_keep:
            continue
        
        # drop keys with unwanted punctuation
        if any(c in key for c in "$#.=/'@:,®+&*™•"):
            continue
        
        has_uppercase_char = key != key.lower()
        if has_uppercase_char:
            if '_' in key:
                continue
        
        # Skip items like “Dow_Jones_industrial”
        if re_double_underscore.match(key):
            continue
        
        to_keep[pruned_key] = wv[key]
    return to_keep

items_to_keep = figure_out_items_to_keep()

In [25]:
len(items_to_keep)

930045

### Taking a top-$n$ subset

That’s still a lot of keys, and the ones toward the end don’t seem very useful.

In [26]:
from itertools import islice

In [27]:
offset, n = 500_000, 10; print(list(islice(items_to_keep.keys(), offset, offset + n)))

['clscs', '2ab', 'imprecation', 'collectivities', 'ezeani', 'barakah', 'amarteifio', 'glennys', 'ingbretson', 'chaotic_jumble']


At 100,000 keys in, we still seem to have some more common terms:

In [28]:
offset, n = 100_000, 10; print(list(islice(items_to_keep.keys(), offset, offset + n)))

['bronze_sculptures', 'expandable_memory', 'backyard_barbecue', 'matchmakers', 'volkswagens', 'unconverted', 'abilify', 'grovel', 'cannibalistic', 'intimations']


And 250k isn’t too bad either

In [29]:
offset, n = 250_000, 10; print(list(islice(items_to_keep.keys(), offset, offset + n)))

['louisvillians', 'subjunctive', 'ntgr', 'popenoe', 'mungall', 'panhandles', 'boccio', 'synqor', 'endangered_tigers', 'incentivises']


But, subjectively, the bits at 300k don’t seem too useful?

In [30]:
offset, n = 300_000, 10; print(list(islice(items_to_keep.keys(), offset, offset + n)))

['1l_behind', 'predeployment', 'tuesay', 'refractory_ores', 'gifting_suites', 'skahill', 'sunnites', 'cfml', 'esophageal_cancers', 'arlow']


It’s a pretty arbitrary cut-off, but let’s just take the top 300,000 keys.

In [31]:
threshold = 300_000
new_wv = KeyedVectors(vector_size=wv.vector_size)
new_keys = list(items_to_keep.keys())[:threshold]
new_values = list(items_to_keep.values())[:threshold]
new_wv.add_vectors(new_keys, new_values)

## The pruned file

A quick check that things look ok:

In [32]:
new_wv.similar_by_key('hello')

[('hi', 0.6188791394233704),
 ('hiya', 0.5998829007148743),
 ('hey', 0.5955355167388916),
 ('oh', 0.5387828350067139),
 ('dear', 0.5133785605430603),
 ('oooh', 0.5129778981208801),
 ('hooray', 0.509107768535614),
 ('wassup', 0.4983426332473755),
 ('ooh', 0.49617260694503784),
 ('whatcha', 0.4949532151222229)]

Well, that’s disappointingly different—and lower quality—compared to the uppercase version, but it’s actually a fairly uncommon word in news articles.

In [33]:
wv.key_to_index['hello']

20397

What about a more common word?

In [34]:
wv.key_to_index['train']

2035

In [35]:
new_wv.similar_by_key('train')

[('trains', 0.8081232309341431),
 ('commuter_train', 0.6523401737213135),
 ('locomotive', 0.6395583152770996),
 ('freight_train', 0.6207071542739868),
 ('railway', 0.6071822047233582),
 ('bus', 0.6067739725112915),
 ('rail', 0.5885170102119446),
 ('commuter_trains', 0.5821391344070435),
 ('tram', 0.5750932097434998),
 ('carriages', 0.5699437260627747)]

That seems just fine. Let’s try it for now and revisit it if we run into issues with query quality from not having/trying the uppercase versions, or speed issues from having too many keys.

In [36]:
new_wv.save(REDUCED_FILE)

In [37]:
!env BLOCK_SIZE="'1" ls -s $VECTOR_DIR

total 2,014,588,928
1,647,046,656 GoogleNews-vectors-negative300.bin.gz
    7,540,736 news_vectors.kv
  360,001,536 news_vectors.kv.vectors.npy
