We'll use the python library gensim: https://radimrehurek.com/gensim/

In [69]:
import gensim

Load a premade word2vec model built on Google News articles.

Download from: https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM

In [70]:
model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True, limit=500000)

We limit the vocabulary to the 500,000 most common words.  Even at 500,000, it starts to get to nonsense words.  Here are the top 50 and bottom 50 words by frequency. And even for the real words that are infrequent, if a word is too obscure, it wouldn't make for a good clue.

In [94]:
print("Most common:",model.index2word[:50])
print("Least common:",model.index2word[-50:])

Most common: ['</s>', 'in', 'for', 'that', 'is', 'on', '##', 'The', 'with', 'said', 'was', 'the', 'at', 'not', 'as', 'it', 'be', 'from', 'by', 'are', 'I', 'have', 'he', 'will', 'has', '####', 'his', 'an', 'this', 'or', 'their', 'who', 'they', 'but', '$', 'had', 'year', 'were', 'we', 'more', '###', 'up', 'been', 'you', 'its', 'one', 'about', 'would', 'which', 'out']
Least common: ['ideological_affinity', 'Rashtriya_Rifles_RR', 'Pedrotti', 'Frysinger', 'Ralph_Sacco', 'Ryan_Nece', 'Homs_Syria', 'BACK_TO_BACK', 'Nag_Hammadi', 'Dashan', 'Murape', 'Majolica', 'Sundvold', 'Jerryd', 'administered_subcutaneously', 'Pierre_Luc_Gagnon', 'Fedrizzi', 'CD_ROMS', 'Raynham_Mass.', 'NN_Vohra', 'Barraba', 'Delta_Upsilon', 'Roilo_Golez', 'Cindy_Scroggins', 'Iter', 'Ford_Expeditions', 'La_Toussuire', 'Hooksett_NH', 'ITCTransmission', 'wakeskate', 'Fervor', 'SAFT', 'steam_boiler', 'Moskwa', 'Inet_electronic', '2A_1A', 'pituitary_tumor', 'Westernbank', '3DV', 'Supremely', 'Mellars', 'JUDGMENT', 'thinnest_sm

Here's an example Codenames board.  `blue` is one team's words, `red` the other and `assassin` is the assassin word.

In [95]:
board = {
    'blue': ['ambulance', 'hospital', 'spell', 'lock', 'charge', 'tail', 'link', 'cook', 'web'],
    'red': ['cat', 'button', 'pipe', 'pants', 'mount', 'sleep', 'stick', 'file', 'worm'],
    'assassin': 'doctor'
}

We can use gensim to find the 10 words most related to "ambulance" in this word2vec model.

In [55]:
model.similar_by_word('ambulance', topn=10)

[('paramedics', 0.7590752243995667),
 ('ambulances', 0.7493595480918884),
 ('Ambulance', 0.7236292362213135),
 ('paramedic', 0.662133514881134),
 ('Ambulance_paramedics', 0.6315338611602783),
 ('Ambulances', 0.6211477518081665),
 ('LifeFlight_helicopter', 0.6147335171699524),
 ('hospital', 0.6099206209182739),
 ('Paramedics', 0.6081751585006714),
 ('Ambulance_Service', 0.6080097556114197)]

Each line is the word, followed by how similar the word is to "ambulance." Some of these words word be useful, "parametics" for instance, but many are just other forms of the word "ambulance."

gensim allows us to directly find words the most similar to a whole group of words at one time.

In [96]:
model.most_similar(positive=board['blue'])

[('%_#F########_3v.jsn', 0.5153687000274658),
 ('By_TBT_staff', 0.4811619818210602),
 ('By_HARVEY_SIMPSON', 0.47336331009864807),
 ('try_resubmitting', 0.46592575311660767),
 ('By_Salvatore_Landolina', 0.4655460715293884),
 ('By_Jason_Kaneshiro', 0.4612027108669281),
 ('%_#F########_2v.jsn', 0.45537447929382324),
 ('%_#F########_1v.jsn', 0.4508393406867981),
 ('BY_VINCENT_MAO', 0.4498888850212097),
 ('Visit_BBC_Webwise', 0.4431522786617279)]

As we can see, it produces a lot of nonsense words. We can use `restrict_vocab` to limit results to only the top n most common words in the corpus.

In [97]:
model.most_similar(
    positive=board['blue'],
    restrict_vocab=50000,
    topn=20
)

[('For_Restrictions', 0.43488097190856934),
 ('bed', 0.39588358998298645),
 ('links', 0.38411831855773926),
 ('hook', 0.38367366790771484),
 ('paramedics', 0.38072746992111206),
 ('emergency', 0.37950167059898376),
 ('jail', 0.3759669065475464),
 ('log', 0.37062549591064453),
 ('intensive_care', 0.3661930561065674),
 ('call', 0.36543411016464233),
 ('webpage', 0.3649423122406006),
 ('tow_truck', 0.3592333197593689),
 ('click', 0.35906946659088135),
 ('cooked', 0.3552851676940918),
 ('care', 0.3537469208240509),
 ('handcuff', 0.35027384757995605),
 ('then', 0.34921103715896606),
 ('stay', 0.3478427529335022),
 ('turn', 0.34607696533203125),
 ('bookmark', 0.3458564579486847)]

This looks much better, and produces some decent clues.  
* "bed", "paramedics", "emergency" all relate to "ambulance" and "hospital." 
* "jail" could relate to "lock" and "charge." 
* "click" to "web" and "link."

But “bed” would also relate to the other team’s word “sleep”; and “click” with “button.” It would be bad to give clues which could point to the opponent’s cards. 

gensim allows for negative examples to be included as well to help avoid that.

In [98]:
model.most_similar(
    positive=board['blue'],
    negative=board['red'],
    restrict_vocab=50000
)

[('Hospital', 0.27265793085098267),
 ('ambulances', 0.2605472207069397),
 ('hospitals', 0.24624229967594147),
 ('outpatient', 0.24339225888252258),
 ('inpatient', 0.2404019981622696),
 ('paramedics', 0.23482689261436462),
 ('escort', 0.23161748051643372),
 ('Partnerships', 0.23104971647262573),
 ('Medical_Center', 0.2306305170059204),
 ('telemedicine', 0.22638411819934845)]

I really like the clue "telemedicine." It's non-obvious, but relates to four words: "web," "link," "ambulance" and "hospital." This shows the potential for this method to produce novel clues.

Let's say that the clue were "telemedicine" and the four words were removed from the board, then the next team got a turn.  What might their clues be?

In [99]:
board = {
    'blue': ['spell', 'lock', 'charge', 'tail', 'link'],
    'red': ['cat', 'button', 'pipe', 'pants', 'mount', 'sleep', 'stick', 'file', 'worm'],
    'assassin': 'doctor'
}

model.most_similar(
    positive=board['red'],
    negative=board['blue'],
    restrict_vocab=50000
)

[('pillow', 0.43686941266059875),
 ('bra', 0.3842337727546692),
 ('couch', 0.38342970609664917),
 ('tub', 0.37922778725624084),
 ('closet', 0.36959999799728394),
 ('sofa', 0.36713898181915283),
 ('bathroom', 0.366258829832077),
 ('bed', 0.36348700523376465),
 ('crotch', 0.36245280504226685),
 ('spoon', 0.36179912090301514)]

This appears much less successful.  The top words mostly just seem to relate to a singe word:
* pillow -> sleep
* bra -> pants
* couch -> sleep? cat?