This notebook contains examples of word2vec performance, using pre-trained vectors trained on Google News dataset (about 100 billion words). The model contains 300-dimensional vectors for 3 million words and phrases. 

# Playing with pre-trained word embeddings.

Download the pre-trained word vectors; this takes a minute (~1.5gb)

In [0]:
!wget -P /root/input/ -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"

--2019-11-12 16:06:08--  https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.185.125
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.185.125|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1647046227 (1.5G) [application/x-gzip]
Saving to: ‘/root/input/GoogleNews-vectors-negative300.bin.gz’


2019-11-12 16:06:52 (36.0 MB/s) - ‘/root/input/GoogleNews-vectors-negative300.bin.gz’ saved [1647046227/1647046227]



Install gensim, a useful NLP library that we will use to load word2vec embeddings

In [0]:
!pip install gensim
from gensim.models import KeyedVectors



In [0]:
EMBEDDING_FILE = '/root/input/GoogleNews-vectors-negative300.bin.gz' # from above
model = KeyedVectors.load_word2vec_format(EMBEDDING_FILE, binary=True)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


Add logging support.

In [0]:
import logging
from pprint import pprint # pretty print output
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

## Problem
The goal is to find interesting related words given a set of words, i.e. understand the ‘theme’ and get an effective ranking of the words.

Before we start, let us define an auxillary method which will remove words from the list which are not present in the vocabulary.

In [0]:
# filters words not in the model vocabulary
def filter_words_not_in_vocab(model, list_of_words):
    word_vectors = model.wv
    return list(filter(lambda x: x in word_vectors.vocab, list_of_words))

### Metrics for ranking
#### Metric 1 (Cosine-mean similarity): 
This method computes cosine similarity between a simple mean of the projection weight vectors of the given words and returns the top-N most similar vecs from the training set. gensim provides a pre-defined method [most_similar()](https://tedboy.github.io/nlps/generated/generated/gensim.models.Word2Vec.most_similar.html) for this.
#### Metric 2 (Frequency count):
This method finds the words which occur most number of times within top-N cosine distance of each input word.


In [0]:
from itertools import chain
from collections import Counter

# this method takes in a list of words and returns top 20 words which are 
# closest to most of the input words. 
def find_highest_frequency(model, list_of_words):
    closest_words = []
    for word in list_of_words:
        words = model.similar_by_word(word, topn=50, restrict_vocab=None)
        words = [x[0] for x in words]
        closest_words = closest_words + words
    freq_count = Counter(chain(closest_words)).most_common(20)
    return [x[0] for x in freq_count]

### Themes
For the sake of a qualitative analysis, we chose seven popular themes.

#### Theme 1: Animals 
This is the most basic input, where the goal would be to come up with more animal names.


1. Define the animal words list. Input words are taken from a crossword.

In [0]:
words_list = ['galapagos', 'gecko','leopard','octopus','horse','fangs','numbat','egg',
'milli','bee','raccoon','eel','dolphin','dodo','dhole','goat','hyena','seahorse','lions']

In [0]:
words_list = ['Porbandar', 'Putli_Bai', 'Ram', 'Time', 'India', 'Aga', 'Abdul'
              , 'soul', 'Charkha', 'butter', 'lawyer', 'Naidu', 'railway'
              , 'quit', 'laugh', 'water', 'earth', 'evil', 'Dandi']

2. Filter words not in vocabulary.

In [0]:
filtered_words_list = filter_words_not_in_vocab(model, words_list)
words_not_in_vocab = set(words_list) - set(filtered_words_list)
print("Following {} words not in vocab:".format(len(words_not_in_vocab)), words_not_in_vocab)

Following 1 words not in vocab: {'Putli_Bai'}


  


3. Compute most similar words using metric 1.

In [0]:
pprint(model.most_similar(positive=filtered_words_list, topn=20))

  if np.issubdtype(vec.dtype, np.int):


[('karma_bhoomi', 0.6687738299369812),
 ('Birwa', 0.6614679098129272),
 ('Lalooji', 0.6578205823898315),
 ('Shastriji', 0.654352068901062),
 ('By_Riyanki_Das', 0.6538019776344299),
 ('Photo_C._Ratheesh', 0.6506340503692627),
 ('Dutt_saab', 0.6500939130783081),
 ('AP_Photo_Bikas', 0.649888277053833),
 ('Subhendu', 0.6484553217887878),
 ('sanyasi', 0.6483239531517029),
 ('Chandra_Bhan', 0.6478111743927002),
 ('Gurdas', 0.647689938545227),
 ('Bhapkar', 0.6473879218101501),
 ('Shivnath', 0.6459715962409973),
 ('Ram_Dev', 0.6449568867683411),
 ('Masterji', 0.6449494361877441),
 ('Narendrabhai', 0.6432783603668213),
 ('Pejavar_Mutt', 0.6429892778396606),
 ('Khan_saab', 0.6411842107772827),
 ('Debaprasad', 0.6405091285705566)]


4. Compute most similar words using metric 2.

In [0]:
pprint(find_highest_frequency(model, filtered_words_list))

  if np.issubdtype(vec.dtype, np.int):


['Gujarat',
 'Porbander',
 'Bhavnagar',
 'Valsad',
 'Junagadh',
 'Navsari',
 'Bharuch',
 'Ramanathapuram',
 'Amreli',
 'Bhatkal',
 'Visakhapatnam',
 'Hubli',
 'Alappuzha',
 'Kollam',
 'Veraval',
 'Jamnagar',
 'Valsad_district',
 'Nalgonda',
 'Kolhapur',
 'Bhadrak']


##### Comments
Overall, the results are quite good, as it suggested a variety of other animals such as turtles, otters, critter and rhinoceros.

#### Theme 2: Travel 
Here, the words are related but do not fall in the same class, eg. “FLIGHT” and “VISA”.


1. Define the words list. Input words are taken from a [crossword](https://amuselabs.com/pmm/crossword?id=3f5ecc40&set=ff4e9d1bdcc8f5a6231b9bf907d68d4c697315c0e869c9d9c19cbc37a9f855ae&test=1).

In [0]:
words_list = ['visa', 'flight', 'cruise', 'quad', 'tip', 'eta', 'agent',
              'day', 'non', 'class', 'inter', 'boat', 'vacation', 'open', 'plan', 'asia'
              ,'air', 'frequent', 'cab', 'guide', 'transit', 'terra']

2. Filter words not in vocabulary.

In [0]:
filtered_words_list = filter_words_not_in_vocab(model, words_list)
words_not_in_vocab = set(words_list) - set(filtered_words_list)
print("Following {} words not in vocab:".format(len(words_not_in_vocab)), words_not_in_vocab)

Following 0 words not in vocab: set()


  


3. Compute most similar words using metric 1.

In [0]:
pprint(model.most_similar(positive=filtered_words_list, topn=20))

  if np.issubdtype(vec.dtype, np.int):


[('Reykjavik_Excursions', 0.5865437984466553),
 ('Volcanic_ash_halts', 0.5828834772109985),
 ('Muscat_Fujairah', 0.5724595785140991),
 ('Priced_separately', 0.5717782974243164),
 ('Dense_fog_disrupts', 0.5714766979217529),
 ('China_quarantined_planeloads', 0.5711870193481445),
 ('Complimentary_childcare', 0.5675526857376099),
 ('Fog_disrupts', 0.5640560388565063),
 ('Antimatter_detector', 0.5639382600784302),
 ('Flyaway_Tours', 0.5624248385429382),
 ('KF_OOE', 0.5609842538833618),
 ('ships_Azamara_Journey', 0.5543647408485413),
 ('Complimentary_valet', 0.5532411932945251),
 ('Aegean_Odyssey', 0.551947295665741),
 ('Fabio_complimented', 0.5478428602218628),
 ('deluxe_motorcoach', 0.5466270446777344),
 ('Metro_Trip_Planner', 0.5443207025527954),
 ('Aguadilla_Ponce', 0.5404565930366516),
 ('Qantas_flight_QF###', 0.5404477715492249),
 ('%_#F########_3v.jsn', 0.5395424365997314)]


4. Compute most similar words using metric 2.

In [0]:
pprint(find_highest_frequency(model, filtered_words_list))

  if np.issubdtype(vec.dtype, np.int):


['tips',
 'firme',
 'seu',
 'daily',
 'visas',
 'passport',
 'tourist_visa',
 'Visas',
 'tourist_visas',
 'nonimmigrant_visa',
 'permanent_residency',
 'Schengen_visa',
 'Schengen_visas',
 'nonimmigrant_visas',
 'greencard',
 'passports',
 'issuing_visas',
 'asylum',
 'H1B_visa',
 'visa_waiver']


##### Comments
The results from metric 1 are not relevant as the input words are quite generic while the results look like phrases from news headlines. The results from metric 2 are more relevant as we get words like passport and greencard.

#### Theme 3: Sports
This is a more diverse domain, where the words might be a specific method of dismissal in cricket or name of a cricket stadium.

1. Define the words list. Input words are taken from a [crossword](http://cdn3.amuselabs.com/hindu/crossword?id=1e05e717&set=hindu-sport-on).

In [0]:
words_list = ['leg_before_wicket', 'aga', 'tracy', 'in_swing', 'dominic_thiem', 'bars', 'sascha', 'bishan_bedi', 'ryder_cup', 'bradshaw', 'rene_arnoux',
              'ayrton', 'bean', 'bayern_munich', 'detroit', 'targa', 'ted', 'preakness_stakes',
              'vandesar', 'rabada', 'erasmus', 'dan_carter', 'baton', 'hernan', 'zola', 'franck', 'juan_martin', 'rhythmic', 
              'juventus', 'white_shark', 'andres', 'cash', 'one_day', 'umaga', 'elizabeth', 'thistle',
              'agassi', 'wankhede']

2. Filter words not in vocabulary.

In [0]:
filtered_words_list = filter_words_not_in_vocab(model, words_list)
words_not_in_vocab = set(words_list) - set(filtered_words_list)
print("Following {} words not in vocab:".format(len(words_not_in_vocab)), words_not_in_vocab)

Following 21 words not in vocab: {'wankhede', 'one_day', 'preakness_stakes', 'vandesar', 'zola', 'erasmus', 'white_shark', 'agassi', 'sascha', 'hernan', 'leg_before_wicket', 'franck', 'bradshaw', 'ayrton', 'rene_arnoux', 'dan_carter', 'juan_martin', 'dominic_thiem', 'in_swing', 'rabada', 'bishan_bedi'}


  


3. Compute most similar words using metric 1.

In [0]:
pprint(model.most_similar(positive=filtered_words_list, topn=20))

  if np.issubdtype(vec.dtype, np.int):


[('JeremyShockey_@', 0.7732208371162415),
 ('nikki', 0.7589539289474487),
 ('dpa_fp', 0.7588638067245483),
 ('newsdesk@afxnews.com_rw', 0.7524714469909668),
 ('SheldenWilliams_@', 0.749648928642273),
 ('dpa_jbp', 0.7483022809028625),
 ('newsdesk@afxnews.com_gh', 0.7479081749916077),
 ('dpa_cp', 0.746633768081665),
 ('newsdesk@afxnews.com_bk', 0.7439806461334229),
 ('dpa_fm', 0.7435202598571777),
 ('dpa_rt', 0.7434886693954468),
 ("o'brien", 0.7432399988174438),
 ('checkauth_root', 0.7429306507110596),
 ('kulhadd', 0.739621639251709),
 ('newsdesk@afxnews.com_fp', 0.7377362251281738),
 ('dpa_amc', 0.7372328639030457),
 ('dpa_dc', 0.7344260215759277),
 ('dpa_si', 0.7329121232032776),
 ('metro_philadelphia', 0.7322756052017212),
 ('juss', 0.7320501804351807)]


4. Compute most similar words using metric 2.

In [0]:
pprint(find_highest_frequency(model, filtered_words_list))

  if np.issubdtype(vec.dtype, np.int):


['lori',
 'gilbert',
 'inter_milan',
 'ra',
 'te',
 'ving',
 '¬_tion',
 'ta',
 'bri',
 'ys',
 'danielle',
 'nigel',
 'gordon',
 'jesse',
 'michele',
 'susan',
 'joel',
 'ryan',
 'jensen',
 'mitchell']


##### Comments
A majority of words were not present in the dictionary, but the top results are relevant as Jeremy Shockey is a famous footballer and nikki (Nikki Bella) is a professional wrestler. The remaining results are less relevant.

#### Theme 4: Entertainment
This theme covers the mass pop culture including music, movies, celebrities, music etc.

1. Define the words list. Input words are taken from a [crossword](http://www.clarity-media.co.uk/celebrity-crossword-sample.pdf).

In [0]:
words_list = ['rachel', 'rovers', 'the_game', 'inred', 'itch',
              'apple', 'harry', 'fish', 'eared', 'trigger', 'ursula',
              'helena', 'britain', 'screech', 'cedar', 'gorilla',
              'keira', 'aside', 'jesus', 'cross', 'gandalf', 'desperate',
              'piper', 'dinglers', 'aside', 'sherman']

2. Filter words not in vocabulary.

In [0]:
filtered_words_list = filter_words_not_in_vocab(model, words_list)
words_not_in_vocab = set(words_list) - set(filtered_words_list)
print("Following {} words not in vocab:".format(len(words_not_in_vocab)), words_not_in_vocab)

Following 6 words not in vocab: {'the_game', 'dinglers', 'keira', 'helena', 'gandalf', 'inred'}


  


3. Compute most similar words using metric 1.

In [0]:
pprint(model.most_similar(positive=filtered_words_list, topn=20))

  if np.issubdtype(vec.dtype, np.int):


[('checkauth_root', 0.6856364011764526),
 ('fukin', 0.6675687432289124),
 ('lubber', 0.6608096361160278),
 ('AdrianneCurry_@', 0.6577449440956116),
 ('samantharonson_@', 0.6558294892311096),
 ('lizzie', 0.6544391512870789),
 ('nikki', 0.6455890536308289),
 ('JeremyShockey_@', 0.642753541469574),
 ('SheldenWilliams_@', 0.6425633430480957),
 ('fookin', 0.636864185333252),
 ('questlove_@', 0.6367624998092651),
 ('yella', 0.6311221122741699),
 ('Hee_hee', 0.6300792694091797),
 ('hev', 0.6293219327926636),
 ('Whadda', 0.6259427070617676),
 ("Ah'm", 0.6252166032791138),
 ('Ooh_ooh', 0.6230376958847046),
 ('willie', 0.6221100687980652),
 ('pish', 0.621886670589447),
 ('sh1t', 0.6218252778053284)]


4. Compute most similar words using metric 2.

In [0]:
pprint(find_highest_frequency(model, filtered_words_list))

  if np.issubdtype(vec.dtype, np.int):


['nigel',
 'harvey',
 'jesse',
 'cohen',
 'nikki',
 'derek',
 'michele',
 'denise',
 'louise',
 'becky',
 'dave',
 'whitney',
 'angelina',
 'miley',
 'selena',
 'joan',
 'hitler',
 'asides',
 'apart',
 'up']


##### Comments
The results don’t make much sense using metric 1 but make more sense using metric 2.

#### Theme 5: Medical
Medical words are quite technical and specialized, so we want to see the performance of the model in complex domains.

1. Define the words list. Input words are taken from a [crossword](https://wordmint.com/public_puzzles/73374).

In [0]:
words_list = ['cell', 'midsagittal', 'diaphragm', 'yellow', 'suppurative',
              'disease', 'ad', 'idiopathic', 'ventro', 'genesis', 'resection',
              'adhesion','pyogenic_infection', 'steato','melanoma',
              'bile_duct', 'digestion', 'adcess', 'meal', 'ascites',
              'anorexia', 'bariatic', 'flatus', 'melena', 'bilirubin',
              'hernia', 'rectostenosis', 'cervical', 'internal_organs', 'anastomosis', 'binding',
              'biopsy']

2. Filter words not in vocabulary.

In [0]:
filtered_words_list = filter_words_not_in_vocab(model, words_list)
words_not_in_vocab = set(words_list) - set(filtered_words_list)
print("Following {} words not in vocab:".format(len(words_not_in_vocab)), words_not_in_vocab)

Following 8 words not in vocab: {'steato', 'adcess', 'ventro', 'midsagittal', 'bariatic', 'melena', 'rectostenosis', 'pyogenic_infection'}


  


3. Compute most similar words using metric 1.

In [0]:
pprint(model.most_similar(positive=filtered_words_list, topn=20))

  if np.issubdtype(vec.dtype, np.int):


[('spermatic_cord', 0.7693666815757751),
 ('fibroma', 0.7680847644805908),
 ('hamartoma', 0.7637079954147339),
 ('Intussusception', 0.7620971202850342),
 ('leiomyoma', 0.7620893716812134),
 ('anal_canal', 0.7601139545440674),
 ('transfusion_syndrome', 0.7588125467300415),
 ('lichen_planus', 0.7579723596572876),
 ('carcinoid_tumor', 0.7569707632064819),
 ('hamartomas', 0.7556929588317871),
 ('pre_cancerous_lesion', 0.7542682886123657),
 ('mesenteric', 0.7541995048522949),
 ('leiomyomas', 0.7539600133895874),
 ('mucinous', 0.7507759928703308),
 ('calculi', 0.7505994439125061),
 ('lipomas', 0.7459940910339355),
 ('pannus', 0.74488365650177),
 ('cystic_lesions', 0.743402361869812),
 ('pseudoaneurysm', 0.7431955933570862),
 ('tumoral', 0.7426878213882446)]


4. Compute most similar words using metric 2.

In [0]:
pprint(find_highest_frequency(model, filtered_words_list))

  if np.issubdtype(vec.dtype, np.int):


['spermatic_cord',
 'leiomyomas',
 'pericardium',
 'anal_canal',
 'ureteral',
 'uterus',
 'inferior_vena_cava',
 'chest_cavity',
 'lungs',
 'cardiac_tamponade',
 'granulomatous',
 'haematuria',
 'leukoencephalopathy',
 'pericardial_effusions',
 'hemoptysis',
 'cancer',
 'cancers',
 'cervical_cancer',
 'hypoplasia',
 'pericardial_effusion']


##### Comments
The results here are surprisingly quite good, with a lot of medical terms such as fibroma and intussusception coming up.

#### Theme 6: Himalayas
This is not a complex domain, but a very specialized one, and we want to see how the model will create interesting words around it.

1. Define the words list. Custom input words are chosen.

In [0]:
words_list = ['Himalayas', 'everest', 'nepal', 'Ganges', 'hinduism', 'ranges', 'basin',
              'Ice_age', 'glacier', 'monsoon', 'snow_leopard', 'Bay_of_Bengal']

2. Filter words not in vocabulary.

In [0]:
filtered_words_list = filter_words_not_in_vocab(model, words_list)
words_not_in_vocab = set(words_list) - set(filtered_words_list)
print("Following {} words not in vocab:".format(len(words_not_in_vocab)), words_not_in_vocab)

Following 2 words not in vocab: {'Bay_of_Bengal', 'Ice_age'}


  


3. Compute most similar words using metric 1.

In [0]:
pprint(model.most_similar(positive=filtered_words_list, topn=20))

  if np.issubdtype(vec.dtype, np.int):


[('Gaumukh', 0.7075930237770081),
 ('Gangotri_glacier', 0.7022626996040344),
 ('Shivaliks', 0.6688489317893982),
 ('himalayan', 0.6661850214004517),
 ('Khumbu_Glacier', 0.6643978357315063),
 ('Dhauladhar_range', 0.658578634262085),
 ('Pir_Panjal_range', 0.655968427658081),
 ('Garhwal_Himalayas', 0.6536864638328552),
 ('Shivling', 0.6533980369567871),
 ('Eastern_Ghats', 0.6509689092636108),
 ('Gangotri_Glacier', 0.6464487314224243),
 ('Himalaya_Mountains', 0.6434546709060669),
 ('Nubra_valley', 0.6424018144607544),
 ('Karakoram_mountain', 0.640743613243103),
 ('Khumbu_glacier', 0.6397353410720825),
 ('Mansarovar_lake', 0.6384942531585693),
 ('Tsangpo', 0.6374609470367432),
 ('Shivalik_hills', 0.6358055472373962),
 ('Gangetic_plains', 0.6355847716331482),
 ('Changthang', 0.6321537494659424)]


4. Compute most similar words using metric 2.

In [0]:
pprint(find_highest_frequency(model, filtered_words_list))

  if np.issubdtype(vec.dtype, np.int):


['Khumbu_glacier',
 'Mera_Peak',
 'Khumbu_Glacier',
 'Shishapangma',
 'glaciers',
 'Mt_Everest',
 'mohammad',
 'sinhala',
 'rivers',
 'Gangotri_glacier',
 'Himalayan_Mountains',
 'Himalaya_Mountains',
 'Himalayan',
 'Himalayan_mountains',
 'Tibetan_plateau',
 'Karakoram_mountains',
 'Himalayan_peaks',
 'Himalaya',
 'Karakoram_mountain',
 'Nanda_Devi']


##### Comments
The results are good, as a variety of related ranges such as Shivaliks and Pir Panjal are recommended. Also, words like Shivling are suggested, which is quite good.

#### Theme 7: Computer Science
 Let us see what the model thinks about the old and new tech in computer science :)

1. Define the words list. Input words are taken from a [crossword](https://www.proprofs.com/games/crossword/computer-science/).

In [0]:
words_list = ['RAM', 'backspace', 'LAN', 'keyboard', 'cursor', 'bookmark',
              'Zuckerberg', 'Jobs', 'Babbage', 'bug', 'Eniac',
              'hardware', 'CD_ROM', 'font', 'file', 'virus', 'URL']

2. Filter words not in vocabulary.

In [0]:
filtered_words_list = filter_words_not_in_vocab(model, words_list)
words_not_in_vocab = set(words_list) - set(filtered_words_list)
print("Following {} words not in vocab:".format(len(words_not_in_vocab)), words_not_in_vocab)

Following 0 words not in vocab: set()


  


3. Compute most similar words using metric 1.

In [0]:
pprint(model.most_similar(positive=filtered_words_list, topn=20))

  if np.issubdtype(vec.dtype, np.int):


[('exe_file', 0.7292200922966003),
 ('htaccess', 0.7170591354370117),
 ('plist', 0.7134256958961487),
 ('ftp_server', 0.7070534825325012),
 ('XAMPP', 0.706710159778595),
 ('plist_files', 0.7057288885116577),
 ('exe_files', 0.7049909830093384),
 ('System_Configuration_Utility', 0.7033084630966187),
 ('xterm', 0.7014968991279602),
 ('pagefile', 0.6963300704956055),
 ('regedit', 0.6935948133468628),
 ('Server_Admin', 0.6906064748764038),
 ('sdcard', 0.6888519525527954),
 ('Dropbox_folder', 0.6874693632125854),
 ('update.zip', 0.6871559619903564),
 ('DLL_files', 0.6868031024932861),
 ('EXE_file', 0.6861448287963867),
 ('VS.NET', 0.6847630143165588),
 ('taskbar_icon', 0.6846297383308411),
 ('crontab', 0.6844176054000854)]


4. Compute most similar words using metric 2.

In [0]:
pprint(find_highest_frequency(model, filtered_words_list))

  if np.issubdtype(vec.dtype, np.int):


['textbox',
 'spacebar',
 'cursor_keys',
 'Ctrl_+_Shift',
 'keyboard_shortcut',
 'Fn_key',
 'hotkey',
 'alphanumeric_keys',
 'numberpad',
 'numeric_pad',
 'numpad',
 'arrow_keys',
 'numeric_keypad',
 'stylus',
 'scroll_wheel',
 'thumbnail_preview',
 'Bookmarks_menu',
 'hotlink',
 'favicon',
 'Ballmer']


##### Comments
The results are good overall since computer related terms such as executable files and SD cards are suggested.