<a href="https://colab.research.google.com/github/charu13a/knowledge-games/blob/word2vec/CoOp_crossword2vec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook contains an example of using word2vec to generate similar words for the crossword. The input is the words, along with the clues, which are used to supplement the context.
Skip to final results [here](https://colab.research.google.com/drive/1Ocb3I-ZmlR3sneE90Bf0cJXMgHKbiJ_A#scrollTo=hHEzyraeT1EX). 

# Crossword words using word2vec

### Setup

Download the pre-trained word vectors; this takes a minute (~1.5gb)

In [1]:
!wget -P /root/input/ -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"

--2019-10-04 06:23:06--  https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.24.118
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.24.118|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1647046227 (1.5G) [application/x-gzip]
Saving to: ‘/root/input/GoogleNews-vectors-negative300.bin.gz’


2019-10-04 06:23:36 (63.0 MB/s) - ‘/root/input/GoogleNews-vectors-negative300.bin.gz’ saved [1647046227/1647046227]



Install gensim, a useful NLP library that we will use to load word2vec embeddings

In [2]:
!pip install gensim
from gensim.models import KeyedVectors



In [3]:
EMBEDDING_FILE = '/root/input/GoogleNews-vectors-negative300.bin.gz' # from above
model = KeyedVectors.load_word2vec_format(EMBEDDING_FILE, binary=True)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


Add logging support.

In [0]:
import logging
from pprint import pprint # pretty print output
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

## Problem
The goal is to find interesting related words given a set of words **and clues**, i.e. understand the ‘theme’ and get an effective ranking of the words.

Before we start, let us define an auxillary method which will remove words from the list which are not present in the vocabulary.

In [0]:
# filters words not in the model vocabulary
def filter_words_not_in_vocab(model, list_of_words):
    word_vectors = model.wv
    return list(filter(lambda x: x in word_vectors.vocab, list_of_words))

Additionally, we will need to remove words which have the same root as the input word.

In [0]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")
def filter_words_with_same_root(input_words):
  def should_filter(word):
    for x in input_words:
      if(stemmer.stem(x) == stemmer.stem(word) or x in word or word in x):
        return False
    return True
  return should_filter

### Metric for ranking: Frequency count
This method finds the words which occur most number of times within top-N cosine distance of each input word.


In [0]:
from itertools import chain
from collections import Counter

# this method takes in a list of words and returns top 20 words which are 
# closest to most of the input words. 
def find_highest_frequency(model, list_of_words, nwords=20):
    closest_words = []
    map_words = []
    for word in list_of_words:
        words = model.similar_by_word(word, topn=50, restrict_vocab=None)
        words = [x[0] for x in words]
        for y in words:
          map_words.append([word, y])
        closest_words = closest_words + words
    freq_count = Counter(chain(closest_words)).most_common(nwords)
    return [x[0] for x in freq_count]

## Gandhi 150

### Results without clues feeded to word2vec.

1. Define the input words list. Input words are taken from a crossword.

In [0]:
words_list = ['Porbandar', 'Putli_Bai', 'Ram', 'Time', 'India', 'Aga', 'Abdul'
              , 'soul', 'Charkha', 'butter', 'lawyer', 'Naidu', 'railway'
              , 'quit', 'laugh', 'water', 'earth', 'evil', 'Dandi']

2. Filter words not in vocabulary.

In [9]:
filtered_words_list = filter_words_not_in_vocab(model, words_list)
words_not_in_vocab = set(words_list) - set(filtered_words_list)
print("Following {} words not in vocab:".format(len(words_not_in_vocab)), words_not_in_vocab)

Following 1 words not in vocab: {'Putli_Bai'}


  


3. Compute most similar words using our metric.

In [10]:
closest_words = find_highest_frequency(model, filtered_words_list)
closest_words = list(filter(filter_words_with_same_root(words_list), closest_words))
pprint(closest_words)

2019-10-04 06:25:22,318 : INFO : precomputing L2-norms of word weight vectors
  if np.issubdtype(vec.dtype, np.int):


['Gujarat',
 'Porbander',
 'Bhavnagar',
 'Valsad',
 'Junagadh',
 'Navsari',
 'Bharuch',
 'Amreli',
 'Bhatkal',
 'Visakhapatnam',
 'Hubli',
 'Alappuzha',
 'Kollam',
 'Veraval',
 'Jamnagar',
 'Valsad_district',
 'Nalgonda',
 'Kolhapur',
 'Bhadrak']


### Adding clues to words.
Next we want to see can we enhance the results by adding clue information to the words.
1. Define clue list.

In [0]:
clue_list = ["Gandhi's birthplace", 
             "Gandhi's mother", 
             "Hey : Gandhi's last words", 
             "Gandhi was Magazine's Man of the Year in 1930",
             "Young : A journal published by Gandhi", 
             "Gandhi and Kasturba were jailed at Khan palace",
             "Khan Gaffar Khan was also known as Frontier Gandhi",
             "Mahatma means Great", 
             "The spinning wheel made iconic by Gandhi",
             "The villagers want bread not: quote by Gandhi",
             "Gandhi's profession in South Africa",
             "Sarojni became president of the Indian National Congress after Gandhi",
             "Gandhi was thrown out of the train at Pietermaritzburg Station",
             "Gandhi started the India Movement in 1942",
             "First they ignore you, then they at you, then they fight you, then you win: quote by Gandhi",
             "We may not be God, but we are of God, even as a little drop is of the ocean: quote by Gandhi",
             "provides enough to satify every man's needs, but not every man's greed: quote by Gandhi",
             "Good and are found together: quote by Gandhi",
             "Gandhi led the Salt March to this beach"]

2. Next, add a function which will extract nouns from the clue. We will use [nltk](https://en.wikipedia.org/wiki/Natural_Language_Toolkit), a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for English written in the Python programming language.

In [12]:
from textblob import TextBlob
import nltk
nltk.download('punkt')
!python -m textblob.download_corpora

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package conll2000 to /root/nltk_data...
[nltk_data]   Unzipping corpora/conll2000.zip.
[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
Finished.


In [0]:
def extract_nouns(txt):
  return [w for (w, pos) in TextBlob(txt).pos_tags if pos[0] == 'N']

In [14]:
import nltk
nltk.download('maxent_ne_chunker')
nltk.download('words')
from nltk import word_tokenize, pos_tag, ne_chunk
from nltk import RegexpParser
from nltk import Tree
import pandas as pd

# Defining a grammar & Parser
NP = "NP: {(<V\w+>|<NN\w?>)+.*<NN\w?>}"
chunker = RegexpParser(NP)

def get_continuous_chunks(text, chunk_func=ne_chunk):
    chunked = chunk_func(pos_tag(word_tokenize(text)))
    continuous_chunk = []
    current_chunk = []

    for subtree in chunked:
        if type(subtree) == Tree:
            current_chunk.append(" ".join([token for token, pos in subtree.leaves()]))
        elif current_chunk:
            named_entity = " ".join(current_chunk)
            if named_entity not in continuous_chunk:
                continuous_chunk.append(named_entity)
                current_chunk = []
        else:
            continue

    return continuous_chunk

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


3. Extract nouns for each clue.

In [15]:
clue_nouns = list(map(extract_nouns, clue_list))
pprint(clue_nouns)

[['Gandhi', 'birthplace'],
 ['Gandhi', 'mother'],
 ['Hey', 'Gandhi', 'words'],
 ['Gandhi', 'Magazine', 'Man', 'Year'],
 ['Young', 'journal', 'Gandhi'],
 ['Gandhi', 'Kasturba', 'Khan', 'palace'],
 ['Khan', 'Gaffar', 'Khan', 'Frontier', 'Gandhi'],
 ['Mahatma', 'Great'],
 ['spinning', 'wheel', 'Gandhi'],
 ['villagers', 'quote', 'Gandhi'],
 ['Gandhi', 'profession', 'South', 'Africa'],
 ['Sarojni', 'president', 'National', 'Congress', 'Gandhi'],
 ['Gandhi', 'train', 'Pietermaritzburg', 'Station'],
 ['Gandhi', 'India', 'Movement'],
 ['quote', 'Gandhi'],
 ['God', 'God', 'drop', 'ocean', 'quote', 'Gandhi'],
 ['man', 'needs', 'man', 'greed', 'quote', 'Gandhi'],
 ['quote', 'Gandhi'],
 ['Gandhi', 'Salt', 'March', 'beach']]


Let us try to extract the noun phrases too.

In [16]:
clue_noun_phrases = list(map(get_continuous_chunks, clue_list))
pprint(clue_noun_phrases)

[['Gandhi'],
 ['Gandhi'],
 ['Gandhi'],
 ['Gandhi', 'Magazine'],
 ['Young'],
 ['Gandhi', 'Kasturba', 'Khan'],
 ['Khan Gaffar Khan'],
 ['Mahatma'],
 [],
 [],
 ['Gandhi'],
 ['Sarojni', 'Indian National Congress'],
 ['Gandhi'],
 ['Gandhi', 'India'],
 [],
 ['God'],
 [],
 ['Good'],
 ['Gandhi', 'Salt']]


Since the noun phrases are not that good, we stick to simply using the nouns. Next, we can try adding these words also as input to the model and see the results.

In [0]:
flattened_clue_list = [item for sublist in clue_nouns for item in sublist]
added_words_list = words_list + flattened_clue_list
#pprint(added_words_list)

Again, filter out the words not in vocabulary.

In [18]:
filtered_words_list = filter_words_not_in_vocab(model, added_words_list)
words_not_in_vocab = set(added_words_list) - set(filtered_words_list)
print("Following {} words not in vocab:".format(len(words_not_in_vocab)), words_not_in_vocab)

Following 1 words not in vocab: {'Putli_Bai'}


  


Compute most similar words.

In [19]:
closest_words = find_highest_frequency(model, filtered_words_list, nwords=40)
closest_words = list(filter(filter_words_with_same_root(added_words_list), closest_words))
pprint(closest_words)

  if np.issubdtype(vec.dtype, np.int):


['salt_satyagraha',
 'Karunanidhi',
 'Gadkari',
 'Joshi',
 'Vinoba',
 'Bapu',
 'Ambedkar',
 'Pandit_Nehru',
 'Swami_Vivekananda',
 'Tagore',
 'Hind_Swaraj',
 'Shri_Guruji',
 'Bhagat_Singh',
 'Advani',
 'Mayawati',
 'Modi',
 'Rahul',
 'Basu',
 'Mamata',
 'Subhash_Chandra_Bose',
 'Ghandi',
 'Vinoba_Bhave']


We can also try merging the clues and the words list, so that we get top word which is most similar, and feed them to the model.

In [20]:
clues_words_combined = []
for x in range(len(words_list)):
  clues_words_combined.append(filter_words_not_in_vocab(model,[words_list[x]] + clue_nouns[x]))
# replace each word list by most similar word
closest_words = []
for x in clues_words_combined:
  words = model.most_similar(positive=x, topn=1)
  words = [x[0] for x in words]
  closest_words += words
pprint(closest_words)

  
  if np.issubdtype(vec.dtype, np.int):


['Kirti_Mandir',
 'grandmother',
 'Jai_Sri',
 'Life_Expectancy_Hits',
 'Sunil_Khilnani',
 'Bahadur_Shah',
 'Hussain',
 'Mahatma_Gandhi',
 'charkha',
 'Gandhiji',
 'Francois_Joubert',
 'Orissa_Pradesh',
 'railway_station',
 'Gandhiji',
 'quip',
 'Mans_REBELLION',
 'Manners_maketh',
 'Gandhiji',
 'Dandi_march']


We can even try getting closest words from these words.

In [21]:
result_words = find_highest_frequency(model, closest_words, nwords=40)
result_words = list(filter(filter_words_with_same_root(added_words_list), result_words))
# remove potential duplicates
final_words = []
for word in result_words:
  is_similar = False
  for x in final_words:
    if stemmer.stem(x) == stemmer.stem(word) or x in word or word in x:
      is_similar = True
      break
  if not is_similar:
    final_words.append(word)
pprint(final_words)

  if np.issubdtype(vec.dtype, np.int):


['Sabarmati_Ashram',
 'Shivaji_Maharaj',
 'satyagraha',
 'Acharya_Vinoba_Bhave',
 'Swami_Vivekananda',
 'Bapu',
 'Ambedkar',
 'Hind_Swaraj',
 'Pandit_Nehru',
 'Sree_Narayana_Guru',
 'Tagore',
 'Bhagat_Singh',
 'Shri_Guruji',
 'Basavanna',
 'Pandit_Jawaharlal_Nehru',
 'Golwalkar']


In [22]:
model.most_similar(positive=["husband", "Kasturba"], negative=["Gandhi"], topn=1)

  if np.issubdtype(vec.dtype, np.int):


[('wife', 0.631434440612793)]

Let us try to establish relationships between words.

In [0]:
import numpy as np
def unit_vector(vector):
    """ Returns the unit vector of the vector.  """
    return vector / np.linalg.norm(vector)

def angle_between(v1, v2):
    """ Returns the angle in radians between vectors 'v1' and 'v2'::

            >>> angle_between((1, 0, 0), (0, 1, 0))
            1.5707963267948966
            >>> angle_between((1, 0, 0), (1, 0, 0))
            0.0
            >>> angle_between((1, 0, 0), (-1, 0, 0))
            3.141592653589793
    """
    v1_u = unit_vector(v1)
    v2_u = unit_vector(v2)
    return np.arccos(np.clip(np.dot(v1_u, v2_u), -1.0, 1.0))

def cosine_similarity(v1, v2):
   v1_u = unit_vector(v1)
   v2_u = unit_vector(v2)
   return np.dot(v1_u, v2_u)

In [24]:
a = "China"
b = "Beijing"
a_vector_list = model.most_similar(positive=[a], topn=10)
b_vector_list = model.most_similar(positive=[b], topn=10)
for v in a_vector_list:
  for w in b_vector_list:
    similar = model.most_similar(positive=[v[0], b], negative=[a], topn=1)
    similar = [x[0] for x in similar]
    if(w[0] in similar):
      print(v[0], w[0])

  if np.issubdtype(vec.dtype, np.int):


Beijing Bejing
Taiwan Taipei
Chinas Bejing
Shanghai Guangzhou
Guangdong Guangzhou
Hong_Kong Shanghai
Shenzhen Guangzhou
