A couple years back, I stumbled upon Greenville Kleiser's *Fifteen Thousand Useful Phrases*, a book which a list of phrases that Kleiser had collected over the course of his lifetime. Though since it was published in 1917 some of the phrases are a bit antiquated, many of them stood out to me as quite eloquent or impressive with uses ranging from public speaking to conversation to literature writing. 

Though the utility of this book was initially quite high, I found it quite difficult at times to search through the entire list to find the phrase that perfectly described what I wanted to say. Thus in this project I sought out to search the database of phrases based on a metric of semantic similarity. That is, given an input word or phrase, I wanted to return a list of similar phrases from this database.

This book is available via Project Gutenberg at www.gutenberg.org/files/18362/18362.txt

We first begin by parsing the book and extracting phrases, using BeautifulSoup and regular expressions.

The phrases we would like are of the form 
    `<p id="idxxxx">phrase\</p>`
where the xxxx corresponds to the number of the phrase, starting at 0082 and ending at 15977.

In [1]:
from bs4 import BeautifulSoup
import re
phrase_filename = 'phrases.html'
file = open(phrase_filename, 'r')
soup = BeautifulSoup(file, 'html.parser')
soup.find('div', attrs={})
phrase_list = soup.find_all(lambda tag : tag.name == 'p' 
                           and int(tag['id'][2:])>=82 and int(tag['id'][2:]) <= 15977 and not tag.has_attr('style'))

Now we want to extract each phrase from the list of tags and remove the clarifications in hard brackets to obtain our final list of phrases. We then save this to a file for future access.

In [2]:
def strip(phrase):
    '''
    Removes the whitespace and clarifying brackets 
    from the end of an input phrase
    '''
    out = str(phrase.string)
    hard_bracket_re = re.compile('\\[.*\\]')
    out = hard_bracket_re.sub('', out)
    out = out.strip()
    return out
stripped_list = list(map(strip,phrase_list))
import pickle
with open('parsed_phrases.txt', 'wb') as file_to_write:
    pickle.dump(stripped_list, file_to_write)

Now we would like to analyze and classify our phrases by meaning. To do so, we use a trained dataset created by Google, which maps many of the words found in Google News archives to a 300-dimensional real vector space based on meaning. In order to interface with this dataset, we use the library gensim.

In [3]:
from gensim import models
googledata_path = 'GoogleNews-vectors-negative300.bin.gz'
googledata_model = models.KeyedVectors.load_word2vec_format(googledata_path,
                                                                binary = True)

In order to get the vector associated with a word we do the following:

In [4]:
from gensim import parsing
import numpy
word1 = 'happy'
word_vector1 = googledata_model[word1]
word2 = 'sad'
word_vector2 = googledata_model[word2]
word3 = 'oscilloscope'
word_vector3 = googledata_model[word3]
print(word_vector1[:10])

[-0.0005188   0.16015625  0.0016098   0.02539062  0.09912109 -0.0859375
  0.32421875 -0.02172852  0.13476562  0.11035156]


Above we show the first ten of three hundred values corresponding to the vector for happy.

A reasonable metric of similarity between two word vectors is the dot product. Two vectors which point nearly in the same direction will have a large dot product, while vectors which are orthogonal will have a dot product of zero, indicating the absence of a semantic relation. Since our vectors are not necessarily of unit length, we must normalize our output.

In order extend our similarity metric from words to phrases, given a phrase, we let its vector be the arithmetic mean of its constituent word vectors.

In [5]:
def similarity(vector1, vector2):
    '''
    Returns a similarity score for two vectors, ranging from -1 to 1,
    where a score near 1 indicates high similarity and a score near or
    less than zero indicates low similarity. Note that a score near -1 does
    not indicate opposite meaning.
    '''
    return numpy.inner(vector1, vector2) / (numpy.linalg.norm(vector1) * numpy.linalg.norm(vector2))
def phrase_vector(phrase):
    '''
    Turns a phrase into a vector formed by taking the arithmetic mean
    of its constituent phrase vectors.
    '''
    stopwords = parsing.preprocessing.STOPWORDS
    phrase_words = [word for word in phrase.lower().split() if word not in stopwords or True]
    phrase_vector = numpy.zeros((300,)) #Each word in the googlenews model is represented as a 300 dimensional vector
    for word in phrase_words:
        try:
            word_vector = googledata_model[word]
        except:
            word_vector = numpy.zeros(300,)
        phrase_vector = numpy.add(word_vector, phrase_vector)
    phrase_vector = phrase_vector * 1/len(phrase_words)
    return phrase_vector

Using our similarity metric from above, we see that the words 'happy' and 'sad' are somewhat similar, as they both are emotions. The word 'chair' however is very unrelated to both 'happy' and 'sad' and thus has a very low similarity score when compared to either word.

In [6]:
print(similarity(word_vector1,word_vector2))
print(similarity(word_vector1,word_vector3))

0.535461
-0.0108541


Now we read the list of fifteen thousand phrases from before and convert the list of phrases into a list of phrase vectors as described above.

In [7]:
file_to_read = 'parsed_phrases.txt'
with open('parsed_phrases.txt', 'rb') as file_to_read:
    interesting_phrases = pickle.load(file_to_read)
interesting_phrases_vectors = list(map(phrase_vector, interesting_phrases))

Finally, we define a function to get the phrases from our list of fifteen thousand phrases, which are most semantically similar to our input phrase.

In [8]:
def get_comparisons(phrase, number):
    numbered_vectors = enumerate(interesting_phrases_vectors)
    #Now we remove the ones where null
    numbered_vectors = [item for item in numbered_vectors if item[-1] is not None]
    vector_to_compare = phrase_vector(phrase)
    for index in range(len(numbered_vectors)):
        vector = numbered_vectors[index]
        numbered_vectors[index] = tuple([vector[0],similarity(vector[-1], vector_to_compare)])
    sorted_vectors = sorted(numbered_vectors, key = lambda item : -item[-1])
    output = []
    for index in range(number):
        output.append(interesting_phrases[sorted_vectors[index][0]])
    return output

Now using the above function, we search through our list of phrases as desired.

In [9]:
print(get_comparisons("old", 30))
print(get_comparisons("skeptical quiet hesitant slow", 30))

  


['old and decrepit', 'childhood, youth, manhood, and age', 'man of iron', 'A man of imperious will', 'juvenile and budding', 'impoverished age', 'A fitful boy full of dreams and hopes', 'A grave man of pretending exterior', 'All embrowned and mossed with age', 'precocious wisdom', 'enfeebled by age', 'guide, philosopher, and friend', 'comely and vivacious', 'problematic age', 'He was giving his youth away by handfuls', 'Beguiled the weary soul of man', 'schooled in self-restraint', 'termagant wife', 'He was born to a lively and intelligent patriotism', 'boyish appreciation', 'senile sensualist', 'sprightly talk', 'ardent and aspiring', 'Dreams that fade and die in the dim west', 'unlettered laborer', 'ancient and venerable', 'vanished centuries', 'Enduring with smiling composure\nthe near presence of people who are distasteful', 'A nameless sadness which is always born of moonlight', 'gaunt and ghastly']
['cautious and reticent', 'apprehensive and anxious', 'slow and sluggish', 'critic

As you can see above, some of the phrases returned are quite useful, often differing from the text of the phrase such that they could not be found with a simple search. 

At the moment, this demonstration uses quite a bit of memory and is not particularly suitable for inexpensive deployment, though it would be cool to have as a web app.