# Beyond `CTRL+F`
__Implementing semantic search within a document using word embeddings__

<table align="center">
    <tr><th>
        <a href="https://colab.research.google.com/github/fabio-a-oliveira/semantic-search/blob/master/semantic_search.ipynb">
            <img src="https://github.com/fabio-a-oliveira/semantic-search/blob/main/data/colab_logo_32px.png?raw=true">
            <br>Run in Google Colab
        </a>
    </th></tr>
</table>

---
## Introduction

In this notebook, we'll apply _Natural Language Processing (NLP)_ techniques to implement a semantic search within a document. This means that, instead of searching for a literal word or sequence of words (as you would when you use `CTRL+F` in Notepad, Microsoft Word etc), we'll be searching for terms, sentences or paragraphs of ___similar meaning___. If you ever had to search a massive document for a piece of information that you cannot recall exactly, you will agree that this application can be extremely useful and time-saving.

In summary, this is what we are going to do:

1. Use ___word embeddings___ to convert every word in the book and the requested sentence to a dense vector representation (in this case, we'll use GloVe embeddings);
2. Apply a ___part-of-speech___ (POS) mask to both the book and the requested sentence: every word that does not belong to a list of relevant _parts-of-speech_ will have the embedding converted to a null vector;
3. Apply a ___bag-of-words___ approach to sentence embedding: average the embeddings of every word in the sentence and get a single vector to represent its semantic content;
4. Apply the _bag-of-words_ sentence embedding and POS filter to the entire book by using a ___sliding window___ with the length of the requested sentence plus a selected margin and averaging the embeddings of words within the window;
5. Calculate the ___cosine distance___ between the requested sentence embedding and the sliding window embeddings
6. Select the position in the book with the shortest distance to the requested sentence.

To illustrate the concept, we'll download the .txt file of the book _Pride and Prejudice_ from the Project Gutenberg website and we'll show two applications of the technique:

1. We will take a sentence from the book and search the text for several increasingly altered versions of it;
2. We will take several excerpts from the Brazilian Portuguese version of the book, translate them back to English (which results in sentences with equivalent meaning but significantly different wording), and find the correspoding match in the original version.

Finally, a web form is provided so that you can experiment with the technique in any book available in the ___Project Gutenberg___ catalogue of public domain works.

---

## Setup

In this section, we will do all the preparation and definitions necessary for the analysis.

### Install and import libraries

We begin by importanting the required libraries. Most of the NLP resources we need are available in the `nltk` (Natural Language Toolkit) package. We also need to install and import the `googletrans` library, which will be useful for translating text from Portuguese to English.

In [1]:
import numpy as np
import pandas as pd
import requests
from os import mkdir, getcwd, chdir, listdir
from os.path import join
import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
import pkg_resources
from matplotlib import pyplot as plt
from zipfile import ZipFile
from textwrap import wrap

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [2]:
package_list = [pkg for pkg in pkg_resources.working_set]

if 'googletrans' not in [pkg.key for pkg in package_list]:
    ! pip install googletrans==3.1.0a0
import googletrans

Collecting googletrans==3.1.0a0
  Downloading https://files.pythonhosted.org/packages/19/3d/4e3a1609bf52f2f7b00436cc751eb977e27040665dde2bd57e7152989672/googletrans-3.1.0a0.tar.gz
Collecting httpx==0.13.3
[?25l  Downloading https://files.pythonhosted.org/packages/54/b4/698b284c6aed4d7c2b4fe3ba5df1fcf6093612423797e76fbb24890dd22f/httpx-0.13.3-py3-none-any.whl (55kB)
[K     |████████████████████████████████| 61kB 285kB/s 
Collecting httpcore==0.9.*
[?25l  Downloading https://files.pythonhosted.org/packages/dd/d5/e4ff9318693ac6101a2095e580908b591838c6f33df8d3ee8dd953ba96a8/httpcore-0.9.1-py3-none-any.whl (42kB)
[K     |████████████████████████████████| 51kB 3.7MB/s 
Collecting sniffio
  Downloading https://files.pythonhosted.org/packages/52/b0/7b2e028b63d092804b6794595871f936aafa5e9322dcaaad50ebf67445b3/sniffio-1.2.0-py3-none-any.whl
Collecting hstspreload
[?25l  Downloading https://files.pythonhosted.org/packages/dd/50/606213e12fb49c5eb667df0936223dcaf461f94e215ea60244b2b1e9b039/hst

I also like to declare a `HOME` variable, so that we have a handle to the starting directory available throughout the analysis.

In [3]:
HOME = getcwd()

### Download and prepare word embeddings and Project Gutenberg catalogue

The heart and soul of capturing word meaning will be provided by GloVe word embeddings. In order to use them, we download the embeddings from the Stanford NLP website and convert the raw data to a Python dictionary.

In [4]:
if 'GloVe' not in listdir():
    mkdir('GloVe')

if 'glove.6B.zip' not in listdir(join(HOME, 'GloVe')):
    URL_GloVe = 'http://nlp.stanford.edu/data/glove.6B.zip'
    r = requests.get(URL_GloVe).content
    with open(join('GloVe', 'glove.6B.zip'), 'wb') as file:
        file.write(r)

z = ZipFile(join(HOME, 'GloVe','glove.6B.zip'))
z.extractall(join(HOME, 'GloVe'))

In [5]:
embedding_dict = {}

with open(join(HOME, 'GloVe', 'glove.6B.300d.txt')) as file:
    for line in file:
        word, vector = line.split(' ', maxsplit=1)
        vector = np.array(vector.split(' ')).astype('float')
        embedding_dict.update({word:vector})

We also need to download references to the Project Gutenberg catalogue, containing URLs for each entry in its huge corpus of public domain books.

In [6]:
if 'catalogue' not in listdir():
    mkdir('catalogue')

if 'project_gutenberg_corpus.csv' not in listdir(join(HOME,'catalogue')):
    URL = "https://github.com/fabio-a-oliveira/semantic-search/blob/main/data/project_gutenberg_corpus___2021_05_01.csv?raw=true"
    content = requests.get(URL).content
    with open(join(HOME,'catalogue','project_gutenberg_corpus.csv'), 'wb') as file:
        file.write(content)

catalogue = pd.read_csv(join(HOME, 'catalogue', 'project_gutenberg_corpus.csv'),
                        sep = "|")

### Helper functions

We now define a series of functions that will be useful in performing the analysis repeatedly.

We begin with a function that takes text content as a string and returns a matrix with the embeddings for each word according to a given dictionary.

In this step, most applications use a list of ___stop words___ - frequently occurrying words that can be removed from the bag-of-words representation without much loss of meaning. My personal preference is for using a list of allowed parts-of-speech, which gives me more control over what categories of words will be kept or masked out.

In [7]:
def sequence_embedding(content, embedding_dict, 
                       allowed_pos = ['NN','NNS','NNP','NNPS','JJ','RB','VB',
                                      'VBG','VBN','VBP','VBZ','VBD']):

    # basic cleanup
    content = content.replace('\n', ' ').replace('_', "").lower()

    # tokenize content
    tokens = nltk.word_tokenize(content)

    # get mask indicating tokens that are in or out of allowed parts-of-speech
    pos_mask = np.array([int(tag[1] in allowed_pos) 
                         for tag in nltk.pos_tag(tokens)]).reshape((-1,1))

    # get embedding and apply mask
    embedding_dim = len(list(embedding_dict.values())[0])
    embedding = np.array([embedding_dict[token] 
                          if token in embedding_dict.keys() 
                          else np.zeros(embedding_dim) 
                          for token in tokens])
    embedding *= pos_mask

    return embedding

We also define a function to calculate the _cosine distance_ between embeddings. This function could be easily accessed from the `scipy` package, but since I intend to use the same code to deploy a simple web app I prefer to define it using `numpy` to preclude the need to import an additional library.

In [8]:
def cosine_distance(vec1, vec2):

    dot_prod = np.dot(vec1, vec2)
    norm1 = np.sqrt(np.sum(vec1 ** 2))
    norm2 = np.sqrt(np.sum(vec2 ** 2))

    if norm1 == 0 or norm2 == 0:
        return 1
    else:
        return 1 - dot_prod / (norm1 * norm2)

Next, we define a function that calculates the cosine distance between the embedding corresponding to the request excerpt and the embeddings of a sliding window of words throughout the entire book.

The length of the sliding window corresponds to the length of the given excerpt plus a selected margin. The margin is implemented to account for the fact that sometimes our recollection of a particular excerpt is limited, and the actual corresponding text is probably longer than the provided input.

In [9]:
def sliding_distance(book, excerpt, margin = 0):

    excerpt_length = excerpt.shape[0]
    mvg_avg_widgth = excerpt_length + margin

    # bag-of-words excerpt embedding
    excerpt_embedding = excerpt.mean(axis=0)

    # moving average of the book embedding, 
    # considering the length of the excerpt + a margin
    mvg_avg_embedding = np.array([book[line-mvg_avg_widgth:line].mean(axis=0) 
                            for line in range(mvg_avg_widgth, book.shape[0])])

    # sliding distance: distance between sliding window and excerpt embeddings
    distance = np.array([cosine_distance(mvg_avg_embedding[line,:], 
                                         excerpt_embedding) 
                         for line in range(mvg_avg_embedding.shape[0])])

    return distance

The next function takes a chosen position and sentence length and returns the corresponding excerpt from the book. This will be used after we have applied the `sliding_distance()` function to calculate the distances between the requested excerpt and each possible sentence in the book and applied the `ndarray.argmin()` method to find the position with the smallest distance.

In [10]:
def find_match(reference, match_position, match_length):

    # basic cleanup
    reference = reference.replace('\n', ' ').replace('_', "").lower()

    match = nltk.word_tokenize(reference)
    match = match[match_position : match_position + match_length]
    match = ' '.join(match).replace(' ,',',').replace(' .','.')
    return match

Finally, the `locate_excerpt()` function puts it all together: it receives a desired excerpt and a book embedding as input and returns the matching sentence from the book. It also accepts a `margin` argument, which makes it easier to search for sentences that are longer than the provided excerpt.

In [11]:
def locate_excerpt(excerpt, book, margin = 0):
    # embed excerpt and get word count
    book_embedding = sequence_embedding(book, embedding_dict)
    excerpt_embedding = sequence_embedding(excerpt, embedding_dict)
    excerpt_word_count = len(nltk.word_tokenize(excerpt))

    # calculate distances
    distances = sliding_distance(book_embedding, 
                                 excerpt_embedding, 
                                 margin = margin)

    # find match and print it
    match = find_match(book, distances.argmin(), 
                       excerpt_word_count + margin)
    
    return match

---
## Application

Now that we have everything we need to begin searching for text with similar semantic content inside a document, let's apply the technique to Jane Austen's _Pride and Prejudice_.

In the first application, we'll take a sentence from the book and search the document for several modified versions of it.

In the second, we'll take several sentences from the Brazilian Portuguese translation of the book, translate them back to English (which results in excerpts with similar meaning but very different wordings) and search the book for them.

Finally, you will be able to use a web form to apply the technique to any book publicly available via the Project Gutenberg website.

### Search for excerpts from a given input

We first need to download the book from the Project Gutenberg website and convert every word to its dense vector representation.

In [12]:
URL = "https://www.gutenberg.org/cache/epub/42671/pg42671.txt"
book = str(requests.get(URL).content, encoding='utf-8')
book_embedding = sequence_embedding(book, embedding_dict)

The excerpt we'll be searching for is the first sentence in the book:

> ___"It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife."___

We'll begin by searching for the original sentence, just to get a feel for the technique and make sure it works.

In [13]:
excerpt = ("It is a truth universally acknowledged, that a single man " + 
          "in possession of a good fortune, must be in want of a wife.")
locate_excerpt(excerpt, book)

'it is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.'

Well, it is no surprise that the search works. After all, we were searching for the exact same sentence. Let's make it a little more difficult and change some of the words for synonims:

> ___It is a fact universally known, that an unmarried man in possession of a vast fortune, must be in need of a wife___

In [14]:
excerpt = ("It is a fact universally known, that an unmarried man in " + 
          "possession of a vast fortune, must be in need of a wife")
locate_excerpt(excerpt, book)

'universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife. however little known'

Great! We do have a match with the corresponding original sentence. However, we can see that there is a slight misalignment in the result: the first 4 words are missing, and 3 additional words (or 4 tokens, considering the dot) are added to the end. This is a frequent artifact of the response. Since we are working with sliding windows, there is a large overlap between sentences, and results will be often ofset by a few words to the right or left.

Now, let's go beyond synonims and alter the sentence more substantially, including some simplifications:

> ___It is a fact universally known, that a rich and unmarried man surely needs of a wife.___

Because we are providing an excerpt that is somewhat shorter than the original, adding a margin to increase the length of the sliding window will help in locating the result.

In [15]:
excerpt = ("It is a fact universally known, that a man who is rich " + 
          "and single surely wants a wife")
locate_excerpt(excerpt, book, margin = 10)

'it is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife. however little known'

We have a match again!

Now, let's try an extreme example and provide just the gist of the sentence as an excerpt for the algorithm:

> ___Everyone knows that a rich single man wants a wife.___

In [16]:
excerpt = "Everyone knows that a rich single man wants a wife"
locate_excerpt(excerpt, book, margin = 10)

'as he chooses. nobody wants him to come. though i shall always say that he used my daughter'

That did not go very well. The algorithm found a match that does not correspond to the original sentence we wanted in the book.

However, we can cycle through some of the top results and see if the actual sentence pops up:

In [17]:
# define excerpt
excerpt = "Everyone knows that a rich single man wants a wife"
margin = 10

# embed excerpt and get word count
excerpt_embedding = sequence_embedding(excerpt, embedding_dict)
excerpt_word_count = len(nltk.word_tokenize(excerpt))

# calculate distances
distances = sliding_distance(book_embedding, excerpt_embedding, margin = margin)

# find top 20 results and print them
for i in range(20):
    print("{}) ".format(i+1), end ='')
    print(find_match(book, distances.argsort()[i], 
                     excerpt_word_count + 2*margin))

1) as he chooses. nobody wants him to come. though i shall always say that he used my daughter extremely ill ; and if i was her, i
2) he is such a man ! '' `` yes, yes, they must marry. there is nothing else to be done. but there are two things that
3) is such a man ! '' `` yes, yes, they must marry. there is nothing else to be done. but there are two things that i
4) he chooses. nobody wants him to come. though i shall always say that he used my daughter extremely ill ; and if i was her, i would
5) , that a single man in possession of a good fortune, must be in want of a wife. however little known the feelings or views of such a
6) acknowledged, that a single man in possession of a good fortune, must be in want of a wife. however little known the feelings or views of such
7) a man ! '' `` yes, yes, they must marry. there is nothing else to be done. but there are two things that i want very
8) man ! '' `` yes, yes, they must marry. there is nothing else to be done. but there are two th

Looking carefully at the top 20 results, we see that the original excerpt we were looking for corresponds to entries 5, 6, 15, and 17. There is a lot of overlap between results, so there are actually just 7 different parts in this selection, of which our desired outcome is the third.

It would be fairly simple to write a function that identifies these overlaps and merges them into single results. In that case, our desired excerpt would be the third result on the list.

### Search for original text from translations

We'll now move to a slightly different application, where instead of searching for sentences manually modified from the original book, we will look for excerpts corresponding to the Brazilian Portuguese translation.

For each of these excerpts, we will use the `googletrans` library to automatically translate them back into English. We will then use our search routine to try and find the corresponding excerpt in the original book. 

You will notice that the translation is rather different from the original text. Although it certainly holds the same meaning, the choice and order of words is remarkably different.

We begin with a few single sentences, starting with the first sentence in the book and then trying a few more:

In [18]:
excerpt_translated = ("É uma verdade universalmente conhecida que um homem " + 
                     "solteiro, possuidor de uma boa fortuna, deve estar " + 
                     "necessitado de esposa.")
excerpt_english = googletrans.Translator().translate(excerpt_translated).text
result = locate_excerpt(excerpt_english, book)

print("Excerpt (pt):")
[print(st) for st in wrap(excerpt_translated)];
print("\nTranslation (en):")
[print(st) for st in wrap(excerpt_english)];
print("\nOriginal (en):")
[print(st) for st in wrap(result)];

Excerpt (pt):
É uma verdade universalmente conhecida que um homem solteiro,
possuidor de uma boa fortuna, deve estar necessitado de esposa.

Translation (en):
It is a universally known truth that a single man, possessing a good
fortune, must be in need of a wife.

Original (en):
is a truth universally acknowledged, that a single man in possession
of a good fortune, must be in want of a wife


In [19]:
excerpt_translated = ("Mas quando Elizabeth contou que ele ficara em " + 
                     "silêncio, a hipótese não pareceu muito plausível, mesmo "+ 
                     "para Charlotte que a desejava.")
excerpt_english = googletrans.Translator().translate(excerpt_translated).text
result = locate_excerpt(excerpt_english, book)

print("Excerpt (pt):")
[print(st) for st in wrap(excerpt_translated)];
print("\nTranslation (en):")
[print(st) for st in wrap(excerpt_english)];
print("\nOriginal (en):")
[print(st) for st in wrap(result)];

Excerpt (pt):
Mas quando Elizabeth contou que ele ficara em silêncio, a hipótese não
pareceu muito plausível, mesmo para Charlotte que a desejava.

Translation (en):
But when Elizabeth told him he was silent, the hypothesis did not seem
very plausible, even to Charlotte who wanted it.

Original (en):
when elizabeth told of his silence, it did not seem very likely, even
to charlotte 's wishes, to be the case


In [20]:
excerpt_translated = ("Mrs. Gardiner ficou surpreendida e preocupada. Mas " + 
                     "como se aproximavam agora do lugar onde residira na " + 
                     "sua mocidade, ela se entregou toda ao encanto das " + 
                     "suas recordações")
excerpt_english = googletrans.Translator().translate(excerpt_translated).text
result = locate_excerpt(excerpt_english, book)

print("Excerpt (pt):")
[print(st) for st in wrap(excerpt_translated)];
print("\nTranslation (en):")
[print(st) for st in wrap(excerpt_english)];
print("\nOriginal (en):")
[print(st) for st in wrap(result)];

Excerpt (pt):
Mrs. Gardiner ficou surpreendida e preocupada. Mas como se aproximavam
agora do lugar onde residira na sua mocidade, ela se entregou toda ao
encanto das suas recordações

Translation (en):
Mrs. Gardiner was surprised and concerned. But as they now approached
the place where she had resided in her youth, she gave herself over to
the charm of her memories

Original (en):
on. mrs. gardiner was surprised and concerned ; but as they were now
approaching the scene of her former pleasures, every idea gave way to
the charm of recollection ;


In [21]:
excerpt_translated = ("— Acha que eles estão em Londres? — Sim, em que " + 
                     "outro lugar poderiam se esconder?")
excerpt_english = googletrans.Translator().translate(excerpt_translated).text
result = locate_excerpt(excerpt_english, book)

print("Excerpt (pt):")
[print(st) for st in wrap(excerpt_translated)];
print("\nTranslation (en):")
[print(st) for st in wrap(excerpt_english)];
print("\nOriginal (en):")
[print(st) for st in wrap(result)];

Excerpt (pt):
— Acha que eles estão em Londres? — Sim, em que outro lugar poderiam
se esconder?

Translation (en):
- Do you think they're in London? - Yes, where else could they hide?

Original (en):
. '' `` do you suppose them to be in london ? '' `` yes ; where else


In all of these examples, the method was able to identify the correct excerpt from the original text (give or take a few words to the right or left), even though the translations were very different from the original.

Finally, we try a longer excerpt with several sentences:

In [22]:
excerpt_translated = ("Se pudéssemos saber quais eram as dívidas de " + 
                     "Wickham... E com quanto ele dotou nossa irmã... " + 
                     "Saberia exatamente o que Mr. Gardiner fez, pois " + 
                     "Wickham não tem um tostão de seu. A bondade dos nossos "+
                     "tios é uma coisa que nunca poderá ser paga. Eles a " + 
                     "levaram para casa e lhe deram toda a sua proteção e " + 
                     "apoio moral. Isto é um sacrifício que anos de gratidão " +
                     "não podem compensar. Nesse momento, ela está em casa " + 
                     "deles. Se uma tão grande bondade não lhe der a " + 
                     "consciência da falta que praticou, é que ela não " + 
                     "merece nunca ser feliz. Imagina a sua cara quando " + 
                     "chegar diante da minha tia")
excerpt_english = googletrans.Translator().translate(excerpt_translated).text
result = locate_excerpt(excerpt_english, book)

print("Excerpt (pt):")
[print(st) for st in wrap(excerpt_translated)];
print("\nTranslation (en):")
[print(st) for st in wrap(excerpt_english)];
print("\nOriginal (en):")
[print(st) for st in wrap(result)];

Excerpt (pt):
Se pudéssemos saber quais eram as dívidas de Wickham... E com quanto
ele dotou nossa irmã... Saberia exatamente o que Mr. Gardiner fez,
pois Wickham não tem um tostão de seu. A bondade dos nossos tios é uma
coisa que nunca poderá ser paga. Eles a levaram para casa e lhe deram
toda a sua proteção e apoio moral. Isto é um sacrifício que anos de
gratidão não podem compensar. Nesse momento, ela está em casa deles.
Se uma tão grande bondade não lhe der a consciência da falta que
praticou, é que ela não merece nunca ser feliz. Imagina a sua cara
quando chegar diante da minha tia

Translation (en):
If we could know what Wickham's debts were ... And how much he endowed
our sister with ... He would know exactly what Mr. Gardiner did,
because Wickham doesn't have a penny of his own. The kindness of our
uncles is something that can never be paid for. They took her home and
gave her all her protection and moral support. This is a sacrifice
that years of gratitude cannot make up for. 

Again, we see that the match was correctly found in the original text.

### Your turn!

Now you can try searching for whatever you'd like from the Project Gutenberg catalog of public domain books.

You can provide text for the search in any language, as it will be translated to English before searching the document.

If you can't find the result you expected, ty experimenting with the `margin` parameter, as it will broaden the length of the sliding window.

To run a search using the form, just fill it up and hit `CTRL+ENTER` or click on the _play_ icon on the left side of the form.

Before running a search, be sure to run this entire notebook by going to the _Runtime_ menu and clicking _Run all_.

In [23]:
#@title Choose your book, author and excerpt

Author = "Jane Austen" #@param {type:"string"}
Book = "Pride and Prejudice" #@param {type:"string"}
Excerpt = "it is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife." #@param {type:"string"}
Margin = 0 #@param {type:"slider", min:0, max:50, step:1}

excerpt_original = Excerpt
excerpt = googletrans.Translator().translate(excerpt_original)
src_lan = excerpt.src
excerpt = excerpt.text

no_author = False
no_book = False

df = catalogue
df = df.loc[df.Language == 'en']
for name in Author.replace(',',' ').split():
    df = df.loc[df.Author.map(lambda x: name.lower() in x.lower())]
if df.shape[0] == 0:
    no_author = True

for word in Book.split():
    df = df.loc[df.Title.map(lambda x: word.lower() in x.lower())]
if df.shape[0] == 0:
    no_book = True

if no_author:
    print("Author not found")
elif no_book:
    print("Book not found. Here are some available from the same author:\n")
    df = catalogue
    df = df.loc[df.Language == 'en']
    for name in Author.replace(',',' ').split():
        df = df.loc[df.Author.map(lambda x: name.lower() in x.lower())]
    print(df.Title.values)
else:
    URL = df.URL.iloc[0]
    selected_book = str(requests.get(URL).content, encoding='utf-8')
    #selected_book_embedding = sequence_embedding(selected_book, embedding_dict)
    result = locate_excerpt(excerpt, selected_book, margin = Margin)

    print("Book URL: {}\n".format(URL))

    if src_lan == 'en':
        print("Provided excerpt:")
        [print(st) for st in wrap(excerpt_original)];
        print("\nOriginal:")
        [print(st) for st in wrap(result)];
    else:
        print("Provided excerpt ({}):".format(src_lan))
        [print(st) for st in wrap(excerpt_original)];
        print("\nTranslation (en):")
        [print(st) for st in wrap(excerpt)];
        print("\nOriginal (en):")
        [print(st) for st in wrap(result)];

Book URL: https://www.gutenberg.org/files/1342/1342-0.txt

Provided excerpt:
it is a truth universally acknowledged, that a single man in
possession of a good fortune, must be in want of a wife.

Original:
1 it is a truth universally acknowledged, that a single man in
possession of a good fortune, must be in want of a wife


---
## Conclusion

The method proposed in this notebook performed remarkably well at searching a document for an excerpt corresponding in meaning to a provided sentence. The heart and soul of the translation of words into a meaningful vector representation was provided by the GloVe word embeddings, and most of the NLP tools were used from the `nltk` package.

We applied the technique to inputs corresponding to severely modified/simplified versions of a book passage, and also to excerpts from a book translation, both with similarly positive results.

Admittedly, there was some cherry-picking in the presented results, especially in the selection of the `margin` parameter. In an actual application, it might be necessary for the user to try a couple of different values for this parameter or to scroll through a short list of candidate results. However, none of these are more complex than the current `CTRL+F` routine we all have to perform more often than we would like.

With some minor modifications - like merging overlapping results - the technique could easily be applied to search for content in large documents, which is currently an ineffective and time-consuming task.