# Measuring Word Similarity with BERT (English Language Public Domain Poems)

By [The BERT for Humanists](https://melaniewalsh.github.io/BERT-for-Humanists/) Team

How can we measure the similarity of words in a collection of texts? For example, how similar are the words "nature" and "science" in a collection of 16th-20th century English language poems? Do 20th-century poets use the word "science" differently than 16th-century poets? Can we map all the different uses and meanings of the word "nature"?

The short answer is: yes! We can explore all of these questions with BERT, a natural language processing model that has revolutionized the field.

BERT turns words or tokens into vectors — essentially, a list of numbers in a coordinate system (x, y). We can then use the geometric similarity between these resulting vectors as a way to represent varying types of similarity between words.

## In This Notebook
In this Colab notebook, we will specifically analyze a collection of poems scraped from [Public-Domain-Poetry.com](http://public-domain-poetry.com/) with the [DistilBert model](https://huggingface.co/transformers/model_doc/distilbert.html) and the HuggingFace Python library. DistilBert is a smaller — yet still powerful! — version of BERT. By using the rich representations of words that BERT produces, we will then explore the multivalent meanings of particular words in context and over time.

We hope this notebook will help illustrate how BERT works, how well it works, and how you might use BERT to explore the similarity of words in a collection of texts. It is surprising, for example, that BERT works as well as it does, without any fine-tuning, on poems that were published hundreds of years before the text data it was trained on (Wikiepdia pages and self-published novels). 

But we also hope that these results will expose some of the limitations and challenges of BERT. We have to disregard poetic line breaks, for example, and we see that BERT has trouble with antiquated words like "thine," which don't show up in its contemporary vocabulary.

In [None]:
#@title BERT Word Vectors: A Preview { display-mode: "form" }
#@title: Hover
import pandas as pd
import altair as alt

url = "https://raw.githubusercontent.com/melaniewalsh/BERT-4-Humanists/main/data/bert-word-nature.csv"
df = pd.read_csv(url, encoding='utf-8')

search_keywords = ['nature', 'science', 'religion', 'art']
color_by = 'word'

alt.Chart(df, title=f"Word Similarity: {', '.join(search_keywords).title()}").mark_circle(size=200).encode(
    alt.X('x',
        scale=alt.Scale(zero=False)
    ), y="y",
    color= color_by,
    href="link",
    tooltip=['title', 'word', 'poem_title', 'author', 'period']
    ).interactive().properties(
    width=500,
    height=500
)

The plot above displays a preview of our later results. This is what we're working toward!

You can hover over each point to see the instance of each word in context. If you press `Shift` and click on a point, you will be taken to the original poem on Public-Domain-Poetry.com. Try it out!

## **Import necessary Python libraries and modules**

Ok enough introduction! Let's get started.

To use the HuggingFace [`transformers` Python library](https://huggingface.co/transformers/installation.html), we first need to install it with `pip`.

In [None]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Then we will import the DistilBertModel and DistilBertTokenizerFast from the Hugging Face `transformers` library. We will also import a handful of other Python libraries and modules.

In [None]:
# For BERT
from transformers import DistilBertTokenizerFast, DistilBertModel

# For data manipulation and analysis
import pandas as pd
pd.options.display.max_colwidth = 200
import numpy as np
from sklearn.decomposition import PCA

# For interactive data visualization
import altair as alt

## **Load text dataset**

Our dataset contains around ~30 thousand poems scraped from  http://public-domain-poetry.com/. This website hosts a curated collection of poems that have fallen out of copyright, which makes them easier for us to share on the web. 
You can find the data in our [GitHub repository](https://github.com/melaniewalsh/BERT-4-Humanists/blob/main/data/public-domain-poetry.csv).

We don't have granular date information about when each poem was published, but we do know the birth dates of most of our authors, which we've used to loosely categorize the poems by time period. The poems in our data range from the Middle Ages to the 20th Century, but most come from the 19th Century. The data features both well-known authors — William Wordsworth, Emily Elizabeth Dickinson, Paul Laurence Dunbar, Walt Whitman, Shakespeare — as well as less well-known authors.

Below we will use the Python library `pandas` to read in our CSV file of poems. It is convenient (especially for Colab notebooks) that `pandas` allows you to read in files directly from the web.

To be clear, however, knowledge of `pandas` is not necessary to use BERT. This is simply how we've chosen to load our data. All you really need  is a list of texts (poems, passages, etc.). You can create this list however you are most comfortable.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import pandas as pd
import numpy as np

df_taming = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/honors thesis/punctuation/taming/df_tamingoftheshrew_sentence_withactbreakdown.csv')
df_tamertamed = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/honors thesis/punctuation/df_tamertamedwithpunctuation.csv')

In [None]:
for index, row in df_taming.iterrows():
  df_taming['title'] = 'The Taming of the Shrew'
 
for index, row in df_tamertamed.iterrows():
 df_tamertamed['title'] = 'Tamer Tamed'

In [None]:
#@title *Click here to see how you might load a dataset from your own computer*
#from google.colab import files
#uploaded = files.upload()

Let's check to see how many poems are in this dataset:

## **Sample text dataset**

Though we wish we could analyze all the poems in this data, Colab tends to crash if we try to use more than 4-5,000 poems —  even with DistilBert, the smaller version of BERT. This is an important limitation to keep in mind. If you'd like to use more text data, you might consider upgrading to a paid version of Colab (with more memory or GPUs) or using a compute cluster.

To reduce the number of poems, we will take a random sample of 1,000 poems from four different time periods: the 20th Century, 19th Century, 18th Century, and the Early Modern period.

In [None]:
df = pd.concat([df_taming,df_tamertamed])

In [None]:
poetry_df['period'].value_counts()

NameError: ignored

Finally, let's make a list of poems from our Pandas DataFrame.

In [None]:
drama_texts = df['sentence'].tolist()

Let's examine a poem in our dataset:

In [None]:
len(drama_texts)

4075

In [None]:
print(drama_texts[0])

Enter Beggar ( Christopher Sly ) and Hostess 


## **Encode/tokenize text data for BERT**

Next we need to transform our poems into a format that BERT (via Huggingface) will understand. This is called *encoding* or *tokenizing* the data.

We will tokenize the poems with the `tokenizer()` from HuggingFace's `DistilBertTokenizerFast`. Here's what the `tokenizer()` will do:

1. Truncate the texts if they're more than 512 tokens or pad them if they're fewer than 512 tokens. If a word is not in BERT's vocabulary, it will be broken up into smaller "word pieces," demarcated by a `##`.

2. Add in special tokens to help BERT:
    - [CLS] — Start token of every document
    - [SEP] — Separator between each sentence 
    - [PAD] — Padding at the end of the document as many times as necessary, up to 512 tokens
    - &#35;&#35; — Start of a "word piece" 

Here we will load `DistilBertTokenizerFast` from HuggingFace library, which will help us transform and encode the texts so they can be used with BERT.

In [None]:
from transformers import DistilBertTokenizerFast

In [None]:
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

The `tokenizer()` will break word tokens into word pieces, truncate to 512 tokens, and add padding and special BERT tokens.

In [None]:
tokenized_poems = tokenizer(drama_texts, truncation=True, padding=True, return_tensors="pt")

Let's examine the first tokenized poem. We can see that the special BERT tokens have been inserted where necessary.

In [None]:
' '.join(tokenized_poems[0].tokens)

'[CLS] enter beg ##gar ( christopher sly ) and hostess [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PA

<br><br>

## **Load pre-trained BERT model**

Here we will load a pre-trained BERT model. To speed things up we will use a GPU, but using GPU involves a few extra steps.
The command `.to("cuda")` moves data from regular memory to the GPU's memory.




In [None]:
from transformers import DistilBertModel

In [None]:
model = DistilBertModel.from_pretrained('distilbert-base-uncased').to("cuda")

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.bias', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


## **Get BERT word embeddings for each document in a collection**

To get word embeddings for all the words in our collection, we will use a `for` loop.

For each poem in our list `poetry_texts`, we will tokenize the poem, and we will extract the vocabulary word ID for each word/token in the poem (to use for later reference). Then we will run the tokenized poem through the BERT model and extract the vectors for each word/token in the poem.

We thus create two big lists for all the poems in our collection — `doc_word_ids` and `doc_word_vectors`.

In [None]:
# List of vocabulary word IDs for all the words in each document (aka each poem)
doc_word_ids = []
# List of word vectors for all the words in each document (aka each poem)
doc_word_vectors = []

# Below we will slice our poem to ignore the first (0th) and last (-1) special BERT tokens
start_of_words = 1
end_of_words = -1

# Below we will index the 0th or first document, which will be the only document, since we're analzying one poem at a time
first_document = 0

for i, poem in enumerate(drama_texts):
  
    # Here we tokenize each poem with the DistilBERT Tokenizer
    inputs = tokenizer(poem, return_tensors="pt", truncation=True, padding=True)

    # Here we extract the vocabulary word ids for all the words in the poem (the first or 0th document, since we only have one document)
    # We ignore the first and last special BERT tokens
    # We also convert from a Pytorch tensor to a numpy array
    doc_word_ids.append(inputs.input_ids[first_document].numpy()[start_of_words:end_of_words])

    # Here we send the tokenized poems to the GPU
    # The model is already on the GPU, but this poem isn't, so we send it to the GPU
    inputs.to("cuda")
    # Here we run the tokenized poem through the DistilBERT model
    outputs = model(**inputs)

    # We take every element from the first or 0th document, from the 2nd to the 2nd to last position
    # Grabbing the last layer is one way of getting token vectors. There are different ways to get vectors with different pros and cons
    doc_word_vectors.append(outputs.last_hidden_state[first_document,start_of_words:end_of_words,:].detach().cpu().numpy())


Confirm that we have the same number of documents for both the tokens and the vectors:

In [None]:
len(doc_word_ids), len(doc_word_vectors)

(4075, 4075)

In [None]:
doc_word_ids[0], doc_word_vectors[0]

(array([ 4607, 11693,  6843,  1006,  5696, 18230,  1007,  1998, 22566]),
 array([[-0.06293996,  0.40168014, -0.01639411, ..., -0.12127846,
          0.3518933 ,  0.05997372],
        [ 0.17778005, -0.6128298 ,  0.42495412, ..., -0.0715012 ,
          0.2180077 ,  0.49534407],
        [ 0.29525435, -0.31497422,  0.13371837, ..., -0.06018631,
         -0.00849556, -0.15944809],
        ...,
        [ 0.74987924,  0.4591687 , -0.12121965, ...,  0.121236  ,
         -0.49228176,  0.0250312 ],
        [-0.6388959 ,  0.16735034,  0.21905185, ..., -0.03539974,
          0.06519184,  0.30487072],
        [-0.20811436, -0.06832563,  0.264144  , ..., -0.2208066 ,
          0.08278073,  0.1258859 ]], dtype=float32))

## **Concatenate all word IDs/vectors for all documents**

Each element of these lists contains all the tokens/vectors for one document. But we want to concatenate them into two giant collections.

In [None]:
all_word_ids = np.concatenate(doc_word_ids)
all_word_vectors = np.concatenate(doc_word_vectors, axis=0)

We want to make comparisons between vectors quickly. One common option is *cosine similarity*, which measures the angle between vectors but ignores their length. We can speed this computation up by setting all the poem vectors to have length 1.0.

In [None]:
# Calculating the length of each vector (Pythagorean theorem)
row_norms = np.sqrt(np.sum(all_word_vectors ** 2, axis=1))
# Dividing every vector by its length
all_word_vectors /= row_norms[:,np.newaxis]

## **Find all word positions in a collection**

We can use the array `all_word_ids` to find all the places, or *positions*, in the collection where a word appears.

We can find a word's vocab ID in BERT with `tokenizer.vocab` and then check to see where/how many times this ID occurs in `all_word_ids`.

In [None]:
def get_word_positions(words):
  
  """This function accepts a list of words, rather than a single word"""

  # Get word/vocabulary ID from BERT for each word
  word_ids = [tokenizer.vocab[word] for word in words]

  # Find all the positions where the words occur in the collection
  word_positions = np.where(np.isin(all_word_ids, word_ids))[0]

  return word_positions

Here we'll check to see all the places where the word "bank" appears in the collection.

In [None]:
get_word_positions(["love"])

array([ 1142,  1211,  1941,  2013,  3080,  3582,  3591,  3896,  4215,
        4250,  4342,  4643,  4686,  4786,  4974,  5013,  5384,  6601,
        7210,  7334,  7429,  7499,  7847,  8403,  8555,  9523, 10566,
       10659, 11038, 12669, 12963, 12987, 13002, 13083, 14296, 14541,
       15010, 16460, 17357, 19574, 20256, 20346, 20479, 20661, 20720,
       20980, 21712, 21839, 22004, 22172, 22553, 22565, 22569, 24156,
       24176, 24296, 24302, 27189, 27196, 27453, 28658, 29131, 29245,
       30009, 30411, 30483, 30683, 30995, 31250, 31264, 34557, 34613,
       34618, 36829, 36852, 37275, 39393, 39412, 39563, 42180, 42460,
       42578, 42581, 42690, 42840, 42863, 43077, 43276, 43298, 43306,
       44093, 44124, 45225, 45474, 45769, 46092, 46231, 46261, 46420,
       47092, 47522, 47755, 48313, 48774, 50100, 50313, 51204, 51631,
       52411, 53938, 54089, 54260, 54324, 54867, 56582, 56956, 56990,
       57940, 58917])

In [None]:
word_positions = get_word_positions(["love"])

## **Find word from word position**

Nice! Now we know all the positions where the word "bank" appears in the collection. But it would be more helpful to know the actual words that appear in context around it. To find these context words, we have to convert position IDs back into words.

In [None]:
# Here we create an array so that we can go backwards from numeric token IDs to words
word_lookup = np.empty(tokenizer.vocab_size, dtype="O")

for word, index in tokenizer.vocab.items():
    word_lookup[index] = word

Now we can use `word_lookup` to find a word based on its position in the collection.

In [None]:
word_positions = get_word_positions(["love"])

for word_position in word_positions:
  print(word_position, word_lookup[all_word_ids[word_position]])

1142 love
1211 love
1941 love
2013 love
3080 love
3582 love
3591 love
3896 love
4215 love
4250 love
4342 love
4643 love
4686 love
4786 love
4974 love
5013 love
5384 love
6601 love
7210 love
7334 love
7429 love
7499 love
7847 love
8403 love
8555 love
9523 love
10566 love
10659 love
11038 love
12669 love
12963 love
12987 love
13002 love
13083 love
14296 love
14541 love
15010 love
16460 love
17357 love
19574 love
20256 love
20346 love
20479 love
20661 love
20720 love
20980 love
21712 love
21839 love
22004 love
22172 love
22553 love
22565 love
22569 love
24156 love
24176 love
24296 love
24302 love
27189 love
27196 love
27453 love
28658 love
29131 love
29245 love
30009 love
30411 love
30483 love
30683 love
30995 love
31250 love
31264 love
34557 love
34613 love
34618 love
36829 love
36852 love
37275 love
39393 love
39412 love
39563 love
42180 love
42460 love
42578 love
42581 love
42690 love
42840 love
42863 love
43077 love
43276 love
43298 love
43306 love
44093 love
44124 love
45225 love
454

We can also look for the 3 words that come before "bank" and the 3 words that come after it.

In [None]:
word_positions = get_word_positions(["love"])

for word_position in word_positions:

  # Slice 3 words before "bank"
  start_pos = word_position - 3
  # Slice 3 words after "bank"
  end_pos = word_position + 4

  context_words = word_lookup[all_word_ids[start_pos:end_pos]]
  # Join the words together
  context_words = ' '.join(context_words)
  print(word_position, context_words)

1142 will win my love , he bear
1211 make known her love and then with
1941 dos ##t thou love hawk ##ing thou
2013 dos ##t thou love pictures we will
3080 father ’ s love and leave am
3582 of you both love katherine , because
3591 you well and love you well ,
3896 for i will love thee ne ’
4215 you — their love is not so
4250 yet for the love i bear my
4342 bianca ’ s love ) to labor
4643 it possible that love should of a
4686 the effect of love - in -
4786 the heart if love have touched you
4974 sir if you love the maid ,
5013 master , your love must live a
5384 so well i love luce ##nti ##o
6601 ##nti ##us ’ love , as old
7210 rivals in my love , su ##pp
7334 leisure to make love to her and
7429 rival of my love pet ##ru ##chio
7499 all books of love see that at
7847 to vent our love listen to me
8403 ##nti ##o i love no chi ##ders
8555 s the choice love of sign ##ior
9523 , for your love to her ,
10566 daughter ’ s love , what dowry
10659 is , her love , for that
11038 wen ##ch i lo

Let's make some functions that will help us get the context words around a certain word position for whatever size window (certain number of words before and after) that we want.

The first function `get_context()` will simply return the tokens without cleaning them, and the second function `get_context_clean()` will return the tokens in a more readable fashion.

In [None]:
def get_context(word_id, window_size=10):
  
  """Simply get the tokens that occur before and after word position"""

  start_pos = max(0, word_id - window_size) # The token where we will start the context view
  end_pos = min(word_id + window_size + 1, len(all_word_ids)) # The token where we will end the context view

  # Make a list called tokens and use word_lookup to get the words for given token IDs from starting position up to the keyword
  tokens = [word_lookup[word] for word in all_word_ids[start_pos:end_pos] ]
  
  context_words = " ".join(tokens)

  return context_words

In [None]:
import re

def get_context_clean(word_id, window_size=10):
  
  """Get the tokens that occur before and after word position AND make them more readable"""

  keyword = word_lookup[all_word_ids[word_id]]
  start_pos = max(0, word_id - window_size) # The token where we will start the context view
  end_pos = min(word_id + window_size + 1, len(all_word_ids)) # The token where we will end the context view

  # Make a list called tokens and use word_lookup to get the words for given token IDs from starting position up to the keyword
  tokens = [word_lookup[word] for word in all_word_ids[start_pos:end_pos] ]
  
  # Make wordpieces slightly more readable
  # This is probably not the most efficient way to clean and correct for weird spacing
  context_words = " ".join(tokens)
  context_words = re.sub(r'\s+([##])', r'\1', context_words)
  context_words = re.sub(r'##', r'', context_words)
  context_words = re.sub('\s+\'s', '\'s', context_words)
  context_words = re.sub('\s+\'d', '\'d', context_words)
  context_words = re.sub('\s\'er', '\'er', context_words)
  context_words = re.sub(r'\s+([-,:?.!;])', r'\1', context_words)
  context_words = re.sub(r'([-\'"])\s+', r'\1', context_words)
  context_words = re.sub('\s+\'s', '\'s', context_words)
  context_words = re.sub('\s+\'d', '\'d', context_words)

  # Bold the keyword by putting asterisks around it
  if keyword in context_words:
    context_words = re.sub(f"\\b{keyword}\\b", f"**{keyword}**", context_words)
    context_words = re.sub(f"\\b({keyword}[esdtrlying]+)\\b", fr"**\1**", context_words)

  return context_words

To visualize the search keyword even more easily, we're going to import a couple of Python modules that will allow us to output text with bolded words and other styling. Here we will make a function `print_md()` that will allow us to print with Markdown styling.

In [None]:
from IPython.display import Markdown, display

def print_md(string):
    display(Markdown(string))

In [None]:
word_positions = get_word_positions(["love"])

for word_position in word_positions:

  print_md(f"<br> {word_position}:  {get_context_clean(word_position)} <br>")

<br> 1142:  tell him from me, as he will win my **love**, he bear himself with honorable action, such as <br>

<br> 1211:  humble wife may show her duty and make known her **love** and then with kind embracements, tempting kisses, <br>

<br> 1941:  studded all with gold and pearl dost thou **love** hawking thou hast hawks will soar above <br>

<br> 2013:  fleeter than the roe servingman dost thou **love** pictures we will fetch thee straight adonis painted by <br>

<br> 3080:  of great italy, and by my father ’ s **love** and leave am armed with his goodwill and thy good <br>

<br> 3582:  a husband for the elder if either of you both **love** katherine, because i know you well and **love** you <br>

<br> 3591:  both **love** katherine, because i know you well and **love** you well, leave shall you have to court her <br>

<br> 3896:  lease thee, good bianca, for i will **love** thee ne ’ er the less, my girl, <br>

<br> 4215:  good here ’ s none will hold you — their **love** is not so great, hortensio, <br>

<br> 4250:  ’ s dough on both sides farewell yet for the **love** i bear my sweet bianca, if i can by <br>

<br> 4342:  fair mistress and be happy rivals in bianca ’ s **love** ) to labor and effect one thing specially what ’ <br>

<br> 4643:  , sir, tell me, is it possible that **love** should of a sudden take such hold o tranio <br>

<br> 4686:  i stood looking on, i found the effect of **love**-in-idleness, and now in plain <br>

<br> 4786:  you now affection is not rated from the heart if **love** have touched you, naught remains but so: <br>

<br> 4974:  trance — i pray, awake, sir if you **love** the maid, bend thoughts and wits to achieve her <br>

<br> 5013:  father rid his hands of her, master, your **love** must live a maid at home, and therefore has <br>

<br> 5384:  to be lucentio, because so well i **love** lucentio tranio, be so, because <br>

<br> 6601:  she as foul as was florentius ’ **love**, as old as sibyl, and as <br>

<br> 7210:  more, suitors to her and rivals in my **love**, supposing it a thing impossible, for <br>

<br> 7334:  device at least, have leave and leisure to make **love** to her and unsuspected court her by <br>

<br> 7429:  grumio, it is the rival of my **love** petruchio, stand by awhile petruchio <br>

<br> 7499:  ll have them very fairly bound, all books of **love** see that at any hand, and see you read <br>

<br> 7847:  o, ’ tis now no time to vent our **love** listen to me, and if you speak me fair <br>

<br> 8403:  hand, i pray, as lucentio i **love** no chiders, sir biondello, let <br>

<br> 8555:  ’ ll know: that she ’ s the choice **love** of signior gremio that she ’ s <br>

<br> 9523:  dance barefoot on her wedding day and, for your **love** to her, lead apes in hell talk not to <br>

<br> 10566:  tell me, if i get your daughter ’ s **love**, what dowry shall i have with her to wife <br>

<br> 10659:  special thing is well obtained, that is, her **love**, for that is all in all why, that <br>

<br> 11038:  world, it is a lusty wench i **love** her ten times more than ere i did o <br>

<br> 12669:  that in a twink she won me to her **love** o, you are novices ’ tis a world <br>

<br> 12963:  , as lucentio and i am one that **love** bianca more than words can witness or your thoughts can <br>

<br> 12987:  lucentio youngling, thou canst not **love** so dear as i, as lucentio gray <br>

<br> 13002:  , as lucentio graybeard, thy **love** doth freeze, as lucentio but thin <br>

<br> 13083:  daughter greatest dower shall have my bianca ’ s **love** say, signior gremio, what can <br>

<br> 14296:  geia tellus, disguised thus to get your **love**, hic steterat, and that luce <br>

<br> 14541:  for my life the knave doth court my **love** pedascule, i ’ ll watch you better <br>

<br> 15010:  methinks he looks as though he were in **love** yet if thy thoughts, bianca, be so humble <br>

<br> 16460:  and lucentio exit but, sir, to **love** concerneth us to add her father ’ s liking <br>

<br> 17357:  treat me how you can now, if you **love** me, stay grumio, my horse a <br>

<br> 19574:  from the dresser and serve it thus to me that **love** it not there, take it to you, trench <br>

<br> 20256:  i read that i profess, the art to **love**, as cambio and may you prove, sir <br>

<br> 20346:  litio, as lucentio o despiteful **love**, unconstant womankind i tell thee <br>

<br> 20479:  so contented, forswear bianca and her **love** forever, as lucentio see how they kiss <br>

<br> 20661:  not their beauteous looks, shall win my **love**, and so i take my leave, in resolution <br>

<br> 20720:  i have ta ’ en you napping, gentle **love**, and have forsworn you with horte <br>

<br> 20980:  if he were the right vincentio take in your **love**, and then let me alone lucentio and <br>

<br> 21712:  these wants, he does it under name of perfect **love**, as who should say, if i should sleep <br>

<br> 21839:  piece of beef and mustard a dish that i do **love** to feed upon ay, but the mustard is <br>

<br> 22004:  uck up thy spirits look cheerfully upon me here, **love**, thou seest how diligent i am <br>

<br> 22172:  nsio prepare to eat and now, my honey **love**, will we return unto thy father ’ s house <br>

<br> 22553:  a bauble, a silken pie i **love** thee well in that thou lik ’ st it <br>

<br> 22565:  well in that thou lik ’ st it not **love** me, or **love** me not, i like the <br>

<br> 22569:  lik ’ st it not **love** me, or **love** me not, i like the cap, and it <br>

<br> 24156:  o made me acquainted with a weighty cause of **love** between your daughter and himself and, for the good <br>

<br> 24176:  good report i hear of you, and for the **love** he beareth to your daughter and she to him <br>

<br> 24296:  it is your son lucentio here doth **love** my daughter, and she loveth him, or <br>

<br> 24302:  o here doth **love** my daughter, and she loveth him, or both dissemble deeply their <br>

<br> 27189:  cambio cambio is changed into lucentio **love** wrought these miracles bianca ’ s **love** made me exchange <br>

<br> 27196:  lucentio **love** wrought these miracles bianca ’ s **love** made me exchange my state with tranio, while <br>

<br> 27453:  thee a kiss she kisses him now pray thee, **love**, stay she kisses him is not this well come <br>

<br> 28658:  bodes marry, peace it bodes, and **love**, and quiet life, an awful rule, and <br>

<br> 29131:  and craves no other tribute at thy hands but **love**, fair looks, and true obedience — too little <br>

<br> 29245:  , and sway when they are bound to serve, **love**, and obey why are our bodies soft and weak <br>

<br> 30009:  s a good fellow, and on my word i **love** him: but to think a fit match for this <br>

<br> 30411:  if your affections be not made of words i **love** you, and you know how dearly rowland, <br>

<br> 30483:  why should we do our honest and our hearty **love** such wrong, to overrun our fortunes then you flat <br>

<br> 30683:  high priest among the jews: his money rowland oh **love** forgive me, what faith hast thou why, <br>

<br> 30995:  hang itself, should i but cross it for pure **love** to the matter i must hatch it nay never look <br>

<br> 31250:  fears and modest blushes, view me, and **love** example here is your sister here is the brave old <br>

<br> 31264:  your sister here is the brave old man ’ s **love** that **loves** the young man ay and hold thee <br>

<br> 34557:  credit you maria, come down, and let your **love** confirm it stay there sir, that bargain ’ s <br>

<br> 34613:  stock a kingdom why this is a riddle: i **love** you, and i **love** you not it is so <br>

<br> 34618:  is a riddle: i **love** you, and i **love** you not it is so: and till your own <br>

<br> 36829:  ear and exit how ’ s this i do not **love** these favors: save you the devil take thee — <br>

<br> 36852:  him by th ’ nose oh there ’ s a **love** token for you: thank me now i ’ ll <br>

<br> 37275:  violent neither — it may be out of her earnest **love**, there grew a longing ( as you know women <br>

<br> 39393:  her i do beseech you, even for **love** sake — i will rowland she may sooner count the <br>

<br> 39412:  count the good i have thought her, our old **love** and our friendship, shed one true tear, mean <br>

<br> 39563:  that i grant her out of my free and liberal **love**, a pardon, which you and all men else <br>

<br> 42180:  repentance, and undoing can win her **love**, i ’ ll make a shift for one when <br>

<br> 42460:  counsel i shall hang first i ’ ll no more **love**, that ’ s certain, ’ tis a bane <br>

<br> 42578:  him of a man, and could be brought to **love**, and **love** a woman, ’ twould make <br>

<br> 42581:  man, and could be brought to **love**, and **love** a woman, ’ twould make his head ache <br>

<br> 42690:  wilt thou rowland as ’ tis to be in **love** and why for virtue sake and why for virtue ’ <br>

<br> 42840:  best a swabber, if thou canst **love** so near to keep thy making, yet thou wil <br>

<br> 42863:  thy language why o tranio, those things in **love**, ne ’ er talk as we do, no <br>

<br> 43077:  io wilt thou rowland, certain ne ’ er **love** again i think so, certain, and if i <br>

<br> 43276:  ’ th ’ wager that ’ s all one **love** you as much, or more, than she now <br>

<br> 43298:  you ’ tis a good hearing, let ’ em **love**: ten pound more, i never **love** that woman <br>

<br> 43306:  ’ em **love**: ten pound more, i never **love** that woman there it is; and so an hundred <br>

<br> 44093:  pray you tell me one thing truly; do you **love** her i would i did not, upon that condition <br>

<br> 44124:  , her modesty required a little violence some women **love** to struggle she had it, and so much that <br>

<br> 45225:  , nor obedience in way of duty, but of **love**, and credit; all i expect is but a <br>

<br> 45474:  pleasure now you perceive him sophocles i **love** thee above thy vanity, thou faithless creature would <br>

<br> 45769:  doting had i not wife enough to turn my **love** too did i want vexation, or any <br>

<br> 46092:  still; i would yet remember you give him his **love** wench; the young man has employment for ’ <br>

<br> 46231:  must do so sometimes, and oftentimes; **love** were too serious else a witty woman had you **loved** <br>

<br> 46261:  ; and i had **loved** you so: you may **love** worse sir, but that is not material i shall <br>

<br> 46420:  sorrow may induce me to forgive you, but never **love** again; if i stay longer, i have lost <br>

<br> 47092:  ll to the lodge; some that are kind and **love** me, i know will visit me petruchio <br>

<br> 47522:  your office and what he wants, if money, **love**, or labor, or any way may win it <br>

<br> 47755:  you were sir look to yourselves, and if you **love** your lives, open the door, and fly me <br>

<br> 48313:  enter moroso and petronius that i do **love** her, is without all question, and most extremely <br>

<br> 48774:  tis handsome, and i know moreover i am to **love** her for ’ t now you come to me nay <br>

<br> 50100:  falser than the devil, i cannot choose but **love** it what do i know but those that came to <br>

<br> 50313:  willinger the law commands me to do it, **love** commands me and my own duty charges me heaven bless <br>

<br> 51204:  for ever never to be recalled: i know you **love** me, mad till you have enjoyed me; i <br>

<br> 51631:  fear the gallows keep thee there still and you **love** rowland say if i say not i am sure i <br>

<br> 52411:  ’ ll no by-blowes if you can **love** her do, if you can hate her, or <br>

<br> 53938:  thus i blow off, the care i took to **love** her, like this point i untie, and <br>

<br> 54089:  uses, and now i am for travel now i **love** you, and now i see you are a man <br>

<br> 54260:  ard urge my strong tie upon you: but i **love** you, and all the world shall know it, <br>

<br> 54324:  ability, and strength of judgement, than any private **love**, or wanton kisses go worthy man, and <br>

<br> 54867:  t fool me in again not i sir, i **love** you better, take your time, and pleasure i <br>

<br> 56582:  but farewell that, we must be wiser cousin **love** must not leave us to the world: have you <br>

<br> 56956:  must deliver it there livia, and a better **love** light on thee, i can no more to this <br>

<br> 56990:  set me up; there rowland, all thy old **love** back: and may a new to come exceed mine <br>

<br> 57940:  monument of what i have had, thou all the **love** now left me, and now lost, let me <br>

<br> 58917:  you dare you kiss me thus i begin my new **love** once again with all my heart once again maria o <br>

Here we make a list of all the context views for our keyword.

In [None]:
word_positions = get_word_positions(["love"])

keyword_contexts = []
keyword_contexts_tokens = []

for position in word_positions:

  keyword_contexts.append(get_context_clean(position))
  keyword_contexts_tokens.append(get_context(position))

## **Get word vectors and reduce them with PCA**

Finally, we don't just want to *read* all the instances of "bank" in the collection, we want to *measure* the similarity of all the instances of "bank."

To measure similarity between all the instances of "bank," we will take the vectors for each instance and then use PCA to reduce each 768-dimensionsal vector to the 2 dimensions that capture the most variation.

In [None]:
from sklearn.decomposition import PCA

word_positions = get_word_positions(["love"])

pca = PCA(n_components=2)

pca.fit(all_word_vectors[word_positions,:].T)

PCA(n_components=2)

Then, for convenience, we will put these PCA results into a Pandas DataFrame, which will use to generate an interactive plot.

In [None]:
df_2 = pd.DataFrame({"x": pca.components_[0,:], "y": pca.components_[1,:],
                   "context": keyword_contexts, "tokens": keyword_contexts_tokens})
df_2.head()

Unnamed: 0,x,y,context,tokens
0,-0.095638,-0.081126,"tell him from me, as he will win my **love**, he bear himself with honorable action, such as","tell him from me , as he will win my love , he bear himself with honorable action , such as"
1,-0.089068,-0.119902,"humble wife may show her duty and make known her **love** and then with kind embracements, tempting kisses,","humble wife may show her duty and make known her love and then with kind embrace ##ments , tempting kisses ,"
2,-0.094583,0.053891,studded all with gold and pearl dost thou **love** hawking thou hast hawks will soar above,stud ##ded all with gold and pearl dos ##t thou love hawk ##ing thou has ##t hawks will so ##ar above
3,-0.094898,0.009634,fleeter than the roe servingman dost thou **love** pictures we will fetch thee straight adonis painted by,fleet ##er than the roe serving ##man dos ##t thou love pictures we will fetch thee straight ad ##onis painted by
4,-0.089672,-0.060323,"of great italy, and by my father ’ s **love** and leave am armed with his goodwill and thy good","of great italy , and by my father ’ s love and leave am armed with his goodwill and thy good"


## **Match context with original text and metadata** 

It's helpful (and fun!) to know where each instance of a word actually comes from — which poem, which poet, which time period, which Public-Domain-Poetry.com web page. The easiest method we've found for matching a bit of context with its original poem and metdata is to 1) add a tokenized version of each poem to our original Pandas Dataframe 2) check to see if the context shows up in a poem 3) and if so, grab the original poem and metadata.

In [None]:
# Tokenize all the poems
tokenized_poems = tokenizer(drama_texts, truncation=True, padding=True, return_tensors="pt")

# Get a list of all the tokens for each poem
all_tokenized_poems = []
for i in range(len(tokenized_poems['input_ids'])):
  all_tokenized_poems.append(' '.join(tokenized_poems[i].tokens))

# Add them to the original DataFrame
df['tokens'] = all_tokenized_poems

In [None]:
df.head(3)

Unnamed: 0.1,Unnamed: 0,index,id,speaker,direction_type,position,act,scene,text,gender,...,power_agent,power_equal,power_theme,speaker_gender,title,character_list,sd_gender,male_gender_count,female_gender_count,tokens
0,0,0,stg-0000,,entrance,0,induction,induction,Enter Beggar ( Christopher Sly ) and Hostess .,,...,0.0,0.0,0.0,,The Taming of the Shrew,,,,,[CLS] enter beg ##gar ( christopher sly ) and hostess [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD...
1,1,1,sp-0001,Sly,,1,induction,induction,"I’ll feeze you , in faith .",M,...,0.0,0.0,0.0,M,The Taming of the Shrew,,,,,"[CLS] i ’ ll fee ##ze you , in faith [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PA..."
2,2,2,sp-0002,Hostess,,2,induction,induction,"A pair of stocks , you rogue !",F,...,0.0,0.0,0.0,F,The Taming of the Shrew,,,,,"[CLS] a pair of stocks , you rogue [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]..."


In [None]:
df

Unnamed: 0.1,Unnamed: 0,index,id,speaker,direction_type,position,act,scene,text,gender,...,power_agent,power_equal,power_theme,speaker_gender,title,character_list,sd_gender,male_gender_count,female_gender_count,tokens
0,0,0,stg-0000,,entrance,0,induction,induction,Enter Beggar ( Christopher Sly ) and Hostess .,,...,0.0,0.0,0.0,,The Taming of the Shrew,,,,,[CLS] enter beg ##gar ( christopher sly ) and hostess [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD...
1,1,1,sp-0001,Sly,,1,induction,induction,"I’ll feeze you , in faith .",M,...,0.0,0.0,0.0,M,The Taming of the Shrew,,,,,"[CLS] i ’ ll fee ##ze you , in faith [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PA..."
2,2,2,sp-0002,Hostess,,2,induction,induction,"A pair of stocks , you rogue !",F,...,0.0,0.0,0.0,F,The Taming of the Shrew,,,,,"[CLS] a pair of stocks , you rogue [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]..."
3,3,3,sp-0003,Sly,,3,induction,induction,"You’re a baggage ! The Slys are no rogues . Look in the chronicles . We came in with Richard Conqueror . Therefore , paucas pallabris , let the world slide . Sessa !",M,...,0.0,0.0,0.0,M,The Taming of the Shrew,,,,,[CLS] you ’ re a baggage [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PA...
4,4,3,sp-0003,Sly,,3,induction,induction,"You’re a baggage ! The Slys are no rogues . Look in the chronicles . We came in with Richard Conqueror . Therefore , paucas pallabris , let the world slide . Sessa !",M,...,0.0,0.0,0.0,M,The Taming of the Shrew,,,,,[CLS] the sly ##s are no rogue ##s [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1721,1721,1417,sp-3396,Rowland,,1417,5,4,There shall not want my labor sir : your money ; Here’s one has undertaken .,,...,0.0,0.0,2.0,M,Tamer Tamed,[],[],0.0,0.0,[CLS] there shall not want my labor sir : your money ; here ’ s one has undertaken [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]...
1722,1722,1418,sp-3398,Tranio,,1418,5,4,"Well , I’ll trust her , And glad I have so good a pawn .",,...,1.0,0.0,1.0,M,Tamer Tamed,[],[],0.0,0.0,"[CLS] well , i ’ ll trust her , and glad i have so good a pawn [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [..."
1723,1723,1419,sp-3400,Rowland,,1419,5,4,I’ll watch ye .,,...,1.0,0.0,0.0,M,Tamer Tamed,[],[],0.0,0.0,[CLS] i ’ ll watch ye [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] ...
1724,1724,1420,sp-3401,Petruchio,,1420,5,4,"Let’s in , and drink of all hands , and be jovial : I have my colt again , and now she carries ; And Gentlemen , whoever marries next , Let him be sure he keep him to his Text .",,...,1.0,1.0,0.0,M,Tamer Tamed,[],[],0.0,0.0,"[CLS] let ’ s in , and drink of all hands , and be jo ##vial : i have my colt again , and now she carries ; and gentlemen , whoever marries next , let him be sure he keep him to his text [SEP] [PA..."


In [None]:
row[20]

0.0

In [None]:
df_2

Unnamed: 0,x,y,context,tokens
0,-0.095638,-0.081126,"tell him from me, as he will win my **love**, he bear himself with honorable action, such as","tell him from me , as he will win my love , he bear himself with honorable action , such as"
1,-0.089068,-0.119902,"humble wife may show her duty and make known her **love** and then with kind embracements, tempting kisses,","humble wife may show her duty and make known her love and then with kind embrace ##ments , tempting kisses ,"
2,-0.094583,0.053891,studded all with gold and pearl dost thou **love** hawking thou hast hawks will soar above,stud ##ded all with gold and pearl dos ##t thou love hawk ##ing thou has ##t hawks will so ##ar above
3,-0.094898,0.009634,fleeter than the roe servingman dost thou **love** pictures we will fetch thee straight adonis painted by,fleet ##er than the roe serving ##man dos ##t thou love pictures we will fetch thee straight ad ##onis painted by
4,-0.089672,-0.060323,"of great italy, and by my father ’ s **love** and leave am armed with his goodwill and thy good","of great italy , and by my father ’ s love and leave am armed with his goodwill and thy good"
...,...,...,...,...
114,-0.089260,-0.056783,"but farewell that, we must be wiser cousin **love** must not leave us to the world: have you","but farewell that , we must be wise ##r cousin love must not leave us to the world : have you"
115,-0.093804,-0.041754,"must deliver it there livia, and a better **love** light on thee, i can no more to this","must deliver it there liv ##ia , and a better love light on thee , i can no more to this"
116,-0.095528,-0.084268,"set me up; there rowland, all thy old **love** back: and may a new to come exceed mine","set me up ; there rowland , all thy old love back : and may a new to come exceed mine"
117,-0.094409,-0.068817,"monument of what i have had, thou all the **love** now left me, and now lost, let me","monument of what i have had , thou all the love now left me , and now lost , let me"


In [None]:
df

Unnamed: 0.1,Unnamed: 0,index,id,speaker,direction_type,position,act,scene,text,gender,...,power_agent,power_equal,power_theme,speaker_gender,title,character_list,sd_gender,male_gender_count,female_gender_count,tokens
0,0,0,stg-0000,,entrance,0,induction,induction,Enter Beggar ( Christopher Sly ) and Hostess .,,...,0.0,0.0,0.0,,The Taming of the Shrew,,,,,[CLS] enter beg ##gar ( christopher sly ) and hostess [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD...
1,1,1,sp-0001,Sly,,1,induction,induction,"I’ll feeze you , in faith .",M,...,0.0,0.0,0.0,M,The Taming of the Shrew,,,,,"[CLS] i ’ ll fee ##ze you , in faith [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PA..."
2,2,2,sp-0002,Hostess,,2,induction,induction,"A pair of stocks , you rogue !",F,...,0.0,0.0,0.0,F,The Taming of the Shrew,,,,,"[CLS] a pair of stocks , you rogue [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]..."
3,3,3,sp-0003,Sly,,3,induction,induction,"You’re a baggage ! The Slys are no rogues . Look in the chronicles . We came in with Richard Conqueror . Therefore , paucas pallabris , let the world slide . Sessa !",M,...,0.0,0.0,0.0,M,The Taming of the Shrew,,,,,[CLS] you ’ re a baggage [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PA...
4,4,3,sp-0003,Sly,,3,induction,induction,"You’re a baggage ! The Slys are no rogues . Look in the chronicles . We came in with Richard Conqueror . Therefore , paucas pallabris , let the world slide . Sessa !",M,...,0.0,0.0,0.0,M,The Taming of the Shrew,,,,,[CLS] the sly ##s are no rogue ##s [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1721,1721,1417,sp-3396,Rowland,,1417,5,4,There shall not want my labor sir : your money ; Here’s one has undertaken .,,...,0.0,0.0,2.0,M,Tamer Tamed,[],[],0.0,0.0,[CLS] there shall not want my labor sir : your money ; here ’ s one has undertaken [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]...
1722,1722,1418,sp-3398,Tranio,,1418,5,4,"Well , I’ll trust her , And glad I have so good a pawn .",,...,1.0,0.0,1.0,M,Tamer Tamed,[],[],0.0,0.0,"[CLS] well , i ’ ll trust her , and glad i have so good a pawn [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [..."
1723,1723,1419,sp-3400,Rowland,,1419,5,4,I’ll watch ye .,,...,1.0,0.0,0.0,M,Tamer Tamed,[],[],0.0,0.0,[CLS] i ’ ll watch ye [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] ...
1724,1724,1420,sp-3401,Petruchio,,1420,5,4,"Let’s in , and drink of all hands , and be jovial : I have my colt again , and now she carries ; And Gentlemen , whoever marries next , Let him be sure he keep him to his Text .",,...,1.0,1.0,0.0,M,Tamer Tamed,[],[],0.0,0.0,"[CLS] let ’ s in , and drink of all hands , and be jo ##vial : i have my colt again , and now she carries ; and gentlemen , whoever marries next , let him be sure he keep him to his text [SEP] [PA..."


In [None]:
def find_original_poem(rows):

  """This function checks to see whether the context tokens show up in the original poem,
  and if so, returns metadata about the title, author, period, and URL for that poem"""

  text = rows['tokens'].replace('**', '')
  text = text[55:70]

  if df['tokens'].str.contains(text, regex=False).any() == True :
    row = df[df['tokens'].str.contains(text, regex=False)].values[0]
    title, speaker_gender, speaker, act = row[19], row[18], row[3], row[7]
    return title, speaker_gender, speaker, act
  else:
    return None, None, None, None

In [None]:
df_2[['title', 'speaker_gender', 'speaker', 'act']] = df_2.apply(find_original_poem, axis='columns', result_type='expand')

In [None]:
df_2

Unnamed: 0,x,y,context,tokens,title,speaker_gender,speaker,act
0,-0.095638,-0.081126,"tell him from me, as he will win my **love**, he bear himself with honorable action, such as","tell him from me , as he will win my love , he bear himself with honorable action , such as",The Taming of the Shrew,M,Lord,induction
1,-0.089068,-0.119902,"humble wife may show her duty and make known her **love** and then with kind embracements, tempting kisses,","humble wife may show her duty and make known her love and then with kind embrace ##ments , tempting kisses ,",The Taming of the Shrew,M,Lord,induction
2,-0.094583,0.053891,studded all with gold and pearl dost thou **love** hawking thou hast hawks will soar above,stud ##ded all with gold and pearl dos ##t thou love hawk ##ing thou has ##t hawks will so ##ar above,,,,
3,-0.094898,0.009634,fleeter than the roe servingman dost thou **love** pictures we will fetch thee straight adonis painted by,fleet ##er than the roe serving ##man dos ##t thou love pictures we will fetch thee straight ad ##onis painted by,,,,
4,-0.089672,-0.060323,"of great italy, and by my father ’ s **love** and leave am armed with his goodwill and thy good","of great italy , and by my father ’ s love and leave am armed with his goodwill and thy good",The Taming of the Shrew,M,Lucentio,1
...,...,...,...,...,...,...,...,...
114,-0.089260,-0.056783,"but farewell that, we must be wiser cousin **love** must not leave us to the world: have you","but farewell that , we must be wise ##r cousin love must not leave us to the world : have you",Tamer Tamed,F,Livia,1
115,-0.093804,-0.041754,"must deliver it there livia, and a better **love** light on thee, i can no more to this","must deliver it there liv ##ia , and a better love light on thee , i can no more to this",Tamer Tamed,M,Rowland,1
116,-0.095528,-0.084268,"set me up; there rowland, all thy old **love** back: and may a new to come exceed mine","set me up ; there rowland , all thy old love back : and may a new to come exceed mine",Tamer Tamed,F,Livia,1
117,-0.094409,-0.068817,"monument of what i have had, thou all the **love** now left me, and now lost, let me","monument of what i have had , thou all the love now left me , and now lost , let me",Tamer Tamed,M,Rowland,3


In [None]:
df_2.to_csv('df_2_t+t2.csv')

In [None]:
df_2 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/honors thesis/df_2_t+t2.csv')

In [None]:
for index,row in df_2.iterrows():
  if row['title'] == 'Taming of the Shrew':
    df_2.at[index,'title'] = 'The Taming of the Shrew'

In [None]:
df_2

Unnamed: 0.1,Unnamed: 0,x,y,context,tokens,word,title,speaker_gender,speaker,act,Unnamed: 10
0,0,-0.069706,0.155902,me tonight player so please your lordship to accept our **duty** with all my heart this fellow i remember since once,me tonight player so please your lordship to accept our duty with all my heart this fellow i remember since once,duty,The Taming of the Shrew,M,Lord/?,,
1,1,-0.086176,-0.025224,"tell him from me, as he will win my **love**, he bear himself with honorable action, such as","tell him from me , as he will win my love , he bear himself with honorable action , such as",love,The Taming of the Shrew,M,Lord,induction,
2,2,-0.069524,0.168127,"noble ladies unto their lords, by them accomplished such **duty** to the drunkard let him do with soft low","noble ladies unto their lords , by them accomplished such duty to the drunk ##ard let him do with soft low",duty,The Taming of the Shrew,M,Lord,induction,
3,3,-0.071334,0.156779,wherein your lady and your humble wife may show her **duty** and make known her love and then with kind embrace,wherein your lady and your humble wife may show her duty and make known her love and then with kind embrace,duty,The Taming of the Shrew,M,Lord,induction,
4,4,-0.081141,-0.001660,"humble wife may show her duty and make known her **love** and then with kind embracements, tempting kisses,","humble wife may show her duty and make known her love and then with kind embrace ##ments , tempting kisses ,",love,The Taming of the Shrew,M,Lord,induction,
...,...,...,...,...,...,...,...,...,...,...,...
155,155,-0.085015,-0.021282,"must deliver it there livia, and a better **love** light on thee, i can no more to this","must deliver it there liv ##ia , and a better love light on thee , i can no more to this",love,Tamer Tamed,M,Rowland,1,
156,156,-0.086660,-0.013430,"set me up; there rowland, all thy old **love** back: and may a new to come exceed mine","set me up ; there rowland , all thy old love back : and may a new to come exceed mine",love,Tamer Tamed,F,Livia,1,
157,157,-0.084693,-0.033413,"monument of what i have had, thou all the **love** now left me, and now lost, let me","monument of what i have had , thou all the love now left me , and now lost , let me",love,Tamer Tamed,M,Rowland,3,
158,158,-0.078671,-0.016835,you dare you kiss me thus i begin my new **love** once again with all my heart once again maria o,you dare you kiss me thus i begin my new love once again with all my heart once again maria o,love,Tamer Tamed,F,Maria,,


## **Plot word embeddings**

Lastly, we will plot the words vectors from this DataFrame with the Python data viz library [Altair](https://altair-viz.github.io/gallery/scatter_tooltips.html).

In [None]:
import altair as alt

In [None]:
alt.Chart(df_2,title="Word Similarity: Love").mark_circle(size=200).encode(
    alt.X('x',
        scale=alt.Scale(zero=False)
    ), y="y", color='title',
    # If you click a point, take you to the URL link 
    # The categories that show up in the hover tooltip
    tooltip=['title', 'speaker_gender', 'speaker', 'act', 'context']
    ).interactive().properties(
    width=500,
    height=500
)




In [None]:
# alt.Chart(df_2,title="Word Similarity: Love").mark_circle(size=200).encode(
#     alt.X('x',
#         scale=alt.Scale(zero=False)
#     ), y="y", color='speaker_gender'
#     # If you click a point, take you to the URL link 
#     # The categories that show up in the hover tooltip
#     tooltip=['title', 'speaker_gender', 'speaker', 'act', 'context']
#     ).interactive().properties(
#     width=500,
#     height=500
#)




In [None]:
alt.Chart(df_2,title="Word Similarity: Love").mark_circle(size=200).encode(
    alt.X('x',
        scale=alt.Scale(zero=False)
    ), y="y", color= 'poem_title',
    # If you click a point, take you to the URL link 
    # The categories that show up in the hover tooltip
    tooltip=['poem_title', 'speaker_gender', 'speaker', 'act', 'title']
    ).interactive().properties(
    width=500,
    height=500
)



ValueError: ignored

alt.Chart(...)

In [None]:
alt.Chart(df_2,title="Word Similarity: Love").mark_circle(size=200).encode(
    alt.X('x',
        scale=alt.Scale(zero=False)
    ), y="y", 
    # If you click a point, take you to the URL link 
    # The categories that show up in the hover tooltip
    tooltip=['title', 'speaker_gender', 'speaker', 'act', 'context']
    ).interactive().properties(
    width=500,
    height=500
)



## **Plot word embeddings from keywords (all at once!)**

We can put the code from the previous few sections into a single cell and plot the BERT word embeddings for any list of words. Let's look at the words "nature," "religion," "science," and "art."

In [None]:
# List of keywords that you want to compare
keywords = ['love', 'obedience', 'serve','hit','strike','bound','servant','husband','wife','wives','duty','husbands','modest','virginity']

# How to color the points in the plot. The other option is "period" for time period
color_by = 'word'

# Get all word positions
word_positions = get_word_positions(keywords)

# Get all contexts around the words
keyword_contexts = []
keyword_contexts_tokens = []
words = []

for position in word_positions:
  words.append(word_lookup[all_word_ids[position]])
  keyword_contexts.append(get_context_clean(position))
  keyword_contexts_tokens.append(get_context(position))

# Reduce word vectors with PCA
pca = PCA(n_components=2)
pca.fit(all_word_vectors[word_positions,:].T)

# Make a DataFrame with PCA results
df_2 = pd.DataFrame({"x": pca.components_[0,:], "y": pca.components_[1,:],
                   "context": keyword_contexts, "tokens": keyword_contexts_tokens, "word": words})
# Match original text and metadata
#df_2[['title', 'author', 'period', 'link']] = df_2.apply(find_original_poem, axis='columns', result_type='expand')
#df_2[['title', 'author', 'period', 'link']] = df_2.apply(find_original_poem)

# Rename columns so that the context shows up as the "title" in the tooltip (bigger and bolded)
#df_2 = df_2.rename(columns={'title': 'poem_title', 'context': 'title'})

# Match original text and metadata
df_2[['title','speaker_gender', 'speaker', 'act']] = df_2.apply(find_original_poem, axis='columns', result_type='expand')

# Rename columns so that the context shows up as the "title" in the tooltip (bigger and bolded)
df_2 = df_2.rename(columns={'title': 'poem_title', 'context': 'title'})


# Make the plot
alt.Chart(df_2, title=f"Word Similarity: {', '.join(keywords).title()}").mark_circle(size=200).encode(
    alt.X('x',
        scale=alt.Scale(zero=False)
    ), y="y",
    color= color_by,
    tooltip=['title','speaker_gender', 'speaker', 'act', 'poem_title']
    ).interactive().properties(
    width=500,
    height=500
)
    
    #do the same but seperate out the two texts?

In [None]:
# List of keywords that you want to compare
keywords = ['love', 'obedience', 'serve','hit','strike','bound','servant','husband','wife','duty','modest','virginity']

# How to color the points in the plot. The other option is "period" for time period
color_by = 'word'

# Get all word positions
word_positions = get_word_positions(keywords)

# Get all contexts around the words
keyword_contexts = []
keyword_contexts_tokens = []
words = []

for position in word_positions:
  words.append(word_lookup[all_word_ids[position]])
  keyword_contexts.append(get_context_clean(position))
  keyword_contexts_tokens.append(get_context(position))

# Reduce word vectors with PCA
pca = PCA(n_components=2)
pca.fit(all_word_vectors[word_positions,:].T)

# Make a DataFrame with PCA results
df_2 = pd.DataFrame({"x": pca.components_[0,:], "y": pca.components_[1,:],
                   "context": keyword_contexts, "tokens": keyword_contexts_tokens, "word": words})
# Match original text and metadata
#df_2[['title', 'author', 'period', 'link']] = df_2.apply(find_original_poem, axis='columns', result_type='expand')
#df_2[['title', 'author', 'period', 'link']] = df_2.apply(find_original_poem)

# Rename columns so that the context shows up as the "title" in the tooltip (bigger and bolded)
#df_2 = df_2.rename(columns={'title': 'poem_title', 'context': 'title'})

# Match original text and metadata
df_2[['title', 'speaker_gender', 'speaker', 'act']] = df_2.apply(find_original_poem, axis='columns', result_type='expand')

# Rename columns so that the context shows up as the "title" in the tooltip (bigger and bolded)
df_2 = df_2.rename(columns={'title': 'poem_title', 'context': 'title'})


# Make the plot
alt.Chart(df_2, title=f"Word Similarity: {', '.join(keywords).title()}").mark_circle(size=200).encode(
    alt.X('x',
        scale=alt.Scale(zero=False)
    ), y="y",
    color= color_by,
    tooltip=['title','speaker_gender', 'speaker', 'act', 'poem_title']
    ).interactive().properties(
    width=500,
    height=500
)
    
    #do the same but seperate out the two texts?

In [None]:
# List of keywords that you want to compare
keywords = ['love', 'obedience', 'serve','hit','strike','bound','servant','husband','wife','wives','duty']

# How to color the points in the plot. The other option is "period" for time period
color_by = 'poem_title'

# Get all word positions
word_positions = get_word_positions(keywords)

# Get all contexts around the words
keyword_contexts = []
keyword_contexts_tokens = []
words = []

for position in word_positions:
  words.append(word_lookup[all_word_ids[position]])
  keyword_contexts.append(get_context_clean(position))
  keyword_contexts_tokens.append(get_context(position))

# Reduce word vectors with PCA
pca = PCA(n_components=2)
pca.fit(all_word_vectors[word_positions,:].T)

# Make a DataFrame with PCA results
df_2 = pd.DataFrame({"x": pca.components_[0,:], "y": pca.components_[1,:],
                   "context": keyword_contexts, "tokens": keyword_contexts_tokens, "word": words})
# Match original text and metadata
#df_2[['title', 'author', 'period', 'link']] = df_2.apply(find_original_poem, axis='columns', result_type='expand')
#df_2[['title', 'author', 'period', 'link']] = df_2.apply(find_original_poem)

# Rename columns so that the context shows up as the "title" in the tooltip (bigger and bolded)
#df_2 = df_2.rename(columns={'title': 'poem_title', 'context': 'title'})

# Match original text and metadata
df_2[['title', 'speaker_gender', 'speaker', 'act']] = df_2.apply(find_original_poem, axis='columns', result_type='expand')

# Rename columns so that the context shows up as the "title" in the tooltip (bigger and bolded)
df_2 = df_2.rename(columns={'title': 'poem_title', 'context': 'title'})


# Make the plot
alt.Chart(df_2, title=f"Word Similarity: {', '.join(keywords).title()}").mark_circle(size=200).encode(
    alt.X('x',
        scale=alt.Scale(zero=False)
    ), y="y",
    color= color_by,
    tooltip=['title','speaker_gender', 'speaker', 'act', 'poem_title']
    ).interactive().properties(
    width=500,
    height=500
)
    
    #do the same but seperate out the two texts?

In [None]:
# List of keywords that you want to compare
keywords = ['love', 'serve', 'equal','duty']

# How to color the points in the plot. The other option is "period" for time period
color_by = 'speaker_gender'

# Get all word positions
word_positions = get_word_positions(keywords)

# Get all contexts around the words
keyword_contexts = []
keyword_contexts_tokens = []
words = []

for position in word_positions:
  words.append(word_lookup[all_word_ids[position]])
  keyword_contexts.append(get_context_clean(position))
  keyword_contexts_tokens.append(get_context(position))

# Reduce word vectors with PCA
pca = PCA(n_components=2)
pca.fit(all_word_vectors[word_positions,:].T)

# Make a DataFrame with PCA results
df_2 = pd.DataFrame({"x": pca.components_[0,:], "y": pca.components_[1,:],
                   "context": keyword_contexts, "tokens": keyword_contexts_tokens, "word": words})
# Match original text and metadata
#df_2[['title', 'author', 'period', 'link']] = df_2.apply(find_original_poem, axis='columns', result_type='expand')
#df_2[['title', 'author', 'period', 'link']] = df_2.apply(find_original_poem)

# Rename columns so that the context shows up as the "title" in the tooltip (bigger and bolded)
#df_2 = df_2.rename(columns={'title': 'poem_title', 'context': 'title'})

# Match original text and metadata
df_2[['title', 'speaker_gender', 'speaker', 'act']] = df_2.apply(find_original_poem, axis='columns', result_type='expand')

# Rename columns so that the context shows up as the "title" in the tooltip (bigger and bolded)
df_2 = df_2.rename(columns={'title': 'poem_title', 'context': 'title'})


# Make the plot
alt.Chart(df_2, title=f"Word Similarity: {', '.join(keywords).title()}").mark_circle(size=200).encode(
    alt.X('x',
        scale=alt.Scale(zero=False)
    ), y="y",
    color= color_by,
    tooltip=['title','speaker_gender', 'speaker', 'act', 'poem_title']
    ).interactive().properties(
    width=500,
    height=500
)
    
    #do the same but seperate out the two texts?

In [None]:
# List of keywords that you want to compare
keywords = ['love', 'serve', 'equal','duty']

# How to color the points in the plot. The other option is "period" for time period
color_by = 'word'

# Get all word positions
word_positions = get_word_positions(keywords)

# Get all contexts around the words
keyword_contexts = []
keyword_contexts_tokens = []
words = []

for position in word_positions:
  words.append(word_lookup[all_word_ids[position]])
  keyword_contexts.append(get_context_clean(position))
  keyword_contexts_tokens.append(get_context(position))

# Reduce word vectors with PCA
pca = PCA(n_components=2)
pca.fit(all_word_vectors[word_positions,:].T)

# Make a DataFrame with PCA results
df_2 = pd.DataFrame({"x": pca.components_[0,:], "y": pca.components_[1,:],
                   "context": keyword_contexts, "tokens": keyword_contexts_tokens, "word": words})
# Match original text and metadata
#df_2[['title', 'author', 'period', 'link']] = df_2.apply(find_original_poem, axis='columns', result_type='expand')
#df_2[['title', 'author', 'period', 'link']] = df_2.apply(find_original_poem)

# Rename columns so that the context shows up as the "title" in the tooltip (bigger and bolded)
#df_2 = df_2.rename(columns={'title': 'poem_title', 'context': 'title'})

# Match original text and metadata
df_2[['title', 'speaker_gender', 'speaker', 'act']] = df_2.apply(find_original_poem, axis='columns', result_type='expand')

# Rename columns so that the context shows up as the "title" in the tooltip (bigger and bolded)
df_2 = df_2.rename(columns={'title': 'poem_title', 'context': 'title'})


# Make the plot
alt.Chart(df_2, title=f"Word Similarity: {', '.join(keywords).title()}").mark_circle(size=200).encode(
    alt.X('x',
        scale=alt.Scale(zero=False)
    ), y="y",
    color= color_by,
    tooltip=['title','speaker_gender', 'speaker', 'act', 'poem_title']
    ).interactive().properties(
    width=500,
    height=500
)
    
    #do the same but seperate out the two texts?

Let's examine the words "nature," "religion," "science," and "art" again but this time color the points by their time period.

In [None]:
# List of keywords that you want to compare
keywords = ['nature', 'religion', 'science', 'art']

# How to color the points in the plot
color_by = 'period' 

# Get all word positions
word_positions = get_word_positions(keywords)

# Get all contexts around the words
keyword_contexts = []
keyword_contexts_tokens = []
words = []

for position in word_positions:
  words.append(word_lookup[all_word_ids[position]])
  keyword_contexts.append(get_context_clean(position))
  keyword_contexts_tokens.append(get_context(position))

# Reduce word vectors with PCA
pca = PCA(n_components=2)
pca.fit(all_word_vectors[word_positions,:].T)

# Make a DataFrame with PCA results
df = pd.DataFrame({"x": pca.components_[0,:], "y": pca.components_[1,:],
                   "context": keyword_contexts, "tokens": keyword_contexts_tokens, "word": words})
# Match original text and metadata
df[['title', 'author', 'period', 'link']] = df.apply(find_original_poem, axis='columns', result_type='expand')

# Rename columns so that the context shows up as the "title" in the tooltip (bigger and bolded)
df = df.rename(columns={'title': 'poem_title', 'context': 'title'})

# Make the plot
alt.Chart(df, title=f"Word Similarity: {', '.join(keywords).title()}").mark_circle(size=200).encode(
    alt.X('x',
        scale=alt.Scale(zero=False)
    ), y="y",
    color= 'period',
    href="link",
    tooltip=['title', 'word', 'poem_title', 'author', 'period']
    ).interactive().properties(
    width=500,
    height=500
)

IndexError: ignored

Let's compare the words "mean," "thin," "average", and "cruel."

In [None]:
# List of keywords that you want to compare
keywords = ['mean', 'thin', 'average', 'cruel']

# How to color the points in the plot. The other option is "period" for time period
color_by = 'word'

# Get all word positions
word_positions = get_word_positions(keywords)

# Get all contexts around the words
keyword_contexts = []
keyword_contexts_tokens = []
words = []

for position in word_positions:
  words.append(word_lookup[all_word_ids[position]])
  keyword_contexts.append(get_context_clean(position))
  keyword_contexts_tokens.append(get_context(position))

# Reduce word vectors with PCA
pca = PCA(n_components=2)
pca.fit(all_word_vectors[word_positions,:].T)

# Make a DataFrame with PCA results
df = pd.DataFrame({"x": pca.components_[0,:], "y": pca.components_[1,:],
                   "context": keyword_contexts, "tokens": keyword_contexts_tokens, "word": words})
# Match original text and metadata
df[['title', 'author', 'period', 'link']] = df.apply(find_original_poem, axis='columns', result_type='expand')

# Rename columns so that the context shows up as the "title" in the tooltip (bigger and bolded)
df = df.rename(columns={'title': 'poem_title', 'context': 'title'})

# Make the plot
alt.Chart(df, title=f"Word Similarity: {', '.join(keywords).title()}").mark_circle(size=200).encode(
    alt.X('x',
        scale=alt.Scale(zero=False)
    ), y="y",
    color= color_by,
    href="link",
    tooltip=['title', 'word', 'poem_title', 'author', 'period']
    ).interactive().properties(
    width=500,
    height=500
)

Let's compare the words 'head', 'heart', 'eye', 'arm', and 'leg.'

In [None]:
# List of keywords that you want to compare
keywords = ['head', 'heart', 'eye', 'arm', 'leg']

# How to color the points in the plot. The other option is "period" for time period
color_by = 'word'

# Get all word positions
word_positions = get_word_positions(keywords)

# Get all contexts around the words
keyword_contexts = []
keyword_contexts_tokens = []
words = []

for position in word_positions:
  words.append(word_lookup[all_word_ids[position]])
  keyword_contexts.append(get_context_clean(position))
  keyword_contexts_tokens.append(get_context(position))

# Reduce word vectors with PCA
pca = PCA(n_components=2)
pca.fit(all_word_vectors[word_positions,:].T)

# Make a DataFrame with PCA results
df_2 = pd.DataFrame({"x": pca.components_[0,:], "y": pca.components_[1,:],
                   "context": keyword_contexts, "tokens": keyword_contexts_tokens, "word": words})
# Match original text and metadata
df_2[['title', '']] = df.apply(find_original_poem, axis='columns', result_type='expand')

# Rename columns so that the context shows up as the "title" in the tooltip (bigger and bolded)
df_2 = df_2.rename(columns={'title': 'poem_title', 'context': 'title'})

# Make the plot
alt.Chart(df, title=f"Word Similarity: {', '.join(keywords).title()}").mark_circle(size=200).encode(
    alt.X('x',
        scale=alt.Scale(zero=False)
    ), y="y",
    color= color_by,
    href="link",
    tooltip=['title', 'word', 'poem_title', 'author', 'period']
    ).interactive().properties(
    width=500,
    height=500
)

In [None]:
# List of keywords that you want to compare
keywords = ['ring']

# How to color the points in the plot. The other option is "period" for time period
color_by = 'word'

# Get all word positions
word_positions = get_word_positions(keywords)

# Get all contexts around the words
keyword_contexts = []
keyword_contexts_tokens = []
words = []

for position in word_positions:
  words.append(word_lookup[all_word_ids[position]])
  keyword_contexts.append(get_context_clean(position))
  keyword_contexts_tokens.append(get_context(position))

# Reduce word vectors with PCA
pca = PCA(n_components=2)
pca.fit(all_word_vectors[word_positions,:].T)

# Make a DataFrame with PCA results
df_2 = pd.DataFrame({"x": pca.components_[0,:], "y": pca.components_[1,:],
                   "context": keyword_contexts, "tokens": keyword_contexts_tokens, "word": words})
# Match original text and metadata
# Match original text and metadata
df_2[['title', 'speaker_gender', 'speaker', 'act']] = df_2.apply(find_original_poem, axis='columns', result_type='expand')

# Make the plot
alt.Chart(df, title=f"Word Similarity: {', '.join(keywords).title()}").mark_circle(size=200).encode(
    alt.X('x',
        scale=alt.Scale(zero=False)
    ), y="y",
    color= color_by,
    tooltip=['title', 'speaker_gender', 'speaker', 'act','context']
    ).interactive().properties(
    width=500,
    height=500
)

Write to CSV

In [None]:
df.to_csv('bert-word-ring.csv', index=False, encoding='utf-8')

## **Find word similarity from a specific word position**

We can also search *all* of the vectors for words similar to a query word. 

In [None]:
def get_nearest(query_vector, n=100):
  cosines = all_word_vectors.dot(query_vector)
  ordering = np.flip(np.argsort(cosines))
  return ordering[:n]

To do so, we need to find the specific word position of our desired search keyword.

In [None]:
word_positions = get_word_positions(['bank'])

for word_position in word_positions:

  print_md(f"<br> {word_position}: {get_context_clean(word_position)} <br>")

> 897288: , the defendant discovered a widow with gold in the bank and the plaintiff was left in the cold. an

In [None]:
keyword_position = 897288

In [None]:
contexts = [get_context_clean(token_id) for token_id in get_nearest(all_word_vectors[keyword_position,:])]

for context in contexts:
  print_md(context)