## Exploring the Embedding and Similarity Data

In an effort to be more Pythonic, I am going to separate out the schema setup from the actual data analysis aspect of this project - I know this isn't exactly elite in terms of best coding practices, but I'm at least making an effort :).

I do think I may return to the old script, or at least create something similarly structured, to work in the realm of paraphrase - but I have not yet decided on that.

A note on file sizes: `EMB_sentences_0` and `EMB_sentences_1` are quite large (slightly under 300 MB). I wanted to capture the embeddings somehow, because the most time consuming part of the process is the calculation of those sentence embeddings. But it may present a memory problem. The `EMB_documents` and `EMB_sentences_2` files are much smaller, but require, at a minimum, `EMB_sentences_1` for reference.

All documents are Early Access editions of *Series 1: Speeches* of the Julian Bond Papers and can be viewed on the [Julian Bond Papers Project](https://bondpapersproject.org/) website.

In [2]:
# import packages

import pandas as pd
import numpy as np

import re
import itertools

import os

import time

Note: loading in the data correctly requires noting the MultiIndex of the `sentences` tables.

In [16]:
# out of curiosity, we're going to time this
start = time.time()

documents = pd.read_csv('EMB_documents.csv', index_col='index')

sents_raw = pd.read_csv('EMB_sentences_0.csv', index_col = ['doc_index', 'sent_num'])

sents_trimmed = pd.read_csv('EMB_sentences_1.csv', index_col = ['doc_index', 'sent_num'])

match_reference = pd.read_csv('EMB_sentences_2.csv', index_col = ['doc_index', 'sent_num'])

end = time.time()

print(f'Runtime: {round(end-start, 3)} seconds.')

Runtime: 6.696 seconds.


Well, it appears that read is a lot faster than write. Makes sense.

In [20]:
print(len(documents))
documents.head()

357


Unnamed: 0_level_0,ID,Title,Document Body,Year
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,670,Undated Speech concerning Conditions of Black ...,[This speech includes pages with many differen...,1969.0
1,667,Speeches making observations about the recent ...,"Now that the nation's voters — at least, 54% o...",1972.0
2,666,Speeches making observations about the recent ...,"Now that the nation's voters — at least, 54% o...",1972.0
3,665,Speeches making observations about the recent ...,"1\nNow that the nations voters — at least, 54%...",1972.0
4,663,Speech about the upcoming presidential electio...,The election approaching on November seventh i...,1972.0


In [22]:
print(len(sents_raw))
sents_raw.head()

45850


Unnamed: 0_level_0,Unnamed: 1_level_0,sent_str,embedding,embeddings_id,matches_counter,matches_indices
doc_index,sent_num,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,0,[This speech includes pages with many differen...,[-6.23956360e-02 1.35172205e-02 4.58183549e-...,0,0,[]
0,1,"We need to discover who is, and who isn't viol...",[-2.08952613e-02 -8.53944570e-03 2.95615457e-...,1,5,"[2569, 2691, 3562, 11712, 40059]"
0,2,Violence is black children going to school for...,[ 1.94369000e-03 -6.33634580e-03 3.50463949e-...,2,6,"[2570, 2692, 3563, 10633, 11713, 40060]"
0,3,Violence is 30 million hungry stomachs in the ...,[ 3.12327943e-03 -1.15781054e-02 4.14918065e-...,3,6,"[2571, 2697, 3568, 10634, 11714, 40061]"
0,4,Violence is having black people represent a di...,[-1.76533626e-03 2.68031936e-02 1.42182084e-...,4,2,"[2572, 11715]"


I already fear that the embeddings have been messed up by the write/read process. Will look into whether I can convert them back into float32 tensors easily. But since we're separating out these two processes, this isn't something I'm terribly concerned about in this notebook.

In [26]:
print(len(sents_trimmed))
sents_trimmed.head()

44761


Unnamed: 0_level_0,Unnamed: 1_level_0,sent_str,embedding,embeddings_id,matches_counter,matches_indices
doc_index,sent_num,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,1,"We need to discover who is, and who isn't viol...",[-2.08952613e-02 -8.53944570e-03 2.95615457e-...,1,5,"[2569, 2691, 3562, 11712, 40059]"
0,2,Violence is black children going to school for...,[ 1.94369000e-03 -6.33634580e-03 3.50463949e-...,2,6,"[2570, 2692, 3563, 10633, 11713, 40060]"
0,3,Violence is 30 million hungry stomachs in the ...,[ 3.12327943e-03 -1.15781054e-02 4.14918065e-...,3,6,"[2571, 2697, 3568, 10634, 11714, 40061]"
0,4,Violence is having black people represent a di...,[-1.76533626e-03 2.68031936e-02 1.42182084e-...,4,2,"[2572, 11715]"
0,5,Violence is a country where properrty counts m...,[-1.40759943e-03 -1.43564390e-02 4.52807471e-...,5,6,"[2573, 2698, 3569, 10635, 11716, 40062]"


In [28]:
print(len(match_reference))
match_reference.head()

8978


Unnamed: 0_level_0,Unnamed: 1_level_0,sent_str,embeddings_id,matches_counter,matches_indices
doc_index,sent_num,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,1,"We need to discover who is, and who isn't viol...",1,5,"[2569, 2691, 3562, 11712, 40059]"
0,2,Violence is black children going to school for...,2,6,"[2570, 2692, 3563, 10633, 11713, 40060]"
0,3,Violence is 30 million hungry stomachs in the ...,3,6,"[2571, 2697, 3568, 10634, 11714, 40061]"
0,4,Violence is having black people represent a di...,4,2,"[2572, 11715]"
0,5,Violence is a country where properrty counts m...,5,6,"[2573, 2698, 3569, 10635, 11716, 40062]"


Everything appears to have worked correctly here in terms of reading in the data.

### Analysis using the match_reference table

In [38]:
match_reference.sort_values(by='matches_counter', ascending=False).head(20)

Unnamed: 0_level_0,Unnamed: 1_level_0,sent_str,embeddings_id,matches_counter,matches_indices
doc_index,sent_num,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
21,85,"I believe that armies, and navies are at the b...",2631,34,"[3111, 11748, 12157, 12380, 13501, 13586, 1366..."
21,82,"I believe that all men, black and brown and wh...",2628,33,"[11746, 12154, 12377, 13499, 13584, 13666, 142..."
81,55,"""I believe in Liberty for all men; the space t...",11749,30,"[12158, 13502, 13587, 14264, 16307, 18414, 196..."
38,202,"""He forgets that the clouds also bring life an...",4507,27,"[4916, 5380, 5477, 5813, 6431, 6684, 6706, 769..."
84,112,"Finally, I believe in patience - patience with...",12159,27,"[12384, 14266, 16309, 18416, 19615, 20467, 214..."
38,201,"""In every cloud he beholds a destructive storm...",4506,27,"[4915, 5379, 5476, 5812, 6430, 6683, 6705, 769..."
164,163,"We live in years, swift, flying, transient years.",25521,26,"[25686, 25781, 26449, 27614, 28545, 28639, 288..."
164,161,It may be a great physical segregation of the ...,25519,26,"[25684, 25780, 26448, 27612, 28543, 28637, 288..."
164,164,"We hold the possible future in our hands, not ...",25522,26,"[25687, 25782, 26450, 27615, 28546, 28640, 288..."
84,109,I believe in the Prince of Peace.,12156,26,"[12379, 14262, 16305, 18412, 19611, 20463, 214..."


In [46]:
top_20 = match_reference.sort_values(by='matches_counter', ascending=False).head(20)
top_20.iloc[:,0].to_list()

['I believe that armies, and navies are at the bottom the tinsel and braggadoa?cio of oppression and wrong; and I believe that the wicked conquest of weaker and darker nations by nations white and stronger but foreshadows the death of that stength.\n"',
 'I believe that all men, black and brown and white\xa0,\xa0are brothers, varying, through tTime and oOpportunity, in form and gift and feature, but differing in no essential particular, and alike in soul and in the possibility of infinite development.',
 '"I believe in Liberty for all men; the space to stretch their arms and their souls; the right to breathe and the right to vote, the freedom to choose their friends, enjoy the sunshine...uncursed by color; thinking, dreaming, working as they will in a Kingdom of God and love..."\n####',
 '"He forgets that the clouds also bring life and hope, that lightning purifies the atmosphere, that shadow and darkness prepare for sunshine and growth, and that hardships and adversity nerve the race,

In [48]:
documents.iloc[164,:]

ID                                                             344
Title            Speech concerning the Reagan Administration an...
Document Body    As I begin, let me state for the record that a...
Year                                                        1983.0
Name: 164, dtype: object

Things that instantly stand out:
1. Parts of the DuBois quote ("Yes, plain, blunt complaint..." and "To press the matter of stopping the curtailment of our political rights...") that Bond loves to use
2. The "I believe" refrain, which is very rhetorically powerful and used in many speeches ("I believe in the Prince of Peace", "I believe that armies and navies are at bottom...", "I believe in pride of race and lineage itself...", "I believe that all men... are brothers..."
3. The quote from Horace Mann Bond ("The pessimist from his corner looks out on the world of wickedness...", "In every cloud he beholds a destructive storm...", "He forgets that the clouds also bring life and hope..."
4. Doc Index 164 (PJB ID: 344) sentences 161-4 are all frequently reused (all in the top 25). I am unfamiliar with this particular paragraph, though.

The next step is finding a visualization that helps to find these full paragraphs. And then I need to think about the best ways to create a timeline and such.

In [53]:
# to do tomorrow, perhaps. it's a snow day today.