## Exploring the Embedding and Similarity Data

In an effort to be more Pythonic, I am going to separate out the schema setup from the actual data analysis aspect of this project - I know this isn't exactly elite in terms of best coding practices, but I'm at least making an effort :).

I do think I may return to the old script, or at least create something similarly structured, to work in the realm of paraphrase - but I have not yet decided on that.

A note on file sizes: `EMB_sentences_0` and `EMB_sentences_1` are quite large (slightly under 300 MB). I wanted to capture the embeddings somehow, because the most time consuming part of the process is the calculation of those sentence embeddings. But it may present a memory problem. The `EMB_documents` and `EMB_sentences_2` files are much smaller, but require, at a minimum, `EMB_sentences_1` for reference.

All documents are Early Access editions of *Series 1: Speeches* of the Julian Bond Papers and can be viewed on the [Julian Bond Papers Project](https://bondpapersproject.org/) website.

In [1]:
# import packages

import pandas as pd
import numpy as np

import re
import itertools

import os

import time

Note: loading in the data correctly requires noting the MultiIndex of the `sentences` tables.

In [3]:
# out of curiosity, we're going to time this
start = time.time()

documents = pd.read_csv('EMB_documents.csv', index_col='index')

match_reference = pd.read_csv('EMB_sentences_2.csv', index_col = ['doc_index', 'sent_num'])

end = time.time()

print(f'Runtime: {round(end-start, 3)} seconds.')

Runtime: 0.251 seconds.


In [5]:
# pickling is much faster and preserves the tensors better for later operations

import pickle

sents_raw = pickle.load(open('EMB_sentences_0.pickle', 'rb'))
sents_trimmed = pickle.load(open('EMB_sentences_1.pickle', 'rb'))

In [7]:
sents_raw.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,sent_str,embedding,embeddings_id,matches_counter,matches_indices
doc_index,sent_num,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,0,[This speech includes pages with many differen...,"[-0.062395636, 0.0135172205, 0.045818355, -0.0...",0,0,[]
0,1,"We need to discover who is, and who isn't viol...","[-0.020895261, -0.008539446, 0.029561546, -0.0...",1,5,"[2569, 2691, 3562, 11712, 40059]"
0,2,Violence is black children going to school for...,"[0.00194369, -0.006336346, 0.035046395, -0.005...",2,6,"[2570, 2692, 3563, 10633, 11713, 40060]"
0,3,Violence is 30 million hungry stomachs in the ...,"[0.0031232794, -0.011578105, 0.041491807, -0.0...",3,6,"[2571, 2697, 3568, 10634, 11714, 40061]"
0,4,Violence is having black people represent a di...,"[-0.0017653363, 0.026803194, 0.014218208, -0.0...",4,2,"[2572, 11715]"


In [9]:
%%time
import torch
torch.tensor(sents_raw['embedding'][2][0])

CPU times: total: 2.33 s
Wall time: 3.05 s


tensor([-4.4908e-02,  2.3612e-02,  1.3178e-02, -6.0455e-02,  8.8216e-03,
         6.7232e-02,  8.3950e-02,  5.3449e-02,  2.2854e-02,  1.9364e-02,
         7.0205e-03,  1.3355e-02, -3.4186e-02,  3.5040e-02,  2.5868e-02,
         2.5888e-02, -3.3355e-02,  1.8082e-02, -7.4325e-02,  3.7689e-02,
         8.9366e-03, -3.9614e-02, -1.1368e-04, -5.4953e-02,  5.4183e-02,
         3.7754e-02,  8.4528e-03, -2.6866e-02, -6.3302e-02, -1.9547e-01,
         3.0149e-02, -5.5925e-02,  5.6353e-02, -4.3717e-02, -3.5180e-02,
        -2.9530e-02, -2.9026e-02,  4.5263e-02, -1.8706e-02,  5.1937e-02,
         2.7005e-03,  1.2154e-02, -5.5534e-02, -2.4826e-02, -5.6881e-02,
        -4.9746e-02, -1.7618e-02,  5.1076e-03,  7.8162e-04, -3.3806e-02,
         1.7438e-03, -3.7259e-02,  1.1834e-02,  1.5286e-03,  4.5147e-02,
         6.8336e-02,  4.0694e-02,  5.0808e-02,  4.3747e-02,  1.7325e-02,
         1.5956e-02,  4.9982e-02, -2.2688e-01,  7.8995e-02,  3.7270e-03,
         5.9969e-02, -5.4758e-02, -1.4392e-02, -4.9

Conversion to tensor is actually pretty fast. Interesting.

In [11]:
print(len(documents))
documents.head()

357


Unnamed: 0_level_0,ID,Title,Document Body,Year
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,670,Undated Speech concerning Conditions of Black ...,[This speech includes pages with many differen...,1969.0
1,667,Speeches making observations about the recent ...,"Now that the nation's voters — at least, 54% o...",1972.0
2,666,Speeches making observations about the recent ...,"Now that the nation's voters — at least, 54% o...",1972.0
3,665,Speeches making observations about the recent ...,"1\nNow that the nations voters — at least, 54%...",1972.0
4,663,Speech about the upcoming presidential electio...,The election approaching on November seventh i...,1972.0


In [13]:
print(len(sents_raw))
sents_raw.head()

45850


Unnamed: 0_level_0,Unnamed: 1_level_0,sent_str,embedding,embeddings_id,matches_counter,matches_indices
doc_index,sent_num,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,0,[This speech includes pages with many differen...,"[-0.062395636, 0.0135172205, 0.045818355, -0.0...",0,0,[]
0,1,"We need to discover who is, and who isn't viol...","[-0.020895261, -0.008539446, 0.029561546, -0.0...",1,5,"[2569, 2691, 3562, 11712, 40059]"
0,2,Violence is black children going to school for...,"[0.00194369, -0.006336346, 0.035046395, -0.005...",2,6,"[2570, 2692, 3563, 10633, 11713, 40060]"
0,3,Violence is 30 million hungry stomachs in the ...,"[0.0031232794, -0.011578105, 0.041491807, -0.0...",3,6,"[2571, 2697, 3568, 10634, 11714, 40061]"
0,4,Violence is having black people represent a di...,"[-0.0017653363, 0.026803194, 0.014218208, -0.0...",4,2,"[2572, 11715]"


In [15]:
sents_raw['sent_str'][0][0]

'[This speech includes pages with many different typesets.]'

**fixed 3/9 - kind of**

I already fear that the embeddings have been messed up by the write/read process. Will look into whether I can convert them back into float32 tensors easily. But since we're separating out these two processes, this isn't something I'm terribly concerned about in this notebook.

After 3/9 - pickling the dataframes makes my life easier (I don't have a particular interest in exporting this whole thing, and it's not like it's 3NF, so I can't do SQL stuff with it). However, the embeddings are saved as numpy arrays, which I always knew was going to be a problem). But they are saved, and in a more workable form.

In [53]:
print(len(sents_trimmed))
sents_trimmed.head()

45850


Unnamed: 0_level_0,Unnamed: 1_level_0,sent_str,embedding,embeddings_id,matches_counter,matches_indices
doc_index,sent_num,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,0,[This speech includes pages with many differen...,"[-0.062395636, 0.0135172205, 0.045818355, -0.0...",0,0,[]
0,1,"We need to discover who is, and who isn't viol...","[-0.020895261, -0.008539446, 0.029561546, -0.0...",1,5,"[2569, 2691, 3562, 11712, 40059]"
0,2,Violence is black children going to school for...,"[0.00194369, -0.006336346, 0.035046395, -0.005...",2,6,"[2570, 2692, 3563, 10633, 11713, 40060]"
0,3,Violence is 30 million hungry stomachs in the ...,"[0.0031232794, -0.011578105, 0.041491807, -0.0...",3,6,"[2571, 2697, 3568, 10634, 11714, 40061]"
0,4,Violence is having black people represent a di...,"[-0.0017653363, 0.026803194, 0.014218208, -0.0...",4,2,"[2572, 11715]"


In [11]:
print(len(match_reference))
match_reference.head()

8978


Unnamed: 0_level_0,Unnamed: 1_level_0,sent_str,embeddings_id,matches_counter,matches_indices
doc_index,sent_num,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,1,"We need to discover who is, and who isn't viol...",1,5,"[2569, 2691, 3562, 11712, 40059]"
0,2,Violence is black children going to school for...,2,6,"[2570, 2692, 3563, 10633, 11713, 40060]"
0,3,Violence is 30 million hungry stomachs in the ...,3,6,"[2571, 2697, 3568, 10634, 11714, 40061]"
0,4,Violence is having black people represent a di...,4,2,"[2572, 11715]"
0,5,Violence is a country where properrty counts m...,5,6,"[2573, 2698, 3569, 10635, 11716, 40062]"


Everything appears to have worked correctly here in terms of reading in the data.

### Analysis using the match_reference table

In [17]:
match_reference.sort_values(by='matches_counter', ascending=False).head(20)

Unnamed: 0_level_0,Unnamed: 1_level_0,sent_str,embeddings_id,matches_counter,matches_indices
doc_index,sent_num,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
21,85,"I believe that armies, and navies are at the b...",2631,34,"[3111, 11748, 12157, 12380, 13501, 13586, 1366..."
21,82,"I believe that all men, black and brown and wh...",2628,33,"[11746, 12154, 12377, 13499, 13584, 13666, 142..."
81,55,"""I believe in Liberty for all men; the space t...",11749,30,"[12158, 13502, 13587, 14264, 16307, 18414, 196..."
38,202,"""He forgets that the clouds also bring life an...",4507,27,"[4916, 5380, 5477, 5813, 6431, 6684, 6706, 769..."
84,112,"Finally, I believe in patience - patience with...",12159,27,"[12384, 14266, 16309, 18416, 19615, 20467, 214..."
38,201,"""In every cloud he beholds a destructive storm...",4506,27,"[4915, 5379, 5476, 5812, 6430, 6683, 6705, 769..."
164,163,"We live in years, swift, flying, transient years.",25521,26,"[25686, 25781, 26449, 27614, 28545, 28639, 288..."
164,161,It may be a great physical segregation of the ...,25519,26,"[25684, 25780, 26448, 27612, 28543, 28637, 288..."
164,164,"We hold the possible future in our hands, not ...",25522,26,"[25687, 25782, 26450, 27615, 28546, 28640, 288..."
84,109,I believe in the Prince of Peace.,12156,26,"[12379, 14262, 16305, 18412, 19611, 20463, 214..."


In [46]:
top_20 = match_reference.sort_values(by='matches_counter', ascending=False).head(20)
top_20.iloc[:,0].to_list()

['I believe that armies, and navies are at the bottom the tinsel and braggadoa?cio of oppression and wrong; and I believe that the wicked conquest of weaker and darker nations by nations white and stronger but foreshadows the death of that stength.\n"',
 'I believe that all men, black and brown and white\xa0,\xa0are brothers, varying, through tTime and oOpportunity, in form and gift and feature, but differing in no essential particular, and alike in soul and in the possibility of infinite development.',
 '"I believe in Liberty for all men; the space to stretch their arms and their souls; the right to breathe and the right to vote, the freedom to choose their friends, enjoy the sunshine...uncursed by color; thinking, dreaming, working as they will in a Kingdom of God and love..."\n####',
 '"He forgets that the clouds also bring life and hope, that lightning purifies the atmosphere, that shadow and darkness prepare for sunshine and growth, and that hardships and adversity nerve the race,

In [48]:
documents.iloc[164,:]

ID                                                             344
Title            Speech concerning the Reagan Administration an...
Document Body    As I begin, let me state for the record that a...
Year                                                        1983.0
Name: 164, dtype: object

Things that instantly stand out:
1. Parts of the DuBois quote ("Yes, plain, blunt complaint..." and "To press the matter of stopping the curtailment of our political rights...") that Bond loves to use
2. The "I believe" refrain, which is very rhetorically powerful and used in many speeches ("I believe in the Prince of Peace", "I believe that armies and navies are at bottom...", "I believe in pride of race and lineage itself...", "I believe that all men... are brothers..."
3. The quote from Horace Mann Bond ("The pessimist from his corner looks out on the world of wickedness...", "In every cloud he beholds a destructive storm...", "He forgets that the clouds also bring life and hope..."
4. Doc Index 164 (PJB ID: 344) sentences 161-4 are all frequently reused (all in the top 25). I am unfamiliar with this particular paragraph, though.

The next step is finding a visualization that helps to find these full paragraphs. And then I need to think about the best ways to create a timeline and such.

In [17]:
# TO DO:

# go back to the other script and convert embeddings to strings before saving them so they don't get messed up
# email Lucian

In [105]:
test = match_reference.reset_index(drop=True)
test = test[test['matches_counter'] >= 10]
test

Unnamed: 0,sent_str,embeddings_id,matches_counter,matches_indices
31,4\nHe will continue to set the budget and name...,96,11,"[279, 398, 521, 627, 738, 42801, 42881, 43330,..."
36,Our ideal is a country where every American ge...,101,10,"[284, 410, 532, 638, 754, 1228, 12826, 12926, ..."
37,"""Our reality needs no full recital here.",102,10,"[285, 412, 533, 639, 755, 1229, 12827, 12927, ..."
39,"In sum, we know that our society is not functi...",104,10,"[287, 415, 535, 641, 757, 1231, 12829, 12929, ..."
40,"""But if we solve the greatest of our ills,"" th...",105,10,"[288, 417, 536, 642, 758, 1232, 12830, 12930, ..."
...,...,...,...,...
7527,"Lower level administrators--the mayors, counci...",33716,11,"[34050, 34177, 39229, 39337, 39961, 40718, 410..."
7528,"For them, the New Federalism promised to be ma...",33717,12,"[34051, 34178, 39230, 39338, 39962, 40719, 410..."
7529,"It promised new money for all, money to pave t...",33718,14,"[34052, 34179, 39231, 39339, 39963, 40720, 410..."
7530,The result was to shift burden and responsibil...,33720,14,"[34054, 34181, 39233, 39341, 39965, 40722, 410..."


In [107]:
groups = list(test['embeddings_id'])

group_counter = 0
groups_dict = {}
for i in range(len(groups)-1):
    if groups[i+1] - groups[i] == 1:
        if group_counter in groups_dict.keys():
            groups_dict[group_counter].append(groups[i])
        else:
            groups_dict[group_counter] = [groups[i]]
    else:
        group_counter += 1

In [109]:
mod_groups = {k:v for k, v in groups_dict.items() if len(v)>= 3}
len(mod_groups)

14

In [111]:
max(mod_groups, key=lambda k: len(mod_groups[k]))

129

In [113]:
mod_groups

{6: [131, 132, 133],
 12: [444, 445, 446],
 15: [1175, 1176, 1177],
 16: [1186, 1187, 1188],
 17: [1191, 1192, 1193, 1194, 1195, 1196, 1197, 1198],
 26: [4489, 4490, 4491, 4492, 4493],
 27: [4505, 4506, 4507, 4508, 4509, 4510],
 46: [7090, 7091, 7092],
 66: [10860, 10861, 10862],
 67: [10872, 10873, 10874],
 85: [14258, 14259, 14260],
 114: [25519, 25520, 25521],
 129: [27600,
  27601,
  27602,
  27603,
  27604,
  27605,
  27606,
  27607,
  27608,
  27609,
  27610],
 140: [29074, 29075, 29076]}

In [117]:
test = test.set_index('embeddings_id')

In [119]:
repeated_paragraphs = []
for i in mod_groups.keys():
    para = (' ').join([test.loc[x, 'sent_str'] for x in mod_groups[i]])
    repeated_paragraphs.append(para)

In [131]:
for i in range(len(repeated_paragraphs)):
    print(f'Quote {i+1}', '\n', repeated_paragraphs[i], '\n')

Quote 1 
 The nation could adopt, and strive for, a policy of full employment. Equal opportunity, both racially and sexually, can stop being the rhetoric of campaigns and platforms and become the reality of the present. Through public service employment, increased economic growth, increases in wage minimums and in guaranteeing social insurance by radically altering public assistance, every American can be guaranteed an income. 

Quote 2 
 We will select a new Congress in 1972 as well. These for the most part must be new men and women, not the tired old faces of the past. It should be a Congress that would reject Nixon's family destruction plan, that would say "no" to more war, "no" to freezes on wages with no freezes on profits, "no" to secret government, and "no" to preventive detention and no-knock justice. 

Quote 3 
 It must be someone who will put teeth into the demands for a decent life. Since the present President took office, we have spent billions
15 7
more on war, over two mi

In [139]:
with open('bond_repeated_phrases.txt', 'w', encoding='utf-8') as file:
    for i in range(len(repeated_paragraphs)):
        file.write(f'Quote {i+1}' + '\n\n' + repeated_paragraphs[i] + '\n\n')