# Embeddings Comparsion

See what articles are ranked as being very similar to others based on the embeddings from:
- Longformer
- MiniLM-L6-v2


Whichever performs better according to;
1. Relevancy of Semantic Similarity
2. Speed in terms of doing comparisons
3. Breadth in the distribution of the embeddings (things are similar AND different)

is going on SageMaker or at least will be used in Alexandria

Answers:
1. LongFormer's returned articles are MUCH better at Anecdotal Semantic Similarity (what gets returned is actually reasonably relevant)
  - If Specificity isn't what we are after, they both do reasonably well on general similarity (Wiki pages for Unsupervised/Supervised learning return the counterpart)
2. For Computation Speed MiniLM is much faster for computing embeddings for the same amount of text (omitted the time required for the averaging operation but it doesn't add too much)
3. Distributional Stuff
   - Longformer's computed similarities are all very high, roughly all sims are >.95
   - MiniLM's similarities have a wider distribution (-.1 < .50) but nothing is flagged as being "very similar" 
   - What seems to be on display here is a tradeoff between variance and bias, LongFormer is biased towards everything being similar but is still accurate (if we account for bias, i.e. take the few highest values) and MiniLM is variant, having a larger breadth of values but not being accurate in the way we want

<BR>

## Results
- Longformer is much slower (larger model), generating embeddings for the full set of Summaries took: 
  - LF  = 67.8 seconds
  - MLM = 8.3 seconds (8.16x faster)

- Length of Embedding
  - LF  = 768 (2x the MiniLM)
  - MLM = 384

- Max Sequence Length, how many tokens can each model fit into a single pass (for generating embeddings)
  - LF  = 4096 (32s the MiniLM)
  - MLM = 128


Tested the similarity across the following Documents 
1. https://thesephist.com/posts/focus/ -- index 46
2. https://arxiv.org/pdf/2002.08910.pdf -- index 54
3. https://networkencyclopedia.com/terminal/ -- index 56

- Models return different similar documents, LongFormer seems to be much more relevant than MiniLM (docs are in same topic, low-level, as opposed to generally in the same space -- article @ index 54 does a good job of showcasing this, LongFormer returns relevant/similar NLP papers whereas MiniLM flags the General Supervised and Unsupervised Learning Wiki Pages)


<BR>
<BR>

In [98]:
# deletable, used to map the indicies of these articles (run after read in if need, do not need to actually run)
eval_articles = ["https://thesephist.com/posts/focus/",
"https://arxiv.org/pdf/2002.08910.pdf",
"https://networkencyclopedia.com/terminal/"]

data_urls = list(data.keys())
for i in range(len(data)):
    c_url = data_urls[i]
    if c_url in eval_articles:
        print(c_url, i)

https://networkencyclopedia.com/terminal/ 46
https://arxiv.org/pdf/2002.08910.pdf 54
https://thesephist.com/posts/focus/ 56


In [5]:
# Load in Test Data
import json

file = open("test_data.json")
data = json.load(file)
file.close()
data

# for link in data:
    # print(link)

len(data.keys()) #62 articles to work with

62

In [84]:
data['https://www.docker.com/resources/what-container/']["SUMMARY"]
data

{'https://www.docker.com/resources/what-container/': {'TIMESTAMP': 1663780513676,
  'TITLE': 'What is a Container? - Docker',
  'SUMMARY': ['- A container is a standard unit of software that packages up code and all its dependencies',
   '- Docker containers run on Docker Engine and are lightweight and secure',
   '- Docker is the industry standard for containers for development, deployment and deployment of software in the cloud',
   '- Docker container images are a lightweight, lightweight package of software to run in any environment'],
  'KEYPHRASES': ['Standardized Units',
   'containerized software',
   'containers',
   'Docker containers',
   'OS system kernel',
   'default isolation capabilities',
   'Containers',
   'Docker',
   'Docker container image',
   'Docker Engine']},
 'https://stackoverflow.com/questions/14096721/how-to-add-file-to-a-previous-commit': {'TIMESTAMP': 1663783594180,
  'TITLE': 'git - How to add file to a previous commit? - Stack Overflow',
  'SUMMARY': [

In [87]:
# Generate Embeddings with the Transformers API (batch style)
import requests


# Sample Text for Request (batch)
def format_payload(text):
    payload = json.dumps({
    # "input_text": text #for singular request
    "batch_text": text
    })
    return payload

headers = {
'Content-Type': 'application/json'
}

# Format Text for Request Payload
def get_text(data, etype=None):
    # LF
    if etype == "doc":
        return " ".join(data["SUMMARY"])
    # MLM
    elif etype == "sent":
        return data["SUMMARY"]

# Get Doc Text to Evaluate Sims
def get_doc(index):
    doc_url = list(data.keys())
    doc_url = doc_url[index]
    title = data[doc_url]["TITLE"]
    txt = data[doc_url]["SUMMARY"]
    return title, txt

In [None]:
# LongFormer
endpoint = '/e-lf'
url = "http://192.168.1.26:5003" + endpoint
lf_list = []

for link in data:
    txt = get_text(data[link], "doc")
    lf_list.append(txt)

payload = format_payload(lf_list)

# POST the Request
response = requests.request("POST", url, headers=headers, data=payload)
print(response.text)

lf_embeddings = response.json()


In [67]:
# MiniLM-L6-v2
endpoint = '/e-mlm'
url = "http://192.168.1.26:5003" + endpoint
mlm_embeddings = []

for link in data:
    txt = get_text(data[link], "sent") #getting each individual summary string here (diff len from Longformer) -- will avg
    payload = format_payload(txt)
    response = requests.request("POST", url, headers=headers, data=payload)
    mlm_embeddings.append(response.json()[0])

# mlm_embeddings = response.json() #get in above instead


# Calculate Average Embedding Across "sentence" dimension
import torch

for i in range(len(mlm_embeddings)):
    sents = mlm_embeddings[i] #sentence embeddings for one document
    tst = torch.tensor(sents)
    # print(tst.shape) #[sentence_len, 384]
    tst = torch.mean(tst, dim=0)
    # print(tst.shape) #[384] --> average embedding per document
    mlm_embeddings[i] = tst.tolist()


In [None]:
# len(lf_embeddings[0][0]) #deprecated, popped out the 62 doc embeddings
len(mlm_embeddings[0])
lf_embeddings[0]
# lf_embeddings = lf_embeddings[0]

mlm_embeddings

In [152]:
# Investigate Similarilities (based on positional embedding)

len(lf_embeddings)  #62 after popping out the first dimension (API returns extra list)
len(mlm_embeddings) #62 docs (average embedding across all sentence embeddings)

lf_data = torch.tensor(lf_embeddings)
mlm_data = torch.tensor(mlm_embeddings)
print("Matrix Shapes:", lf_data.shape, mlm_data.shape, sep="    ")


def top_k_articles(i, t, k):
    kb = {} #k best matches
    article_to_match = t[i]

    for j in range(len(t)):
        # Same Tensor, don't compute sim
        if i == j:
            continue
        # Compute Similarity for all else
        else:
            ca = t[j] #current article
            # print(article_to_match.shape, ca.shape)
            sim = torch.nn.functional.cosine_similarity(article_to_match, ca, dim=0)
            kb[j] = sim.item() #add actual float, not tensor

    kb = dict(sorted(kb.items(), key=lambda x:x[1], reverse=True)) #sorted in descending order
    top_k, c = [], 0

    for key, val in kb.items():
        c += 1
        top_k.append([key, val])
        if c == k:
            return top_k



Matrix Shapes:    torch.Size([62, 768])    torch.Size([62, 384])


In [190]:
top_k_articles(45, mlm_data, 3)
top_k_articles(46, lf_data, 3)
top_k_articles(46, mlm_data, 60)

[[38, 0.49174046516418457],
 [52, 0.4324840307235718],
 [44, 0.4226316511631012],
 [35, 0.40380775928497314],
 [18, 0.39104241132736206],
 [48, 0.37757188081741333],
 [39, 0.37700143456459045],
 [0, 0.3722068667411804],
 [25, 0.36621779203414917],
 [40, 0.3467150628566742],
 [12, 0.32266533374786377],
 [53, 0.312508761882782],
 [54, 0.31151333451271057],
 [20, 0.309421181678772],
 [28, 0.30819910764694214],
 [8, 0.28799891471862793],
 [34, 0.2843751311302185],
 [57, 0.2709887623786926],
 [6, 0.2483585923910141],
 [26, 0.23561148345470428],
 [33, 0.23407180607318878],
 [24, 0.2323741316795349],
 [11, 0.2299753725528717],
 [37, 0.22632792592048645],
 [3, 0.21640968322753906],
 [49, 0.2153039276599884],
 [55, 0.2120894193649292],
 [61, 0.2111295461654663],
 [19, 0.2110472321510315],
 [7, 0.20394273102283478],
 [60, 0.20004187524318695],
 [14, 0.18269197642803192],
 [59, 0.18240144848823547],
 [22, 0.17757660150527954],
 [21, 0.17493432760238647],
 [27, 0.15213561058044434],
 [51, 0.145545

In [203]:
# Articles: [46, 54, 56] <-- to test for

article_index = 46
article_index = 56
article_index = 54
article_index = 11

t, v = get_doc(article_index)
print(f"Main Doc Index- {article_index}    Title- {t}    Summary- {v[0]}", end="\n\n")

k_best = top_k_articles(article_index, lf_data, 3)  #LongFormer
k_best = top_k_articles(article_index, mlm_data, 3) #MiniLM-L6-v2

kbi = [i for i, v in k_best]
for i, v in k_best:
    r, t = get_doc(i) #62 documents (0-61 indexing)
    print(f"     Doc Index- {i}    Score- {v}\n     Title- {r}    Summary- {t[0]}")

# get_doc(61) #62 documents (0-61 indexing)

Main Doc Index- 11    Title- Supervised learning - Wikipedia    Summary- - supervised learning is the task of learning a function that maps an input to an output based on example pairs

     Doc Index- 42    Score- 0.7780314683914185
     Title- 9 Temporal-Difference Learning    Summary- - This chapter introduces a reinforcement learning method called Temporal-Difference (TD) learning
     Doc Index- 12    Score- 0.7635361552238464
     Title- Unsupervised learning - Wikipedia    Summary- - unsupervised learning is a type of algorithm that learns patterns from untagged data.Unlike supervised learning where data is tagged by an expert, unsupervised methods exhibit self-organization
     Doc Index- 37    Score- 0.730943500995636
     Title- An Intuitive Explanation of Policy Gradient | by Adrien Lucas Ecoffet | Towards Data Science    Summary- - this is the first in a series of tutorials on the policy gradient family of reinforcement learning algorithms


In [178]:
kbi[2]
get_doc(34)
get_doc(54)

('2002.08910',
 ['- pre-trained neural language models can be used to answer natural language queries without access to external context or knowledge',
  '- Adam Roberts, Noam Shazeer and Colin Raffel present a new approach to pre-tuning T5',
  '- They show that the models can internalize an implicit knowledge base for natural language questions'])