## Sentence Transformers (specifically stsb-mpnet-base-v2)
This is THE BEST sentence-level embedding model on huggingface. 
But we'll see if it's good enough for the real world. 


In [None]:
!pip install sentence-transformers

In [12]:
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted", "Kastan is a fun programmer"]

model = SentenceTransformer('sentence-transformers/stsb-mpnet-base-v2')
embeddings = model.encode(sentences)
# print(embeddings)
print("embeddings.shape:", embeddings.shape)

score01 = embeddings[0] @ embeddings[1] #1.0473
score02 = embeddings[0] @ embeddings[2] #1.0095
# score02 = embeddings[0] @ embeddings[3] #1.0095

print(score01, score02) # the first two are closer than the first and third

embeddings.shape: (3, 768)
5.535633 0.17808995


# Doc Query
This does pure question to text lookup (no generation).
But I like that because hopefully it's more factual. 

Also this implementation works directly with PDFs! That's awesome for easily using all kinds of new data!

In [None]:
!pip install docquery

In [None]:
!brew install poppler

In [38]:
from docquery import document, pipeline
p = pipeline('document-question-answering')

# using shorter PDF for quick testing
doc = document.load_document("../notes/Student_Notes_short.pdf")

# use full pdf for real testing
# doc = document.load_document("../data-generator/notes/Student Notes.pdf")

In [40]:
questions = [
  "What are boolean logic operations?", 
  "What is overflow?",
  ]

for q in questions:
  print(q, p(question=q, **doc.context))

What are boolean logic operations? {'score': 0.9949806332588196, 'answer': 'notational conventions and tools that we use to express general functions on bits.', 'word_ids': [32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44], 'page': 10}
What is overflow? {'score': 0.04521600529551506, 'answer': 'negative pattern for -8 in 4-bit 2’s complement.', 'word_ids': [134, 135, 136, 137, 138, 139, 140, 141], 'page': 5}


In [4]:
!docquery scan "What is the invoice number?" https://templates.invoicehome.com/invoice-template-us-neat-750px.png

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
document-question-answering is already registered. Overwriting pipeline for task document-question-answering...
image-classification is already registered. Overwriting pipeline for task image-classification...
2022-10-01 15:35:02,786 INFO: Loading https://templates.invoicehome.com/invoice-template-us-neat-750px.png
2022-10-01 15:35:02,786 INFO: Done loading files. Loading pipeline...
Downloading: 100%|██████████████████████████████| 789/789 [00:00<00:00, 227kB/s]
Downloading: 100%|██████████████████████████████| 315/315 [00:00<00:00, 112kB/s]
Downloading: 100%|███████████████████████████| 798k/798k [00:00<00:00, 1.98MB/s]
Downloading: 100%|███████████████████████████| 456k/456k [00:00<00:00, 2.44MB/s]
Downlo

In [15]:

import re
from sentence_transformers import SentenceTransformer
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline
import torch

with open('../gpt-3/GPT-3_section_level.json') as f:
    data = json.load(f)

# top_k sets how many answers you want the pipeline to return (each with a score)
# model = SentenceTransformer('sentence-transformers/stsb-mpnet-base-v2')
model = "deepset/roberta-base-squad2"
pipe = pipeline('question-answering', model=model, tokenizer=model, max_answer_len=128, top_k=5)
all_retrieve_data = []
for i in range(len(data)):
    question = re.sub('\nQ.', '', data[i]['questions'])
    context = re.sub('\n', ' ', data[i]['positive_ctxs']['text'])
    if not question or not context:
        continue
    retrieval = pipe(question=question, context=context)
    all_retrieve_data.append(retrieval)

with open('section_level_retrieval.json', 'w', encoding='utf-8') as f:
    json.dump(all_retrieve_data, f, ensure_ascii=False, indent=4) 

# Returns something like this:
# [{'score': 0.47350358963012695, 'start': 20, 'end': 28, 'answer': 'textbook'},
#  {'score': 0.1505853682756424,
#   'start': 20,
#   'end': 41,
#   'answer': 'textbook and in class'},
#  {'score': 0.041666436940431595,
#   'start': 16,
#   'end': 28,
#   'answer': 'the textbook'}]

Downloading: 100%|██████████| 571/571 [00:00<00:00, 249kB/s]
Downloading: 100%|██████████| 496M/496M [00:27<00:00, 18.3MB/s] 
Downloading: 100%|██████████| 79.0/79.0 [00:00<00:00, 37.3kB/s]
Downloading: 100%|██████████| 899k/899k [00:00<00:00, 2.82MB/s]
Downloading: 100%|██████████| 456k/456k [00:00<00:00, 1.83MB/s] 
Downloading: 100%|██████████| 772/772 [00:00<00:00, 288kB/s]
