## Sentence Transformers (specifically stsb-mpnet-base-v2)
This is THE BEST sentence-level embedding model on huggingface. 
But we'll see if it's good enough for the real world. 


In [None]:
!pip install sentence-transformers rank_bm25

In [3]:
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted", "Kastan is a fun programmer"]

model = SentenceTransformer('sentence-transformers/stsb-mpnet-base-v2')
embeddings = model.encode(sentences)
# print(embeddings)
print("embeddings.shape:", embeddings.shape)

score01 = embeddings[0] @ embeddings[1] #1.0473
score02 = embeddings[0] @ embeddings[2] #1.0095
# score02 = embeddings[0] @ embeddings[3] #1.0095

print(score01, score02) # the first two are closer than the first and third

Ignored unknown kwarg option direction
embeddings.shape: (3, 768)
5.535631 0.17808935


# Doc Query
This does pure question to text lookup (no generation).
But I like that because hopefully it's more factual. 

Also this implementation works directly with PDFs! That's awesome for easily using all kinds of new data!

In [None]:
!pip install docquery

In [None]:
!brew install poppler

In [38]:
from docquery import document, pipeline
p = pipeline('document-question-answering')

# using shorter PDF for quick testing
doc = document.load_document("../notes/Student_Notes_short.pdf")

# use full pdf for real testing
# doc = document.load_document("../data-generator/notes/Student Notes.pdf")

In [40]:
questions = [
  "What are boolean logic operations?", 
  "What is overflow?",
  ]

for q in questions:
  print(q, p(question=q, **doc.context))

What are boolean logic operations? {'score': 0.9949806332588196, 'answer': 'notational conventions and tools that we use to express general functions on bits.', 'word_ids': [32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44], 'page': 10}
What is overflow? {'score': 0.04521600529551506, 'answer': 'negative pattern for -8 in 4-bit 2’s complement.', 'word_ids': [134, 135, 136, 137, 138, 139, 140, 141], 'page': 5}


In [4]:
!docquery scan "What is the invoice number?" https://templates.invoicehome.com/invoice-template-us-neat-750px.png

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
document-question-answering is already registered. Overwriting pipeline for task document-question-answering...
image-classification is already registered. Overwriting pipeline for task image-classification...
2022-10-01 15:35:02,786 INFO: Loading https://templates.invoicehome.com/invoice-template-us-neat-750px.png
2022-10-01 15:35:02,786 INFO: Done loading files. Loading pipeline...
Downloading: 100%|██████████████████████████████| 789/789 [00:00<00:00, 227kB/s]
Downloading: 100%|██████████████████████████████| 315/315 [00:00<00:00, 112kB/s]
Downloading: 100%|███████████████████████████| 798k/798k [00:00<00:00, 1.98MB/s]
Downloading: 100%|███████████████████████████| 456k/456k [00:00<00:00, 2.44MB/s]
Downlo

In [None]:

import re
from sentence_transformers import SentenceTransformer
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline
import torch

with open('GPT-3_paragraph_level.json') as f:
    data = json.load(f)

# top_k sets how many answers you want the pipeline to return (each with a score)
# model = SentenceTransformer('sentence-transformers/stsb-mpnet-base-v2')
model = "deepset/roberta-base-squad2"
pipe = pipeline('question-answering', model=model, tokenizer=model, max_answer_len=128, top_k=5)
all_retrieve_data = []
for i in range(len(data)):
    question = re.sub('\nQ.', '', data[i]['questions'])
    context = re.sub('\n', ' ', data[i]['positive_ctxs']['text'])
    if not question or not context:
        continue
    retrieval = pipe(question=question, context=context)
    all_retrieve_data.append(retrieval)

with open('section_level_retrieval.json', 'w', encoding='utf-8') as f:
    json.dump(all_retrieve_data, f, ensure_ascii=False, indent=4) 

# Returns something like this:
# [{'score': 0.47350358963012695, 'start': 20, 'end': 28, 'answer': 'textbook'},
#  {'score': 0.1505853682756424,
#   'start': 20,
#   'end': 41,
#   'answer': 'textbook and in class'},
#  {'score': 0.041666436940431595,
#   'start': 16,
#   'end': 28,
#   'answer': 'the textbook'}]

### Wikipedia Retrieval

In [2]:
# input a question, use semantic search to find the relevant passages in the Simple English Wikipedia
# crossencoder: cross-encoder/ms-marco-MiniLM-L-6-v2 

import json
from sentence_transformers import SentenceTransformer, CrossEncoder, util
import gzip
import os
import torch

if not torch.cuda.is_available():
    print("Warning: No GPU found. Please add GPU to your notebook")


#We use the Bi-Encoder to encode all passages, so that we can use it with sematic search
bi_encoder = SentenceTransformer('sentence-transformers/stsb-mpnet-base-v2')
bi_encoder.max_seq_length = 256     #Truncate long passages to 256 tokens
top_k = 32                          #Number of passages we want to retrieve with the bi-encoder

#The bi-encoder will retrieve 100 documents. We use a cross-encoder, to re-rank the results list to improve the quality
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

# As dataset, we use Simple English Wikipedia. Compared to the full English wikipedia, it has only
# about 170k articles. We split these articles into paragraphs and encode them with the bi-encoder

wikipedia_filepath = 'simplewiki-2020-11-01.jsonl.gz'

if not os.path.exists(wikipedia_filepath):
    util.http_get('http://sbert.net/datasets/simplewiki-2020-11-01.jsonl.gz', wikipedia_filepath)

passages = []
with gzip.open(wikipedia_filepath, 'rt', encoding='utf8') as fIn:
    for line in fIn:
        data = json.loads(line.strip())

        #Add all paragraphs
        #passages.extend(data['paragraphs'])

        #Only add the first paragraph
        passages.append(data['paragraphs'][0])

print("Passages:", len(passages))

# We encode all passages into our vector space. This takes about 5 minutes (depends on your GPU speed)
corpus_embeddings = bi_encoder.encode(passages, convert_to_tensor=True, show_progress_bar=True)


Passages: 169597


Batches:   0%|          | 0/5300 [00:00<?, ?it/s]

Ignored unknown kwarg option direction
Ignored unknown kwarg option direction
Ignored unknown kwarg option direction
Ignored unknown kwarg option direction
Ignored unknown kwarg option direction
Ignored unknown kwarg option direction
Ignored unknown kwarg option direction
Ignored unknown kwarg option direction
Ignored unknown kwarg option direction
Ignored unknown kwarg option direction
Ignored unknown kwarg option direction
Ignored unknown kwarg option direction
Ignored unknown kwarg option direction
Ignored unknown kwarg option direction
Ignored unknown kwarg option direction
Ignored unknown kwarg option direction
Ignored unknown kwarg option direction
Ignored unknown kwarg option direction
Ignored unknown kwarg option direction
Ignored unknown kwarg option direction
Ignored unknown kwarg option direction
Ignored unknown kwarg option direction
Ignored unknown kwarg option direction
Ignored unknown kwarg option direction
Ignored unknown kwarg option direction
Ignored unknown kwarg opt

In [4]:
# This function will search all wikipedia articles for passages that
# answer the query
def search(query):
#     print("Input question:", query)

    ##### Sematic Search #####
    # Encode the query using the bi-encoder and find potentially relevant passages
    question_embedding = bi_encoder.encode(query, convert_to_tensor=True)
    question_embedding = question_embedding.cuda()
    hits = util.semantic_search(question_embedding, corpus_embeddings, top_k=top_k)
    hits = hits[0]  # Get the hits for the first query

    ##### Re-Ranking #####
    # Now, score all retrieved passages with the cross_encoder
    cross_inp = [[query, passages[hit['corpus_id']]] for hit in hits]
    cross_scores = cross_encoder.predict(cross_inp)

    # Sort results by the cross-encoder scores
    for idx in range(len(cross_scores)):
        hits[idx]['cross-score'] = cross_scores[idx]


    # Output of top-5 hits from re-ranker
#     print("\n-------------------------\n")
#     print("Top-3 Cross-Encoder Re-ranker hits")
    hits = sorted(hits, key=lambda x: x['cross-score'], reverse=True)
    output = []
    for hit in hits[0:5]:
        temp = passages[hit['corpus_id']].replace("\n", " ")
        output.append(temp)
    data = {}
    data[query] = output
    return data
#         print("\t{:.3f}\t{}".format(hit['cross-score'], passages[hit['corpus_id']].replace("\n", " ")))
        

In [None]:
import re
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline
import torch

with open('GPT-3_paragraph_level.json') as f:
    data = json.load(f)

all_retrieve_data = []
for i in range(len(data)):
    question = re.sub('\nQ.', '', data[i]['questions'])
    retrieved_passage = search(query=question)
#     context = re.sub('\n', ' ', data[i]['positive_ctxs']['text'])
    if not question:
        continue
#     retrieval = pipe(question=question, context=context)
    all_retrieve_data.append(retrieved_passage)

with open('wiki_retrieval_paragraph.json', 'w', encoding='utf-8') as f:
    json.dump(all_retrieve_data, f, ensure_ascii=False, indent=4) 

In [5]:
search(query = "What is the capital of the United States?")

Input question: What is the capital of the United States?
Top-3 lexical search (BM25) hits
	13.316	Capital punishment (the death penalty) has existed in the United States since before the United States was a country. As of 2017, capital punishment is legal in 30 of the 50 states. The federal government (including the United States military) also uses capital punishment.
	11.434	Ohio is one of the 50 states in the United States. Its capital is Columbus. Columbus also is the largest city in Ohio.
	11.179	Nevada is one of the United States' states. Its capital is Carson City. Other big cities are Las Vegas and Reno.
Ignored unknown kwarg option direction
Ignored unknown kwarg option direction

-------------------------

Top-3 Bi-Encoder Retrieval hits
	0.571	A capital district, capital region, or capital territory is a district in which the capital of a; state, country or settlement is located.
	0.565	A capital city (or capital town or just capital) is a city or town, specified by law or 

In [6]:
search(query="When is Chinese New Year")

Input question: When is Chinese New Year
Top-3 lexical search (BM25) hits
	18.743	Chinese New Year, known in China as the SpringFestival and in Singapore as the LunarNewYear, is a holiday on and around the new moon on the first day of the year in the traditional Chinese calendar. This calendar is based on the changes in the moon and is only sometimes changed to fit the seasons of the year based on how the Earth moves around the sun. Because of this, Chinese New Year is never on January1. It moves around between January21 and February20.
	18.527	New Year in Japan is one of the most important festivals. Unlike the Chinese New Year, it is held on January 1.
	15.789	The CCTV New Year's Gala (Simplified Chinese: 中国中央电视台春节联欢晚会; Traditional Chinese: 中國中央電視台春節聯歡晚會; Pinyin: "Zhōngguó zhōngyāng diànshìtái chūnjié liánhuān wǎnhuì") is a Chinese New Year special produced by China Central Television. It was presented by Zhao Zhongxiang.
Ignored unknown kwarg option direction
Ignored unknown kwarg o