## 0. Set up environment

In [2]:
import os
import fitz
import requests
import re

from openai import OpenAI

with open('openai_api_key.txt', 'r') as file:
    os.environ['OPENAI_API_KEY'] = file.read().strip()

## 1. Basic Workflow

In [3]:
def read_pdf(file_path):
    # Read PDF and extract text
    text = ""
    with fitz.open(file_path) as doc:
        for page in doc:
            text += page.get_text() + "\n"
    return text

def process_text(text):
    # Clean and segment text
    
    # Remove contents after Acknowledgments
    ack_index = text.lower().find("acknowledgments\n")
    if ack_index != -1:
        text = text[:ack_index]
    
    # Remove any lines that are entirely whitespace
    cleaned_text = "\n".join([line for line in text.split('\n') if line.strip()])
    
    # Clean sentences
    subsentences = re.sub(r'\n(?=[a-z])', ' ', cleaned_text)
    sentences = re.sub(r'(?<=[^\.!?])\n', ' ', subsentences)
    
    # Combine sentences into paragraphs
    paragraphs = sentences.split('\n')
    
    return paragraphs

def query_openai_api(question, context):
    """
    Sends a request to the OpenAI API with the given text.
    
    :param question: The question you want to ask based on the context.
    :param context: The context (e.g., a paragraph from the PDF) related to the question.
    :param api_key: OpenAI API key.
    :return: Response from the API.
    """
    client = OpenAI()

    completion = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", 
             "content": "You are an assistant skilled in analyzing and summarizing academic articles."},
            {"role": "user", "content": context},
            {"role": "user", "content": question}
        ],
        temperature=0.2,
        max_tokens=2000
    )

    return completion.choices[0].message.content
    

In [39]:
filepath = 'import/MTEB.pdf'

text = read_pdf(filepath)
paras = process_text(text)

to_save = "\n\n".join(paras)
with open('data/MTEB.txt', 'w', encoding='utf-8') as file:
    file.write(to_save)

print(paras[:3])
print(paras[-3:])

['MTEB: Massive Text Embedding Benchmark Niklas Muennighoff1, Nouamane Tazi1, Loïc Magne1, Nils Reimers2* 1Hugging Face 2cohere.ai 1firstname@hf.co 2info@nils-reimers.de Abstract Text embeddings are commonly evaluated on a small set of datasets from a single task not covering their possible applications to other tasks. It is unclear whether state-of-the-art em- beddings on semantic textual similarity (STS) can be equally well applied to other tasks like clustering or reranking. This makes progress in the ﬁeld difﬁcult to track, as various models are constantly being proposed without proper evaluation. To solve this problem, we intro- duce the Massive Text Embedding Benchmark (MTEB). MTEB spans 8 embedding tasks cov- ering a total of 58 datasets and 112 languages.', 'Through the benchmarking of 33 models on MTEB, we establish the most comprehensive benchmark of text embeddings to date.', 'We ﬁnd that no particular text embedding method dominates across all tasks. This suggests that the 

### Sample Question

In [32]:
question = "What is the title of the paper?"
query_openai_api(question, paras[0])

'The title of the paper is "MTEB: Massive Text Embedding Benchmark".'

## 2. Llama-Index Workflow

In [33]:
# Now more complex questions requiring finding relevant chunks from the text based on STS
# Let's leverage llama-index for this
from llama_index import SimpleDirectoryReader, VectorStoreIndex, ServiceContext
from llama_index.node_parser import SimpleNodeParser

# Load data
documents = SimpleDirectoryReader("data").load_data()

# Parse nodes
node_parser = SimpleNodeParser.from_defaults(chunk_size=512, chunk_overlap=10) # Could implement paragraph chuking
service_context = ServiceContext.from_defaults(node_parser=node_parser)

# Index
index = VectorStoreIndex.from_documents(
    documents, service_context=service_context
)

query_engine = index.as_query_engine(similarity_top_k = 5) # Returnn top 5 most similar documents

def answer(question):
    response = query_engine.query(question)
    return print(response)

### Prompt Engineering

In [48]:
answer("What's the name of this paper? Note that the title of this paper is usually the title at the beginning of the text.")

The name of this paper is "Massive Text Embedding Benchmark".


In [49]:
answer("What are the full names of the authors of this paper? Note that the authors are the names immediately following the paper title.")

Niklas Muennighoff, Nouamane Tazi, Loïc Magne, Nils Reimers


In [50]:
answer("Can you summarize key takeaways from this paper?")

The paper discusses the evaluation of different embedding models on various NLP tasks. It highlights the strengths and weaknesses of different models for tasks such as retrieval, clustering, and semantic textual similarity (STS). The benchmark used in the evaluation, called MTEB, consists of 58 datasets across 8 task types. The paper emphasizes the importance of benchmarks and evaluation frameworks in driving NLP progress. It also mentions the insufficiency of existing benchmarks like SentEval and introduces the USEB benchmark that focuses on reranking tasks. The paper aims to make it easy to reproduce the results by providing versioning at the dataset and software level. Overall, the paper provides insights into the performance of different embedding models and aims to simplify the process of selecting the right model for specific NLP tasks.
