# Week 1
The objectives of this week are to develop and test benchmarks. This will give us a baseline to compare our methods to.


Objectives:
- Find example pdf / text data.
- Setup LangChain Recursive Text Splitter.
- Setup fixed length token splitter.
- Setup Chroma DB.
- Create pipeline of: Text + Chunker -> Chroma Store

# Example Text Data
To start off simple, I copied a recent news article from BBC about Effective Accelerationist, Grimes.

In [5]:
def read_txt_file(file_path):
    with open(file_path, 'r') as file:
        data = file.read()
    return data

# Test the function with the news.txt file
news_data = read_txt_file('../data/news.txt')
print(news_data[:200]+'...')


Coachella: Grimes apologises for technical difficulties

Mon 15 Apr

BBC NEWS

Grimes has apologised for "major technical difficulties" during her Coachella DJ set.

Fans watched the singer scream in ...


# Setup LangChain Recursive Text Splitter

In [12]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Initialize the Recursive Text Splitter
splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=20,
)

# Split the news_data using the splitter
split_text = splitter.split_text(news_data)

# Print the first 5 splits
print(split_text[:5])



['Coachella: Grimes apologises for technical difficulties\n\nMon 15 Apr\n\nBBC NEWS', 'BBC NEWS\n\nGrimes has apologised for "major technical difficulties" during her Coachella DJ set.', 'Fans watched the singer scream in frustration after a string of problems - such as songs playing at', 'as songs playing at double-speed - marred the second half of her festival slot.', 'Posting on X, the singer said it was "one of the first times" she had "outsourced essential']


In [None]:
# from langchain_experimental.text_splitter import SemanticChunker
# from langchain_openai.embeddings import OpenAIEmbeddings

# Setup LangChain fixed length splitter

In [15]:
from langchain_text_splitters import CharacterTextSplitter

splitter = CharacterTextSplitter.from_tiktoken_encoder(
    encoding_name="cl100k_base", chunk_size=100, chunk_overlap=0
)

texts = splitter.split_text(news_data)

# Print the first 5 splits
print(split_text[:5])

['Coachella: Grimes apologises for technical difficulties\n\nMon 15 Apr\n\nBBC NEWS', 'BBC NEWS\n\nGrimes has apologised for "major technical difficulties" during her Coachella DJ set.', 'Fans watched the singer scream in frustration after a string of problems - such as songs playing at', 'as songs playing at double-speed - marred the second half of her festival slot.', 'Posting on X, the singer said it was "one of the first times" she had "outsourced essential']


# Setting up Chroma

In [17]:
import chromadb
chroma_client = chromadb.PersistentClient(path="../data/chroma_db")

collection = chroma_client.create_collection(name="chuck_1")


# Retrieval Precision:
In the ARAGOG paper they used Tonic Validate for this. If have taken Tonic's prompt so our implementation is identical (except we have the power to use models beyond GPT-3.5).
Ref: https://github.com/TonicAI/tonic_validate/blob/main/tonic_validate/utils/llm_calls.py

In [10]:
def get_retrieval_precision_prompt(question, context):
    main_message = ("Considering the following question and context, determine whether the context "
                    "is relevant for answering the question. If the context is relevant for "
                    "answering the question, respond with true. If the context is not relevant for "
                    "answering the question, respond with false. Respond with either true or false "
                    "and no additional text.")

    main_message += f"\nQUESTION: {question}\n"
    main_message += f"CONTEXT: {context}\n"

    return main_message

In [11]:
# Testing the function get_retrieval_precision_prompt
question = "What is the capital of France?"
context = "France, in Western Europe, encompasses medieval cities, alpine villages and Mediterranean beaches. Paris, its capital, is famed for its fashion houses, classical art museums including the Louvre and monuments like the Eiffel Tower."

print(get_retrieval_precision_prompt(question, context))


Considering the following question and context, determine whether the context is relevant for answering the question. If the context is relevant for answering the question, respond with true. If the context is not relevant for answering the question, respond with false. Respond with either true or false and no additional text.
QUESTION: What is the capital of France?
CONTEXT: France, in Western Europe, encompasses medieval cities, alpine villages and Mediterranean beaches. Paris, its capital, is famed for its fashion houses, classical art museums including the Louvre and monuments like the Eiffel Tower.



In [40]:
import os
from openai import OpenAI

OPENAI_API_KEY = os.getenv('OPENAI_CHROMA_API_KEY')

client = OpenAI(api_key=OPENAI_API_KEY)

completion = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": "You are a poetic assistant, skilled in explaining complex programming concepts with creative flair."},
    {"role": "user", "content": "Compose a poem that explains the concept of recursion in programming."}
  ]
)

print(completion.choices[0].message)

ChatCompletionMessage(content="In the realm of coding's elegant dance,\nLies a concept, a mystical trance,\nRecursion, a method cleverly devised,\nWhere functions call upon themselves, disguised.\n\nLike a mirror reflecting its own reflection,\nRecursion invokes a deep connection,\nTo tackle problems with beauty and grace,\nIn a recursive embrace, they boldly embrace.\n\nA function that calls itself, again and again,\nUnraveling complexity, breaking the chain,\nEach iteration peeling layers of the maze,\nUntil the solution emerges, in a recursive blaze.\n\nWith elegance and power, recursion weaves,\nA tapestry of solutions, that it achieves,\nInfinite loops, a danger to beware,\nYet in skilled hands, recursion is rare.\n\nSo embrace the loop of self-reference,\nLet recursion be your ally, your guidance,\nIn the world of coding, a concept profound,\nRecursion's magic forever unbound.", role='assistant', function_call=None, tool_calls=None)


In [6]:
import os
import anthropic

ANTHROPIC_API_KEY = os.getenv('ANTHROPIC_CHROMA_API_KEY')

client = anthropic.Anthropic(
    api_key=ANTHROPIC_API_KEY,
)

message = client.messages.create(
    model="claude-3-opus-20240229",
    max_tokens=1000,
    temperature=0.0,
    system="Respond only in Yoda-speak.",
    messages=[
        {"role": "user", "content": "How are you today?"}
    ]
)

print(message.content)

[TextBlock(text='*clears throat and speaks in a croaky voice* Hmm, well I am today, young Padawan. The Force, strong in me it flows. Yes, hmmm.', type='text')]


In [18]:
message = client.messages.create(
    model="claude-3-opus-20240229",
    max_tokens=1000,
    temperature=0.0,
    system=get_retrieval_precision_prompt(question, context),
    messages=[
        {"role": "user", "content": "Is this CONTEXT relavent?"}
    ]
)

print(message.content)

[TextBlock(text='true', type='text')]


In [21]:
import json

# Open the json file and read it
with open('../eval_questions/eval_data.json', 'r') as file:
    data = json.load(file)

# Print the data to verify it's been read correctly
print(len(data['questions']))


107


Thoughts:

Currently using:
ARAGOGs - Dataset (small, they added all other papers, arXiv papers)
ARAGOGs - Questions (107 questions)
Tonic Validate - Prompt for Retrieval Precision

# PDF to Text to Chroma DB

In [23]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("../papers_for_questions/bert.pdf")
pages = loader.load_and_split()

In [35]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Initialize the Recursive Text Splitter
splitter = RecursiveCharacterTextSplitter(
    # chunk_size=1024,
    # chunk_overlap=256,
)

# Split the news_data using the splitter
split_text = splitter.split_documents(pages)

In [None]:
split_text[:5]

# Count the number of tokens in each page_content
import tiktoken

# Count the number of tokens in each page_content
def num_tokens_from_string(string: str) -> int:
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.get_encoding("cl100k_base")
    num_tokens = len(encoding.encode(string))
    return num_tokens

for page in split_text:
    print(num_tokens_from_string(page.page_content))
    # print(page.page_content[:200]+'...')



In [44]:
split_text[1].metadata

{'source': '../papers_for_questions/bert.pdf', 'page': 1}

In [45]:
split_text[1].page_content

'word based only on its context. Unlike left-to-\nright language model pre-training, the MLM ob-\njective enables the representation to fuse the left\nand the right context, which allows us to pre-\ntrain a deep bidirectional Transformer. In addi-\ntion to the masked language model, we also use\na “next sentence prediction” task that jointly pre-\ntrains text-pair representations. The contributions\nof our paper are as follows:\n• We demonstrate the importance of bidirectional\npre-training for language representations. Un-\nlike Radford et al. (2018), which uses unidirec-\ntional language models for pre-training, BERT\nuses masked language models to enable pre-\ntrained deep bidirectional representations. This\nis also in contrast to Peters et al. (2018a), which\nuses a shallow concatenation of independently\ntrained left-to-right and right-to-left LMs.\n• We show that pre-trained representations reduce\nthe need for many heavily-engineered task-\nspeciﬁc architectures. BERT is the ﬁr

In [41]:
import chromadb

import chromadb.utils.embedding_functions as embedding_functions
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
                api_key=OPENAI_API_KEY,
                model_name="text-embedding-3-small"
            )

chroma_client = chromadb.PersistentClient(path="../data/chroma_db")

collection = chroma_client.get_or_create_collection(name="chuck_1", embedding_function=openai_ef)

In [52]:
collection.count()

24

In [50]:
documents = [chunk.page_content for chunk in split_text]
metadatas = [chunk.metadata for chunk in split_text]
ids = [str(i) for i in range(len(split_text))]


In [51]:
collection.add(
    documents=documents,
    metadatas=metadatas,
    ids=ids
)

In [55]:
collection.query(query_texts=["What are the two main tasks BERT is pre-trained on?"], n_results=5)

{'ids': [['3', '20', '0', '6', '4']],
 'distances': [[0.7218702708002468,
   0.7597797400597308,
   0.7711573382687963,
   0.7771485522310896,
   0.8290271472604436]],
 'metadatas': [[{'page': 2, 'source': '../papers_for_questions/bert.pdf'},
   {'page': 13, 'source': '../papers_for_questions/bert.pdf'},
   {'page': 0, 'source': '../papers_for_questions/bert.pdf'},
   {'page': 4, 'source': '../papers_for_questions/bert.pdf'},
   {'page': 3, 'source': '../papers_for_questions/bert.pdf'}]],
 'embeddings': None,
 'documents': [['BERT BERT \nE[CLS] E1 E[SEP] ... ENE1’... EM’\nC\nT1\nT[SEP] ...\n TN\nT1’...\n TM’\n[CLS] Tok 1 [SEP] ... Tok NTok 1 ... TokM \nQuestion Paragraph Start/End Span \nBERT \nE[CLS] E1 E[SEP] ... ENE1’... EM’\nC\nT1\nT[SEP] ...\n TN\nT1’...\n TM’\n[CLS] Tok 1 [SEP] ... Tok NTok 1 ... TokM \nMasked Sentence A Masked Sentence B \nPre-training Fine-Tuning NSP Mask LM Mask LM \nUnlabeled Sentence A and B Pair SQuAD \nQuestion Answer Pair NER MNLI Figure 1: Overall pre-tr

Current Issue:

Retrieval Precision expects a COMPLETE RAG system and wants to measure the TOTAL number of returned contexts divided by relavent context. 

The issue with this is we'd ideally just retrieve N contexts always. The metric should not punish if the model returns all relavent context but there isn't much relavent context. Nor should it be rewarded if it only returns one relavent context but there's lots. 