# Week 1
The objectives of this week are to develop and test benchmarks. This will give us a baseline to compare our methods to.


Objectives:
- Find example pdf / text data.
- Setup LangChain Recursive Text Splitter.
- Setup fixed length token splitter.
- Setup Chroma DB.
- Create pipeline of: Text + Chunker -> Chroma Store

# Example Text Data
To start off simple, I copied a recent news article from BBC about Effective Accelerationist, Grimes.

In [5]:
def read_txt_file(file_path):
    with open(file_path, 'r') as file:
        data = file.read()
    return data

# Test the function with the news.txt file
news_data = read_txt_file('../data/news.txt')
print(news_data[:200]+'...')


Coachella: Grimes apologises for technical difficulties

Mon 15 Apr

BBC NEWS

Grimes has apologised for "major technical difficulties" during her Coachella DJ set.

Fans watched the singer scream in ...


# Setup LangChain Recursive Text Splitter

In [12]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Initialize the Recursive Text Splitter
splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=20,
)

# Split the news_data using the splitter
split_text = splitter.split_text(news_data)

# Print the first 5 splits
print(split_text[:5])



['Coachella: Grimes apologises for technical difficulties\n\nMon 15 Apr\n\nBBC NEWS', 'BBC NEWS\n\nGrimes has apologised for "major technical difficulties" during her Coachella DJ set.', 'Fans watched the singer scream in frustration after a string of problems - such as songs playing at', 'as songs playing at double-speed - marred the second half of her festival slot.', 'Posting on X, the singer said it was "one of the first times" she had "outsourced essential']


In [None]:
# from langchain_experimental.text_splitter import SemanticChunker
# from langchain_openai.embeddings import OpenAIEmbeddings

# Setup LangChain fixed length splitter

In [15]:
from langchain_text_splitters import CharacterTextSplitter

splitter = CharacterTextSplitter.from_tiktoken_encoder(
    encoding_name="cl100k_base", chunk_size=100, chunk_overlap=0
)

texts = splitter.split_text(news_data)

# Print the first 5 splits
print(split_text[:5])

['Coachella: Grimes apologises for technical difficulties\n\nMon 15 Apr\n\nBBC NEWS', 'BBC NEWS\n\nGrimes has apologised for "major technical difficulties" during her Coachella DJ set.', 'Fans watched the singer scream in frustration after a string of problems - such as songs playing at', 'as songs playing at double-speed - marred the second half of her festival slot.', 'Posting on X, the singer said it was "one of the first times" she had "outsourced essential']


# Setting up Chroma

In [17]:
import chromadb
chroma_client = chromadb.PersistentClient(path="../data/chroma_db")

collection = chroma_client.create_collection(name="chuck_1")


# Retrieval Precision:
In the ARAGOG paper they used Tonic Validate for this. If have taken Tonic's prompt so our implementation is identical (except we have the power to use models beyond GPT-3.5).
Ref: https://github.com/TonicAI/tonic_validate/blob/main/tonic_validate/utils/llm_calls.py

In [10]:
def get_retrieval_precision_prompt(question, context):
    main_message = ("Considering the following question and context, determine whether the context "
                    "is relevant for answering the question. If the context is relevant for "
                    "answering the question, respond with true. If the context is not relevant for "
                    "answering the question, respond with false. Respond with either true or false "
                    "and no additional text.")

    main_message += f"\nQUESTION: {question}\n"
    main_message += f"CONTEXT: {context}\n"

    return main_message

In [11]:
# Testing the function get_retrieval_precision_prompt
question = "What is the capital of France?"
context = "France, in Western Europe, encompasses medieval cities, alpine villages and Mediterranean beaches. Paris, its capital, is famed for its fashion houses, classical art museums including the Louvre and monuments like the Eiffel Tower."

print(get_retrieval_precision_prompt(question, context))


Considering the following question and context, determine whether the context is relevant for answering the question. If the context is relevant for answering the question, respond with true. If the context is not relevant for answering the question, respond with false. Respond with either true or false and no additional text.
QUESTION: What is the capital of France?
CONTEXT: France, in Western Europe, encompasses medieval cities, alpine villages and Mediterranean beaches. Paris, its capital, is famed for its fashion houses, classical art museums including the Louvre and monuments like the Eiffel Tower.



In [None]:
import os
from openai import OpenAI

OPENAI_API_KEY = os.getenv('OPENAI_CHROMA_API_KEY')

client = OpenAI(api_key=OPENAI_API_KEY)

completion = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": "You are a poetic assistant, skilled in explaining complex programming concepts with creative flair."},
    {"role": "user", "content": "Compose a poem that explains the concept of recursion in programming."}
  ]
)

print(completion.choices[0].message)

In [6]:
import os
import anthropic

ANTHROPIC_API_KEY = os.getenv('ANTHROPIC_CHROMA_API_KEY')

client = anthropic.Anthropic(
    api_key=ANTHROPIC_API_KEY,
)

message = client.messages.create(
    model="claude-3-opus-20240229",
    max_tokens=1000,
    temperature=0.0,
    system="Respond only in Yoda-speak.",
    messages=[
        {"role": "user", "content": "How are you today?"}
    ]
)

print(message.content)

[TextBlock(text='*clears throat and speaks in a croaky voice* Hmm, well I am today, young Padawan. The Force, strong in me it flows. Yes, hmmm.', type='text')]


In [18]:
message = client.messages.create(
    model="claude-3-opus-20240229",
    max_tokens=1000,
    temperature=0.0,
    system=get_retrieval_precision_prompt(question, context),
    messages=[
        {"role": "user", "content": "Is this CONTEXT relavent?"}
    ]
)

print(message.content)

[TextBlock(text='true', type='text')]


In [21]:
import json

# Open the json file and read it
with open('../eval_questions/eval_data.json', 'r') as file:
    data = json.load(file)

# Print the data to verify it's been read correctly
print(len(data['questions']))


107
