# Context Length Experiment

This notebook is to experiment how well ultra-long context LLMs can perform reasoning tasks over their context length. 

## Observation

Needle-in-a-haystack (NIH) tests test the ability of the LLM to find factoids that have been sprinkled randomly into the context. For example they might put things like "Pineapple is the best pizza topping" somewhere in all of the Paul Graham essays, then ask the LLM what is the best pizza topping. This is great for measuring how well the LLM can recall facts from its context but it doesnt measure how well the LLM can reason over 1M tokens of context.

This experiment will test that ability. We will fill the context window up gradually and measure how well the LLM can answer questions that require it to understand the entire context. 

## Hypothesis

My guess is that the agent will be able to perform these tasks with high accuracy in low context lengths, and the accuracy will drop off after a point. I also believe that the LLM will be close to accurate every time, but will not be 100% accurate after a certain threshold.


## Experiments

The way we will be able to evaluate the accuracy is by choosing tasks that we can validate programattically. So we cant just ask it for an analysis because we wont be able to descreetly validate the analysis results. We also want to avoid adding variables, so we won't want to include any tests that guage the ability of the LLM to count or function call for example. So in that case I thought of a few experiments:

1. Parse Titles - We ask the LLM to give us a list of all of the titles of all of the essays in order. We can write some regex scripts to parse that info ourselves and validate precision and recall and order correctness.
2. Parse Quotes - We ask the LLM to give us a list of all of the quotes that Paul Graham includes in his essays. We can then write regex to parse out everything wrapped in quotes (or `blockquotes`) and validate precision and recall
3. Ordered Instructions - We write a set of step by step instructions and break those steps up and sprinkle them randomly in the essays in order. We then ask the LLM to put those instructions together into the correct order and validate the accuracy.
4. Unordered Instructions - Same as the previous experiment but mix the order of the steps.
5. Parse Links - Parse out all of the href links from the essays and validate the accuracy of the LLM in parsing out the links.

### Utils

In [None]:
import re


def split_essays():
    """Split the Paul Graham essays"""
    with open("./paul_graham_essay.txt", "r") as file:
        essay_text = file.read()

    essays = []
    lines = []
    # Regex to match titles formatted as "Month Year"
    title_pattern = re.compile(r'^(January|February|March|April|May|June|July|August|September|October|November|December) \d{4}$')

    for line in essay_text.split('\n'):
        if title_pattern.match(line.strip()):
            # If we find a title and have collected lines for an essay, save the essay
            if lines:
                essays.append("\n".join(lines).strip())
                lines = []
        lines.append(line)

    # Add the last essay collected, if any
    if lines:
        essays.append("\n".join(lines).strip())

    return essays

essays = split_essays()

In [None]:
print(essays[-1])

### Parse Titles Experiment

In [None]:
%pip install -qU langsmith

#### Dataset

We need to create a dataset that includes a list of indexes as the input and the expected titles in the output

In [None]:
from langsmith.schemas import Example

def parse_titles(essays: list[str]):
    titles = []
    for essay in essays:
        title = essay.split("\n")[0]
        titles.append(title)
    return titles


def create_titles_dataset(essays=essays, chunks=8):
    examples = []
    for index in range(0, len(essays), chunks):
        # Current index and all indexes before it
        indexes = [i for i in range(index)]
        essays_to_parse = [essays[i] for i in indexes]

        if not indexes:
            continue

        examples.append({"inputs": {"indexes": indexes}, "outputs": {"titles": parse_titles(essays=essays_to_parse)}})

    return examples

In [None]:
examples = create_titles_dataset()
len(examples)


In [None]:
examples[0]

Now lets upload them to a dataset in langsmith

In [None]:
from dotenv import load_dotenv
from langsmith import Client

load_dotenv()

def upload_title_dateset():
    client = Client()
    dataset_name = "Context Experiment - Titles"

    # Storing inputs in a dataset lets us
    # run chains and LLMs over a shared set of examples.
    # dataset = client.create_dataset(
    #     dataset_name=dataset_name,
    #     description="Experiment to test the ability of the LLM to recall titles from essays.",
    # )

    for i, example in enumerate(examples):
        client.create_example(
            inputs=example["inputs"],
            outputs=example["outputs"],
            dataset_name=dataset_name,
            metadata={"index": i}
        )

    # return dataset

upload_title_dateset()

Okay now lets write the predict function

In [None]:
%pip install -qU langchain-openai langchain-google-vertexai

In [None]:
from langchain_openai import ChatOpenAI
from langchain.pydantic_v1 import BaseModel, Field
from langchain.schema.messages import SystemMessage, HumanMessage
from typing import List


class TitleSchema(BaseModel):
    """A list of ALL titles, including duplicates, in the order that they appear in the response."""

    titles: List[str] = Field(
        description="A list of ALL titles, including duplicates, in the order that they appear in the response."
    )


def llm_parse_titles(text: str):
    """Use an LLM to parse the titles from the LLM output to an array"""
    llm = ChatOpenAI(model="gpt-4o", temperature=0).with_structured_output(TitleSchema)
    system_prompt = """The user was tasked with analyzing and writing a list of titles that they found in some essays. \
Your job is to parse the user's response and return a list of titles that they mentioned. Be very exact! The integrity of your \
response is very important. If you are off by a single character, you will be penalized."""
    response: TitleSchema = llm.invoke([SystemMessage(content=system_prompt), HumanMessage(content=text)])
    return response



Now lets write the evaluations

In [None]:
from langsmith.schemas import Example, Run


def title_precision(root_run: Run, example: Example) -> dict:
    """The LLM's ability to only recall real titles from the expected list"""
    output_titles: list[str] = root_run.outputs.get("titles", [])
    expected_titles: list[str] = example.outputs["titles"]


    if not output_titles:
        return {"score": 0, "key": "precision", "comment": "No output titles provided"}

    score = 0
    false_positives = []
    for llm_title in output_titles:
        if llm_title in expected_titles:
            score += 1
        else:
            false_positives.append(llm_title)

    final_score = score / len(output_titles) if output_titles else 0.0
    comment = f"Titles not included in the example: {', '.join(false_positives)}"
    return {"score": final_score, "key": "precision", "comment": comment}


def title_recall(root_run: Run, example: Example) -> dict:
    """The LLM's ability to recall all real titles from the expected list"""
    output_titles: list[str] = root_run.outputs.get("titles", [])
    expected_titles: list[str] = example.outputs["titles"]

    total_expected_titles = len(expected_titles)
    output_titles_copy = output_titles.copy()
    score = 0
    missed_titles = []

    for expected_title in expected_titles:
        if expected_title in output_titles_copy:
            score += 1
            output_titles_copy.remove(expected_title)  # Remove the title to account for duplicates
        else:
            missed_titles.append(expected_title)

    final_score = score / total_expected_titles if total_expected_titles else 0.0
    comment = f"Missed titles from the expected list: {', '.join(missed_titles)}"
    return {"score": final_score, "key": "recall", "comment": comment}


def title_order(root_run: Run, example: Example) -> dict:
    """The LLM's ability to order the titles correctly"""
    output_titles: list[str] = root_run.outputs.get("titles", [])
    expected_titles: list[str] = example.outputs["titles"]
    score = 0
    out_of_order_title = None

    for index, title in enumerate(expected_titles):
        try:
            if title.lower() == output_titles[index].lower():
                score += 1
            else:
                out_of_order_title = title
                break
        except:
            out_of_order_title = title
            break

    final_score = score / len(output_titles) if output_titles else 0.0
    comment = (
        f"First title out of order: {out_of_order_title}"
        if out_of_order_title
        else "All titles are in the correct order"
    )
    return {"score": final_score, "key": "order", "comment": comment}

Okay this seems good. Lets put it all together into an eval

In [None]:
%pip install -qU google-generativeai

In [None]:
from langchain_google_genai import ChatGoogleGenerativeAI
from langsmith import evaluate, Client

client = Client()
dataset_id = "e4c55781-52d9-47e7-a1bb-fcdac719d838"

dataset = client.read_dataset(dataset_id=dataset_id)


class GeminiPredictor:
    def __init__(
        self,
        llm=ChatGoogleGenerativeAI(
            model="gemini-1.5-flash", temperature=0, google_api_key="**"
        ),
        prefix="Flash",
    ):
        self.llm = llm
        self.prefix = prefix

    def predict_titles(self, inputs: dict):
        print("PREDICTING...")
        indexes: list[int] = inputs["indexes"]

        essays_for_context = [essays[i] for i in indexes]
        essay_str = "\n\n".join(essays_for_context)

        system_prompt = f"""You are a very thorough and detailed analyzer of Paul Graham essays. Your task it to analyze the \
provided essays and ONLY return a single list of titles in the order that they appear. You can tell which lines are titles because the \
contain ONLY a month and year followed by 2 new lines. For example: 'February 1993\n\n'. 

Return a numbered list of all of the titles you have found. \
You will be graded on precision and recall so be sure to include ALL of the titles and in the correct order. Some duplicates are expected, \
just be sure to be as accurate as possible!


<PAUL GRAHAM ESSAYS>
{essay_str}
</PAUL GRAHAM ESSAYS>"""
        response = self.llm.invoke(system_prompt)
        titles = llm_parse_titles(response.content)
        return {"response": response.content, "titles": titles.titles}

    def evaluate(self, splits: list[str] = ["tiny"], max_concurrency: int = 1, repetitions=1):
        examples = client.list_examples(dataset_name="Context Experiment - Titles", splits=splits)
        for example in examples:
            essay_count = len(example.inputs["indexes"])
            evaluate(
                self.predict_titles,
                data=client.list_examples(
                    dataset_name="Context Experiment - Titles", splits=splits, metadata={"index": example.metadata["index"]}
                ),
                evaluators=[title_precision, title_recall, title_order],
                experiment_prefix=f"{self.prefix}-{essay_count}_Essays",
                max_concurrency=max_concurrency,
                num_repetitions=repetitions,
            )


# response = GeminiPredictor().predict_titles({"indexes": [0, 1]})
# response

In [None]:
predictor = GeminiPredictor()
predictor.evaluate(splits=["small5"])

In [None]:
total_examples = client.list_examples(dataset_name="Context Experiment - Titles")


In [None]:
total_examples = client.list_examples(dataset_name="Context Experiment - Titles")
for example in total_examples:
    print(example.metadata["index"])

In [None]:
response
print(response["titles"][0])

In [None]:
from langsmith import evaluate, Client
from langchain_google_genai import ChatGoogleGenerativeAI

client = Client()

dataset_id = "e4c55781-52d9-47e7-a1bb-fcdac719d838"

dataset = client.read_dataset(dataset_id=dataset_id)

examples = client.list_examples(dataset_name=dataset.name, splits=["single"])

evaluate(
    predictor.predict_titles,
    data=examples,
    evaluators=[title_precision, title_recall, title_order],
    experiment_prefix="flash",
    max_concurrency=2,
)