# Replicate Results

This notebook demonstrates how to replicate our results for:
- **Generating representative benchmarks** with public datasets using the English subset of the [multilingual Wikipedia dataset](https://huggingface.co/datasets/ellamind/wikipedia-2023-11-retrieval-multilingual-queries)

- **Aligning our LLM Judge** for document quality filtering with [Weights and Biases](https://wandb.ai/site/) data

The rest of our results can be replicated with through replacing the dataset and embedding model.

## 1. Setup

### 1.1 Install & Import

Install the necessary packages.

In [None]:
!pip install -r requirements.txt

Import modules.

In [11]:
%load_ext autoreload
%autoreload 2

import chromadb
import pandas as pd
import numpy as np
import datasets
from openai import OpenAI as OpenAIClient
from anthropic import Anthropic as AnthropicClient
from functions.llm import *
from functions.embed import *
from functions.chroma import *
from functions.evaluate import *
from functions.utils import *
from functions.visualize import *

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### 1.2 Load API Keys

To use Chroma Cloud, you can sign up for a Chroma Cloud account [here](https://www.trychroma.com/) and create a new database. If you want to use local Chroma, skip this step and simply input `OPENAI_API_KEY` and `CLAUDE_API_KEY`.

In [None]:
# Chroma Cloud
CHROMA_TENANT = "YOUR CHROMA TENANT ID"
X_CHROMA_TOKEN = "YOUR CHROMA API KEY"
DATABASE_NAME = "YOUR CHROMA DATABASE NAME"

# Embedding Model
OPENAI_API_KEY = "YOUR OPENAI API KEY"

# LLM
CLAUDE_API_KEY = "YOUR CLAUDE API KEY"

### 1.3 Set Clients

Initialize the clients.

In [None]:
chroma_client = chromadb.HttpClient(
  ssl=True,
  host='api.trychroma.com',
  tenant=CHROMA_TENANT,
  database=DATABASE_NAME,
  headers={
    'x-chroma-token': X_CHROMA_TOKEN
  }
)

# If you want to use the local Chroma instead, uncomment the following line:
# chroma_client = chromadb.Client()

openai_client = OpenAIClient(api_key=OPENAI_API_KEY)
anthropic_client = AnthropicClient(api_key=CLAUDE_API_KEY)

### 1.4 Load Data

We'll use the English subset of the [multilingual Wikipedia dataset](https://huggingface.co/datasets/ellamind/wikipedia-2023-11-retrieval-multilingual-queries).

We use the `test` split for this demonstration, which contains:
- 1500 queries
- 1500 query-document relevance judgments (qrels)
- 13500 corpus documents

First, we'll load the queries, documents, and query-document relevance judgments.

In [None]:
wiki_queries = datasets.load_dataset("ellamind/wikipedia-2023-11-retrieval-multilingual-queries", "en")["test"].to_pandas()
wiki_corpus = datasets.load_dataset("ellamind/wikipedia-2023-11-retrieval-multilingual-corpus", "en")["test"].to_pandas()
wiki_qrels = datasets.load_dataset("ellamind/wikipedia-2023-11-retrieval-multilingual-qrels", "en")["test"].to_pandas()

For this specific dataset, the query-documnet relevance judgements include distractors as indicated by scores of 0.5 and target matches as indicated by scores of 1.0.

We'll filter the query-document relevance judgments to only include target matches. Then, we'll combine the queries, documents, and query-document relevance judgments into a single dataframe for convenience.

In [None]:
wiki_qrels = wiki_qrels[wiki_qrels["score"] == 1.0]

wiki_qrels = combined_datasets_dataframes(wiki_queries, wiki_corpus, wiki_qrels)

## 2. Embed Corpus & Store in Chroma

### 2.1 Embed

In [None]:
wiki_corpus_ids = wiki_corpus["_id"].tolist()
wiki_corpus_texts = wiki_corpus["text"].tolist()

wiki_corpus_embeddings = openai_embed_in_batches(
    openai_client=openai_client, 
    texts=wiki_corpus_texts, 
    model="text-embedding-3-small"
)

wiki_corpus_lookup = {
    id: {
        "text": text,
        "embedding": embedding
    } for id, text, embedding in zip(wiki_corpus_ids, wiki_corpus_texts, wiki_corpus_embeddings)
}

### 2.2 Create & Add to Chroma Collection

In [None]:
wiki_collection = chroma_client.get_or_create_collection(
    name="wiki-text-embedding-3-small",
    metadata={"hnsw:space": "cosine"}
)

collection_add_in_batches(
    collection=wiki_collection, 
    ids=wiki_corpus_ids, 
    texts=wiki_corpus_texts, 
    embeddings=wiki_corpus_embeddings
)

## 3. Naive Query Generation

We will demonstrate that LLMs have memorized a substantial portion of public benchmarks, which limits their ability to reliably generate unseen queries.

### 3.1 Generate Queries

Generate 1500 queries, only including the document as context.

We batch the LLM calls for efficiency
- ids are converted to align with Anthropic's batch id formatting
- batch processing status can be viewed through [Anthropic's Console](https://console.anthropic.com/workspaces/default/batches)

In [None]:
corpus_ids_qrels = wiki_qrels["corpus-id"].tolist()
corpus_texts_qrels = wiki_qrels["corpus-text"].tolist()

ids_for_batching = [clean_id_for_batching(id) for id in corpus_ids_qrels]

naive_queries_batch_id = create_naive_query_batch(
    client=anthropic_client,
    model="claude-3-5-sonnet-20241022",
    documents=corpus_texts_qrels,
    ids=ids_for_batching
)

Retrieve the generated queries once the batch is complete and merge with `wiki_qrels`.

In [None]:
naive_queries_df = retrieve_batch(
    client=anthropic_client,
    batch_id=naive_queries_batch_id
)

naive_queries_df["id"] = naive_queries_df["id"].apply(revert_id_from_batching)

wiki_qrels = wiki_qrels.merge(naive_queries_df, left_on="corpus-id", right_on="id", how="left")
wiki_qrels.rename(columns={"query": "naively-generated-query"}, inplace=True)
wiki_qrels.drop(columns=["id"], inplace=True)
wiki_qrels.head()

### 3.2 Compare with Ground Truth Queries

Embed the ground truth queries and generated queries.

In [None]:
ground_truth_queries = wiki_qrels["query-text"].tolist()
query_ids = wiki_qrels["query-id"].tolist()
naive_queries = wiki_qrels["naively-generated-query"].tolist()

ground_truth_query_embeddings = openai_embed_in_batches(
    openai_client=openai_client, 
    texts=ground_truth_queries, 
    model="text-embedding-3-small"
)

naive_query_embeddings = openai_embed_in_batches(
    openai_client=openai_client, 
    texts=naive_queries, 
    model="text-embedding-3-small"
)

ground_truth_query_lookup = {
    id: {
        "text": text,
        "embedding": embedding
    } for id, text, embedding in zip(query_ids, ground_truth_queries, ground_truth_query_embeddings)
}

naive_query_lookup = {
    id: {
        "text": text,
        "embedding": embedding
    } for id, text, embedding in zip(query_ids, naive_queries, naive_query_embeddings)
}

Here, we plot the cosine similarity scores between each ground truth query and its corresponding generated query:

In [None]:
naive_query_comparison = score_query_query(
    qrels=wiki_qrels, 
    query_embeddings_1=ground_truth_query_lookup, 
    query_embeddings_2=naive_query_lookup,
    column_name="naive-query-score"
)

plot_single_distribution(
    df=naive_query_comparison, 
    column="naive-query-score", 
    title="Ground Truth vs Naively Generated Queries", 
    xlabel="Cosine Similarity", 
    ylabel="Normalized Frequency"
)

With further investigation, we can see that identical queries have been generated:

In [None]:
naive_query_comparison.sort_values(by="naive-query-score", ascending=False, inplace=True)

for i, row in naive_query_comparison.head(10).iterrows():
    print(f"Score: {row['naive-query-score']:.4f}")
    print(f"Original Query: {row['query-text']}")
    print(f"Generated Query: {row['naively-generated-query']}")
    print("-" * 80)

## 4. Distinct Query Generation

Since models have memorized these public benchmarks, we will generate unseen queries by explicitely prompting the model to generate a distinct query. 

Then, we will demonstrate that these newly generated distinct queries are also representative of the ground truth dataset.

#### 4.1 Generate Queries

We generate 1500 queries, now including both the ground truth query and the corpus as context.

In [None]:
distinct_batch_id = create_distinct_query_batch(
    client=anthropic_client,
    model="claude-3-5-sonnet-20241022",
    documents=corpus_texts_qrels,
    ids=ids_for_batching,
    queries=ground_truth_queries
)

In [None]:
distinct_queries_df = retrieve_batch(
    client=anthropic_client,
    batch_id=distinct_batch_id
)
distinct_queries_df["id"] = distinct_queries_df["id"].apply(revert_id_from_batching)

wiki_qrels = wiki_qrels.merge(distinct_queries_df, left_on="corpus-id", right_on="id", how="left")
wiki_qrels.rename(columns={"query": "distinct-generated-query"}, inplace=True)
wiki_qrels.drop(columns=["id"], inplace=True)

### 4.2 Compare with Ground Truth Queries

In [None]:
distinct_queries = wiki_qrels["distinct-generated-query"].tolist()

distinct_query_embeddings = openai_embed_in_batches(
    openai_client=openai_client, 
    texts=distinct_queries, 
    model="text-embedding-3-small"
)

distinct_query_lookup = {
    id: {
        "text": text,
        "embedding": embedding
    } for id, text, embedding in zip(query_ids, distinct_queries, distinct_query_embeddings)
}

We plot the cosine similarity scores between each ground truth query and its corresponding generated (distinct) query, and compare with our previous plot:

In [None]:
distinct_query_comparison = score_query_query(
    qrels=wiki_qrels, 
    query_embeddings_1=ground_truth_query_lookup, 
    query_embeddings_2=distinct_query_lookup,
    column_name="distinct-query-score"
)

plot_overlaid_distribution(
    df_1=naive_query_comparison, 
    df_2=distinct_query_comparison, 
    column_1="naive-query-score", 
    column_2="distinct-query-score", 
    title="Distinct vs Ground Truth Queries", 
    xlabel="Cosine Similarity", 
    ylabel="Normalized Frequency"
)

We can look at the most similar query-query scores and see that no identical queries have been generated:

In [None]:
distinct_query_comparison.sort_values(by="distinct-query-score", ascending=False, inplace=True)

for i, row in distinct_query_comparison.head(10).iterrows():
    print(f"Score: {row['distinct-query-score']:.4f}")
    print(f"Original Query: {row['query-text']}")
    print(f"Generated Query: {row['distinct-generated-query']}")
    print("-" * 80)

Compare Metrics:

In [None]:
wiki_metrics = evaluate_and_visualize(
    ground_truth_query_dict=ground_truth_query_lookup,
    generated_query_dict=distinct_query_lookup,
    corpus_embeddings_dict=wiki_corpus_lookup,
    qrels=wiki_qrels,
    collection=wiki_collection,
    dataset_name="Wikipedia (English)",
    model_name="text-embedding-3-small"
)

## 5. Align LLM Judge

We used labeled documents from [wandbot](https://github.com/wandb/wandbot), a technical support bot for Weights & Biases' AI developer tools.

These documents were manually labeled `true` or `false` based on whether they are good for generating relevant queires from.

### 5.1 Load in Data

In [14]:
with open("data/wandb_human_labels.json", "r") as f:
    human_labeled_documents = json.load(f)

with open("data/wandb_docs.json", "r") as f:
    all_documents = json.load(f)

In [15]:
labeled_ids = list(human_labeled_documents.keys())
labeled_documents = [all_documents[id] for id in labeled_ids]

### 5.2 Set Baseline Criteria

In [29]:
relevance_v1 = "The document is relevant and contains information that users would search for in the context of a question-answering bot for Weights & Biases. It should address topics that are useful to machine learning practitioners."

completeness_v1 = "The document is complete, meaning it provides comprehensive information to answer queries rather than merely serving as an introduction."

clarity_v1 = "The document contains clear ideas and is comprehensible."

criteria_v1 = [relevance_v1, completeness_v1, clarity_v1]
criteria_labels_v1 = ["relevant", "complete", "clear"]

### 5.3 Get LLM Labels

In [None]:
batch_v1_id = create_document_filter_batch(
    client=anthropic_client,
    documents=labeled_documents,
    ids=labeled_ids,
    criteria=criteria_v1,
    criteria_labels=criteria_labels_v1
)

In [None]:
batch_v1 = retrieve_document_filter_batch(
    client=anthropic_client,
    batch_id=batch_v1_id
)

We take our LLM-labeled data and compare with our manual labling.

`criteria_threshold` indicates the number of criterion that must be met in order for a document to be considered "good quality".

In [30]:
llm_vs_human(
    llm_judgements=batch_v1,
    human_judgements=human_labeled_documents,
    documents_mapping=all_documents,
    criteria_labels=criteria_labels_v1,
    criteria_threshold=2
)

{'relevant': 0.652, 'complete': 0.528, 'clear': 0.508}
Documents aligned with Human Judgement: 114, 45.6%


Number of documents meeting threshold: 163
Number of documents 100% aligned: 23
Number of documents 0% aligned: 14
Aligned: ## Challenges and Limitations of Q-learning
* Slow Convergence and High Computational Requirements - Q-learning can take significant time to converge, especially in complex environments. It may require substantial computational resources, making it less feasible for real-time applications.
* Curse of Dimensionality - The performance of Q-learning can deteriorate in high-dimensional state and action spaces, leading to increased computational complexity and reduced efficiency.
* Lack of Generalisation - Q-learning tends to focus on specific states and actions, potentially leading to difficulties in generalizing learned policies to new, unseen environments and increased susceptibility to overfitting.
* Exploration vs. Exploitation Trade-off - Striking the right

### 5.4 Iterate

Based on the baseline results, we iterate on our criteria.

In [22]:
relevance_v2 = """
    The document is relevant and something that users would search for considering the following context: 
    We are building a question-answering bot designed specifically for Weights & Biases, an AI developer for training, fine-tuning, and managing models.
    Any information that would be useful to a user working in machine learning is considered as relevant.
    """

completeness_v2 = "The document is complete, meaning that it contains useful information to answer queries and does not only serve as an introduction to the main content that users may be looking for."

intent_v2 = "The document would be relevant in the use case of a user working in machine learning, who may be seeking help or learn more about Weights & Biases or machine learning in general."

criteria_v2 = [relevance_v2, completeness_v2, intent_v2]
criteria_labels_v2 = ["relevant", "complete", "intent"]

In [None]:
batch_v2_id = create_document_filter_batch(
    client=anthropic_client,
    documents=labeled_documents,
    ids=labeled_ids,
    criteria=criteria_v2,
    criteria_labels=criteria_labels_v2
)

In [24]:
batch_v2 = retrieve_document_filter_batch(
    client=anthropic_client,
    batch_id=batch_v2_id
)

In [28]:
llm_vs_human(
    llm_judgements=batch_v2,
    human_judgements=human_labeled_documents,
    documents_mapping=all_documents,
    criteria_labels=criteria_labels_v2,
    criteria_threshold=2
)

{'relevant': 0.64, 'complete': 0.656, 'intent': 0.568}
Documents aligned with Human Judgement: 188, 75.2%


Number of documents meeting threshold: 159
Number of documents 100% aligned: 65
Number of documents 0% aligned: 8
Aligned: ## Data Preparation and Annotation
### Annotating a Summarization Dataset for Fine-Tuning

There is a given format required by ChatGPT in order to fine-tune the model on. This format includes 3 sections:  

* System: This is the prompt that you will pass to ChatGPT. In our case, the prompt would be “GPT is a great and to-the-point dialogue summarization tool.”
* User: This is the question asked to the model. In our case, it would be the text that we are required to summarize.
* Assistant: This is the answer that our model would return. In this case, it would be a brief summary of the text.


Aligned: ## Applications of Q-learning
* The agent can be the program controlling a robot. In this scenario, the agent observes the environment (the real world) through s

We notice an improvement in our overall LLM vs Human alignment, as well as individual criterion categories.