In [17]:
%load_ext autoreload
%autoreload 2

## The data

We will be using our Weights and Biases documentation for this demo.

In [273]:
import json
import wandb
import dotenv
import pathlib
import nest_asyncio
import pandas as pd
from pprint import pprint

dotenv.load_dotenv()
nest_asyncio.apply()

In [3]:
import weave

weave.init(f"{WANDB_PROJECT}")

Logged in as Weights & Biases user: ayut.
View Weave data at https://wandb.ai/ayut/llamaindex-weave/weave


<weave.weave_client.WeaveClient at 0x17c372050>

# Build a Simple RAG Pipeline

Start of cycle

- simple iteration over prompts
- evaluation by looking at the generated response

In [274]:
docs_dir = "data/wandb_docs"

docs_dir = pathlib.Path(docs_dir)
docs_files = sorted(docs_dir.rglob("*.md"))

print(f"Number of files: {len(docs_files)}\n")
print("First 5 files:\n{files}".format(files="\n".join(map(str, docs_files[:5]))))

Number of files: 380

First 5 files:
data/wandb_docs/guides/app/features/anon.md
data/wandb_docs/guides/app/features/custom-charts/intro.md
data/wandb_docs/guides/app/features/custom-charts/walkthrough.md
data/wandb_docs/guides/app/features/intro.md
data/wandb_docs/guides/app/features/notes.md


A Document is a generic container around any data source - for instance, a PDF, an API output, or retrieved data from a database. They can be constructed manually, or created automatically via our data loaders. By default, a Document stores text along with some other attributes. 

In [14]:
from llama_index.core import Document

documents = []
for file in docs_files:
    content = file.read_text()
    documents.append(
        Document(
            text=content,
            metadata = {
                "source": str(file.relative_to(docs_dir)),
                "raw_tokens": len(content.split()),
            },
        )
    )

print(f"Number of documents: {len(documents)}\n")
pprint(documents[:2])

Number of documents: 380

 Document(id_='9a4e995d-bb36-4250-bbef-e4ddb9ed52a4', embedding=None, metadata={'source': 'guides/app/features/custom-charts/intro.md', 'raw_tokens': 1904}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='---\nslug: /guides/app/features/custom-charts\ndisplayed_sidebar: default\n---\n\nimport Tabs from \'@theme/Tabs\';\nimport TabItem from \'@theme/TabItem\';\n\n# Custom Charts\n\nUse **Custom Charts** to create charts that aren\'t possible right now in the default UI. Log arbitrary tables of data and visualize them exactly how you want. Control details of fonts, colors, and tooltips with the power of [Vega](https://vega.github.io/vega/).\n\n* **What\'s possible**: Read the[ launch announcement →](https://wandb.ai/wandb/posts/reports/Announcing-the-W-B-Machine-Learning-Visualization-IDE--VmlldzoyNjk3Nzg)\n* **Code**: Try a live example in a[ hosted notebook →](https://tiny.cc/custom-charts)\n* **Video**: Watch a quick [w

Node parsers are a simple abstraction that take a list of documents, and chunk them into Node objects, such that each node is a specific chunk of the parent document. When a document is broken into nodes, all of it's attributes are inherited to the children nodes. 

In [275]:
from llama_index.core.node_parser import SentenceSplitter

CHUNK_SIZE = 512
CHUNK_OVERLAP = 128

splitter = SentenceSplitter(
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP,
)

nodes = splitter.get_nodes_from_documents(documents, show_progress=True)

print(f"Number of nodes/chunks: {len(nodes)}\n")
pprint(nodes[:2])

Parsing nodes: 100%|██████████| 380/380 [00:00<00:00, 429.17it/s]

Number of nodes/chunks: 1440

[TextNode(id_='38ab098e-e0b5-4211-adfa-dc13d6a55673', embedding=None, metadata={'source': 'guides/app/features/anon.md', 'raw_tokens': 470}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='76e3ea77-8b0d-4085-bf13-88f98348294c', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'source': 'guides/app/features/anon.md', 'raw_tokens': 470}, hash='2c3d7c77bbae8091e7fbf90aee7787ddeae36603e3bae1ec5933373d0a620160'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='04efbeac-55cf-42e5-8b7f-f417edfcb5ed', node_type=<ObjectType.TEXT: '1'>, metadata={}, hash='be818713d44407b6cf58555b610298971b306f85d65a46669862ac4691fef5e9')}, text='---\ndescription: Log and visualize data without a W&B account\ndisplayed_sidebar: default\n---\n\n# Anonymous Mode\n\nAre you publishing code that you want anyone to be able to run easily? Use Anonymous Mode to let someone run your code, see a W&B 




In [276]:
from llama_index.embeddings.openai import OpenAIEmbedding

embedding = OpenAIEmbedding(
    model="text-embedding-3-small",
)

Vector stores accept a list of Node objects and build an index from them. By default it stores everything in the memory.

In [277]:
from llama_index.core import VectorStoreIndex

index = VectorStoreIndex(
    nodes[:200], # using small number for demonstration purposes
    embed_model=embedding,
    show_progress=True,
    insert_batch_size=128,
)

Generating embeddings: 100%|██████████| 128/128 [00:05<00:00, 23.70it/s]
Generating embeddings: 100%|██████████| 72/72 [00:03<00:00, 22.10it/s]


In [278]:
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core import get_response_synthesizer
from llama_index.llms.openai import OpenAI

retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=10,
)

llm = OpenAI(
    model="gpt-3.5-turbo",
    temperature=0.0,
    max_tokens=4096,
)

response_synthesizer = get_response_synthesizer(
    llm=llm,
    response_mode="compact",
)

In [279]:
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

In [280]:
response = query_engine.query("what can we log with Weights and Biases?")

In [281]:
print(response.response)

You can log various metrics and visualizations with Weights and Biases, such as custom histograms, Precision-Recall curves, and different types of smoothing algorithms like exponential moving average, Gaussian smoothing, and running average.


- You change a few things in the pipeline above, "eye-ball" the responses and do it till a point where you can package it in a script.

# Better pipeline with `weave.Model`

In [282]:
from rag import SimpleRAGPipeline

rag_pipeline = SimpleRAGPipeline()
rag_pipeline.build_query_engine()

Parsing nodes: 100%|██████████| 380/380 [00:00<00:00, 429.21it/s]
2024/07/10 22:03:59 [DEBUG] GET https://storage.googleapis.com/wandb-production.appspot.com/ayut/llamaindex-weave/ya1e93nn/artifact/946347886/wandb_manifest.json?Expires=1720632839&GoogleAccessId=gorilla-files-url-signer-man%40wandb-production.iam.gserviceaccount.com&Signature=Ezffs42GyxwvGqxJ%2Bh7DEU3VoqtCe9F8%2BKELduqPaE4QSVvry9MHdZSdPSXXHAbAtzUrrw8Mq1CLLSuZLqdZOo6cpk%2FDWxJ8bFf7XKSiPKd64Mj2uvdGnTQx%2Bxsya7prelT0%2FlHbe5S37uPmaAkJj70ItoFnb%2B0xAia7inIj0gDUY7QARbaj0QGM1CoI2W4ipb%2FSPvucuVDMcph44S4Pf7HsHG9f5n%2BRl5aVOm1cOAQekoLGvIWXxd26a5YhTXZFzKWSKB%2FQoa9b%2F3c30u%2BmrlMedpkXQUNn0UBBpKRreGmKroIVNqD9Y5SdsSxC%2Byqhhybr1kmCjmphW6IDmXviZw%3D%3D


Downloaded Chroma Index to: data/chroma_db


In [283]:
response = rag_pipeline.predict("what can we log with Weights and Biases?")
pprint(response["response"])

🍩 https://wandb.ai/ayut/llamaindex-weave/r/call/33ccecf9-2bbd-465f-a9df-f07bb5167af3
('With Weights and Biases, you can log various aspects of your machine '
 'learning experiments. This includes training and validation metrics, system '
 'metrics, and hyperparameters for every experiment. You can also use it for '
 'model and dataset versioning, and project collaboration. The platform '
 'supports tracking experiments, visualizing predictions, tuning '
 'hyperparameters, tracking models and datasets, registering models, and '
 'iterating on LLMs. It is compatible with popular ML frameworks and libraries '
 'such as PyTorch, PyTorch Lightning, HuggingFace Transformers, Tensorflow, '
 'Keras, and XGBoost.')


In [284]:
rag_pipeline = SimpleRAGPipeline(chat_llm="gpt-4o")
rag_pipeline.build_query_engine()

Parsing nodes: 100%|██████████| 380/380 [00:00<00:00, 417.12it/s]
2024/07/10 22:08:33 [DEBUG] GET https://storage.googleapis.com/wandb-production.appspot.com/ayut/llamaindex-weave/ya1e93nn/artifact/946347886/wandb_manifest.json?Expires=1720633113&GoogleAccessId=gorilla-files-url-signer-man%40wandb-production.iam.gserviceaccount.com&Signature=RublCCYhV98UESyfV6nws17bgk9i7h1%2FOlGlNiWnG0m%2FY9Hu%2FxvlvOQkFWGYFbY0k4yVSBe8vBymggGJPSNt%2Bzv91y6pp7Jy3dni8xJMf3ajSYWuqC72YZWrZKeIRYuRsEfZVaMF0UaIv18BkDdns%2FAxXZNQwLwiGUp26cho1TivotpJ6G0UGuNatPNB3QhRWMj%2B4OKOoHwijQvuEp%2Bceu70NqQmuGKM84DqGDfKGyt1K08mmy3p161jgTQo41I9WYxFSJswN%2B1T0mt%2BoiqnNYAQTx7%2BunW1cZZ1KtnedjKEoXSMQqriJ0nWKaQKDpTVxpbsfMdgfjX%2B8unVqkgdFA%3D%3D


Downloaded Chroma Index to: data/chroma_db


In [285]:
response = rag_pipeline.predict("what can we log with Weights and Biases?")
pprint(response["response"])

🍩 https://wandb.ai/ayut/llamaindex-weave/r/call/547b7ca8-c897-44e7-b09e-c0316f549c38
('With Weights and Biases (W&B), you can log a variety of elements essential '
 'for machine learning experiment tracking and project collaboration. Here are '
 'the key items you can log:\n'
 '\n'
 '1. **Training and Validation Metrics**: Using the `WandbMetricsLogger` '
 'callback, you can log metrics such as accuracy, loss, and other custom '
 'metrics during the training and validation phases of your machine learning '
 'models.\n'
 '\n'
 '2. **System Metrics**: Alongside training and validation metrics, W&B can '
 'also log system metrics like GPU usage, CPU usage, memory consumption, and '
 'other hardware-related metrics to help you monitor the resource utilization '
 'of your experiments.\n'
 '\n'
 '3. **Hyperparameters**: W&B allows you to track hyperparameters for every '
 'experiment, which is crucial for reproducibility. You can log '
 'hyperparameters using a simple Python `dict` or more a

# Evaluate the RAG Pipeline

## Eval dataset

In [25]:
eval_rows = []
with open(f"data/eval_subset.jsonl", "r") as f:
    eval_data = f.readlines()
    for row in eval_data:
        eval_rows.append(json.loads(row))

# create a weave dataset
eval_dataset = weave.Dataset(name="eval_data_subset", rows=eval_rows)

# Publish the dataset
weave.publish(eval_dataset)

# Retrieve the dataset
eval_dataset = weave.ref('eval_data_subset').get()

📦 Published to https://wandb.ai/ayut/llamaindex-weave/weave/objects/eval_data_subset/versions/HJgMIvtD3UMFwDouX8BUp1KJgKfUTtUwJdXS3VjzpMI


In [286]:
dict(eval_dataset.rows[0])

{'question': "Hey I have a question about using wandb with fastapi in a prod environment. is it recommended to initialize wandb within a specific route function, ie\n\n`@app.route('/')\ndef my_function():\n    wandb.init(...)\n`\nor should i initialize beforehand:\n\n`wandb.init(...)\n@app.route('/')\ndef my_function():\n    ...`\n\nI'm getting a long list of log items in the console and many of them are empty.",
 'answer': 'When integrating `wandb` (Weights & Biases) with a FastAPI application, it\'s important to consider the nature of your application and how you\'re using `wandb`. FastAPI is an asynchronous web framework, and initializing `wandb` within a route function could lead to multiple initializations if the route is hit multiple times, which is not ideal.\n\nHere are some guidelines to help you decide where to place the `wandb.init()` call:\n\n1. **One-time Initialization**: If you need to track metrics across the entire application lifecycle and not just within a specific r

## Metric(s) / Scoring Function (Scorer)

In [287]:
from llama_index.llms.openai import OpenAI
from llama_index.core.evaluation import CorrectnessEvaluator, EvaluationResult

In [288]:
judge_llm = OpenAI("gpt-4-1106-preview")
evaluator = CorrectnessEvaluator(llm=judge_llm)

In [289]:
row = dict(eval_dataset.rows[0])
row

{'question': "Hey I have a question about using wandb with fastapi in a prod environment. is it recommended to initialize wandb within a specific route function, ie\n\n`@app.route('/')\ndef my_function():\n    wandb.init(...)\n`\nor should i initialize beforehand:\n\n`wandb.init(...)\n@app.route('/')\ndef my_function():\n    ...`\n\nI'm getting a long list of log items in the console and many of them are empty.",
 'answer': 'When integrating `wandb` (Weights & Biases) with a FastAPI application, it\'s important to consider the nature of your application and how you\'re using `wandb`. FastAPI is an asynchronous web framework, and initializing `wandb` within a route function could lead to multiple initializations if the route is hit multiple times, which is not ideal.\n\nHere are some guidelines to help you decide where to place the `wandb.init()` call:\n\n1. **One-time Initialization**: If you need to track metrics across the entire application lifecycle and not just within a specific r

In [290]:
response = rag_pipeline.predict(row["question"])
print(response["response"])

🍩 https://wandb.ai/ayut/llamaindex-weave/r/call/6f22cbe6-0096-4596-a596-2b4dea9dc224
When integrating Weights & Biases (W&B) with FastAPI in a production environment, it is generally recommended to initialize W&B before defining your route functions. This approach ensures that the W&B run is started as early as possible, capturing all relevant logs and metrics from the beginning of your application’s execution. 

Here’s why initializing W&B beforehand is beneficial:

1. **Early Logging**: Initializing W&B early ensures that any output in your console, including error messages, are logged as part of the W&B run. This can be crucial for debugging and monitoring purposes.
2. **Single Initialization**: By initializing W&B once at the start, you avoid multiple initializations which can lead to redundant runs and cluttered logs.
3. **Consistency**: It provides a consistent and centralized place for configuration and initialization, making the code easier to maintain and understand.

Here’s h

In [291]:
result = evaluator.evaluate(
    query=row["question"],
    response=response["response"],
    reference=row["answer"],
)

In [292]:
pprint(result.__dict__)

{'contexts': None,
 'feedback': 'The generated answer is relevant and provides a correct '
             'explanation and guidelines for integrating Weights & Biases '
             '(W&B) with FastAPI, which aligns well with the reference answer. '
             'It suggests initializing W&B before defining route functions, '
             'which is a good practice to avoid multiple initializations and '
             'cluttered logs. The answer also includes a code example that '
             'demonstrates the recommended practice. The explanation of the '
             'benefits of early logging, single initialization, and '
             'consistency adds value to the answer. However, it does not '
             'explicitly mention the potential issues with asynchronous '
             'environments or the use of `run.finish()` in per-request '
             'initializations, which were points covered in the reference '
             'answer. Despite this, the generated answer is still fully 

https://wandb.ai/ayut/llamaindex-weave/weave/traces

### Customizing `CorrectnessEvaluator` for our usecase

https://github.com/run-llama/llama_index/blob/29ece9b058f6b9a1cf29bc723ed4aa3a39879ad5/llama-index-core/llama_index/core/evaluation/correctness.py

In [293]:
SYSTEM_TEMPLATE = """You are a Weight & Biases support expert tasked with evaluating the correctness of answers to questions asked by users to a a technical support chatbot.

You are given the following information:
- a user query,
- a reference answer
- a generated answer.

Your job is to judge the relevance and correctness of the generated answer.
Your score has to be between 1 and 3, where 1 is the worst and 3 is the best.

Output your final verdict by strictly following JSON format:
{{
    "reason": <<Provide a brief explanation for your decision here>>,
    "score": <<Provide a score as per the above guidelines>>,
    "decision": <<Provide your final decision here, either 'correct', or 'incorrect'>>
}}
"""

In [294]:
USER_TEMPLATE = """
## User Query
{query}

## Reference Answer
{reference_answer}

## Generated Answer
{generated_answer}
"""

In [36]:
from llama_index.core import ChatPromptTemplate
from llama_index.core.llms import ChatMessage, MessageRole

EVALUATOR_TEMPLATE = ChatPromptTemplate(
    message_templates=[
        ChatMessage(role=MessageRole.SYSTEM, content=SYSTEM_TEMPLATE),
        ChatMessage(role=MessageRole.USER, content=USER_TEMPLATE),
    ]
)

In [295]:
import asyncio
from ragas.llms.json_load import json_loader


async def safe_parse_eval_response(eval_response, passing_decision):
    try:
        eval_response_dict = await json_loader.safe_load(
            eval_response, llm=OpenAI()
        )
        score = eval_response_dict.get("score")
        reasoning = eval_response_dict.get("reason")
        decision = eval_response_dict.get("decision") == passing_decision

    except Exception as e:
        print(e)
        print(eval_response)
        score = 0
        reasoning = "Unable to parse response"
        decision = False
    return decision, reasoning, score


class WandbCorrectnessEvaluator(CorrectnessEvaluator):
    async def aevaluate(
        self,
        query = None,
        response = None,
        contexts = None,
        reference = None,
        sleep_time_in_seconds: int = 0,
        **kwargs,
    ) -> EvaluationResult:
        await asyncio.sleep(sleep_time_in_seconds)

        if query is None or response is None or reference is None:
            print(query, response, reference, flush=True)
            raise ValueError("query, response, and reference must be provided")

        eval_response = await self._llm.apredict(
            prompt=self._eval_template,
            query=query,
            generated_answer=response,
            reference_answer=reference,
        )

        passing, reasoning, score = await safe_parse_eval_response(
            eval_response, "correct"
        )

        return EvaluationResult(
            query=query,
            response=response,
            passing=passing,
            score=score,
            feedback=reasoning,
        )

In [296]:
correctness_evaluator = WandbCorrectnessEvaluator(
    llm=judge_llm,
    eval_template=EVALUATOR_TEMPLATE,
)

In [297]:
result = correctness_evaluator.evaluate(
    query=row["question"],
    response=response["response"],
    reference=row["answer"],
)

In [298]:
pprint(result.dict())

{'contexts': None,
 'feedback': 'The generated answer is relevant and correct. It provides a '
             'clear explanation of why initializing W&B before defining route '
             "functions is beneficial, aligning with the reference answer's "
             'guidelines. It also includes an example of how to structure the '
             'FastAPI application with W&B initialization, which is helpful '
             'for the user.',
 'invalid_reason': None,
 'invalid_result': False,
 'pairwise_source': None,
 'passing': True,
 'query': 'Hey I have a question about using wandb with fastapi in a prod '
          'environment. is it recommended to initialize wandb within a '
          'specific route function, ie\n'
          '\n'
          "`@app.route('/')\n"
          'def my_function():\n'
          '    wandb.init(...)\n'
          '`\n'
          'or should i initialize beforehand:\n'
          '\n'
          '`wandb.init(...)\n'
          "@app.route('/')\n"
          'def my_f

## Evaluate using `weave.Evaluate`

Decorating the scoring metric with `weave.op()` allows us to track the inputs and outputs. The function itself will be traced.

In [43]:
@weave.op()
async def get_answer_correctness(
    question: str,
    answer: str,
    model_output: dict
) -> dict:
    result = await correctness_evaluator.aevaluate(
        query=question,
        response=model_output["response"],
        reference=answer,
    )
    return {
        "answer_correctness": result.dict()["passing"],
        "feedback": result.dict()["feedback"],
    }

In [299]:
evaluation = weave.Evaluation(
    dataset=eval_dataset, scorers=[get_answer_correctness]
)
evaluation

Evaluation(name='eval_data_subset-evaluation', description=None, dataset=Dataset(name='eval_data_subset', description=None, rows=<weave.table.Table object at 0x32c7fe7a0>), scorers=[Op(get_answer_correctness)], preprocess_model_input=None, trials=1)

In [300]:
await evaluation.evaluate(rag_pipeline)

🍩 https://wandb.ai/ayut/llamaindex-weave/r/call/4dc425c7-952e-41e6-8622-e4e72ca1fba1


{'get_answer_correctness': {'answer_correctness': {'true_count': 13,
   'true_fraction': 0.65}},
 'model_latency': {'mean': 17.07438259124756}}

# Let's try aligning the Judge LLM (`gpt-4o`)

Judge alignment is nothing more than improving the system prompt of the LLM you are using as an evaluator. 

It means:

- adding instructions to the system prompt to capture what you want?
    - Is the tone correct? 
    - Is there gender reference which you don't want?
    - Is there code snippets?
    - It can mean anything!

- adding examples (few-shot) to the system prompt for cases which you find as edge cases or cases where you feel your judge is not doing a good job.

Both the method requires going back and forth with a tooling in place. The tooling should:

- allow you to give feedbacks
- allow you to programatically pull in feedbacks
- allow you to compare two or more evaluations
- allow you to adapt to a new technique

### Align the criteria with better instructions

In [59]:
from eval import evaluator

Parsing nodes: 100%|██████████| 380/380 [00:00<00:00, 437.86it/s]
2024/07/10 16:52:43 [DEBUG] GET https://storage.googleapis.com/wandb-production.appspot.com/ayut/llamaindex-weave/ya1e93nn/artifact/946347886/wandb_manifest.json?Expires=1720614163&GoogleAccessId=gorilla-files-url-signer-man%40wandb-production.iam.gserviceaccount.com&Signature=ivp1r2bV51BFFzfQyIGwlcGAE%2Bh8k%2FB3KvP7rcYuVhTqzpd%2BcTgUNnlXb7hgpXWzChiwzqjxoP325SDsM5JNZG9KnH6pa%2Bcnjga3SQwcMMu9xUkxat95MlJ%2F3OnYxaK1sLA1XoiKJUozzXqxatMTfX6WAMgPfe0QQdhyluDb61J2t25p%2BZzV8QQiM5mpGZeVennvtnV%2FGcBs055gvj0hoSOtlPnO0IMTXa%2BVYzCYQ7X5ixDH0MY%2BK7yHUBLT%2Bpy9ngKO8XpxCsuxtvhQBjSGXWmQU4dCqm%2BfHbzMznOwH%2BJUPyAE7lNQ4lYc6CxHN4J%2BSV%2BaQ1%2FagcvQ8uH6bv89Hg%3D%3D


Downloaded Chroma Index to: data/chroma_db


In [60]:
await evaluator.evaluate(rag_pipeline)

🍩 https://wandb.ai/ayut/llamaindex-weave/r/call/4a72b196-7fe8-4392-9c7f-3d5d53bc4bd8


{'get_answer_correctness': {'answer_correctness': {'true_count': 19,
   'true_fraction': 0.95}},
 'compare_length_within_95_percentile': {'length_within_95_percentile': {'true_count': 4,
   'true_fraction': 0.2}},
 'check_code_block_presence': {'code_block_presense': {'true_count': 17,
   'true_fraction': 0.85},
  'has_code_block_gt': {'true_count': 16, 'true_fraction': 0.8},
  'has_code_block_gen': {'true_count': 19, 'true_fraction': 0.95}},
 'check_bullet_points_presence': {'bullet_points_presense': {'true_count': 19,
   'true_fraction': 0.95},
  'has_bullet_points_gt': {'true_count': 18, 'true_fraction': 0.9},
  'has_bullet_points_gen': {'true_count': 19, 'true_fraction': 0.95}},
 'model_latency': {'mean': 15.491464567184448}}

Update system prompt with better instructions:

diff:

"""<br>
Follow these guidelines for scoring:
- Your score has to be between 1 and 3, where 1 is the worst and 3 is the best.
- If the generated answer is not correct in comparison to the reference, you should give a score of 1.
- If the generated answer is correct in comparison to the reference but contains mistakes, you should give a score of 2.
- If the generated answer is correct in comparision to the reference and completely answer's the user's query, you should give a score of 3.<br>

"""

In [61]:
from eval import evaluator

await evaluator.evaluate(rag_pipeline)

Parsing nodes: 100%|██████████| 380/380 [00:00<00:00, 453.89it/s]
2024/07/10 17:18:02 [DEBUG] GET https://storage.googleapis.com/wandb-production.appspot.com/ayut/llamaindex-weave/ya1e93nn/artifact/946347886/wandb_manifest.json?Expires=1720615681&GoogleAccessId=gorilla-files-url-signer-man%40wandb-production.iam.gserviceaccount.com&Signature=wUTAkFFumad9kJLf6zUOAtIJX%2Fc8%2FkxbcoYY8ofWp0xYiJRgTgXQS7ZVmTc7VFxwjT4UVEZjba16NIVWA3nx2ZluZ9Ys0LZHw1gSrCmid2iQVMU9g0A%2FLCOf6wPKTxpni%2BbKt9ZZU6o0IuZHmGvigCZDdrYcGx6oquloSxRJrGCWEQT2EWZc1RzxfEPyNpE%2F0Ho3UKaREx%2FK7M5ScPfUfMY7aWNG9SgyqQSdPwSECfEXa75Km6L7h6bOCdcHmeXNTNOEI3CJwXmJm7T6h%2FLoFLwiYTv%2B3uKhWuDyN%2B7aQXtM2ojVBcyDDbBNNZOFnN5zY3GAQoGBEC1vU61oBsXNFw%3D%3D


Downloaded Chroma Index to: data/chroma_db


🍩 https://wandb.ai/ayut/llamaindex-weave/r/call/39a684e9-31b7-477f-8445-8ce0ce4b20e2


{'get_answer_correctness': {'answer_correctness': {'true_count': 18,
   'true_fraction': 0.9}},
 'compare_length_within_95_percentile': {'length_within_95_percentile': {'true_count': 1,
   'true_fraction': 0.05}},
 'check_code_block_presence': {'code_block_presense': {'true_count': 18,
   'true_fraction': 0.9},
  'has_code_block_gt': {'true_count': 16, 'true_fraction': 0.8},
  'has_code_block_gen': {'true_count': 18, 'true_fraction': 0.9}},
 'check_bullet_points_presence': {'bullet_points_presense': {'true_count': 18,
   'true_fraction': 0.9},
  'has_bullet_points_gt': {'true_count': 18, 'true_fraction': 0.9},
  'has_bullet_points_gen': {'true_count': 20, 'true_fraction': 1.0}},
 'model_latency': {'mean': 13.661026763916016}}

### Align the criteria with few-shot examples

In [99]:
# Get the eval with feedbacks
client = weave.init("ayut/llamaindex-weave")

Logged in as Weights & Biases user: ayut.
View Weave data at https://wandb.ai/ayut/llamaindex-weave/weave


In [181]:
thumbs_down = client.feedback(reaction="👎")
calls = thumbs_down.refs().calls()

In [136]:
calls

[TraceObject(Call(op_name='weave:///ayut/llamaindex-weave/op/Evaluation.predict_and_score:NmwfShfFmgAhDGLXrF6Xn02T9MIAsCXBUcifCjyKpOM', trace_id='1f32e8ff-e8f4-4479-a71b-56775f04b269', project_id='ayut/llamaindex-weave', parent_id='27025708-d381-436a-b1ab-c65c7e74e84e', inputs={'self': ObjectRef(entity='ayut', project='llamaindex-weave', name='eval_data_subset-evaluation', digest='VdRdiqyuDGeDJqixF5QCdDdv5WkjPqqU3VXLFAfLuPM', extra=[]), 'model': ObjectRef(entity='ayut', project='llamaindex-weave', name='SimpleRAGPipeline', digest='b1OEC1BzWkLgEFqj0aubhncBXHiSTmAC9PPXiEYLvXg', extra=[]), 'example': ObjectRef(entity='ayut', project='llamaindex-weave', name='eval_data_subset', digest='HJgMIvtD3UMFwDouX8BUp1KJgKfUTtUwJdXS3VjzpMI', extra=['attr', 'rows', 'id', 'QPl3QIGcbj06cGvsYqvMJvFMIlUPXyBrKehVVUGp4Vo'])}, id='8e3defe8-334f-41db-b3b1-539a1c47a0d3', output={'model_output': {'response': "Yes, you can copy runs from one project to another. However, since you have used all the free tracked h

In [198]:
len(calls)

10

In [209]:
few_shot_calls = []

for call in calls:
    if call.feedback.feedbacks:
        for feedback in call.feedback.feedbacks:
            if feedback.feedback_type=="wandb.note.1":
                if feedback.payload["note"] == "few-shot":
                    few_shot_calls.append(call)

In [270]:
len(few_shot_calls)

2

In [267]:
few_shot_prompt = ""

for call in few_shot_calls:
    question = call.inputs["example"]["question"]
    reference_answer = call.inputs["example"]["answer"]
    generated_answer = call.output["model_output"]["response"]
    score = 3 if not call.output["scores"]["get_answer_correctness"]["answer_correctness"] else 1
    feeback = call.output["scores"]["get_answer_correctness"]["feedback"]

    few_shot_prompt += f"Question: \n{question}\n Reference Answer: \n{reference_answer}\n Generated Answer: \n{generated_answer}\n Score: \n{score}\n Feedback: \n{feeback}\n"
    few_shot_prompt = few_shot_prompt.replace("{", "").replace("}", "")

In [268]:
few_shot_prompt

'Question: \nI\'m getting Internal Server Error for url: https://api.wandb.ai/graphql when trying to download an artifact with the following code \n```\nprocessed_data_at = run.use_artifact(f\'params.PROCESSED_DATA_AT:latest\')\nprocessed_dataset_dir = Path(processed_data_at.download())\n```\n Reference Answer: \nAn "Internal Server Error" typically indicates that there\'s a problem on the server side of the application you\'re trying to interact with—in this case, the W&B (Weights & Biases) API. Here are some steps you can take to troubleshoot and potentially resolve the issue:\n\n1. **Check W&B Status**: Before diving into debugging, check if W&B is experiencing any known issues by visiting their status page. If there\'s a known outage or issue, it might be the cause of the error, and you\'ll have to wait until it\'s resolved.\n\n2. **Retry the Request**: Sometimes, intermittent server issues can cause this error. Try running your code again after a short wait to see if the issue res

In [269]:
from eval import evaluator

await evaluator.evaluate(rag_pipeline)

Parsing nodes: 100%|██████████| 380/380 [00:00<00:00, 452.08it/s]
2024/07/10 18:50:02 [DEBUG] GET https://storage.googleapis.com/wandb-production.appspot.com/ayut/llamaindex-weave/ya1e93nn/artifact/946347886/wandb_manifest.json?Expires=1720621202&GoogleAccessId=gorilla-files-url-signer-man%40wandb-production.iam.gserviceaccount.com&Signature=t3VQZS1MVipMLiEWhN9Tq%2FrJ4caBwcVTfSvGlqjjz8cK%2FzWLD%2FJqmMgNXobHGXN3gSdX6H85GlS10gfNw2hp1pOfTfRxDip3aCbQpn1CU2HAScfFK6rVWElozrSjVowOXaZawvXr0VzNKn0hWm35TA%2FKKyduQP3%2FMaSS4aIAKQfh7hgO%2F4tdydX5rhhxHAD4zPI2ImkchrCBaV9fclMscR3v4jnypVgEao0cz6uRUZrURFcSsLrVrAFkzb5h%2Br4aIvwSlNiogEyW7tWu9x5wOD1zmfU1X88Ng7qOxIk0zdipzUvKW5EoaPgGL%2FDHxEUs3GRLhB%2BHNJXCs58j3W788A%3D%3D


Downloaded Chroma Index to: data/chroma_db


🍩 https://wandb.ai/ayut/llamaindex-weave/r/call/52df3995-4bbe-4304-8483-8ca83e626e5d


{'get_answer_correctness': {'answer_correctness': {'true_count': 20,
   'true_fraction': 1.0}},
 'compare_length_within_95_percentile': {'length_within_95_percentile': {'true_count': 1,
   'true_fraction': 0.05}},
 'check_code_block_presence': {'code_block_presense': {'true_count': 17,
   'true_fraction': 0.85},
  'has_code_block_gt': {'true_count': 16, 'true_fraction': 0.8},
  'has_code_block_gen': {'true_count': 19, 'true_fraction': 0.95}},
 'check_bullet_points_presence': {'bullet_points_presense': {'true_count': 18,
   'true_fraction': 0.9},
  'has_bullet_points_gt': {'true_count': 18, 'true_fraction': 0.9},
  'has_bullet_points_gen': {'true_count': 20, 'true_fraction': 1.0}},
 'model_latency': {'mean': 14.592189240455628}}

In [271]:
rag_pipeline = SimpleRAGPipeline()
rag_pipeline.build_query_engine()

Parsing nodes: 100%|██████████| 380/380 [00:00<00:00, 461.64it/s]
2024/07/10 19:01:22 [DEBUG] GET https://storage.googleapis.com/wandb-production.appspot.com/ayut/llamaindex-weave/ya1e93nn/artifact/946347886/wandb_manifest.json?Expires=1720621882&GoogleAccessId=gorilla-files-url-signer-man%40wandb-production.iam.gserviceaccount.com&Signature=l27moAxnRYLiKMMfd95sMzLKGX67dz%2FEeh6u325HyJhYuWuycfeK%2F5nGS6CylNDuTfIcy3pJPbigeO61Ul9g5EAQ544PaPbi6WK69F6d3e5JRZENfYtx8r7N6YEFU4xlbEi4zsXhcti4bBUttTQH1qBPToapGjBcjf271eqWlZs4u9yr5tjwGnOQH1tVYjYI4uJn5f2DbUAAE%2FZJv4E01xItfpBWOzNqBzytk1XnkXcUnFayqGtkfm4sBAcBfjeFTY5SWSIzBnuI7%2BG%2BZF6N8rtFEIX%2F29du2KerePhlp9eQnvZWJSIXCDDieYOKdqNphfMg5YLzC5uDJ%2BKOCCRG9A%3D%3D


Downloaded Chroma Index to: data/chroma_db


In [272]:
from eval import evaluator

await evaluator.evaluate(rag_pipeline)

'OpenAI' object has no attribute 'agenerate_text'
{
    "reason": "The generated answer provides a clear and relevant approach to organizing W&B runs during hyperparameter tuning with cross-validation. It suggests using the `group` parameter in `wandb.init()` to group related runs, which directly addresses the user's need to organize runs by cross-validation folds or hyperparameter sets. Additionally, it mentions using the W&B dashboard to sort and analyze runs, and it includes an example of integrating Optuna with W&B, which is pertinent since the user is using Optuna for hyperparameter optimization. The answer is practical and actionable, covering the key aspects of the user's query.",
    "score": 3,
    "decision": correct
}


🍩 https://wandb.ai/ayut/llamaindex-weave/r/call/491be5e0-15a5-4aa8-9225-3836b4490e0f


{'get_answer_correctness': {'answer_correctness': {'true_count': 14,
   'true_fraction': 0.7}},
 'compare_length_within_95_percentile': {'length_within_95_percentile': {'true_count': 0,
   'true_fraction': 0.0}},
 'check_code_block_presence': {'code_block_presense': {'true_count': 11,
   'true_fraction': 0.55},
  'has_code_block_gt': {'true_count': 16, 'true_fraction': 0.8},
  'has_code_block_gen': {'true_count': 11, 'true_fraction': 0.55}},
 'check_bullet_points_presence': {'bullet_points_presense': {'true_count': 11,
   'true_fraction': 0.55},
  'has_bullet_points_gt': {'true_count': 18, 'true_fraction': 0.9},
  'has_bullet_points_gen': {'true_count': 9, 'true_fraction': 0.45}},
 'model_latency': {'mean': 17.090890181064605}}