# Step 6 (Optional): Use Langfuse to track evaluation and trace

This notebook uses langfuse to trace and track the evaluation. You need an API key to run this notebook. Langfuse offers a free tier with 50,000 observations per month (as of June 2024). This notebook uses approximately 50 observations. You can get the API key by signing up on https://cloud.langfuse.com, creating a new project, and creating a new API key in the settings. 

In [1]:
%run 01-llm-app-setup.ipynb

Repo card metadata block was not found. Setting CardData to empty.
100%|██████████| 5/5 [00:00<00:00, 3591.63it/s]


Parsing nodes:   0%|          | 0/5 [00:00<?, ?it/s]

Documents before chunking: 5
Documents after chunking: 35




 = Plain maskray = 
 
 The plain maskray or brown stingray ( Neotrygon annotata ) is a species of stingray in the family Dasyatidae . It is found in shallow , soft-bottomed habitats off northern Australia . Reaching 24 cm ( 9.4 in ) in width , this species has a diamond-shaped , grayish green pectoral fin disc . Its short , whip-like tail has alternating black and white bands and fin folds above and below . There are short rows of thorns on the back and the base of the tail , but otherwise the s


Generating embeddings:   0%|          | 0/35 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


The plain maskray is found in the continental shelf of northern Australia, from the Wellesley Islands in Queensland to the Bonaparte Archipelago in Western Australia, including the Gulf of Carpentaria and the Timor and Arafura Seas. There are also unsubstantiated reports that its range extends to southern Papua New Guinea.

Source: Plain maskray
Relevant Snippet: "The plain maskray inhabits the continental shelf of northern Australia from the Wellesley Islands in Queensland to the Bonaparte Archipelago in Western Australia, including the Gulf of Carpentaria and the Timor and Arafura Seas. There are unsubstantiated reports that its range extends to southern Papua New Guinea."


## Set your keys if you didn't put it in your ".env" file

In [2]:
 # Get keys for your project from https://cloud.langfuse.com
# Set them here if not set in .env
# import os
# os.environ["LANGFUSE_PUBLIC_KEY"] = "pk-lf-..."
# os.environ["LANGFUSE_SECRET_KEY"] = "sk-lf-..."
# os.environ["LANGFUSE_HOST"] = "https://us.cloud.langfuse.com"

In [4]:
# import
from langfuse import Langfuse
import openai
 
# init
langfuse = Langfuse()

## Create dataset

In [5]:
import pandas as pd

gen_dataset = pd.read_csv("generated_qa.csv").fillna("")

In [6]:
dataset_name = "RAG QA Dataset"

In [8]:
langfuse.create_dataset(name=dataset_name)

# Upload to Langfuse
for _, row in gen_dataset.iterrows():
  langfuse.create_dataset_item(
      dataset_name=dataset_name,
      # any python object or value
      input=row["question"],
      # any python object or value, optional
      expected_output={
        "ground_truth": row["ground_truth"],
        "ground_truth_context": row["ground_truth_context"]
      }
)

## Setup custom evaluators

In [9]:
%run 03-metrics-definition.ipynb

In [27]:
def lf_context_correctness(output, expected_output):
    ground_truth_context = expected_output["ground_truth_context"]
    retrieved_contexts = output["context"] or []
    return context_correctness(ground_truth_context, retrieved_contexts)


def lf_ground_truth_context_rank(output, expected_output):
    ground_truth_context = expected_output["ground_truth_context"]
    retrieved_contexts = output["context"] or []
    return ground_truth_context_rank(ground_truth_context, retrieved_contexts)


def lf_context_rougel_score(output, expected_output):
    ground_truth_context = expected_output["ground_truth_context"]
    retrieved_contexts = output["context"] or []
    return context_rougel_score(ground_truth_context, retrieved_contexts)

## Run evaluation

In [41]:
from datetime import datetime
 
def run_my_custom_llm_app(input):
    generationStartTime = datetime.now()

    model_output = openai_query_engine.query(input)
    response = model_output.response
    context = [node.text for node in model_output.source_nodes]
    formatted_output = {
        "output": response,
        "context": context
    }
    
    langfuse_generation = langfuse.generation(
        name="rag-chain-qa",
        input=input,
        output=formatted_output,
        model="gpt-3.5-turbo",
        start_time=generationStartTime,
        end_time=datetime.now()
        )

    return formatted_output, langfuse_generation

In [42]:
dataset = langfuse.get_dataset(dataset_name)

for item in dataset.items:
    completion, langfuse_generation = run_my_custom_llm_app(item.input)

    item.link(langfuse_generation, "Exp 1")

    langfuse_generation.score(
        name="context_correctness",
        value=lf_context_correctness(completion, item.expected_output)
        )
    langfuse_generation.score(
        name="context_rank",
        value=lf_ground_truth_context_rank(completion, item.expected_output)
        )
    langfuse_generation.score(
        name="context_rougel_score",
        value=lf_context_rougel_score(completion, item.expected_output)
        )
    

## Please go to https://cloud.langfuse.com/ to see the trace and evaluation

![Screenshot of LangFuse as of July 12 2024](langfuse.png)