# Step 6 (Optional): Use Langfuse to track evaluation and trace

This notebook uses langfuse to trace and track the evaluation. You need an API key to run this notebook. Langfuse offers a free tier with 50,000 observations per month (as of Feb 2024). This notebook uses approximately 100 observations. You can get the API key by signing up on https://cloud.langfuse.com, creating a new project, and creating a new API key in the settings. 

In [1]:
%run 01-llm-app-setup.ipynb

## Set your keys if you didn't put it in your ".env" file

In [2]:
# import os
 
# # get keys for your project from https://cloud.langfuse.com
# os.environ["LANGFUSE_PUBLIC_KEY"] = "pk-lf-..."
# os.environ["LANGFUSE_SECRET_KEY"] = "sk-lf-..."
# os.environ["LANGFUSE_HOST"] = "https://us.cloud.langfuse.com"

In [3]:
# import
from langfuse import Langfuse
import openai
 
# init
langfuse = Langfuse()

## Create dataset

In [4]:
import pandas as pd

gen_dataset = pd.read_csv("generated_qa.csv").fillna("")

In [5]:
dataset_name = "RAG QA Dataset"

In [6]:
langfuse.create_dataset(name=dataset_name)

# Upload to Langfuse
for _, row in gen_dataset.iterrows():
  langfuse.create_dataset_item(
      dataset_name=dataset_name,
      # any python object or value
      input=row["question"],
      # any python object or value, optional
      expected_output={
        "ground_truth": row["ground_truth"],
        "ground_truth_context": row["ground_truth_context"]
      }
)

## Setup custom evaluators

In [7]:
%run 03-metrics-definition.ipynb

In [8]:
def lf_context_correctness(output, expected_output):
    ground_truth_context = expected_output["ground_truth_context"]
    retrieved_contexts = output["context"] or []
    return context_correctness(ground_truth_context, retrieved_contexts)


def lf_ground_truth_context_rank(output, expected_output):
    ground_truth_context = expected_output["ground_truth_context"]
    retrieved_contexts = output["context"] or []
    return ground_truth_context_rank(ground_truth_context, retrieved_contexts)


def lf_context_rougel_score(output, expected_output):
    ground_truth_context = expected_output["ground_truth_context"]
    retrieved_contexts = output["context"] or []
    return context_rougel_score(ground_truth_context, retrieved_contexts)

## Run evaluation

In [9]:
from datetime import datetime
 
def run_my_custom_llm_app(input):
    generationStartTime = datetime.now()

    out = rag_chain.invoke(input)
    
    langfuse_generation = langfuse.generation(
        name="rag-chain-qa",
        input=input,
        output=out,
        model="gpt-3.5-turbo",
        start_time=generationStartTime,
        end_time=datetime.now()
        )

    return out, langfuse_generation

In [10]:
dataset = langfuse.get_dataset(dataset_name)

for item in dataset.items:
    completion, langfuse_generation = run_my_custom_llm_app(item.input)

    item.link(langfuse_generation, "Exp 1")

    langfuse_generation.score(
        name="context_correctness",
        value=lf_context_correctness(completion, item.expected_output)
        )
    langfuse_generation.score(
        name="context_rank",
        value=lf_ground_truth_context_rank(completion, item.expected_output)
        )
    langfuse_generation.score(
        name="context_rougel_score",
        value=lf_context_rougel_score(completion, item.expected_output)
        )
    

## Please go to https://cloud.langfuse.com/ to see the trace and evaluation

![](langfuse.png)