## ハンズオン01: LLM アプリケーション Trace

必要なライブラリをダウンロードします。

In [1]:
%pip install -r ../requirements.txt

Collecting aiohappyeyeballs==2.4.0 (from -r ../requirements.txt (line 1))
  Downloading aiohappyeyeballs-2.4.0-py3-none-any.whl.metadata (5.9 kB)
Collecting aiohttp==3.10.5 (from -r ../requirements.txt (line 2))
  Downloading aiohttp-3.10.5-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.metadata (7.5 kB)
Collecting aiosignal==1.3.1 (from -r ../requirements.txt (line 3))
  Downloading aiosignal-1.3.1-py3-none-any.whl.metadata (4.0 kB)
Collecting altair==5.4.1 (from -r ../requirements.txt (line 4))
  Downloading altair-5.4.1-py3-none-any.whl.metadata (9.4 kB)
Collecting annotated-types==0.7.0 (from -r ../requirements.txt (line 5))
  Downloading annotated_types-0.7.0-py3-none-any.whl.metadata (15 kB)
Collecting anyio==4.4.0 (from -r ../requirements.txt (line 6))
  Downloading anyio-4.4.0-py3-none-any.whl.metadata (4.6 kB)
Collecting asttokens==2.4.1 (from -r ../requirements.txt (line 7))
  Downloading asttokens-2.4.1-py2.py3-none-any.whl.metadata (5.2 kB)
Collecting attrs==2

ハンズオンに必要な環境変数を `../.env` から読み込みます。

In [3]:
import os
from dotenv import load_dotenv, find_dotenv

_ = load_dotenv(find_dotenv())

endpoint = "http://langfuse-server:3000"
public_key = os.getenv("PUBLIC_KEY")
secret_key = os.getenv("SECRET_KEY")

Langfuseのクライアントを初期化します。

In [4]:
from langfuse import Langfuse

langfuse = Langfuse(
    public_key=public_key,
    secret_key=secret_key,
    host=endpoint
)

### 出力に対する評価

LLM as a Judgeの対象となる生成結果の一覧を取得します。今回は、現在時から24時間以内に生成された生成結果を評価対象として扱います。

In [5]:
import datetime
from pprint import pprint

generations = langfuse.get_generations(
    from_start_time=datetime.datetime.now() - datetime.timedelta(hours=24),
)

pprint(f"Fetched {len(generations.data)} generations.")
pprint(f"{generations.data[0].__dict__}")

'Fetched 1 generations.'
("{'id': 'bd64f942-97f9-48a8-9f99-0fa00dc538bb', 'trace_id': "
 "'0fd4b94d-bb90-488a-90dd-2531091976f0', 'type': 'GENERATION', 'name': "
 "'ChatOpenAI', 'start_time': datetime.datetime(2024, 10, 6, 16, 8, 22, 9000, "
 "tzinfo=datetime.timezone.utc), 'end_time': datetime.datetime(2024, 10, 6, "
 "16, 8, 25, 212000, tzinfo=datetime.timezone.utc), 'completion_start_time': "
 "None, 'model': 'gpt-4o-mini', 'model_parameters': {'max_tokens': 1024, "
 "'temperature': '0.7'}, 'input': [{'role': 'user', 'content': "
 "'\\n以下の質問に答えてください。\\n\\n## 質問\\nカルビクッパ\\n'}], 'version': None, 'metadata': "
 "{'tags': ['seq:step:3'], 'ls_provider': 'openai', 'ls_max_tokens': 1024, "
 "'ls_model_name': 'gpt-4o-mini', 'ls_model_type': 'chat', 'ls_temperature': "
 "0.7}, 'output': {'role': 'assistant', 'content': "
 "'カルビクッパは、韓国の料理の一つで、カルビ（牛肉のあばら肉）を使ったスープご飯です。通常、煮込んだカルビをスープにし、白ご飯を加えて煮込むことで、風味豊かな料理に仕上げます。スープは辛さの調整が可能で、コチュジャンや唐辛子粉を使って辛味を加えることが一般的です。\\n\\nカルビクッパは、栄養バランスが良く、温かいので、寒い季節にもぴった

評価用の関数を実装します。今回は、LangChainのEvaluatorを使用します。

In [6]:
from langchain.evaluation.loading import load_evaluator
from langchain.evaluation.schema import EvaluatorType

def load_evaluator_by_criteria_key(key: str):
    if os.getenv("COHERE_API_KEY") == None:
        from langchain_openai.chat_models import ChatOpenAI
        openai_api_key = os.getenv("OPENAI_API_KEY")
        llm = ChatOpenAI(api_key=openai_api_key, model="gpt-4o-mini")
    else:
        from langchain_cohere.chat_models import ChatCohere
        cohere_api_key = os.getenv("COHERE_API_KEY")
        llm = ChatCohere(cohere_api_key=cohere_api_key, model="command-r-plus")

    evaluator = load_evaluator(
        evaluator=EvaluatorType.CRITERIA,
        llm=llm,
        criteria=key
    )
    return evaluator

評価基準を設定します。今回は、

- conciseness: 簡潔で要点をついた回答であるか
- coherence: 構造化され、整理された回答であるか
- harmfulness: 有害、攻撃的、不適切な回答であるか

を評価基準として設定します。

In [7]:
criterias = [
    "conciseness",
    "coherence",
    "harmfulness",
]

24時間以内の生成結果に対して、実際にLLMによる評価を行います。

In [8]:
def execute_evaluation_and_scoring():
    for generation in generations.data:
        for key in criterias:
            evaluator = load_evaluator_by_criteria_key(key=key)
            result = evaluator.evaluate_strings(
                prediction=generation.output,
                input=generation.input
            )
            pprint(result)
            langfuse.score(
                name=f"llm-as-a-judge-{key}",
                trace_id=generation.trace_id,
                observation_id=generation.id,
                value=result.get("score"),
                comment=result.get("reasoning")
            )

execute_evaluation_and_scoring()

* 'allow_population_by_field_name' has been renamed to 'populate_by_name'
* 'smart_union' has been removed


{'reasoning': 'Step-by-step reasoning: \n'
              'The submission provides a clear and concise description of what '
              "'Kalbi-kukpa' is, including the key ingredients, cooking "
              'method, and cultural context. It also mentions the flexibility '
              "of spice customization and the dish's nutritional value. There "
              'is no unnecessary information or repetition, and the '
              'explanation is focused and to the point. \n'
              '\n'
              'Conclusion: The submission meets the criterion of conciseness.\n'
              '\n'
              'Answer: Y',
 'score': 1,
 'value': 'Y'}
{'reasoning': 'Step-by-step reasoning: \n'
              'The submission is structured as a single paragraph that '
              "responds to the user's question. It provides a coherent and "
              "concise description of 'Kalbi-kukbab', including its main "
              'ingredients, preparation method, and cultural context. 

### （オプション）入力に対する評価

LLM as a Judgeの対象となる一覧を取得します。今回は、現在時から24時間以内に入力されたプロンプトを対象として扱います。

In [9]:
import datetime
from pprint import pprint

traces = langfuse.get_traces(
    from_timestamp=datetime.datetime.now() - datetime.timedelta(hours=24),
    tags=["app"]
)

pprint(f"Fetched {len(traces.data)} generations.")
pprint(f"{traces.data[0].__dict__}")

'Fetched 3 generations.'
("{'id': '0fd4b94d-bb90-488a-90dd-2531091976f0', 'timestamp': "
 'datetime.datetime(2024, 10, 6, 16, 8, 21, 419000, '
 "tzinfo=datetime.timezone.utc), 'name': 'Ask the BigBaBy', 'input': 'カルビクッパ', "
 "'output': "
 "'カルビクッパは、韓国の料理の一つで、カルビ（牛肉のあばら肉）を使ったスープご飯です。通常、煮込んだカルビをスープにし、白ご飯を加えて煮込むことで、風味豊かな料理に仕上げます。スープは辛さの調整が可能で、コチュジャンや唐辛子粉を使って辛味を加えることが一般的です。\\n\\nカルビクッパは、栄養バランスが良く、温かいので、寒い季節にもぴったりの一品です。さまざまな野菜や豆腐を加えることもあり、アレンジが楽しめる料理でもあります。韓国では、特に居酒屋や家庭料理として親しまれています。', "
 "'session_id': '86e4389a-02ba-4f47-89e5-1134f2c44eaf', 'release': "
 "'0.0.1-SNAPSHOT', 'version': None, 'user_id': None, 'metadata': None, "
 "'tags': ['app'], 'public': False, 'html_path': "
 "'/project/pj-1234567890/traces/0fd4b94d-bb90-488a-90dd-2531091976f0', "
 "'latency': 3.800999879837036, 'total_cost': 0.00012465, 'observations': "
 "['1c49197d-7981-4633-b86f-0ebb7694b4cd', "
 "'5d097f0a-8f8d-4fcf-9c2d-bebde23bcff9', "
 "'99ed2c36-02f4-4778-a8eb-ffca3e81f2f3', "
 "'ad64b6ae-6255-4011-8587-10dbea3c

評価用の関数を実装します。今回は、ユーザーの入力プロンプトを”否定的”、”中立的”、”肯定的”にLLMを用いて分類を行います。

In [10]:
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate

fallback_prompt = """
以下の入力テキストを”否定的”、”中立的”、”肯定的”に分類してください。
また、出力は”否定的”、”中立的”、”肯定的”のみで理由などは含まないでください。

## 入力テキスト

{{input}}
"""

def sentiment_analysis(input: str) -> str:
    if os.getenv("COHERE_API_KEY") == None:
        from langchain_openai.chat_models import ChatOpenAI
        openai_api_key = os.getenv("OPENAI_API_KEY")
        llm = ChatOpenAI(api_key=openai_api_key, model="gpt-4o-mini")
    else:
        from langchain_cohere.chat_models import ChatCohere
        cohere_api_key = os.getenv("COHERE_API_KEY")
        llm = ChatCohere(cohere_api_key=cohere_api_key, model="command-r-plus")
    prompt = langfuse.get_prompt(name="sentiment-analysis-prompt", fallback=fallback_prompt)
    sentiment_analysis_chain = (
        {"input": RunnablePassthrough()}
        | PromptTemplate.from_template(prompt.get_langchain_prompt())
        | llm
        | StrOutputParser()
    )
    result = sentiment_analysis_chain.invoke(input)
    return result

入力プロンプトに対する感情分析を実行します。

In [11]:
def execute_sentiment_analysis():
    for trace in traces.data:
        result = sentiment_analysis(input=trace.input)
        score_map = {
            "否定的": 0,
            "中立的": 0.5,
            "肯定的": 1
        }
        pprint({"input": trace.input, "result": result})
        langfuse.score(
            name=f"llm-as-a-judge-sentiment-analysis",
            trace_id=trace.id,
            observation_id=trace.id,
            value=score_map.get(result, 0.5),
            comment=result
        )

execute_sentiment_analysis()

Giving up fetch_prompts(...) after 2 tries (langfuse.api.resources.commons.errors.not_found_error.NotFoundError: status_code: 404, body: {'message': "Prompt not found: 'sentiment-analysis-prompt' with label 'production'", 'error': 'LangfuseNotFoundError'})
Error while fetching prompt 'sentiment-analysis-prompt-label:production': status_code: 404, body: {'message': "Prompt not found: 'sentiment-analysis-prompt' with label 'production'", 'error': 'LangfuseNotFoundError'}
Traceback (most recent call last):
  File "/opt/conda/lib/python3.11/site-packages/langfuse/client.py", line 1110, in _fetch_prompt_and_update_cache
    prompt_response = fetch_prompts()
                      ^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/backoff/_sync.py", line 105, in retry
    ret = target(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/langfuse/client.py", line 1099, in fetch_prompts
    return self.client.prompts.get(
           ^^^

{'input': 'カルビクッパ', 'result': '中立的'}


Giving up fetch_prompts(...) after 2 tries (langfuse.api.resources.commons.errors.not_found_error.NotFoundError: status_code: 404, body: {'message': "Prompt not found: 'sentiment-analysis-prompt' with label 'production'", 'error': 'LangfuseNotFoundError'})
Error while fetching prompt 'sentiment-analysis-prompt-label:production': status_code: 404, body: {'message': "Prompt not found: 'sentiment-analysis-prompt' with label 'production'", 'error': 'LangfuseNotFoundError'}
Traceback (most recent call last):
  File "/opt/conda/lib/python3.11/site-packages/langfuse/client.py", line 1110, in _fetch_prompt_and_update_cache
    prompt_response = fetch_prompts()
                      ^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/backoff/_sync.py", line 105, in retry
    ret = target(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/langfuse/client.py", line 1099, in fetch_prompts
    return self.client.prompts.get(
           ^^^

{'input': 'カルビクッパを食べたい', 'result': '肯定的'}


Giving up fetch_prompts(...) after 2 tries (langfuse.api.resources.commons.errors.not_found_error.NotFoundError: status_code: 404, body: {'message': "Prompt not found: 'sentiment-analysis-prompt' with label 'production'", 'error': 'LangfuseNotFoundError'})
Error while fetching prompt 'sentiment-analysis-prompt-label:production': status_code: 404, body: {'message': "Prompt not found: 'sentiment-analysis-prompt' with label 'production'", 'error': 'LangfuseNotFoundError'}
Traceback (most recent call last):
  File "/opt/conda/lib/python3.11/site-packages/langfuse/client.py", line 1110, in _fetch_prompt_and_update_cache
    prompt_response = fetch_prompts()
                      ^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/backoff/_sync.py", line 105, in retry
    ret = target(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/langfuse/client.py", line 1099, in fetch_prompts
    return self.client.prompts.get(
           ^^^

{'input': 'カルビクッパを食べたい', 'result': '肯定的'}
