The aim of this course is to Find metrics to Evaluate LLMs 
<img source="">
[](./img/metrics.png)

This code will:
1. Explore the dataset of LLM prompts and responses named **chats.csv** that we’ll use throughout this course.
2. Get a fast demo overview of all the techniques showcased in greater detail in later lessons.



# Agenda 

1. **Introduction and Dataset Exploration**
   - Initially, the code aims to import necessary libraries and explore a dataset containing prompts and responses from a large language model (LLM). The dataset is provided in a CSV file named **chats.csv**. The first few rows of the dataset are displayed to get an overview of its structure and contents.

2. **Setup and explore whylogs and langkit**
   - The code then sets up tools for analyzing the dataset. It uses `whylogs` for logging and analyzing dataset statistics and `langkit` for specific metrics related to large language models. The dataset is logged using these tools to track various aspects like prompt-response relevance.

3. **Analyzing Prompt-Response Relevance**
   - The relevance of the LLM's responses to the prompts is analyzed. This involves using custom functions to visualize and identify how relevant each response is to its corresponding prompt.

4. **Data Leakage Analysis**
   - The code performs an analysis to detect any patterns in the prompts and responses that might lead to data leakage, which could compromise the model's integrity or reveal sensitive information.

5. **Toxicity Analysis**
   - To ensure the content's appropriateness, the code analyzes the dataset for toxicity in both prompts and responses.

6. **Injection Analysis**
   - This part of the code checks for any injections in the dataset. Injections can be malicious inputs or errors that could potentially harm the model or skew the analysis.

7. **Evaluating Specific Cases**
   - Finally, the code filters and evaluates specific cases within the dataset. For instance, it looks at responses containing apologies or prompts exceeding a certain length, assessing these subsets for quality or other criteria.



# Introduction and Dataset Exploration

Next, we load a CSV file into a DataFrame. A DataFrame is a 2-dimensional labeled data structure with columns of potentially different types.

In [4]:
import helpers  # This is likely a custom module for specific helper functions.
import pandas as pd  # Pandas is a popular data manipulation library.

chats = pd.read_csv("./chats.csv")  # Loading the dataset from 'chats.csv' into 'chats' DataFrame.
chats.head(5)  # Displaying the first 5 rows of the DataFrame for a quick preview.

pd.set_option('display.max_colwidth', None)  # Setting pandas to display full content of each column.
chats.head(5)  # Displaying the first 5 rows again, this time showing full column content.


ModuleNotFoundError: No module named 'helpers'

# Setup and explore whylogs and langkit

Setup and Explore whylogs and langkit: These are tools for logging and analyzing data.

In [None]:
import whylogs as why  # Importing the whylogs library for logging dataset statistics.
why.init("whylabs_anonymous")  # Initializing whylogs with an anonymous configuration.

from langkit import llm_metrics  # Importing a module from langkit for LLM metrics.
schema = llm_metrics.init()  # Initializing a metrics schema for logging.

result = why.log(chats, name="LLM chats dataset", schema=schema)  # Logging the chats dataset with the defined schema.


ModuleNotFoundError: No module named 'whylogs'

# Analyzing Prompt-Response Relevance

Using custom functions to visualize and analyze the relevance of responses to prompts.

In [None]:
from langkit import input_output  # Importing a module from langkit for input-output analysis.

# Visualizing a specific metric related to the relevance of responses to prompts.
helpers.visualize_langkit_metric(chats, "response.relevance_to_prompt")

# Displaying critical queries regarding response relevance.
helpers.show_langkit_critical_queries(chats, "response.relevance_to_prompt")

# Data Leakage Analysis

Identifying patterns in prompts and responses that might lead to data leakage.

In [None]:
from langkit import regexes  # Importing a module for regex operations.

# Visualizing metrics to identify patterns in prompts and responses.
helpers.visualize_langkit_metric(chats, "prompt.has_patterns")
helpers.visualize_langkit_metric(chats, "response.has_patterns")

# Toxicity Analysis

Evaluating prompts and responses for toxic content.

In [None]:
from langkit import toxicity  # Importing a module for toxicity analysis.

# Visualizing metrics related to toxicity in prompts and responses.
helpers.visualize_langkit_metric(chats, "prompt.toxicity")
helpers.visualize_langkit_metric(chats, "response.toxicity")

# Injection Analysis
Checking for injections in the data, which can be malicious inputs or errors.

In [None]:
from langkit import injections  # Importing a module for injection analysis.

# Visualizing metrics related to injections in the data.
helpers.visualize_langkit_metric(chats, "injection")
helpers.show_langkit_critical_queries(chats, "injection")


# Evaluating Specific Cases
Filtering and evaluating the dataset based on specific criteria.

In [None]:
helpers.evaluate_examples()  # Evaluating some examples from the dataset.

# Filtering chats where the response contains 'Sorry' and evaluating them.
filtered_chats = chats[chats["response"].str.contains("Sorry")]
helpers.evaluate_examples(filtered_chats)

# Filtering chats where the prompt length is greater than 250 characters and evaluating them.
filtered_chats = chats[chats["prompt"].str.len() > 250]
helpers.evaluate_examples(filtered_chats)