# Open-ended dataset conversation starters for regression testing

The dataset we use to sample conversation starters is available at:
https://huggingface.co/datasets/HuggingFaceTB/everyday-conversations-llama3.1-2k

This code will read the dataset, sample 100 random rows, extract the second message from the User in each conversation, remove trailing quotations and backslash while processing CSV, and save it to a CSV file.


In [1]:
import polars as pl

splits = {
    "train_sft": "data/train_sft-00000-of-00001.parquet",
    "test_sft": "data/test_sft-00000-of-00001.parquet",
}
df = pl.read_parquet(
    "hf://datasets/HuggingFaceTB/everyday-conversations-llama3.1-2k/"
    + splits["train_sft"]
)

# show the second message of Users
user_msgs = [
    msg["content"] for row in df["messages"] for msg in row if msg.get("role") == "user"
]
if len(user_msgs) > 1:
    print(user_msgs[1])
else:
    print("Less than two user messages found.")

I'm trying to track my expenses. Can you help me with that?


In [2]:
def remove_quotes_strip(text: str) -> str:
    """Remove leading and trailing double quotes.

    Args:
        text: Input string that may have quotes

    Returns:
        String with quotes stripped
    """
    return text.strip('"')

In [3]:
def remove_backslash(text: str) -> str:
    """Remove backslashes from a string.

    Args:
        text: Input string that may have backslashes

    Returns:
        String with backslashes removed
    """
    return text.replace("\\", "")

In [4]:
# pick 100 random rows
sdf = df.sample(n=100, seed=8)
sdf

# show the second messages of User in completion column
# Show the second message of User in the completion column for all rows in sdf
second_user_msgs = []
for idx, row in enumerate(sdf.iter_rows(named=True)):
    messages = row["messages"]
    user_msgs = [msg["content"] for msg in messages if msg.get("role") == "user"]
    # Collect the second user message if it exists
    if len(user_msgs) > 1:
        second_user_msg = user_msgs[1]
    else:
        second_user_msg = None
    # Append to a list

    second_user_msgs.append(remove_backslash(remove_quotes_strip(second_user_msg)))

In [5]:
second_user_msgs

["I'm putting together a basic first aid kit. What are some essential items I should include?",
 "I'm looking for some new music to listen to. What's a good movie soundtrack?",
 "I'm looking for a new laptop, can you tell me about some popular e-commerce platforms where I can buy one?",
 "I'm learning about the water cycle in school. What is transpiration?",
 "I'm trying to get to the city center. What's the fastest way to get there?",
 "I'm looking for some new recipes to try out, do you have any family recipes you can share?",
 "I'm thinking of buying a bike to ride to work. Are there bike lanes in my city?",
 "I'm looking for ways to make my home more energy-efficient. What are some good options for appliances?",
 'What season is it right now?',
 "I'm getting ready for summer and I want to know what I should do to get my home ready.",
 "I'm going to a meeting in the city center, but I'm worried about traffic congestion. What's the current traffic situation like?",
 "I'm planning a v

In [6]:
from typing import List


def save_list_to_csv(data: List[str], filename: str) -> None:
    """Save a simple list to text file, one item per line.

    Args:
        data: List of strings to save
        filename: Output filename
    """
    with open(filename, "w", newline="", encoding="utf-8") as file:
        for item in data:
            if item:  # Only write non-None items
                file.write(item + "\n")

In [7]:
from hpms.loading.constants import DATASET_3_CONVERSATION_STARTERS

save_list_to_csv(second_user_msgs, DATASET_3_CONVERSATION_STARTERS)

In [8]:
# Load second_user_msgs.csv in polars
newdf = pl.read_csv(DATASET_3_CONVERSATION_STARTERS, has_header=False, separator="|")

# Print number of unique messages
print(f"Number of unique messages: {newdf.n_unique()}")

# Print which message is not unique
print(newdf.get_column("column_1").value_counts().filter(pl.col("count") > 1))
newdf

Number of unique messages: 100
shape: (0, 2)
┌──────────┬───────┐
│ column_1 ┆ count │
│ ---      ┆ ---   │
│ str      ┆ u32   │
╞══════════╪═══════╡
└──────────┴───────┘


column_1
str
"""I'm putting together a basic f…"
"""I'm looking for some new music…"
"""I'm looking for a new laptop, …"
"""I'm learning about the water c…"
"""I'm trying to get to the city …"
…
"""I want to start meal planning …"
"""I need some help with meal pla…"
"""I'm flying to New York tomorro…"
"""I need help planning a meal fo…"
