<div style="border-left: 4px solid #4a90d9; padding-left: 12px;">

### üìÇ Load Simulation Data

Load raw JSON simulation files and convert them into a structured DataFrame.

**Functions:** [`load_json_files`](../src/symtrain/data/loader.py) ¬∑ [`create_transcript_dataframe`](../src/symtrain/data/loader.py)

</div>


In [2]:
import sys
import os

os.environ["TOKENIZERS_PARALLELISM"] = "false"
sys.path.insert(0, "../src")

# Load simulation data
from symtrain.data import load_json_files, create_transcript_dataframe

json_list = load_json_files("../data/raw")
df = create_transcript_dataframe(json_list)
df

Unnamed: 0,name,transcript
0,Copy OSA - Parity 2.0 1,TRAINEE: Thank you for calling Northwestern Mu...
1,Return Request (Bed Frame/Toppers/Misc),"TRAINEE: Thank you for calling Zinus, this is ..."
2,Startek_Compassion- Gift Donation. BEG,"SYM: In this simulation, you will assist the c..."
3,Sales U - Closing Conclusion,SYM: It‚Äôs closing time for this module.\nSYM: ...
4,Startek_Telco_Visual_BEG_Demo,"TRAINEE: Thank you for calling Xfinity Mobile,..."
5,Arise_Carnival DS - Sales Inquiries - Booking ...,"TRAINEE: Carnival Fun Ships, this is [Agent Na..."
6,H21 Health insurance coverage question ‚Äì Step ...,"SYM: In this simulation, you will learn how to..."
7,TRAINEE STARTS HERE: How did you do? View your...,SYM: Congratulations on running your first sym...
8,H21 Health insurance coverage question ‚Äì Step ...,SYM: You have already opened the call and veri...
9,UCSL eApp ‚Äì State Specifics,"SYM: In this sym, we will review how the UCSL ..."


<div style="border-left: 4px solid #4a90d9; padding-left: 12px;">

### üî¢ Generate Embeddings

Create 768-dim vector embeddings for each transcript using DistilBERT.

**Functions:** [`embed_dataframe_column`](../src/symtrain/embeddings/distilbert.py)

</div>


In [3]:
# Generate embeddings for all transcripts
from symtrain.embeddings import embed_dataframe_column

df = embed_dataframe_column(df, "transcript")

Generated 38 embeddings ‚Üí 'transcript_emb' (dim=768)


<div style="border-left: 4px solid #4a90d9; padding-left: 12px;">

### üéØ Transformer-Based Clustering

Group simulations into clusters using K-means on embeddings.

**Functions:** [`cluster_embeddings`](../src/symtrain/clustering/kmeans.py) ¬∑ [`print_cluster_summary`](../src/symtrain/clustering/kmeans.py)

</div>


In [4]:
# Approach 1: Transformer-based clustering
from symtrain.clustering import cluster_embeddings, print_cluster_summary

df = cluster_embeddings(df, "transcript_emb", n_clusters=5)
print_cluster_summary(df, "name", "transcript_emb_cluster")


=== Cluster 0 ===
  - Sales U - Closing Conclusion
  - TRAINEE STARTS HERE: How did you do? View your Playback!  
  - UCSL eApp ‚Äì State Specifics

=== Cluster 1 ===
  - Startek_Compassion- Gift Donation. BEG
  - Startek_Telco_Visual_BEG_Demo 
  - Arise_Carnival DS - Sales Inquiries - Booking created ~ BEG

=== Cluster 2 ===
  - H21 Health insurance coverage question ‚Äì Step 1 (Opening)
  - H21 Health insurance coverage question ‚Äì Step 2 (Gather Information)
  - H21 Health insurance coverage question ‚Äì (Real human voice)

=== Cluster 3 ===
  - Zinus Knowledge Check 6 Version 2

=== Cluster 4 ===
  - Copy OSA - Parity 2.0 1
  - Return Request (Bed Frame/Toppers/Misc)
  - OSA - Parity 2.0


<div style="border-left: 4px solid #4a90d9; padding-left: 12px;">

### ü§ñ LLM-Based Categorization

Use Ollama to generate natural language categories for each transcript.

**Functions:** [`categorize_with_ollama`](../src/symtrain/llm/categorize.py)

</div>


In [5]:
# Approach 2: LLM-based categorization
from symtrain.llm import categorize_with_ollama

# Apply to dataframe
df["llm_category"] = df["transcript"].apply(categorize_with_ollama)
df[["name", "llm_category"]]

Unnamed: 0,name,llm_category
0,Copy OSA - Parity 2.0 1,Insurance Claims
1,Return Request (Bed Frame/Toppers/Misc),Returns
2,Startek_Compassion- Gift Donation. BEG,Account Issues
3,Sales U - Closing Conclusion,Sales Training
4,Startek_Telco_Visual_BEG_Demo,Account Issues
5,Arise_Carnival DS - Sales Inquiries - Booking ...,Booking
6,H21 Health insurance coverage question ‚Äì Step ...,Insurance Claims
7,TRAINEE STARTS HERE: How did you do? View your...,Training & Onboarding
8,H21 Health insurance coverage question ‚Äì Step ...,Insurance Claims
9,UCSL eApp ‚Äì State Specifics,Insurance Application


<div style="border-left: 4px solid #4a90d9; padding-left: 12px;">

### üíæ Export Processed Data

Save the enriched DataFrame with embeddings, clusters, and categories to CSV.

</div>


In [6]:
df.to_csv("../data/processed/df.csv", index=False)

<div style="border-left: 4px solid #4a90d9; padding-left: 12px;">

### üîç Similarity Search

Find simulations most similar to a query using cosine similarity.

**Functions:** [`find_similar`](../src/symtrain/search/similarity.py)

</div>


In [7]:
# Similarity search
from symtrain.search import find_similar

# Example: Find simulations related to payment issues
find_similar(
    df,
    "transcript_emb",
    """Hi, I ordered a shirt last week and paid with my American Express card. I need to update the
payment method because there is an issue with that card. Can you help me?""",
)

Unnamed: 0,name,similarity
36,BN_CANCEL_RELEASED_ORDER_ADV,0.895486
37,C2 Car insurance claim - FNOL ‚Äì Step 1 (Opening),0.879906
23,Startek_Walmart - Demo - 2025,0.877854
16,BN_ORDER_LOST_REPLACE_ADV,0.877653
6,H21 Health insurance coverage question ‚Äì Step ...,0.877434


<div style="border-left: 4px solid #4a90d9; padding-left: 12px;">

### üß† Few-Shot Learning Pipeline

Test step generation with RAG-based few-shot prompting.

**Functions:** [`generate_steps_with_ollama`](../src/symtrain/llm/few_shot.py)

</div>


In [9]:
# Task 5: Few-shot learning pipeline
import json
from symtrain.llm import generate_steps_with_ollama

test_inputs = {
    "test_1": "Hi, I ordered a shirt last week and paid with my American Express card. I need to update the payment method because there is an issue with that card. Can you help me?",
    "test_2": "Hi, I need to update the payment method for one of my recent orders. Can you help me with that?",
    "test_3": "Hi, I am Sam. I was in a car accident this morning and need to file an insurance claim. Can you help me?",
    "test_4": "Hi, can you help me file a claim?",
    "test_5": "Hi, I recently ordered a book online. Can you give me an update on the order status?",
    "test_6": "Hi, I have been waiting for two weeks for the book I ordered. What is going on with it? Can you give me an update?",
}

# Run on all test inputs
results = {}
for test_id, query in test_inputs.items():
    results[test_id] = generate_steps_with_ollama(query, df)

# Display results
for test_id, result in results.items():
    print(f"\n=== {test_id} ===")
    print(json.dumps(result, indent=2))


=== test_1 ===
{
  "category": "Payment Issues",
  "reason": "The customer wants to change the payment method for a recent order due to a problem with their current card.",
  "steps": [
    "Greet the customer and verify their identity (name, address, order number, or email).",
    "Locate the specific order in the system using the provided order details.",
    "Confirm the current payment method (American Express) and note the issue reported.",
    "Ask for the new payment method details (card type, number, expiration date, CVV) and verify that it is an accepted form of payment.",
    "Update the payment method in the order management system, ensuring the new card is set as the primary payment.",
    "If the order has already been charged to the old card, initiate a refund to that card and re\u2011process the payment on the new card, or apply the new payment method to any pending charges.",
    "Provide the customer with a confirmation number or updated order summary reflecting the n