# Week 4 - Systematically Improving Your Rag Application

> **Prerequisites**: Please make sure that you've ran `1. Generate Dataset.ipynb` to generate the dataset that we'll be using in this notebook. It'll also help you to get familiar with data that we're working with for this specific case study

In this notebook, we'll look at how we can use topic modelling to identify clusters of related queries that we can then manually inspect. 

## Why This Matters

When handling thousands of daily user queries, manually reviewing each conversation isn't practical. Instead of guessing what users are asking about, topic modeling helps us automatically group similar questions together. 

By understanding these patterns and combining them with data like query volume and user satisfaction, we can prioritize improvements where they matter most. This is especially true when we have domain experts who can help us validate our topics and come up with explicit categories for them.

## What you'll Learn

Through this hands-on analysis of synthetic Klarna support queries, you'll discover how to:

1. Set Up Topic Modeling

- Configure BERTopic for query analysis
- Choose appropriate clustering parameters
- Prepare data for pattern discovery


2. Analyze Query Clusters

- Identify common themes in user questions
- Measure topic importance by volume
- Find patterns in user frustration


3. Act on Insights

- Target high-impact improvements
- Monitor topic changes over time
- Validate topic quality

By the end of this notebook, you'll have a good understanding of how to use the `BERTopic` library to generate topics, manually inspect some of these topics and then use them to improve your RAG system. It's important to stress here that topic modelling is just a way to come up with these explicit categories. You should always validate these topics with a domain expert to ensure we're covering all the grounds that we need to.


## Generating Topics with BERTopic


We use Bertopic because it provides a few benefits

- It has a modular architecture that allows us to swap out embedding models, clustering algorithms and dimensionality reduction techniques easily 
- It has built in visualisation and analysis tools
- It offers a large amount of extensions that we can use to guide the model to generate topics that we care about



In this section, we'll walk you through how to generate topics from our dataset. In order to ensure our analysis is reproducible, we'll be fixing the random state of the `UMAP` algorithm as seen below.

### Generating Our Topics

When generating topics, due to the random nature of its dimensionality reduction and clustering algorithms, you might get different results than what we have below. Operating systems also add further variation because of the differences in how specific calculations and randomness might be implemented among other factors. 

To get around this issue, we'll export a `.csv` a few cells down. We'll save the following data to a single csv file so we can load it for data analysis. Each row will have 

- `query` : This is the user query that we've generated previously
- `topic` : This is the topic id that the query has been assigned to
- `answer` : This was the generated answer that we created
- `category` : This was the category of the question
- `subcategory` : This was the specific subcategory
- `satisfaction` : This is a binary value which represents whether users indicated that their query was answered or not which we'll simulate down the line.

This ensures that when we start thinking about the categories for our classifier down the line, the analysis that we're performing is reproducible on your local machine so you can get a sense for what the data analysis for topics might look like. We've tried to limit the amount of possible variation by fixing the parameters for UMAP and HDBScan as seen below.

That's why topic modelling should be used as a way to discover explicit categories of user queries we're doing badly for rather than as an absolute ground truth label.

In [3]:
import json
import pandas as pd

with open("./data/cleaned.jsonl", "r") as f:
    questions = [json.loads(line) for line in f]
    docs = [item["question"] for item in questions]

df = pd.DataFrame(questions)
df.drop(columns=["citations"], inplace=True)
df.drop(columns=["sources"], inplace=True)
df.head()

Unnamed: 0,question,answer,category,subcategory
0,"How can I close my Klarna account, and what ha...","To close your Klarna account, you must settle ...",Account & settings,Manage account
1,What happens after I report my item being retu...,"When a return is reported, the invoice is paus...",Delivery & returns,Returns
2,Why was my payment declined for klarna today? ...,A person might be unable to complete a purchas...,Declined purchase,Declined Purchase
3,why do I always need to verify my banking deta...,Customers need to verify their details to ensu...,Fraud & security,Data protection
4,any issues using the virtual vs the physical k...,Your virtual card will be available in the app...,Products & services,Klarna Card


In [4]:
from hdbscan import HDBSCAN
from umap import UMAP
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer

# Configure embedding model for better representation
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

umap_model = UMAP(
    n_neighbors=15, n_components=5, min_dist=0.0, metric="cosine", random_state=30
)

# HDBSCAN parameters for clustering
# min_cluster_size: minimum size of clusters
# min_samples: controls how conservative clustering is
# metric: distance metric for clustering
# prediction_data: needed for predicting topics for new documents
hdbscan_model = HDBSCAN(
    min_cluster_size=10,
    min_samples=8,
    metric="euclidean",
    cluster_selection_method="eom",
)

# BERTopic parameters
topic_model = BERTopic(
    embedding_model=embedding_model,
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    nr_topics="auto",
)

# Fit the model and transform documents to topics
topics, probs = topic_model.fit_transform(docs)

# Display topic information
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,47,-1_for_to_klarna_the,"[for, to, klarna, the, this, but, my, can, car...",[I paid $200 for a Nike jacket from Macy's two...
1,0,36,0_klarna_declined_my_in,"[klarna, declined, my, in, ve, for, was, used,...",[Tried buying a $150 pair of Nike sneakers 2 d...
2,1,31,1_payment_to_the_my,"[payment, to, the, my, for, pay, due, card, is...",[I ordered a pair of Nike sneakers for $120 th...
3,2,26,2_my_klarna_details_and,"[my, klarna, details, and, account, email, cal...",[I got a call from someone who claimed to be a...
4,3,15,3_store_the_it_my,"[store, the, it, my, refund, back, ago, days, ...",[Ordered a Samsung Galaxy Tablet three days ag...
5,4,13,4_ago_the_shoes_what,"[ago, the, shoes, what, days, sneakers, ordere...",[I ordered a pair of Adidas sneakers costing $...


In [3]:
# Get representative docs and sample items for each topic
for topic_id in topic_model.get_topic_info().Topic:
    if topic_id == -1:
        continue

    print(f"Topic {topic_id}:")

    # Get representative docs for this specific topic
    topic_docs = topic_model.get_representative_docs(topic_id)

    for i, doc in enumerate(topic_docs, 1):
        print(f"{i}. {doc}")

    # Get sample items from this topic using get_document_info
    document_info = topic_model.get_document_info(docs)
    topic_items = document_info[document_info["Topic"] == topic_id]["Document"].tolist()
    if topic_items:
        print("\nSample items:")
        # Get up to 5 sample items
        sample_items = topic_items[:10]

        for i, item in enumerate(sample_items, 1):
            print(f"{i}. {item}")

    print("\n" + "-" * 50 + "\n")

Topic 0:
1. Tried buying a $150 pair of Nike sneakers 2 days ago with Klarna, but it was declined. I've used Klarna before without an issue. What's going on?
2. Klarna declined my purchase of a ¥150,000 designer kotatsu from MUJI in Kyoto. Very disappointed as I've used Klarna successfully at Tokyu Hands in Ikebukuro before. Winter is coming and I need this for my apartment in Nakano. Can you explain why my payment was rejected despite my good payment history?
3. I'm really confused about why Klarna declined my Nike Air Max 270 purchase for €140. I've used Klarna many times before here in Germany with no issues. I followed all the verification steps in the popup, but still can't complete my order on Nike.de.

Sample items:
1. Why was my payment declined for klarna today? It's worked without any issues before though
2. any spots in downtown Tribeca that I can use klarna at? Need to pick up a new gift 
3. Can I use Klarna for my monthly Vogue magazine subscription?
4. any stores that acc

If we manually review these topics, clear patterns emerge in user queries:

- Topic 0: Users experiencing unexpectedly declined payments despite having used Klarna successfully before or having sufficient funds.

- Topic 1: Users requesting to modify payment methods, schedules, or installment plans after making a purchase.

- Topic 2: Users reporting suspicious activities, potential fraud, or seeking help with account security and privacy concerns.

- Topic 3: Users facing difficulties with refunds for returned items, including delays and confusion about refund methods.

- Topic 4: Users needing assistance with order fulfillment issues such as delayed deliveries, wrong items, or merchant disputes.

Let's now try to mimic synthetic user satisfaction scores

## Assigning Synthetic User Satisfaction Scores

We want to look at our topics in closer detail to see if there are classes of queries that we're performing badly for. This helps us to understand how well we're nailing our search queries. To do so, we'll look at topics in terms of the query volume and the user satisfaction score.

We'll be generating user satisfaction scores synthetically for each topic so that we can see how well we're doing on each of them. Each query will be assigned a binary outcome of 1 or 0 representing whether the user was satisfied or not with the outcome determined by a random uniform distribution.

In order to demonstrate the different combinations of query volume and satisfaction scores, we've chosen the following probabilities for each topic randomly. Once we've done so, we can start segmenting our queries into specific topics and see how well we're doing on those. This user satisfaction score is very important because it allows us to understand whether our system is able to retrieve relevant documents to answer our users' queries.

Therefore, when it comes to building a real product, you should start collecting user satisfaction scores early so that you can use them to iteratively improve your system over time.


In [47]:
probabilities = {
    -1: 0.95,
    0: 0.9,
    1: 0.2,
    2: 0.88,
    3: 0.85,
    4: 0.3,
}

In [48]:
topic_df = topic_model.get_document_info(docs)
topic_df.head()

Unnamed: 0,Document,Topic,Name,Representation,Representative_Docs,Top_n_words,Probability,Representative_document
0,"How can I close my Klarna account, and what ha...",2,2_my_klarna_details_and,"[my, klarna, details, and, account, email, cal...",[I got a call from someone who claimed to be a...,my - klarna - details - and - account - email ...,1.0,False
1,What happens after I report my item being retu...,3,3_store_the_it_my,"[store, the, it, my, refund, back, ago, days, ...",[Ordered a Samsung Galaxy Tablet three days ag...,store - the - it - my - refund - back - ago - ...,1.0,False
2,Why was my payment declined for klarna today? ...,0,0_klarna_declined_my_in,"[klarna, declined, my, in, ve, for, was, used,...",[Tried buying a $150 pair of Nike sneakers 2 d...,klarna - declined - my - in - ve - for - was -...,1.0,False
3,why do I always need to verify my banking deta...,2,2_my_klarna_details_and,"[my, klarna, details, and, account, email, cal...",[I got a call from someone who claimed to be a...,my - klarna - details - and - account - email ...,0.750559,False
4,any issues using the virtual vs the physical k...,-1,-1_for_to_klarna_the,"[for, to, klarna, the, this, but, my, can, car...",[I paid $200 for a Nike jacket from Macy's two...,for - to - klarna - the - this - but - my - ca...,0.0,False


In [49]:
# Join topic_df with original df on Document/question to get categories
results_df = topic_df.merge(df, left_on="Document", right_on="question", how="left")
results_df = results_df[["question", "Topic", "answer", "category", "subcategory"]]
results_df.head()

Unnamed: 0,question,Topic,answer,category,subcategory
0,"How can I close my Klarna account, and what ha...",2,"To close your Klarna account, you must settle ...",Account & settings,Manage account
1,What happens after I report my item being retu...,3,"When a return is reported, the invoice is paus...",Delivery & returns,Returns
2,Why was my payment declined for klarna today? ...,0,A person might be unable to complete a purchas...,Declined purchase,Declined Purchase
3,why do I always need to verify my banking deta...,2,Customers need to verify their details to ensu...,Fraud & security,Data protection
4,any issues using the virtual vs the physical k...,-1,Your virtual card will be available in the app...,Products & services,Klarna Card


In [58]:
import numpy as np

np.random.seed(21)

# Generate satisfaction scores based on topic probabilities
results_df["satisfied"] = results_df["Topic"].apply(
    lambda x: 1 if np.random.uniform() < probabilities[x] else 0
)

results_df

Unnamed: 0,question,Topic,answer,category,subcategory,satisfied
0,"How can I close my Klarna account, and what ha...",2,"To close your Klarna account, you must settle ...",Account & settings,Manage account,1
1,What happens after I report my item being retu...,3,"When a return is reported, the invoice is paus...",Delivery & returns,Returns,1
2,Why was my payment declined for klarna today? ...,0,A person might be unable to complete a purchas...,Declined purchase,Declined Purchase,1
3,why do I always need to verify my banking deta...,2,Customers need to verify their details to ensu...,Fraud & security,Data protection,1
4,any issues using the virtual vs the physical k...,-1,Your virtual card will be available in the app...,Products & services,Klarna Card,1
...,...,...,...,...,...,...
163,I purchased a pair of Nike Air Max sneakers fo...,1,You can view your current payment plan through...,Payments,Make & manage payments,0
164,I just purchased an iPhone 14 Pro for $999 thr...,1,You can easily pay off your Dyson vacuum early...,Payments,Make & manage payments,0
165,used my debit card for a $300 Nike purchase wi...,1,You have not been charged twice. What you're s...,Payments,Make & manage payments,0
166,Bought a Samsung Galaxy S21 for $799 on Klarna...,-1,If your payment is not registered by the last ...,Payments,Make & manage payments,1


In [59]:
round(float(results_df["satisfied"].sum() / len(results_df)), 2)

0.75

Now that we've done this analysis, let's save it to a file in our `data` folder so that you can reproduce the analysis below!

In [60]:
results_df.to_csv("./data/results_df.csv", index=False)

In [61]:
results_df = pd.read_csv("./data/results_df.csv")

## Analyzing Problematic Clusters

If we look at the overall satisfaction score, we might think that we're doing pretty decent because we're getting an average score of 0.74. But we'll see that this isn't the case once we start to look at the topics in closer detail.

In [62]:
# Calculate mean satisfaction and volume per topic
topic_stats = (
    results_df.groupby("Topic")
    .agg(
        {
            "satisfied": lambda x: round(x.mean(), 2),
            "question": "count",  # Count number of questions per topic
        }
    )
    .rename(columns={"question": "volume"})
)

topic_stats

Unnamed: 0_level_0,satisfied,volume
Topic,Unnamed: 1_level_1,Unnamed: 2_level_1
-1,0.98,47
0,0.97,36
1,0.16,31
2,0.92,26
3,0.93,15
4,0.15,13


Let's quickly revisit our original matrix to decide what to prioritize. 

![](./assets/matrix.png)

1. Maintain - Topic 0 fits here because it has a above average query satisfaction of 0.97 coupled with higher query volumes of 36. This is a succesful area that should be preserved
2. Danger Zone - Topic 1 fits here because it has very low satisfaction (0.16) but above-average volume (31), suggesting users frequently encounter payment detail management issues that are poorly addressed.
3. High ROI - Topics 2 and 3 fit here because they maintain high satisfaction scores (0.92 and 0.93) with moderate to lower volumes (26 and 15), representing areas where your approaches are working well.
4. Low ROI - Topic 4 fits here because it has both below-average satisfaction (0.15) and the lowest volume (13), suggesting order fulfillment issues may need cost-benefit analysis to determine if continued effort is warranted.

We don't analyse Topic -1 here because it consists of topics that weren't able to be categorised and so it's best if we avoid it for now

## Digging Deeper into Problematic Topics

Out of the 4 topics, the one we want to dig deeper into first has to be Topic 1 which is in the danger zone. Let's take a look at the queries in this topic and see if we can understand why users are dissatisfied.

We'll do so in the following 4 steps

1. First we'll take a look and aggregate the queries in this topic by category to see if there are specific categories that are causing the most issues within the topic
2. We'll then sample from the queries in this topic and see if we can understand what sort of queries are being asked
3. We'll then see if there are specific categories or query intents that are causing the most issues within the topic itself by manually inspecting the queries
4. Lastly, we'll brainstorm some potential solutions that we can use to address the issues we're seeing

In [63]:
total_items = results_df[results_df["Topic"] == 1].shape[0]
# Calculate mean satisfaction and volume per topic
topic_stats = (
    results_df[results_df["Topic"] == 1]
    .groupby("category")
    .agg(
        {
            "satisfied": lambda x: round(x.mean(), 2),
            "question": "count",  # Count number of questions per topic
            # "pct": lambda x: round(len(x) / total_items, 2)
        }
    )
    .rename(columns={"question": "volume"})
)

topic_stats["pct"] = round(topic_stats["volume"] / total_items, 3)
topic_stats

Unnamed: 0_level_0,satisfied,volume,pct
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Account & settings,0.0,2,0.065
Payments,0.11,19,0.613
Products & services,0.3,10,0.323


In [66]:
import textwrap

query_df = results_df[results_df["Topic"] == 1]
# query_df = query_df[query_df["category"].isin(["Products & services", "Payments"])]
query_df = query_df[query_df["satisfied"] == 0]

for i, item in enumerate(query_df["question"].tolist()[:50]):
    wrapped_text = textwrap.fill(item, width=80)
    print(f"{i+1}. {wrapped_text}\n")

1. I bought a new pair of Adidas UltraBoost sneakers for $135 using Klarna's 'Pay
later in 30 days', Are there any additional fees that I might be charged? Just
wanted to confirm again.

2. I want to cancel my one-time card

3. I bought Nike Air Max shoes for $200 with Klarna and the payment is due
tomorrow. Are there any interest or fees that I'll have to pay with my plan?

4. I just purchased a Dyson V11 vacuum for $599 on your app three days ago using
the installment plan, but I'd prefer to pay it all at once now. Can I switch
from installments to full payment directly through the app?

5. I made a $130 purchase for Nike shoes using a one-time Klarna card two days ago,
but now I need to return them. How do I cancel this payment method since it was
a one-time card? The store won't process my return without the original payment
method

6. Just got a Peloton bike for $165 a couple of days back and my payment is due
tomorrow.  I need to change my bank account to a different account due 

While the majority of the queries focus on payment methods and changes, we can break down user intent into several distinct categories:

1. Payment Method Changes: Users frequently need to switch payment methods after making a purchase. This highlights a need for clearer instructions on how to update payment information before due dates, as many express confusion about locating this option in the app.
2. Early Payoff Requests: Many users want to pay their balance in full rather than continuing with installments. This reflects a desire for flexibility in payment schedules and suggests that the option to pay early may not be prominently displayed or intuitive to find.
3. Payment Due Date Concerns: Several queries express anxiety about upcoming payment deadlines and potential consequences. This indicates an opportunity to improve notification systems and provide clearer information about payment timing.

More importantly, we can see that many of these issues aren't really just issues with retrieval or our language model. They're pain points that we can address either with better UX or policies that can address these recurring patterns.
A better prompt wouldn't solve this, but targeted interventions would. But we wouldn't have been able to identify these potential issues/trends without query segmentation.

## Saving Our model

Now that we've trained our model, we want to save it so that we can use it in our production application as an online topic model. We'd ideally also do some batch inference offline to run these topics on previous time periods to see how well we're doing over time.

We can do so by saving the model into a `safetensors` file. `BERTopic` recommends saving the model this way since it's a relatively safe format and creates a small file size to use in production.

In [68]:
embedding_model = "sentence-transformers/all-MiniLM-L6-v2"

# Save the model
topic_model.save(
    "./models/topic_model",
    serialization="safetensors",
    save_ctfidf=True,
    save_embedding_model=embedding_model,
)

# Load the model
topic_model = BERTopic.load("./models/topic_model")

We can also use `HuggingFace` to push our model to the hub relatively easily. This allows us to share it and use it in other projects if necessary.

In [21]:
topic_model.push_to_hf_hub("ivanleomk/rag-topic-model")

No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/ivanleomk/rag-topic-model/commit/cc4eb16013a554f9df569aef3ae50f33e0d9eefb', commit_message='Add BERTopic model', commit_description='', oid='cc4eb16013a554f9df569aef3ae50f33e0d9eefb', pr_url=None, repo_url=RepoUrl('https://huggingface.co/ivanleomk/rag-topic-model', endpoint='https://huggingface.co', repo_type='model', repo_id='ivanleomk/rag-topic-model'), pr_revision=None, pr_num=None)

## Conclusion

In this notebook, we've looked at a potential workflow for understanding query patterns in your RAG application. First, we generated high-quality synthetic queries by carefully varying personas, intents, and question types. Then using BERTopic, we discovered meaningful clusters of related questions - revealing that payment-related queries were particularly problematic.

Looking ahead, we'll transform these insights into actionable improvements by building a classifier that can tag and categorize queries in real-time. We'll also demonstrate how we can collaborate with domain experts easily by using a simple .yaml file which allows us to easily define new categories and reuse them in our prompt.

This topic modeling approach will be particularly valuable when we cover tool calling in Week 6 and metadata parsing in Week 5, allowing us to build systems that can effectively route and handle different types of user queries. By combining topic modeling with these new techniques, we'll be able to build more specialized tooling that can effectively handle different types of user queries and improve our system systematically over time.