# RAG-Based Topic Extraction

This notebook implements a Retrieval-Augmented Generation (RAG) pipeline for extracting topics from call transcripts using Whisper, SBERT embeddings, and Annoy for vector search.

## Phase 1: Data Preparation

### 1. Transcription
Convert raw audio files into text transcripts using Whisper ASR.

In [1]:
!pip install openai-whisper



In [2]:
import whisper
from pathlib import Path

# Load Whisper model
model = whisper.load_model("medium")

def transcribe_audio(audio_path: str) -> str:
    result = model.transcribe(audio_path)
    return result["text"]

# Example usage
audio_file = "sample_data/34712515_09994039074_20250116144348_out.mp3"
transcript = transcribe_audio(audio_file)
print(transcript[:500])  # preview first 500 characters

 Hello, who is that? Sir, check your WhatsApp. I have sent you the high message on your WhatsApp, sir. Okay, okay. Okay, send me the model number and... Okay, sir? Okay, sir. Okay, sir. Okay.


### 2. Chunking
Split each transcript into overlapping passages (~200 tokens each) to preserve context.

In [20]:
import json
import pandas as pd
with open('sample_data/dataset_67.json', 'r') as f:
    data = json.load(f)
df = pd.DataFrame(data)
" ".join(list(df['summary']))

"The call involves a brief exchange where one party instructs the other to check a message sent via WhatsApp. The conversation is focused on confirming the receipt of a message and requesting a model number. The call involves a discussion about a laptop issue and a request to make a call to a specific number. The conversation includes instructions on how to proceed with the call and mentions the use of a normal call instead of a WhatsApp call. The call involves a request for the customer to send a picture of a QR code related to a forgotten password issue. The technical support team plans to visit the customer's house after receiving the QR code. The call discusses the setup and configuration of a P2P device, including how to connect it using buttons and the necessary settings for proper operation. Instructions are given for resetting the devices and ensuring they are correctly aligned for network connectivity. The call involves a customer service interaction where the customer is faci

In [3]:
!pip install nltk



In [3]:
import nltk

In [4]:
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [21]:
from nltk.tokenize import word_tokenize

def chunk_text(text, chunk_size=60, overlap=50):
    tokens = word_tokenize(text)
    chunks = []
    start = 0
    while start < len(tokens):
        end = min(start + chunk_size, len(tokens))
        chunk = tokens[start:end]
        chunks.append(" ".join(chunk))
        if end == len(tokens):
            break
        start += chunk_size - overlap
    return chunks

# Example chunking+
chunks = chunk_text(" ".join(list(df['summary'])))
print(f"Generated {len(chunks)} chunks, each ~200 tokens.")

Generated 245 chunks, each ~200 tokens.


### 3. Embedding
Use a SentenceTransformer model (e.g., all-MiniLM-L6-v2) to turn each chunk into a 384-dimensional vector.

In [22]:
from sentence_transformers import SentenceTransformer
import numpy as np

# Load SBERT model
sbert_model = SentenceTransformer("all-MiniLM-L6-v2")

# Compute embeddings for chunks
embeddings = sbert_model.encode(chunks, convert_to_numpy=True)
print("Embeddings shape:", embeddings.shape)

Embeddings shape: (245, 384)


### 4. Indexing
Store embeddings in an Annoy index for fast k-NN similarity search.

In [10]:
!pip install annoy

Collecting annoy
  Downloading annoy-1.17.3.tar.gz (647 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/647.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m645.1/647.5 kB[0m [31m21.9 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m647.5/647.5 kB[0m [31m15.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: annoy
  Building wheel for annoy (setup.py) ... [?25l[?25hdone
  Created wheel for annoy: filename=annoy-1.17.3-cp311-cp311-linux_x86_64.whl size=553322 sha256=207807182391f3d905aba08f15563bf33845ddd2ee22cde85b6be380152d81fb
  Stored in directory: /root/.cache/pip/wheels/33/e5/58/0a3e34b92bedf09b4c57e37a63ff395ade6f6c1099ba59877c
Successfully built annoy
Installing collected packages: annoy
Successfully installed annoy-1.17.3


In [23]:
from annoy import AnnoyIndex

vector_dim = embeddings.shape[1]
index = AnnoyIndex(vector_dim, metric='angular')

# Build Annoy index
for i, vec in enumerate(embeddings):
    index.add_item(i, vec)
index.build(50)  # 10 trees for indexing
index.save("topic_index.ann")

print("Annoy index built and saved.")

Annoy index built and saved.


## Phase 2: Query Execution

When a user asks a query, embed the query and retrieve top-k similar chunks from the index.

In [25]:
# Load index and embeddings
index = AnnoyIndex(vector_dim, metric='angular')
index.load("topic_index.ann")

# Example query
query = "send your camera configuration"

# Embed query
query_vec = sbert_model.encode([query], convert_to_numpy=True)[0]

# Retrieve top 10 chunks
k = 10
ids, distances = index.get_nns_by_vector(query_vec, k, include_distances=True)

print("Top matched chunks:")
for idx, dist in zip(ids, distances):
    print(f"Chunk #{idx} (distance: {dist:.4f}):")
    print(chunks[idx])
    print("---")

Top matched chunks:
Chunk #152 (distance: 0.9348):
on troubleshooting and configuring camera settings . The participants discussed how to change modes and presets on the camera , and there was some confusion about the current settings and how to update them . The conversation also touched on the need for a demo and the successful update of the camera firmware . The call primarily discusses technical issues
---
Chunk #153 (distance: 0.9616):
how to change modes and presets on the camera , and there was some confusion about the current settings and how to update them . The conversation also touched on the need for a demo and the successful update of the camera firmware . The call primarily discusses technical issues related to a device , including problems with night vision
---
Chunk #28 (distance: 0.9717):
camera . The customer is seeking assistance to enable night vision and configure the IP camera settings . The call involved a discussion about handling camera updates and timezone iss