# Text Chunking, Embedding, and Vector Store Indexing

In [3]:
import pandas as pd

df=pd.read_csv("../data/processed/filtered_complaints.csv")
df = df[['Complaint ID', 'Mapped Product', 'cleaned_narrative']].dropna()


In [5]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

chunk_size = 300
chunk_overlap = 50

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

chunks = []

for idx, row in df.iterrows():
    text = row['cleaned_narrative']
    chunked_texts = text_splitter.split_text(text)
    for chunk in chunked_texts:
        chunks.append({
            'chunk': chunk,
            'complaint_id': row['Complaint ID'],
            'product': row['Mapped Product']
        })

# Convert to DataFrame
chunk_df = pd.DataFrame(chunks)

In [3]:
chunk_df.head()

Unnamed: 0,chunk,complaint_id,product
0,a xxxx xxxx card was opened under my name by a...,14069121,Credit card
1,and immediately closed the card however they h...,14069121,Credit card
2,i made the mistake of using my wellsfargo debi...,14061897,Savings account
3,i went into the branch and was told they could...,14061897,Savings account
4,i waited a few days and got a letter stating m...,14061897,Savings account


In [6]:
chunk_df.to_csv('../data/processed/text_chunks.csv', index=False)


# Load Model and Encode Chunks

In [7]:

# Preview the content
print(chunk_df.head())


                                               chunk  complaint_id  \
0  a xxxx xxxx card was opened under my name by a...      14069121   
1  and immediately closed the card however they h...      14069121   
2  i made the mistake of using my wellsfargo debi...      14061897   
3  i went into the branch and was told they could...      14061897   
4  i waited a few days and got a letter stating m...      14061897   

           product  
0      Credit card  
1      Credit card  
2  Savings account  
3  Savings account  
4  Savings account  


In [16]:
import pandas as pd
from sentence_transformers import SentenceTransformer
from chromadb import PersistentClient
from tqdm import tqdm

# Load your chunk data CSV
text_chunk = pd.read_csv("../data/processed/text_chunks.csv")

# Function to sample chunks
def sample_chunks(group, max_samples):
    return group.sample(n=min(len(group), max_samples), random_state=42)

# Sample chunks from each product
max_per_product = 1250  # Adjust this number to get closer to 5000 total
subset = text_chunk.groupby('product').apply(lambda x: sample_chunks(x, max_per_product)).reset_index(drop=True)

# If the total is less than 5000, sample more randomly from the entire dataset
while len(subset) < 5000:
    additional_samples = text_chunk.sample(n=5000 - len(subset), random_state=42)
    subset = pd.concat([subset, additional_samples]).drop_duplicates().reset_index(drop=True)

# Limit to 5000 chunks if necessary
subset = subset.sample(n=min(len(subset), 5000), random_state=42).reset_index(drop=True)

# Initialize Chroma client (persistent storage)
client = PersistentClient(path="../vector_store")
collection = client.get_or_create_collection("complaints")

# Load embedding model
print("Loading embedding model...")
model = SentenceTransformer("all-MiniLM-L6-v2")

# Prepare data
texts = subset['chunk'].tolist()
ids = [f"{row['complaint_id']}_{i}" for i, row in subset.iterrows()]
metadatas = subset[['complaint_id', 'product']].to_dict(orient='records')

# Generate embeddings with progress bar
print("Generating embeddings...")
embeddings = model.encode(texts, batch_size=32, show_progress_bar=True).tolist()

# Add data to Chroma vector store
print("Saving to Chroma vector store...")
collection.add(
    documents=texts,
    embeddings=embeddings,
    ids=ids,
    metadatas=metadatas
)

print("✅ Embeddings and metadata successfully stored in vector_store.")


  subset = text_chunk.groupby('product').apply(lambda x: sample_chunks(x, max_per_product)).reset_index(drop=True)


Loading embedding model...
Generating embeddings...


Batches:   0%|          | 0/157 [00:00<?, ?it/s]

Saving to Chroma vector store...
✅ Embeddings and metadata successfully stored in vector_store.


# Task 2: Chunking and Embedding Report

## Introduction
In this task, I focused on processing customer complaints by chunking them into manageable parts and generating vector embeddings. These embeddings will be used later for retrieval and generation tasks in our Retrieval-Augmented Generation (RAG) pipeline.

## Data Preparation
### Loading the Dataset
I began by loading the chunk data from a CSV file that contains processed text chunks of customer complaints.

### Sampling
Given that the dataset contained over 78,000 chunks, I decided to sample a subset of 5,000 chunks to make the processing more efficient. To ensure a balanced representation, I sampled from each product category, limiting the maximum number of chunks per category.

## Embedding Generation
### Model Selection
I chose the `sentence-transformers/all-MiniLM-L6-v2` model for generating embeddings. This model was selected for its balance between performance and computational efficiency, making it suitable for embedding tasks.

### Generating Embeddings
I generated embeddings for each text chunk using the selected model. These embeddings are crucial for enabling similarity searches in the subsequent RAG pipeline.

## Storing in Vector Store
### Vector Store Initialization
I utilized ChromaDB as my vector store for embedding storage, ensuring that I could efficiently retrieve embeddings later.

### Saving Embeddings
I stored the embeddings alongside their corresponding metadata, such as complaint ID and product category, in the vector store. This step is essential for tracing the embeddings back to their original source.

## Conclusion
By chunking the data and generating embeddings, I laid the groundwork for my RAG pipeline. The embeddings are now stored in a vector database, ready for retrieval and further analysis in upcoming tasks.

## Future Work
The next steps involve implementing the retrieval and generation logic, evaluating the effectiveness of my pipeline, and refining my prompt engineering to enhance response quality.