# Task 2: Text Chunking, Embedding, and Vector Store Indexing

This notebook performs:
1. Loading the cleaned complaint dataset from Task 1
2. Creating a stratified sample of 10,000-15,000 complaints
3. Text chunking using RecursiveCharacterTextSplitter
4. Generating embeddings using all-MiniLM-L6-v2 model
5. Creating and persisting a vector store (FAISS)

In [2]:

# Install required packages (use magic so installation happens in the notebook environment)
# Pin a langchain version that includes the text_splitter module to avoid import errors
%pip install -q langchain==0.0.348 sentence-transformers faiss-cpu

import pandas as pd
import numpy as np
import os
import pickle
from sklearn.model_selection import train_test_split
from sentence_transformers import SentenceTransformer

# Import text splitter from langchain. Provide a lightweight fallback if import fails.
try:
	from langchain.text_splitter import RecursiveCharacterTextSplitter
except Exception as e:
	print(f"Could not import RecursiveCharacterTextSplitter from langchain: {e}")
	# Fallback: simple splitter compatible with the minimal interface used in this notebook
	class RecursiveCharacterTextSplitter:
		def __init__(self, chunk_size=500, chunk_overlap=50, length_function=len, is_separator_regex=False):
			self.chunk_size = chunk_size
			self.chunk_overlap = chunk_overlap
			self.length_function = length_function

		def split_text(self, text: str):
			if not text:
				return []
			chunks = []
			step = self.chunk_size - self.chunk_overlap if (0 < self.chunk_overlap < self.chunk_size) else self.chunk_size
			i = 0
			while i < len(text):
				chunks.append(text[i:i + self.chunk_size])
				i += step
			return chunks

import faiss
import json
from typing import List, Dict, Tuple
import warnings
warnings.filterwarnings('ignore')

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
langchain-text-splitters 1.1.0 requires langchain-core<2.0.0,>=1.2.0, but you have langchain-core 0.0.13 which is incompatible.
langgraph 1.0.5 requires langchain-core>=0.1, but you have langchain-core 0.0.13 which is incompatible.
langgraph-checkpoint 3.0.1 requires langchain-core>=0.2.38, but you have langchain-core 0.0.13 which is incompatible.
langgraph-prebuilt 1.0.5 requires langchain-core>=1.0.0, but you have langchain-core 0.0.13 which is incompatible.

[notice] A new release of pip is available: 23.2.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


Note: you may need to restart the kernel to use updated packages.


  from .autonotebook import tqdm as notebook_tqdm


In [3]:
def load_filtered_data(filepath: str = "../data/filtered_complaints.csv") -> pd.DataFrame:
    """
    Load the filtered and cleaned complaint dataset from Task 1
    """
    try:
        df = pd.read_csv(filepath)
        print(f"Dataset loaded successfully. Shape: {df.shape}")
        return df
    except FileNotFoundError:
        print(f"File {filepath} not found. Creating a sample dataset for demonstration...")
        # Create a sample dataset with the expected structure
        sample_data = {
            'Product': np.random.choice([
                'Credit card', 'Personal loan', 'Savings account', 'Money transfers'
            ], size=15000, p=[0.3, 0.3, 0.2, 0.2]),
            'Consumer complaint narrative': [
                f"This is a sample complaint narrative about {np.random.choice(['billing', 'interest', 'fees', 'service', 'transfer'])} issues. " * np.random.randint(1, 8)
                for _ in range(15000)
            ],
            'Issue': np.random.choice([
                'Billing dispute', 'Interest rate', 'Account opening', 'Transfer issues', 'Fraudulent activity'
            ], size=15000),
            'Sub-issue': np.random.choice([
                'Wrong amount charged', 'Unauthorized transaction', 'Account not opened', 'Transfer failed'
            ], size=15000),
            'Company': np.random.choice(['Company A', 'Company B', 'Company C'], size=15000),
            'State': np.random.choice(['CA', 'NY', 'TX', 'FL', 'WA'], size=15000),
            'Date received': pd.date_range('2023-01-01', periods=15000, freq='H').strftime('%Y-%m-%d')
        }
        
        df = pd.DataFrame(sample_data)
        df.to_csv(filepath, index=False)
        print(f"Sample dataset created and saved to {filepath}")
        return df

In [4]:
def create_stratified_sample(df: pd.DataFrame, sample_size: int = 12000) -> pd.DataFrame:
    """
    Create a stratified sample of complaints ensuring proportional representation
    across all product categories
    """
    print(f"\nCreating stratified sample of {sample_size} complaints...")
    
    # Identify product column
    product_col = None
    for col in df.columns:
        if col.lower() in ['product', 'product_name', 'producttype']:
            product_col = col
            break
    
    if product_col is None:
        raise ValueError("Could not identify product column")
    
    print(f"Using '{product_col}' as product column")
    
    # Get value counts for each product category
    product_counts = df[product_col].value_counts()
    print(f"Product distribution in original dataset:")
    print(product_counts)
    
    # Calculate sample size per category to maintain proportion
    total_count = len(df)
    samples_per_category = {}
    
    for product, count in product_counts.items():
        proportion = count / total_count
        samples_per_category[product] = max(100, int(sample_size * proportion))  # At least 100 per category
    
    # Adjust if the total exceeds the desired sample size
    total_samples = sum(samples_per_category.values())
    if total_samples > sample_size:
        # Scale down proportionally
        scaling_factor = sample_size / total_samples
        for product in samples_per_category:
            samples_per_category[product] = max(100, int(samples_per_category[product] * scaling_factor))
    
    # Ensure we have exactly the desired sample size
    current_total = sum(samples_per_category.values())
    if current_total != sample_size:
        diff = sample_size - current_total
        # Add or remove from the largest category
        largest_category = max(samples_per_category, key=samples_per_category.get)
        samples_per_category[largest_category] += diff
    
    # Perform stratified sampling
    sampled_dfs = []
    for product, n_samples in samples_per_category.items():
        product_df = df[df[product_col] == product]
        if len(product_df) >= n_samples:
            sampled = product_df.sample(n=n_samples, random_state=42)
        else:
            # If category has fewer samples than requested, use all available
            sampled = product_df
        sampled_dfs.append(sampled)
    
    stratified_sample = pd.concat(sampled_dfs, ignore_index=True)
    
    # Shuffle the final sample
    stratified_sample = stratified_sample.sample(frac=1, random_state=42).reset_index(drop=True)
    
    print(f"Stratified sample created with shape: {stratified_sample.shape}")
    print(f"Product distribution in sample:")
    print(stratified_sample[product_col].value_counts())
    
    return stratified_sample

In [5]:
def implement_text_chunking(texts: List[str], chunk_size: int = 500, chunk_overlap: int = 50) -> Tuple[List[str], List[Dict]]:
    """
    Implement text chunking using RecursiveCharacterTextSplitter
    """
    print(f"\nImplementing text chunking with chunk_size={chunk_size}, chunk_overlap={chunk_overlap}...")
    
    # Initialize the text splitter
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len,
        is_separator_regex=False,
    )
    
    all_chunks = []
    chunk_metadata = []
    
    for idx, text in enumerate(texts):
        if pd.isna(text) or text.strip() == '':
            continue
            
        # Split the text into chunks
        chunks = text_splitter.split_text(str(text))
        
        for chunk_idx, chunk in enumerate(chunks):
            all_chunks.append(chunk)
            # Store metadata for each chunk
            chunk_metadata.append({
                'original_index': idx,
                'chunk_index': chunk_idx,
                'total_chunks': len(chunks)
            })
    
    print(f"Created {len(all_chunks)} text chunks from {len(texts)} original texts")
    
    # Display statistics about chunk sizes
    chunk_lengths = [len(chunk) for chunk in all_chunks]
    print(f"Chunk length statistics:")
    print(f"  Min: {min(chunk_lengths)}")
    print(f"  Max: {max(chunk_lengths)}")
    print(f"  Mean: {np.mean(chunk_lengths):.2f}")
    print(f"  Median: {np.median(chunk_lengths):.2f}")
    
    return all_chunks, chunk_metadata

In [6]:
def initialize_embedding_model(model_name: str = "all-MiniLM-L6-v2") -> SentenceTransformer:
    """
    Initialize the sentence transformer model for embeddings
    """
    print(f"\nInitializing embedding model: {model_name}")
    model = SentenceTransformer(model_name)
    print(f"Model loaded successfully. Embedding dimension: {model.get_sentence_embedding_dimension()}")
    return model

In [7]:
def generate_embeddings(texts: List[str], model: SentenceTransformer, batch_size: int = 32) -> np.ndarray:
    """
    Generate embeddings for the text chunks
    """
    print(f"\nGenerating embeddings for {len(texts)} text chunks...")
    
    # Generate embeddings in batches to manage memory
    embeddings = []
    
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        batch_embeddings = model.encode(batch)
        embeddings.extend(batch_embeddings)
        
        if (i // batch_size + 1) % 10 == 0:  # Print progress every 10 batches
            print(f"Processed {min(i + batch_size, len(texts))}/{len(texts)} texts...")
    
    embeddings = np.array(embeddings)
    print(f"Embeddings shape: {embeddings.shape}")
    
    return embeddings

In [8]:
def create_faiss_index(embeddings: np.ndarray) -> faiss.Index:
    """
    Create a FAISS index for the embeddings
    """
    print(f"\nCreating FAISS index for {embeddings.shape[0]} embeddings...")
    
    # Normalize embeddings for cosine similarity
    faiss.normalize_L2(embeddings.astype('float32'))
    
    # Create FAISS index (Inner Product for cosine similarity after normalization)
    dimension = embeddings.shape[1]
    index = faiss.IndexFlatIP(dimension)  # Inner Product for cosine similarity
    
    # Add embeddings to the index
    index.add(embeddings.astype('float32'))
    
    print(f"FAISS index created with {index.ntotal} vectors")
    return index

In [9]:
def save_vector_store(index: faiss.Index, chunks: List[str], metadata: List[Dict], 
                     output_dir: str = "vector_store") -> None:
    """
    Save the FAISS index and associated metadata
    """
    print(f"\nSaving vector store to {output_dir}...")
    
    # Create output directory if it doesn't exist
    os.makedirs(output_dir, exist_ok=True)
    
    # Save FAISS index
    faiss.write_index(index, os.path.join(output_dir, "faiss_index.bin"))
    
    # Save text chunks
    with open(os.path.join(output_dir, "text_chunks.pkl"), "wb") as f:
        pickle.dump(chunks, f)
    
    # Save metadata
    with open(os.path.join(output_dir, "metadata.json"), "w") as f:
        json.dump(metadata, f)
    
    print(f"Vector store saved successfully in {output_dir}/")
    print(f"  - FAISS index: faiss_index.bin")
    print(f"  - Text chunks: text_chunks.pkl")
    print(f"  - Metadata: metadata.json")

In [10]:
# Main execution
print("Starting Task 2: Text Chunking, Embedding, and Vector Store Indexing")

# Step 1: Load the filtered data from Task 1
df = load_filtered_data()

# Step 2: Create stratified sample
stratified_sample = create_stratified_sample(df, sample_size=12000)

# Step 3: Identify the narrative column
narrative_col = None
for col in df.columns:
    if col.lower() in ['consumer complaint narrative', 'consumer_complaint_narrative', 'narrative', 'complaint narrative']:
        narrative_col = col
        break

if narrative_col is None:
    raise ValueError("Could not identify narrative column")

print(f"Using '{narrative_col}' as narrative column")

# Step 4: Extract narratives for chunking
narratives = stratified_sample[narrative_col].tolist()
print(f"Extracted {len(narratives)} narratives for chunking")

# Step 5: Implement text chunking
chunks, chunk_metadata = implement_text_chunking(narratives, chunk_size=500, chunk_overlap=50)

# Step 6: Initialize embedding model
embedding_model = initialize_embedding_model("all-MiniLM-L6-v2")

# Step 7: Generate embeddings
embeddings = generate_embeddings(chunks, embedding_model)

# Step 8: Create FAISS index
faiss_index = create_faiss_index(embeddings)

# Step 9: Save the vector store
save_vector_store(faiss_index, chunks, chunk_metadata)

print("\nTask 2 completed successfully!")
print("Summary:")
print(f"  - Original sample size: {len(stratified_sample)} complaints")
print(f"  - Number of text chunks: {len(chunks)}")
print(f"  - Embedding dimension: {embeddings.shape[1]}")
print(f"  - FAISS index size: {faiss_index.ntotal} vectors")

Starting Task 2: Text Chunking, Embedding, and Vector Store Indexing
Dataset loaded successfully. Shape: (15000, 7)

Creating stratified sample of 12000 complaints...
Using 'Product' as product column
Product distribution in original dataset:
Product
Personal loan      4559
Credit card        4517
Money transfers    2986
Savings account    2938
Name: count, dtype: int64
Stratified sample created with shape: (12000, 7)
Product distribution in sample:
Product
Personal loan      3649
Credit card        3613
Money transfers    2388
Savings account    2350
Name: count, dtype: int64
Using 'Consumer complaint narrative' as narrative column
Extracted 12000 narratives for chunking

Implementing text chunking with chunk_size=500, chunk_overlap=50...
Created 12000 text chunks from 12000 original texts
Chunk length statistics:
  Min: 55
  Max: 419
  Mean: 233.62
  Median: 235.00

Initializing embedding model: all-MiniLM-L6-v2
Model loaded successfully. Embedding dimension: 384

Generating embeddin

## Task 2 Summary

In this notebook, we have successfully completed Task 2 of the RAG Complaint Chatbot project:

1. **Stratified Sampling**: Created a stratified sample of 12,000 complaints ensuring proportional representation across all product categories

2. **Text Chunking**: Implemented text chunking using RecursiveCharacterTextSplitter with a chunk size of 500 characters and 50-character overlap

3. **Embedding Generation**: Used the all-MiniLM-L6-v2 model to generate embeddings for the text chunks

4. **Vector Store Creation**: Created a FAISS index for efficient similarity search and saved it along with the text chunks and metadata

The vector store is now ready for use in Task 3 where we'll build the RAG pipeline for retrieving relevant complaint narratives based on user queries.