# Chunking & Embeddings

Chunking and embedding are foundational steps in building a RAG system. Chunking involves dividing text into smaller, manageable parts for better retrieval and processing, while embedding involves converting these chunks into numerical vectors that represent their semantic meaning. These steps enable efficient information retrieval and precise answer generation.

### Chunking

Chunking involves splitting long documents into smaller, coherent segments of text, which can be retrieved and processed more effectively. This prevents large documents from being treated as a single retrieval unit, improving the system's ability to pinpoint relevant information.

#### Benefits of Chunking

- **Improved Retrieval Accuracy**: Smaller chunks make it easier to retrieve precise and relevant information for a query
- **Reduced Memory Overhead**: Dividing text into smaller segments ensures more efficient storage and processing
- **Facilitates Parallel Processing**: Each chunk can be processed independently, enhancing system efficiency
- **Enhanced Contextual Relevance**: Smaller chunks reduce noise, improving the likelihood of retrieving relevant content

#### Strategy
For our pipeline, we use the Recursive Character Text Splitter from LangChain, a robust method for splitting text based on character limits while preserving semantic integrity.

**Parameters**
- **Chunk Size**: The maximum number of characters in each chunk. A larger chunk size includes more context but risks including irrelevant information.
- **Chunk Overlap**: The number of overlapping characters between consecutive chunks. Overlap ensures continuity of context across chunks.

### Embeddings

Embeddings transform text into high-dimensional numerical vectors that encode semantic meaning. These embeddings are used to calculate similarity between queries and chunks, enabling effective retrieval of relevant contexts.

#### Models
We use two state-of-the-art embedding models in our experiments:

1. `text-embedding-3-large`

    - A large, general-purpose embedding model capable of capturing nuanced semantic meaning.

    - Suitable for complex queries requiring detailed contextual understanding.

2. `text-embedding-ada-002`

    - A lightweight, highly efficient model with competitive performance.
    
    - Ideal for scalable applications with a large number of documents or queries.

### Experiments

We experiment with different chunking strategies and embedding models to evaluate the best combinations. The strategies involve varying chunk sizes, overlaps, and embedding models:

| **Chunk Size** | **Overlap** | **Embedding Model**      | **Description**                                                                 |
|----------------|-------------|--------------------------|---------------------------------------------------------------------------------|
| 500            | 50          | `text-embedding-3-large` | Small chunk size with overlap to ensure continuity; tested with a large model. |
| 500            | 50          | `text-embedding-ada-002` | Same as above but using a more lightweight model for comparison.                    |
| 1000           | 0           | `text-embedding-ada-002` | Larger chunks with no overlap to assess performance on non-redundant segments. |
| 1000           | 100         | `text-embedding-3-large` | Larger chunks with overlap, paired with a nuanced embedding model.             |
| 2500           | 300         | `text-embedding-3-large` | Very large chunks with substantial overlap, aimed at preserving context.       |

The generated embeddings for each strategy are saved as pickle files for future retrieval and evaluation. These embeddings form the foundation for later stages of the RAG pipeline, including retrieval and answer generation.

## Imports

In [1]:
import pandas as pd
import pickle
import time
import os

from langchain.text_splitter import RecursiveCharacterTextSplitter
from openai import AzureOpenAI

import credentials

from concurrent.futures import ThreadPoolExecutor, as_completed
from tqdm import tqdm
import math

In [2]:
client = AzureOpenAI(
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),  
    api_version="2023-12-01-preview",
    azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
)

## Data Preparation and Chunking

Here we load the preprocessed data and chunk the texts with the defined parameters.

In [3]:
CHUNK_SIZE = 2500
CHUNK_OVERLAP = 300

In [4]:
def chunk_text(dataframe, text_column, chunk_size=1000, chunk_overlap=100):
    # Initialize RecursiveCharacterTextSplitter with dynamic parameters
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap
    )
    # Split text in the specified column
    dataframe['content_chunks'] = dataframe[text_column].apply(lambda text: text_splitter.split_text(text))
    
    # Flatten the DataFrame for individual chunk rows
    chunked_df = dataframe.explode('content_chunks').reset_index(drop=True)

    return chunked_df

In [5]:
data = pd.read_csv('../data/processed/cleantech_processed.csv')
chunked_data = chunk_text(data, text_column='content', chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)

print("Number of articles in the dataset: ", len(data))
print("Amount of chunks: ", len(chunked_data))

chunked_data.head()

Number of articles in the dataset:  9593
Amount of chunks:  21632


Unnamed: 0,title,date,content,domain,url,content_chunks
0,Qatar to Slash Emissions as LNG Expansion Adva...,2021-01-13,Qatar Petroleum (QP) is targeting aggressive c...,energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...,Qatar Petroleum (QP) is targeting aggressive c...
1,Qatar to Slash Emissions as LNG Expansion Adva...,2021-01-13,Qatar Petroleum (QP) is targeting aggressive c...,energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...,"in the Phase 1 trains. Exxon Mobil, Royal Dutc..."
2,India Launches Its First 700 MW PHWR,2021-01-15,Nuclear Power Corp of India Ltd (NPCIL) synchr...,energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...,Nuclear Power Corp of India Ltd (NPCIL) synchr...
3,India Launches Its First 700 MW PHWR,2021-01-15,Nuclear Power Corp of India Ltd (NPCIL) synchr...,energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...,dropped out of the project and another two off...
4,New Chapter for US-China Energy Trade,2021-01-20,New US President Joe Biden took office this we...,energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...,New US President Joe Biden took office this we...


Now that we have chunked the texts into smaller segments, the next step is to pass these chunks through an embedding model to obtain their vector representations. The embedding model maps the textual information into high-dimensional vector spaces, where semantic similarities and relationships are preserved.

## Embedding Creation

This code below generates embeddings for the text data in batches while handling rate-limiting errors. It splits the input data into manageable batches, each of size `BATCH_SIZE` and processes them concurrently using `ThreadPoolExecutor` with up to `MAX_WORKERS` threads. The `embed_batch` function sends each batch to the embedding model, implements retry logic to handle rate-limiting errors by waiting and retrying up to three times and collects the embeddings. The `embed_in_batches` function coordinates this process, maps batches to threads, and aggregates the results into a single list, ensuring each text chunk receives its embedding. The entire pipeline is designed for scalability and robustness against API limits.

In [None]:
EMBEDDING_MODEL = "text-embedding-3-large"
BATCH_SIZE = 50
MAX_WORKERS = 8  # Number of threads
RETRY_DELAY = 60  # Retry delay in seconds for rate-limiting errors

# Function to generate embeddings for a batch of texts with retry logic
def embed_batch(text_batch, model, retries=3):
    attempts = 0
    while attempts < retries:
        try:
            response = client.embeddings.create(input=text_batch, model=model)
            return [item.embedding for item in response.data]
        except Exception as e:
            if "rate limit" in str(e).lower() or "429" in str(e):
                print(f"Rate limit exceeded. Retrying in {RETRY_DELAY} seconds...")
                time.sleep(RETRY_DELAY)  # Wait before retrying
                attempts += 1
            else:
                print(f"Error embedding batch: {e}")
                return [None] * len(text_batch)
    print("Failed to embed after multiple attempts.")
    return [None] * len(text_batch)

# Function to process batches in parallel with rate limiting
def embed_in_batches(data, model=EMBEDDING_MODEL, batch_size=BATCH_SIZE, max_workers=MAX_WORKERS):
    total_chunks = len(data)
    num_batches = math.ceil(total_chunks / batch_size)
    
    # Create batches
    batches = [data[i * batch_size: (i + 1) * batch_size] for i in range(num_batches)]
    embeddings = [None] * total_chunks  # Placeholder for embeddings

    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = {}
        for i, batch in enumerate(batches):
            futures[executor.submit(embed_batch, batch, model)] = i
        
        # Process results as they complete
        for future in tqdm(as_completed(futures), total=len(futures), desc="Embedding in Parallel"):
            batch_index = futures[future]
            try:
                batch_embeddings = future.result()
                embeddings[batch_index * batch_size: (batch_index + 1) * batch_size] = batch_embeddings
            except Exception as e:
                print(f"Error processing batch {batch_index}: {e}")
    
    return embeddings

# Prepare data for embedding generation
texts_to_embed = chunked_data["content_chunks"].tolist()

# Run batch embedding generation with rate limiting
chunked_data['embeddings'] = embed_in_batches(
    texts_to_embed, 
    model=EMBEDDING_MODEL,
    batch_size=BATCH_SIZE,
    max_workers=MAX_WORKERS
)

# Construct the filename dynamically
filename = f"../embeddings/recursive_{CHUNK_SIZE}_chunksize_{CHUNK_OVERLAP}_overlap_{EMBEDDING_MODEL.replace('-', '_')}.pkl"

# Save embeddings and corresponding chunked data to a file
with open(filename, 'wb') as f:
    pickle.dump(chunked_data, f)

print(f"Embeddings and chunked data saved to {filename}")

Embedding in Parallel:   1%|          | 4/433 [01:03<2:34:04, 21.55s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:   5%|▌         | 23/433 [05:06<1:08:56, 10.09s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:   6%|▌         | 25/433 [06:07<1:41:05, 14.87s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:   7%|▋         | 29/433 [07:08<1:25:35, 12.71s/it]

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:   8%|▊         | 33/433 [08:09<1:18:11, 11.73s/it]

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:   9%|▉         | 38/433 [09:09<1:07:58, 10.33s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:   9%|▉         | 40/433 [10:10<1:41:10, 15.45s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  12%|█▏        | 50/433 [13:12<1:28:48, 13.91s/it]

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  12%|█▏        | 51/433 [14:12<2:43:55, 25.75s/it]

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  12%|█▏        | 54/433 [14:13<1:08:20, 10.82s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  13%|█▎        | 55/433 [15:13<2:21:30, 22.46s/it]

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  14%|█▎        | 59/433 [16:14<1:34:30, 15.16s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  14%|█▍        | 62/433 [17:15<1:29:24, 14.46s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  15%|█▌        | 65/433 [18:15<1:24:25, 13.77s/it]

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  16%|█▌        | 68/433 [19:16<1:23:22, 13.71s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  16%|█▋        | 71/433 [20:16<1:37:01, 16.08s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  17%|█▋        | 75/433 [21:17<1:16:36, 12.84s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  18%|█▊        | 76/433 [22:18<2:30:26, 25.29s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  19%|█▉        | 83/433 [24:19<1:51:35, 19.13s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  21%|██        | 89/433 [25:21<1:11:38, 12.50s/it]

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  22%|██▏       | 95/433 [27:21<1:17:49, 13.81s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  23%|██▎       | 98/433 [28:22<1:16:12, 13.65s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  23%|██▎       | 100/433 [29:22<1:46:51, 19.25s/it]

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  24%|██▍       | 104/433 [30:23<1:14:58, 13.67s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  26%|██▌       | 112/433 [32:26<1:04:37, 12.08s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  26%|██▌       | 113/433 [33:25<2:16:08, 25.53s/it]

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  27%|██▋       | 116/433 [34:25<2:13:26, 25.26s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  27%|██▋       | 118/433 [34:27<1:13:41, 14.04s/it]

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  27%|██▋       | 119/433 [35:26<2:19:25, 26.64s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  29%|██▊       | 124/433 [36:27<1:13:27, 14.26s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  29%|██▉       | 127/433 [37:28<1:10:59, 13.92s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  30%|███       | 130/433 [38:30<1:17:30, 15.35s/it]

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  31%|███       | 133/433 [39:29<1:16:47, 15.36s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  31%|███       | 134/433 [40:29<2:07:09, 25.52s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  33%|███▎      | 143/433 [42:31<1:01:24, 12.71s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  34%|███▎      | 146/433 [43:32<1:08:05, 14.23s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  35%|███▍      | 151/433 [45:33<1:43:59, 22.12s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  36%|███▌      | 156/433 [46:34<1:04:04, 13.88s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  37%|███▋      | 159/433 [47:34<1:07:15, 14.73s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  38%|███▊      | 163/433 [48:37<1:01:55, 13.76s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  38%|███▊      | 165/433 [49:37<1:18:59, 17.68s/it]

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  39%|███▉      | 168/433 [50:36<1:17:07, 17.46s/it]

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  39%|███▉      | 171/433 [51:37<1:17:04, 17.65s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  40%|████      | 175/433 [52:37<1:02:44, 14.59s/it]

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  41%|████      | 178/433 [53:38<1:04:18, 15.13s/it]

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  42%|████▏     | 183/433 [55:37<1:25:25, 20.50s/it]

Failed to embed after multiple attempts.
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  43%|████▎     | 187/433 [55:41<30:57,  7.55s/it]  

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  48%|████▊     | 206/433 [59:43<24:23,  6.44s/it]  

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  48%|████▊     | 209/433 [1:00:44<42:31, 11.39s/it]  

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  49%|████▊     | 211/433 [1:01:44<1:01:56, 16.74s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  50%|████▉     | 216/433 [1:02:44<35:26,  9.80s/it]  

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  50%|█████     | 217/433 [1:03:45<1:23:57, 23.32s/it]

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  51%|█████▏    | 222/433 [1:04:46<42:32, 12.10s/it]  

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  52%|█████▏    | 225/433 [1:05:46<44:54, 12.95s/it]  

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  52%|█████▏    | 227/433 [1:06:47<1:10:00, 20.39s/it]

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  53%|█████▎    | 230/433 [1:07:48<1:08:46, 20.33s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  55%|█████▍    | 236/433 [1:08:50<44:21, 13.51s/it]  

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  55%|█████▌    | 239/433 [1:09:50<43:30, 13.46s/it]  

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  56%|█████▌    | 243/433 [1:11:50<1:22:29, 26.05s/it]

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  57%|█████▋    | 245/433 [1:11:51<42:42, 13.63s/it]  

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  57%|█████▋    | 248/433 [1:12:51<45:42, 14.82s/it]  

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  58%|█████▊    | 251/433 [1:13:51<46:25, 15.30s/it]  

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  58%|█████▊    | 252/433 [1:14:52<1:17:13, 25.60s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  61%|██████    | 262/433 [1:16:55<34:19, 12.04s/it]  

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  61%|██████    | 264/433 [1:17:55<46:49, 16.62s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  62%|██████▏   | 268/433 [1:18:56<39:00, 14.18s/it]  

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  63%|██████▎   | 271/433 [1:19:56<40:26, 14.98s/it]  

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  65%|██████▍   | 280/433 [1:21:58<30:21, 11.90s/it]  

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  65%|██████▌   | 282/433 [1:22:59<43:50, 17.42s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  66%|██████▌   | 286/433 [1:24:01<33:22, 13.62s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  67%|██████▋   | 289/433 [1:25:00<35:03, 14.61s/it]  

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  68%|██████▊   | 296/433 [1:27:01<32:45, 14.35s/it]

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  70%|██████▉   | 301/433 [1:28:02<25:22, 11.53s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  72%|███████▏  | 311/433 [1:30:05<15:02,  7.40s/it]

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  73%|███████▎  | 314/433 [1:31:05<22:28, 11.33s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  73%|███████▎  | 317/433 [1:32:05<26:13, 13.56s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  73%|███████▎  | 318/433 [1:33:06<44:50, 23.39s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  75%|███████▌  | 326/433 [1:35:07<27:30, 15.43s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  76%|███████▌  | 330/433 [1:37:08<41:40, 24.28s/it]

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  77%|███████▋  | 332/433 [1:37:13<27:15, 16.19s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  78%|███████▊  | 337/433 [1:39:08<35:52, 22.42s/it]

Failed to embed after multiple attempts.
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  78%|███████▊  | 338/433 [1:39:09<27:18, 17.24s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  79%|███████▉  | 341/433 [1:39:13<12:56,  8.44s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  80%|███████▉  | 345/433 [1:41:09<34:57, 23.84s/it]

Failed to embed after multiple attempts.
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  80%|███████▉  | 346/433 [1:41:10<25:19, 17.47s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  80%|████████  | 348/433 [1:41:14<13:30,  9.53s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  81%|████████  | 350/433 [1:42:14<24:02, 17.37s/it]

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  81%|████████  | 351/433 [1:42:14<16:58, 12.42s/it]

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  81%|████████▏ | 352/433 [1:43:12<34:48, 25.78s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  82%|████████▏ | 354/433 [1:43:15<19:15, 14.63s/it]

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  82%|████████▏ | 357/433 [1:44:13<17:35, 13.89s/it]

Failed to embed after multiple attempts.
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  83%|████████▎ | 358/433 [1:44:15<13:03, 10.45s/it]

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  83%|████████▎ | 360/433 [1:45:13<20:46, 17.08s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  84%|████████▎ | 362/433 [1:46:12<30:20, 25.64s/it]

Failed to embed after multiple attempts.
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  84%|████████▍ | 365/433 [1:46:15<11:55, 10.52s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  85%|████████▍ | 367/433 [1:47:14<18:09, 16.51s/it]

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  85%|████████▍ | 368/433 [1:47:16<13:45, 12.70s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  86%|████████▋ | 374/433 [1:49:15<13:07, 13.35s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  87%|████████▋ | 376/433 [1:50:16<17:53, 18.83s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  88%|████████▊ | 379/433 [1:51:17<16:21, 18.18s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  88%|████████▊ | 383/433 [1:52:18<12:14, 14.68s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  89%|████████▉ | 386/433 [1:53:18<11:04, 14.15s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  90%|████████▉ | 389/433 [1:54:20<10:18, 14.05s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  91%|█████████ | 395/433 [1:56:20<08:59, 14.19s/it]

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  92%|█████████▏| 399/433 [1:57:21<06:52, 12.14s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  93%|█████████▎| 404/433 [1:59:20<10:39, 22.05s/it]

Failed to embed after multiple attempts.
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  94%|█████████▎| 405/433 [1:59:22<08:08, 17.44s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  97%|█████████▋| 418/433 [2:02:24<03:23, 13.59s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  97%|█████████▋| 422/433 [2:03:25<02:14, 12.23s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  98%|█████████▊| 424/433 [2:04:25<02:39, 17.69s/it]

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel: 100%|██████████| 433/433 [2:07:27<00:00, 17.66s/it]


The `chunked_data` DataFrame is structured so that each row represents an individual **chunk** of text derived from the original documents, with a corresponding **embedding vector** stored in the `embeddings` column. Each embedding vector is a dense list of floating-point numbers that encapsulates the semantic meaning of the text chunk.

These embeddings are critical for the RAG system, as they allow efficient similarity searches and retrieval tasks by representing the content of each text chunk in a high-dimensional vector space. When a query is embedded, the RAG system can quickly locate the most relevant text chunks by identifying embedding vectors in the database that are closest in meaning. This approach enables the system to retrieve contextually relevant information by comparing semantic relationships, streamlining the entire retrieval process.

We dynamically construct the filename based on the chunk size, overlap, and embedding model to ensure clarity and traceability in the saved file. Using `pickle`, we save the embeddings along with the corresponding chunked data to a file for later use. This approach allows us to efficiently store and retrieve preprocessed data, reducing the need for re-computation.