# Chunking & Embeddings

In this notebook, we will create chunking strategies and use two different embedding models to vectorize the chunks. We will use the following chunking strategies:
- 1000 chunk size with 100 overlap
- 500 chunk size with 50 overlap
- 2500 chunk size with 300 overlap

We will use the following models:
- `text-embedding-ada-002`
- `text-embedding-3-large`

These created embeddings will be saved as a pickle file for further use.

## Imports and Setup

In [82]:
import pandas as pd
import time
import os

from langchain.text_splitter import RecursiveCharacterTextSplitter
from sentence_transformers import SentenceTransformer
from openai import AzureOpenAI
import openai

import credentials

from concurrent.futures import ThreadPoolExecutor, as_completed
from tqdm import tqdm
import math

In [83]:
client = AzureOpenAI(
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),  
    api_version="2023-12-01-preview",
    azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
    )

## Data Preparation and Chunking

Here we load the preprocessed data and chunk the texts with the defined parameters.

In [84]:
data = pd.read_csv('../data/processed/cleantech_processed.csv')
data.head()

print("Number of articles in the dataset: ", len(data))

Unnamed: 0,title,date,content,domain,url
0,Qatar to Slash Emissions as LNG Expansion Adva...,2021-01-13,Qatar Petroleum (QP) is targeting aggressive c...,energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...
1,India Launches Its First 700 MW PHWR,2021-01-15,Nuclear Power Corp of India Ltd (NPCIL) synchr...,energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...
2,New Chapter for US-China Energy Trade,2021-01-20,New US President Joe Biden took office this we...,energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...
3,Japan: Slow Restarts Cast Doubt on 2030 Energy...,2021-01-22,The slow pace of Japanese reactor restarts con...,energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...
4,NYC Pension Funds to Divest Fossil Fuel Shares,2021-01-25,Two of New York City's largest pension funds s...,energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...


In [86]:
CHUNK_SIZE = 2500
CHUNK_OVERLAP = 300
EMBEDDING_MODEL = "text-embedding-3-large"

In [87]:
def chunk_text(dataframe, text_column, chunk_size=1000, chunk_overlap=100):
    # Initialize RecursiveCharacterTextSplitter with dynamic parameters
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap
    )
    # Split text in the specified column
    dataframe['content_chunks'] = dataframe[text_column].apply(lambda text: text_splitter.split_text(text))
    
    # Flatten the DataFrame for individual chunk rows
    chunked_df = dataframe.explode('content_chunks').reset_index(drop=True)

    return chunked_df

chunked_data = chunk_text(data, text_column='content', chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)
chunked_data.head()

Unnamed: 0,title,date,content,domain,url,content_chunks
0,Qatar to Slash Emissions as LNG Expansion Adva...,2021-01-13,Qatar Petroleum (QP) is targeting aggressive c...,energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...,Qatar Petroleum (QP) is targeting aggressive c...
1,Qatar to Slash Emissions as LNG Expansion Adva...,2021-01-13,Qatar Petroleum (QP) is targeting aggressive c...,energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...,"in the Phase 1 trains. Exxon Mobil, Royal Dutc..."
2,India Launches Its First 700 MW PHWR,2021-01-15,Nuclear Power Corp of India Ltd (NPCIL) synchr...,energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...,Nuclear Power Corp of India Ltd (NPCIL) synchr...
3,India Launches Its First 700 MW PHWR,2021-01-15,Nuclear Power Corp of India Ltd (NPCIL) synchr...,energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...,dropped out of the project and another two off...
4,New Chapter for US-China Energy Trade,2021-01-20,New US President Joe Biden took office this we...,energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...,New US President Joe Biden took office this we...


In [88]:
len(chunked_data)

21632

Now that we have chunked the texts into smaller segments, the next step is to pass these chunks through an embedding model to obtain their vector representations. The embedding model maps the textual information into high-dimensional vector spaces, where semantic similarities and relationships are preserved.

## Embedding Creation

This code below generates embeddings for the text data in batches while handling rate-limiting errors. It splits the input data into manageable batches, each of size `BATCH_SIZE` and processes them concurrently using `ThreadPoolExecutor` with up to `MAX_WORKERS` threads. The `embed_batch` function sends each batch to the embedding model, implements retry logic to handle rate-limiting errors by waiting and retrying up to three times and collects the embeddings. The `embed_in_batches` function coordinates this process, maps batches to threads, and aggregates the results into a single list, ensuring each text chunk receives its embedding. The entire pipeline is designed for scalability and robustness against API limits.

In [89]:
BATCH_SIZE = 50
MAX_WORKERS = 8  # Number of threads
RETRY_DELAY = 60  # Retry delay in seconds for rate-limiting errors

# Function to generate embeddings for a batch of texts with retry logic
def embed_batch(text_batch, model, retries=3):
    attempts = 0
    while attempts < retries:
        try:
            response = client.embeddings.create(input=text_batch, model=model)
            return [item.embedding for item in response.data]
        except Exception as e:
            if "rate limit" in str(e).lower() or "429" in str(e):
                print(f"Rate limit exceeded. Retrying in {RETRY_DELAY} seconds...")
                time.sleep(RETRY_DELAY)  # Wait before retrying
                attempts += 1
            else:
                print(f"Error embedding batch: {e}")
                return [None] * len(text_batch)
    print("Failed to embed after multiple attempts.")
    return [None] * len(text_batch)

# Function to process batches in parallel with rate limiting
def embed_in_batches(data, model=EMBEDDING_MODEL, batch_size=BATCH_SIZE, max_workers=MAX_WORKERS):
    total_chunks = len(data)
    num_batches = math.ceil(total_chunks / batch_size)
    
    # Create batches
    batches = [data[i * batch_size: (i + 1) * batch_size] for i in range(num_batches)]
    embeddings = [None] * total_chunks  # Placeholder for embeddings

    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = {}
        for i, batch in enumerate(batches):
            futures[executor.submit(embed_batch, batch, model)] = i
        
        # Process results as they complete
        for future in tqdm(as_completed(futures), total=len(futures), desc="Embedding in Parallel"):
            batch_index = futures[future]
            try:
                batch_embeddings = future.result()
                embeddings[batch_index * batch_size: (batch_index + 1) * batch_size] = batch_embeddings
            except Exception as e:
                print(f"Error processing batch {batch_index}: {e}")
    
    return embeddings

# Prepare data for embedding generation
texts_to_embed = chunked_data["content_chunks"].tolist()

# Run batch embedding generation with rate limiting
chunked_data['embeddings'] = embed_in_batches(
    texts_to_embed, 
    model=EMBEDDING_MODEL,
    batch_size=BATCH_SIZE,
    max_workers=MAX_WORKERS
)

Embedding in Parallel:   0%|          | 0/433 [00:00<?, ?it/s]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:   2%|▏         | 9/433 [03:45<2:29:30, 21.16s/it]  

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:   3%|▎         | 14/433 [06:03<2:28:39, 21.29s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:   4%|▎         | 16/433 [06:46<2:48:05, 24.19s/it]

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:   5%|▍         | 21/433 [07:47<2:01:01, 17.62s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:   5%|▌         | 23/433 [08:48<2:21:40, 20.73s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:   6%|▌         | 24/433 [09:51<3:03:11, 26.88s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:   6%|▌         | 25/433 [11:49<4:48:29, 42.43s/it]

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:   6%|▌         | 26/433 [12:49<5:11:28, 45.92s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:   6%|▋         | 28/433 [13:49<4:40:38, 41.58s/it]

Failed to embed after multiple attempts.
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:   7%|▋         | 29/433 [13:49<3:29:19, 31.09s/it]

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:   7%|▋         | 31/433 [14:50<3:26:07, 30.77s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:   8%|▊         | 34/433 [15:50<2:40:20, 24.11s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:   9%|▊         | 37/433 [17:04<2:09:15, 19.58s/it]

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:   9%|▉         | 40/433 [18:04<1:54:59, 17.56s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  10%|▉         | 42/433 [18:52<2:24:05, 22.11s/it]

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  10%|█         | 44/433 [19:53<2:41:25, 24.90s/it]

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  15%|█▍        | 63/433 [23:14<1:02:56, 10.21s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  15%|█▍        | 64/433 [23:57<1:52:16, 18.26s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  16%|█▌        | 69/433 [25:58<2:10:03, 21.44s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  22%|██▏       | 96/433 [31:14<50:15,  8.95s/it]  

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  24%|██▍       | 105/433 [33:24<55:19, 10.12s/it]  

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  25%|██▍       | 108/433 [34:24<1:08:00, 12.56s/it]

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  25%|██▌       | 110/433 [35:05<1:37:50, 18.18s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  27%|██▋       | 115/433 [36:24<1:12:42, 13.72s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  27%|██▋       | 116/433 [37:06<1:53:13, 21.43s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  27%|██▋       | 117/433 [38:07<2:32:27, 28.95s/it]

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  27%|██▋       | 118/433 [39:07<3:07:08, 35.64s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  29%|██▉       | 126/433 [42:10<2:00:15, 23.50s/it]

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  31%|███       | 133/433 [44:09<1:35:12, 19.04s/it]

Failed to embed after multiple attempts.
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  31%|███       | 134/433 [44:24<1:25:15, 17.11s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  31%|███       | 135/433 [45:11<1:55:15, 23.20s/it]

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  31%|███▏      | 136/433 [46:12<2:25:42, 29.43s/it]

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  32%|███▏      | 137/433 [47:13<2:54:47, 35.43s/it]

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  35%|███▌      | 153/433 [51:15<1:07:58, 14.57s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  36%|███▋      | 157/433 [52:15<1:07:59, 14.78s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  37%|███▋      | 159/433 [54:20<2:07:04, 27.83s/it]

Error embedding batch: Request timed out.


Embedding in Parallel:  37%|███▋      | 161/433 [54:20<1:21:03, 17.88s/it]

Error embedding batch: Request timed out.
Error embedding batch: Request timed out.


Embedding in Parallel:  37%|███▋      | 162/433 [54:20<1:02:16, 13.79s/it]

Error embedding batch: Request timed out.
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  39%|███▉      | 170/433 [56:18<58:00, 13.23s/it]  

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  41%|████      | 177/433 [58:19<1:04:11, 15.04s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  41%|████▏     | 179/433 [59:20<1:19:28, 18.77s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  42%|████▏     | 180/433 [1:00:20<1:44:47, 24.85s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  42%|████▏     | 181/433 [1:01:21<2:10:32, 31.08s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  42%|████▏     | 182/433 [1:02:22<2:34:38, 36.97s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  42%|████▏     | 183/433 [1:04:22<3:50:36, 55.35s/it]

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  43%|████▎     | 187/433 [1:05:21<2:03:01, 30.01s/it]

Failed to embed after multiple attempts.
Failed to embed after multiple attempts.
Failed to embed after multiple attempts.


Embedding in Parallel:  45%|████▍     | 193/433 [1:05:22<30:34,  7.64s/it]  

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  45%|████▌     | 196/433 [1:06:23<42:29, 10.76s/it]  

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  46%|████▌     | 199/433 [1:07:23<56:20, 14.45s/it]  

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  46%|████▋     | 201/433 [1:08:24<1:14:10, 19.18s/it]

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  47%|████▋     | 204/433 [1:09:26<1:06:10, 17.34s/it]

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  48%|████▊     | 206/433 [1:10:25<1:17:05, 20.38s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  48%|████▊     | 209/433 [1:11:26<1:15:44, 20.29s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  49%|████▊     | 211/433 [1:12:26<1:21:41, 22.08s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  49%|████▉     | 212/433 [1:14:29<2:49:47, 46.10s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  50%|████▉     | 216/433 [1:16:28<2:09:51, 35.91s/it]

Failed to embed after multiple attempts.


Embedding in Parallel:  50%|█████     | 218/433 [1:16:29<1:19:34, 22.21s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  52%|█████▏    | 224/433 [1:18:28<1:11:23, 20.50s/it]

Failed to embed after multiple attempts.
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  52%|█████▏    | 225/433 [1:18:30<1:00:28, 17.44s/it]

Rate limit exceeded. Retrying in 60 seconds...Rate limit exceeded. Retrying in 60 seconds...

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  52%|█████▏    | 227/433 [1:19:31<1:09:13, 20.16s/it]

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  53%|█████▎    | 230/433 [1:20:31<1:00:05, 17.76s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  54%|█████▎    | 232/433 [1:21:37<1:17:36, 23.17s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  54%|█████▍    | 233/433 [1:22:32<1:39:28, 29.84s/it]

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  54%|█████▍    | 235/433 [1:23:33<1:39:01, 30.01s/it]

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  58%|█████▊    | 249/433 [1:26:45<38:13, 12.46s/it]  

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  60%|█████▉    | 258/433 [1:29:36<55:50, 19.15s/it]  

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  60%|██████    | 261/433 [1:30:37<56:05, 19.57s/it]

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  64%|██████▎   | 276/433 [1:33:55<30:48, 11.78s/it]

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  64%|██████▍   | 278/433 [1:34:39<45:50, 17.74s/it]

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  70%|██████▉   | 301/433 [1:39:45<37:57, 17.26s/it]

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  71%|███████▏  | 309/433 [1:42:06<32:18, 15.63s/it]

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  72%|███████▏  | 312/433 [1:42:46<35:29, 17.60s/it]

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  75%|███████▌  | 326/433 [1:45:49<21:29, 12.05s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  76%|███████▌  | 329/433 [1:46:50<23:40, 13.66s/it]

Rate limit exceeded. Retrying in 60 seconds...
Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  77%|███████▋  | 333/433 [1:47:50<20:43, 12.43s/it]

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  78%|███████▊  | 336/433 [1:48:50<22:17, 13.79s/it]

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  80%|███████▉  | 345/433 [1:51:06<16:46, 11.44s/it]

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  81%|████████  | 349/433 [1:52:06<16:01, 11.45s/it]

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  85%|████████▍ | 368/433 [1:57:16<13:15, 12.24s/it]

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  90%|████████▉ | 388/433 [2:02:16<08:59, 11.98s/it]

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel:  90%|████████▉ | 389/433 [2:03:01<16:01, 21.86s/it]

Rate limit exceeded. Retrying in 60 seconds...


Embedding in Parallel: 100%|██████████| 433/433 [2:11:11<00:00, 18.18s/it]


In [90]:
print("Dimension of the embeddings: ", len(chunked_data['embeddings'][0]))

Dimension of the embeddings:  3072


The `chunked_data` DataFrame is structured so that each row represents an individual **chunk** of text derived from the original documents, with a corresponding **embedding vector** stored in the `embeddings` column. Each embedding vector is a dense list of floating-point numbers (1536 dimensions with the `text-embedding-ada-002` model and 3072 dimensions for the `text-embedding-3-large` model) that encapsulates the semantic meaning of the text chunk.

These embeddings are critical for the RAG system, as they allow efficient similarity searches and retrieval tasks by representing the content of each text chunk in a high-dimensional vector space. When a query is embedded, the RAG system can quickly locate the most relevant text chunks by identifying embedding vectors in the database that are closest in meaning. This approach enables the system to retrieve contextually relevant information by comparing semantic relationships, streamlining the entire retrieval process.

In [91]:
import pickle

# Construct the filename dynamically
filename = f"../embeddings/recursive_{CHUNK_SIZE}_chunksize_{CHUNK_OVERLAP}_overlap_{EMBEDDING_MODEL.replace('-', '_')}.pkl"

# Save embeddings and corresponding chunked data to a file
with open(filename, 'wb') as f:
    pickle.dump(chunked_data, f)

print(f"Embeddings and chunked data saved to {filename}")

Embeddings and chunked data saved to ../data/experiments/recursive_2500_chunksize_300_overlap_text_embedding_3_large.pkl


metric assessment of embedding results on artificially created chunk-query pairs (intrinsic evaluation)