# Creating the table

- **`id`**: A `bigint` primary key that auto-increments, ensuring unique document identification.

- **`content`**: A `longtext` column for a chunk of the document's main content.

- **`v`**: A `vector` column that stores a 768-dimension vector. This enables advanced search capabilities and similarity comparisons based on the document's content.

- **`metadata`**: A `JSON` column for flexible storage of additional document attributes (in this example we just use the document name, but you can include other data points as needed)

In [11]:
%%sql
DROP TABLE IF EXISTS embeddings;
CREATE TABLE `embeddings` (
  `id` bigint NOT NULL AUTO_INCREMENT,
  `content` longtext CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci,
  `v` vector(768),
  `metadata` JSON,
  PRIMARY KEY (`id`)
);

In [12]:
!pip install boto3 sentence_transformers pdfplumber langchain --quiet

In [14]:
import boto3
import pdfplumber
from sentence_transformers import SentenceTransformer
from langchain.text_splitter import RecursiveCharacterTextSplitter
import io
from concurrent.futures import ThreadPoolExecutor
import singlestoredb as s2
import json
import numpy as np

In [16]:
# initialize S3 client
session = boto3.Session(
    aws_access_key_id='x', # replace with your access_key
    aws_secret_access_key='y', # replace with your secret_access_key
    aws_session_token='z' # replae with your session_token
)

s3_client = session.client('s3')

# set bucket name
bucket_name = 'example-vecs'

In [17]:
# List all PDF files in the bucket and save to variable
response = s3_client.list_objects_v2(Bucket=bucket_name)
pdf_keys = [obj['Key'] for obj in response.get('Contents', []) if obj['Key'].endswith('.pdf')]

## Embedding Model
Here, we define our embedding model. We use an open source model in this example, but you can switch this out for any embedding model of your choice, just make sure to adjust the vector dimension in the table DDL.

In [None]:
# load a pre-trained model for embeddings
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

## Chunking Methods

Chunking is a preprocessing step for RAG that involves breaking down large documents into smaller, manageable pieces of text. Smaller chunks generally allow for more precise information retrieval but potentially lack broader context. Conversely, larger chunks may provide more comprehensive context but can introduce noise and reduce the specificity of retrieved information. Finding the optimal chunk size often requires experimentation and consideration of factors such as the nature of the content, the end user's knowledge level of the information, and the specific requirements of the app.

### 1. Fixed-Size Chunking
- **Description**: Divides data into chunks of a predetermined size.
- **Pros**:
  - Simple to implement.
  - Efficient for uniform data.
- **Cons**:
  - May split meaningful data units.
  - Inefficient for variable-length data.

### 2. Semantic Chunking
- **Description**: Divides data based on semantic boundaries (e.g., sentences or paragraphs).
- **Pros**:
  - Preserves meaningful data units.
  - Useful for natural language processing.
- **Cons**:
  - Computationally expensive.
  - Requires language understanding.

### 3. Overlapping Chunking
- **Description**: Creates chunks with overlapping sections to ensure continuity.
- **Pros**:
  - Maintains context between chunks.
  - Reduces boundary issues.
- **Cons**:
  - Increases data redundancy.
  - Higher storage and processing requirements.

### 4. Dynamic Chunking
- **Description**: Adjusts chunk size based on data characteristics or processing needs.
- **Pros**:
  - Flexible and adaptive.
  - Optimizes performance for varying data.
- **Cons**:
  - Complex to implement.
  - Requires real-time analysis.


Here we use `RecursiveCharacterTextSplitter` (an overlapping chunking method). This involves recursively splitting text into chunks based on character counts, utilizing both a fixed chunk size and a specified overlap between chunks. This balances the need for manageable chunk sizes with the necessity of maintaining context across chunks, particularly beneficial when processing large text datasets or streams.

In [None]:
# initialize text splitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=50)

In [19]:
def process_pdf(pdf_key):
    """
    Processes a PDF by downloading it from S3, extracting its text, splitting the text into chunks, vectorizing each chunk, 
    and storing the results in SingleStore.
    """
    connection = None
    try:
        # Download PDF from S3
        pdf_object = s3_client.get_object(Bucket=bucket_name, Key=pdf_key)
        pdf_content = pdf_object['Body'].read()

        # Extract text from PDF using pdfplumber
        with pdfplumber.open(io.BytesIO(pdf_content)) as pdf:
            full_text = ''
            for page in pdf.pages:
                page_text = page.extract_text()
                if page_text:
                    full_text += page_text.encode('utf-8', errors='replace').decode('utf-8', errors='ignore')

        # Split text into chunks
        chunks = text_splitter.split_text(full_text)

        # Vectorize each chunk
        vecs = [model.encode(chunk).tolist() for chunk in chunks]  # Convert to list

        # Connect to the database
        connection = s2.connect(**db_config)
        cursor = connection.cursor()

        # Insert each chunk and vector into the database
        for chunk, vector in zip(chunks, vecs):
            metadata = {'pdf_key': pdf_key}  # using document name as metadata. Add to this as needed
            insert_query = """
            INSERT INTO embeddings (content, v, metadata)
            VALUES (%s, %s, %s)
            """
            cursor.execute(insert_query, (chunk, json.dumps(vector), json.dumps(metadata)))

        # Commit the transaction
        connection.commit()

    except Exception as e:
        print(f"Error processing {pdf_key}: {e}")
    finally:
        if connection:
            connection.close()

    return pdf_key, len(chunks)

In [20]:
with ThreadPoolExecutor(max_workers=4) as executor:
    futures = {executor.submit(process_pdf, pdf_key): pdf_key for pdf_key in pdf_keys}
    for future in futures:
        pdf_key = futures[future]
        try:
            key, num_chunks = future.result()
            print(f"Processed {key}: {num_chunks} chunks")
        except Exception as e:
            print(f"Error processing {pdf_key}: {e}")

Processed 1706.03762v7.pdf: 19 chunks
Processed Adaptive Uncertainty Quantification for Scenarioba.pdf: 17 chunks
Processed AudioInsight Detecting Social Contexts Relevant to.pdf: 24 chunks
Processed Catastrophic Goodhart regularizing RLHF with KL di.pdf: 24 chunks
Processed ChatQA 2 Bridging the Gap to Proprietary LLMs in L.pdf: 17 chunks
Processed CheckEval A Checklistbased Approach for Evaluating.pdf: 18 chunks
Processed Coarsegraining bistability with the Martini force .pdf: 15 chunks
Processed Combining Gradient Information and Primitive Direc.pdf: 25 chunks
Processed Computing ground states of spin2 BoseEinstein cond.pdf: 30 chunks
Processed Conformal Thresholded Intervals for Efficient Regr.pdf: 18 chunks
Processed Contrastive Learning with Counterfactual Explanati.pdf: 26 chunks
Processed DEAL Disentangle and Localize Conceptlevel Explana.pdf: 32 chunks
Processed DEPICT DiffusionEnabled Permutation Importance for.pdf: 43 chunks
Processed DataCentric Human Preference Optimizatio

In [37]:
%%sql
select count(*) from embeddings;

count(*)
1741


In [4]:
%%sql
select metadata from embeddings limit 5;

metadata
{'pdf_key': 'Catastrophic Goodhart regularizing RLHF with KL di.pdf'}
{'pdf_key': 'AudioInsight Detecting Social Contexts Relevant to.pdf'}
{'pdf_key': 'Conformal Thresholded Intervals for Efficient Regr.pdf'}
{'pdf_key': 'DataCentric Human Preference Optimization with Rat.pdf'}
{'pdf_key': 'Evaluating the Reliability of SelfExplanations in .pdf'}
