# Pushing our embeddings to a vector database

For our vector database of choice, we opted for Pinecone primarily because it is available online (we wish to extend our work to build a Streamlit application that can demonstrate the functionality of our system) and additionally it directly supports the text-embedding-ada-002 embeddings without any extra configuration. It is also free!

In [1]:
from dotenv import load_dotenv
from pinecone import Pinecone
import os
import json
import pandas as pd
import uuid

In [2]:
load_dotenv()

True

In [3]:
PINECONE_API_KEY = os.environ["PINECONE_API_KEY"]

In [4]:
pc = Pinecone(api_key=PINECONE_API_KEY)

### Read JSON files with embeddings

In [None]:
lectures_embeddings_df = pd.read_json("../data/subset/lectures_chunks.json", orient="records", lines=False)
exercises_embeddings_df = pd.read_json("../data/subset/exercises_chunks.json", orient="records", lines=False)
notebooks_embeddings_df = pd.read_json("../data/subset/notebooks_chunks.json", orient="records", lines=False)
qas_embeddings_df = pd.read_json("../data/subset/qas_chunks.json", orient="records", lines=False)
references_embeddings_df = pd.read_json("../data/subset/references_chunks.json", orient="records", lines=False)

### Create data to upsert into Pinecone

Use a random UUID4 as ID and pass the metadata dictionary from chunks

In [None]:
def prepare_data_for_pinecone(df):
    pinecone_data = []
    for _, row in df.iterrows():
        embedding = list(row['embedding'])  # Ensure embedding is in list format
        # Prepare the metadata dictionary with the relevant columns
        metadata = {
            "file_type": sample['file_type'],
            "file_name": sample['file_name'],
            "marker": str(sample['marker']),
            "sub_marker": str(sample['sub_marker']),
            "first_10_words": sample['first_10_words']
        }
        pinecone_data.append({
            'id': str(uuid.uuid4()),
            'values': embedding,
            'metadata': metadata
        })
    return pinecone_data

In [None]:
lectures_pc_data = prepare_data_for_pinecone(lectures_embeddings_df)
exercises_pc_data = prepare_data_for_pinecone(exercises_embeddings_df)
notebooks_pc_data = prepare_data_for_pinecone(notebooks_embeddings_df)
qas_pc_data = prepare_data_for_pinecone(qas_embeddings_df)
references_pc_data = prepare_data_for_pinecone(references_embeddings_df)

### Send to Pinecone

If this is done once, don't do it again please

In [None]:
# index_name = "subset"
index_name = "nlp-material-full"
index = pc.Index(name=index_name)

In [None]:
index.upsert(vectors=lectures_pc_data)

In [None]:
index.upsert(vectors=exercises_pc_data)

In [None]:
index.upsert(vectors=notebooks_pc_data) # This failed because the request size needs to be smaller

This failed because there was too much data (not necessarily incredibly large chunks, but an incredibly large number of chunks) in the overall request

In [None]:
index.upsert(vectors=qas_pc_data)

In [None]:
index.upsert(vectors=references_pc_data) # This failed because the message length is too large

This failed because there were a few very large chunks (particularly from HTMLs and reference PDFs).

The two failures prompted us to do the following:
<ul>
    <li>Use tiktoken by OpenAI to tokenize the text during chunking to get an accurate estimate of the size of each chunk.</li>
    <li>Include a regex matching for a special, weird tag of "\<latexit>" when converting a few LaTeX PDFs (mostly reference papers) in our PDF parser to remove that content; it turned out to be a continuous string of 256 characters (possibly some hash) that held no value but was unnecessarily increasing the size of the chunks.</li>
    <li>Batch the data to be upserted</li>
</ul>