Load the document

TODO:  Do we want to add text splitting? e.g. 
text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=500, chunk_overlap=50
        )
        text_chunks_docx = text_splitter.split_text("\n".join(text_chunks))

EXERCISE OVERVIEW

In this exercise, we'll step through:
A. Chunking a document into paragraphs
B. Storing all of the raw paragraph text in our database
C. After everything is stored as text in the database, we'll retrieve those paragraphs and embed/vectorize each one, storing it in the vector store (pgvector from Postgres in our case).
D. Then, we'll act as a user and send in a query, that will also be vectorized by the same tool we used to vectorize our paragraphs.
E. Then we'll find and retrieve the vectors that are most similar to our query.
F. Once we have the similar vectors, we'll submit them, along with the user query to our LLM and print our response.


NOTE: When you setup directories for labs, you may have to run this code to add the virtual environment as a kernel (otherwise, things worked but you will get some warnings).
python -m ipykernel install --user --name=myenv --display-name "Python (myenv)"


Step 1 - Import docx library for Python and instantiate a document object

In [21]:
from docx import Document as DocxDocument
doc = DocxDocument("Jupyter_Notebook_Info.docx")

Step 2 - Extract and chunk into paragraphs

In [22]:
doc_chunks = []
for para in doc.paragraphs:
            if para.text.strip():
                doc_chunks.append(para.text)

Step 3 (First time only)

You will want to make sure your env file has the proper database configuration info.

If you have not yet ran create_db_python.py.  Take a minute to browse it and then run it please.
This creates your relational database tables.

The run create_pgvector_table.py.  This, creates the vector store table, mmr_vector.



Step 4 - Connect to the database 
(Database should now be created.  Se)

In [23]:
import psycopg2
import os
def get_connection():
    try:
        conn = psycopg2.connect(
                    dbname=os.getenv("dbname"),
                    user=os.getenv("dbuser"),
                    password=os.getenv("dbpassword"),
                    host=os.getenv("dbhost"),
                    port=os.getenv("dbport"),
                )
    except (psycopg2.DatabaseError, Exception) as error:
        print(f"Error: {error}")
    
    return conn



Step 5 - Store the chunks in the database

In [25]:
from psycopg2 import sql

try:
    conn = get_connection()
    with conn.cursor() as cursor:
        # Insert file metadata and content into the complete_files table
        for chunk in doc_chunks:
            cursor.execute(
                sql.SQL("INSERT INTO text_chunks (text) VALUES (%s) RETURNING pk"),
                [chunk],
            )
            pk = cursor.fetchone()[0]  # (Optional) capture the returned primary key
    
            conn.commit()  # Commit the transaction after each insert
except Exception as e:
    print(f"Error inserting file with chunks: {e}")

Step 6 - Vectorization
A. Retrieve Current Chunks from Database
B. Create embedding of each chunk
C. Store vector of embedding in pgvector (vector store)

In [35]:
from langchain_openai import OpenAIEmbeddings
from dotenv import load_dotenv
import os

load_dotenv()
openai_key = os.getenv("OPENAI_API_KEY")
# Retreive Currrent Chunks from Database
current_chunks = []
conn = get_connection()
with conn.cursor() as cursor:
    cursor.execute(
        sql.SQL("SELECT pk, text FROM text_chunks WHERE is_vectorized = FALSE"),
    )

    rows = cursor.fetchall()
    for row in rows:
        current_chunks.append(row) # append tuple of id and text 
        cursor.execute(f"UPDATE text_chunks SET is_vectorized = TRUE WHERE pk = %s",
                       (row[0],))
    
    conn.commit()

# Get the embedding model
openai_embedding = OpenAIEmbeddings(model="text-embedding-3-small", api_key=openai_key)

# Generate the embedding for each chunk
vector_dict = {}
for chunk in current_chunks:
     content = openai_embedding.embed_query(chunk[1])
     # Convert the embedding values to floats (ensures compatibility with storage formats)
     float_content = [float(x) for x in content]
     vector_dict[chunk[0]] = float_content


# add the vectorized content to the vector store
with conn.cursor() as cursor:
    chunk_type = "text"  # Only working with text right now, so hard-coding chunk_type = "text"
    
    for cid, vec in vector_dict.items():
        cursor.execute(
            sql.SQL("INSERT INTO mmr_vector (vector, chunk_type, chunk_id) VALUES (%s, %s, %s)"),
            [vec, chunk_type, cid]
        )

conn.commit()

Step 7 - Vectorize Incoming Query 

In [36]:




query = "Where should I point my web browser after Jupyter is running?"
# query = "What is the command to install jupyter?"
# query = "You can execute a cell by clicking on it and pressing what?"
# query = "What City and State had the highest temperature?"

vectorized_query = openai_embedding.embed_query(query)

Step 8 - Find Similar Vectors

pgvector similarity search operators: 
<->:
Represents the Euclidean distance between two vectors, which is the "straight-line" distance between them in multi-dimensional space. 
<=>:
Calculates the cosine similarity between vectors, which is often preferred for high-dimensional data as it focuses on the angle between vectors rather than their magnitude. 
<#>
: Computes the inner product of two vectors, where each corresponding element is multiplied and summed. 

In [40]:
#print(f"vectorized query: {vectorized_query[:5]}")
top_k = 3
conn = get_connection()
with conn.cursor() as cur:
    cur.execute(
        sql.SQL(
            """SELECT pk, chunk_type, chunk_id, 1 - (vector <=> %s::VECTOR) AS similarity
               FROM mmr_vector
               ORDER BY similarity DESC
               LIMIT %s"""
        ),
        [vectorized_query, top_k],
    )
    rows = cur.fetchall()
    similar_chunk_ids = []
    if rows:
        for row in rows:
            similar_chunk_ids.append(row[2])
    else:
        print("No results found.")


      

Step 9 - Get the text chunks of the closest matches

In [41]:
conn = get_connection()
with conn.cursor() as cur:
    print(f"similar_chunk_ids:{similar_chunk_ids}")
    similar_context = []
    for chunk_id in similar_chunk_ids:
        cur.execute(
            sql.SQL("""SELECT text FROM text_chunks where pk = %s"""),
            [chunk_id],
        )
        row = cur.fetchone()  # Fetch only one row for the current chunk_id
        similar_context.append(row[0])



similar_chunk_ids:[2, 47, 92]


Step 10 - Retrieval
Submit Similar Vectors to LLM with query to retrieve a response

In [42]:
from openai import OpenAI
from pprint import pprint

# Show the similar content retrieved
for sc in similar_context:
    pprint(f"CONTEXT ITEM:{sc}")

# Format the prompt
prompt = f"""You are an assistant for question-answering tasks. Use only 
the following pieces of retrieved context to answer the 
question. Use 3 sentences maximum to keep your answer concise. Here's a query: 
{query} and here are similar queries of retrieved context: {similar_context}. Again,
only base your answer on the similar queries data within the similar context."""

# Call the OpenAI ChatCompletion API using the updated method
client = OpenAI()
response = client.chat.completions.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": prompt},
        {"role": "user", "content": query},
    ],
)

# Extract and print the response
print("\n\n")
pprint(f"RESPONSE:{response.choices[0].message.content.strip()}")

('CONTEXT ITEM:Once Jupyter is running, point your web browser at '
 'http://localhost:8888 to start using Jupyter notebooks. If everything worked '
 'correctly, you should see a screen like this, showing all available Jupyter '
 'notebooks in the current directory:')
('CONTEXT ITEM:Once Jupyter is running, point your web browser at '
 'http://localhost:8888 to start using Jupyter notebooks. If everything worked '
 'correctly, you should see a screen like this, showing all available Jupyter '
 'notebooks in the current directory:')
('CONTEXT ITEM:Once Jupyter is running, point your web browser at '
 'http://localhost:8888 to start using Jupyter notebooks. If everything worked '
 'correctly, you should see a screen like this, showing all available Jupyter '
 'notebooks in the current directory:')



('RESPONSE:After Jupyter is running, you should point your web browser at '
 'http://localhost:8888 to start using Jupyter notebooks. If everything worked '
 'correctly, you will see a scree