# Loading Data into PGVector

The goal of this notebook is to show how to load embeddings from `unstructured` outputs into a Postgre database with the `pgvector` extension installed.
The [Postgres documentation](https://www.postgresql.org/docs/15/tutorial-install.html) has instructions on how to install the Postgres database.
See [the `pgvector` repo](https://github.com/pgvector/pgvector) for information on how to install `pgvector`.

Postgres with `pgvector` is helpful because it combines the capabilities of a vector database with the structured information available in a traditional RDBMS. In this example, we'll show how to:

- Load `unstructured` outputs into `pgvector`.
- Conduct a similarity search conditioned on a metadata field.
- Conduct a similarity search, with a decayed score that biases more recent information.

## Setup the Postgres Database

First, we'll get everything set up for the Postgres database. We'll use `sqlalchemy` as
and ORM for defining the table and performing queries.

In [1]:
from sqlalchemy import (
    create_engine,
    ARRAY,
    Column,
    Integer,
    String,
    Float,
    DateTime,
    func,
    text,
)
from pgvector.sqlalchemy import Vector
from sqlalchemy.orm import declarative_base, sessionmaker

In [2]:
ADA_TOKEN_COUNT = 1536

In [3]:
connection_string = "postgresql://localhost:5432/postgres"
engine = create_engine(connection_string)

Base = declarative_base()

In [4]:
class Element(Base):
    __tablename__ = "unstructured_elements"

    id = Column(Integer, primary_key=True)
    embedding = Column(Vector(ADA_TOKEN_COUNT))
    text = Column(String)
    category = Column(String)
    filename = Column(String)
    category = Column(String)
    date = Column(DateTime)
    sent_to = Column(ARRAY(String))
    sent_from = Column(ARRAY(String))
    subject = Column(String)

In [5]:
Base.metadata.create_all(engine)

In [6]:
Session = sessionmaker(bind=engine)
session = Session()

## Preprocess Documents with Unstructured

Next, we'll preprocess data (in this case emails) using the `partition_email` function from `unstructured`. We'll also use the `OpenAIEmbeddings` class from `langchain` to create embeddings from the text. The embeddings will be used for similarity search after we've loaded the documents into the database.

In [7]:
import datetime
import os

from langchain.embeddings.openai import OpenAIEmbeddings
from unstructured.partition.email import partition_email

In [8]:
EXAMPLE_DOCS_DIRECTORY = "../../example-docs"

In [9]:
elements = []
for f in os.listdir(EXAMPLE_DOCS_DIRECTORY):
    if not f.endswith(".eml"):
        continue

    filename = os.path.join(EXAMPLE_DOCS_DIRECTORY, f)
    elements.extend(partition_email(filename=filename))

In [10]:
embedding_function = OpenAIEmbeddings()

In [11]:
for element in elements:
    element.embedding = embedding_function.embed_query(element.text)

## Load the Documents into Postgres

Now that we've preprocessed the documents, we're ready to load the results into the database. We'll do this by creating objects with `sqlalchemy` using the schema we defined early and then running an insert command.

In [12]:
items_to_add = []
for element in elements:
    items_to_add.append(
        Element(
            text=element.text,
            category=element.category,
            embedding=element.embedding,
            filename=element.metadata.filename,
            date=element.metadata.get_date(),
            sent_to=element.metadata.sent_to,
            sent_from=element.metadata.sent_from,
            subject=element.metadata.subject,
        )
    )

In [13]:
session.add_all(items_to_add)
session.commit()

## Query the Database

Finally, we're ready to query the database. The results from similarity search can be used for retrieval augmented generation, as described in the `langchain` doc [here](https://docs.langchain.com/docs/use-cases/qa-docs). First, we'll run a query conditioned on metadata. In this case, we'll look for similar items, but only look for narrative text elements. You can also perfor this operation using the [`pgvector` vectorstore](https://github.com/hwchase17/langchain/blob/master/langchain/vectorstores/pgvector.py) in `langchain`.

In [14]:
vector = embedding_function.embed_query("email")

In [16]:
query = (
    session.query(Element)
    .filter(Element.category == "NarrativeText")
    .order_by(Element.embedding.l2_distance(vector)).limit(5)
)

for element in query:
    print(element.id, element.text)

1 This is a test email to use for unit tests.
13 This is a test email to use for unit tests.
5 This is a test email to use for unit tests.
9 The unstructured logo is attached to this email.
19 It includes:


Next, we'll run a similarity search, but add a decay function that biases the results toward most recent documents. This can be helpful if you want to run retrieval augmented generation, but are concerned about passing outdated information into the LLM. In this case, we multiply the distance metric by a decay function with an exponential decay rate.

In [17]:
vector = embedding_function.embed_query("violets")
decay_rate = 0.10

In [18]:
query = (
    session.query(
        Element,
        Element.text,
        func.exp(
            text(f"-{decay_rate} * EXTRACT(DAY FROM (NOW() - date))")
            * Element.embedding.l2_distance(vector)
        ).label("decay_score"),
    )
    .order_by(text("decay_score DESC"))
    .limit(5)
)

for element in query:
    print(f"{element.decay_score} -  {element.text}")

0.0050977532596662945 -  Violets are blue
0.001773595479626926 -  Violets are blue
0.001773595479626926 -  Violets are blue
0.0011421532895244265 -  Roses are red
0.00029501066142995373 -  Roses are red
