In [1]:
%pip install --upgrade --quiet pgvector

Note: you may need to restart the kernel to use updated packages.


In [2]:
%pip install --upgrade --quiet langchain-openai

Note: you may need to restart the kernel to use updated packages.


In [3]:
%pip install --upgrade --quiet psycopg2-binary
%pip install --upgrade --quiet tiktoken

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [4]:
import getpass

In [5]:
from langchain.docstore.document import Document
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores.pgvector import PGVector
from langchain_community.embeddings.ollama import OllamaEmbeddings

In [6]:
loader = TextLoader("../paul_graham_MIT.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

embeddings = OllamaEmbeddings()

In [8]:
CONNECTION_STRING = PGVector.connection_string_from_db_params(
    driver="psycopg2",
    host="localhost",
    port="5432",
    database="postgres",
    user="postgres",
    password="password"
)

## Similarity Search with Euclidean Distance (Default)

### Install PG Vector Postgres version 
https://github.com/pgvector/pgvector?tab=readme-ov-file#docker



In [12]:
COLLECTION_NAME = "paul_graham_mit_speech"

db = PGVector.from_documents(
    embedding=embeddings,
    documents=docs,
    collection_name=COLLECTION_NAME,
    connection_string=CONNECTION_STRING
)

In [13]:
query = "What did Paul Graham says about startups?"
docs_with_store = db.similarity_search_with_score(query)

In [14]:
for doc, score in docs_with_store:
    print("-" * 80)
    print("Score: ", score)
    print(doc.page_content)
    print("-" * 80)

--------------------------------------------------------------------------------
Score:  0.42734237077305814
They at least were in Boston. What if they'd been in Nebraska, like Evan Williams was at their age? Someone wrote recently that the drawback of Y Combinator was that you had to move to participate. It couldn't be any other way. The kind of conversations we have with founders, we have to have in person. We fund a dozen startups at a time, and we can't be in a dozen places at once. But even if we could somehow magically save people from moving, we wouldn't. We wouldn't be doing founders a favor by letting them stay in Nebraska. Places that aren't startup hubs are toxic to startups. You can tell that from indirect evidence. You can tell how hard it must be to start a startup in Houston or Chicago or Miami from the microscopically small number, per capita, that succeed there. I don't know exactly what's suppressing all the startups in these towns—probably a hundred subtle little thi

## Maximal Marginal Relevance Search (MMR)

Maximal marginal relevance optimizes for similarity to query AND diversity among selected documents

In [16]:
docs_with_score = db.max_marginal_relevance_search_with_score(query)

In [17]:
for doc, score in docs_with_store:
    print("-" * 80)
    print("Score: ", score)
    print(doc.page_content)
    print("-" * 80)

--------------------------------------------------------------------------------
Score:  0.42734237077305814
They at least were in Boston. What if they'd been in Nebraska, like Evan Williams was at their age? Someone wrote recently that the drawback of Y Combinator was that you had to move to participate. It couldn't be any other way. The kind of conversations we have with founders, we have to have in person. We fund a dozen startups at a time, and we can't be in a dozen places at once. But even if we could somehow magically save people from moving, we wouldn't. We wouldn't be doing founders a favor by letting them stay in Nebraska. Places that aren't startup hubs are toxic to startups. You can tell that from indirect evidence. You can tell how hard it must be to start a startup in Houston or Chicago or Miami from the microscopically small number, per capita, that succeed there. I don't know exactly what's suppressing all the startups in these towns—probably a hundred subtle little thi

## Working With VectorStore
Often times, we want to work with an existing vectore.

In [18]:
store = PGVector(
    collection_name=COLLECTION_NAME,
    connection_string=CONNECTION_STRING,
    embedding_function=embeddings
)

In [20]:
# Add Documents
new_doc_content = "This could be an example of a chunked content"
store.add_documents([Document(page_content=new_doc_content)])

['e276b864-c6f0-11ee-a7df-00d8614f24cb']

In [21]:
docs_with_score = db.similarity_search_with_score(new_doc_content)

In [24]:
docs_with_score[0]

(Document(page_content='This could be an example of a chunked content'),
 0.1877201492750784)

In [25]:
new_doc_content = "foo"
store.add_documents([Document(page_content=new_doc_content)])

['1bfdbc54-c6f1-11ee-a7df-00d8614f24cb']

In [26]:
docs_with_score = db.similarity_search_with_score(new_doc_content)

In [28]:

docs_with_score

[(Document(page_content='foo'), 0.11790256479371342),
 (Document(page_content="What goes wrong with young founders is that they build stuff that looks like class projects. It was only recently that we figured this out ourselves. We noticed a lot of similarities between the startups that seemed to be falling behind, but we couldn't figure out how to put it into words. Then finally we realized what it was: they were building class projects.\n\nBut what does that really mean? What's wrong with class projects? What's the difference between a class project and a real startup? If we could answer that question it would be useful not just to would-be startup founders but to students in general, because we'd be a long way toward explaining the mystery of the so-called real world.\n\nThere seem to be two big things missing in class projects: (1) an iterative definition of a real problem and (2) intensity.", metadata={'source': '../paul_graham_MIT.txt'}),
  0.6977502053868228),
 (Document(page_co

In [None]:
%pip install --upgrade --quiet  sentence_transformers > /dev/null
from langchain_community.embeddings import HuggingFaceEmbeddings

In [None]:
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

In [34]:
store = PGVector(
    collection_name=COLLECTION_NAME,
    connection_string=CONNECTION_STRING,
    embedding_function=embeddings
)

In [35]:
docs_with_score = db.similarity_search_with_score(new_doc_content)

In [37]:
docs_with_score

[(Document(page_content='foo'), 0.11790256479371342),
 (Document(page_content="What goes wrong with young founders is that they build stuff that looks like class projects. It was only recently that we figured this out ourselves. We noticed a lot of similarities between the startups that seemed to be falling behind, but we couldn't figure out how to put it into words. Then finally we realized what it was: they were building class projects.\n\nBut what does that really mean? What's wrong with class projects? What's the difference between a class project and a real startup? If we could answer that question it would be useful not just to would-be startup founders but to students in general, because we'd be a long way toward explaining the mystery of the so-called real world.\n\nThere seem to be two big things missing in class projects: (1) an iterative definition of a real problem and (2) intensity.", metadata={'source': '../paul_graham_MIT.txt'}),
  0.6977502053868228),
 (Document(page_co

In [39]:
docs_with_score = db.max_marginal_relevance_search_with_score("foo")
docs_with_score

[(Document(page_content='foo'), 0.11790256479371342),
 (Document(page_content='The first is probably unavoidable. Class projects will inevitably solve fake problems. For one thing, real problems are rare and valuable. If a professor wanted to have students solve real problems, he\'d face the same paradox as someone trying to give an example of whatever "paradigm" might succeed the Standard Model of physics. There may well be something that does, but if you could think of an example you\'d be entitled to the Nobel Prize. Similarly, good new problems are not to be had for the asking.', metadata={'source': '../paul_graham_MIT.txt'}),
  0.7172553867702852),
 (Document(page_content="So I'd be skeptical of classes and books. The way to learn about startups is by watching them in action, preferably by working at one. How do you do that as an undergrad? Probably by sneaking in through the back door. Just hang around a lot and gradually start doing things for them. Most startups are (or should 

In [41]:
query = "What did Paul Graham says about startups?"
docs_with_store = db.similarity_search_with_score(query)
docs_with_store

[(Document(page_content="They at least were in Boston. What if they'd been in Nebraska, like Evan Williams was at their age? Someone wrote recently that the drawback of Y Combinator was that you had to move to participate. It couldn't be any other way. The kind of conversations we have with founders, we have to have in person. We fund a dozen startups at a time, and we can't be in a dozen places at once. But even if we could somehow magically save people from moving, we wouldn't. We wouldn't be doing founders a favor by letting them stay in Nebraska. Places that aren't startup hubs are toxic to startups. You can tell that from indirect evidence. You can tell how hard it must be to start a startup in Houston or Chicago or Miami from the microscopically small number, per capita, that succeed there. I don't know exactly what's suppressing all the startups in these towns—probably a hundred subtle little things—but something must be. [2]", metadata={'source': '../paul_graham_MIT.txt'}),
  0