# Vector Databases in LangChain
* Notebook by Adam Lang
* Date: 7/8/2024

# Overview
* In this notebook we will go over how to use Vector Databases with LangChain.
* This includes "CRUD" actions (create, read, update, delete).

## Install dependencies

In [1]:
!pip install langchain==0.2.0
!pip install langchain-openai==0.1.7
!pip install langchain-community==0.2.0

Collecting langchain==0.2.0
  Downloading langchain-0.2.0-py3-none-any.whl (973 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m973.7/973.7 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain==0.2.0)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl (28 kB)
Collecting langchain-core<0.3.0,>=0.2.0 (from langchain==0.2.0)
  Downloading langchain_core-0.2.11-py3-none-any.whl (337 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m337.4/337.4 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain-text-splitters<0.3.0,>=0.2.0 (from langchain==0.2.0)
  Downloading langchain_text_splitters-0.2.2-py3-none-any.whl (25 kB)
Collecting langsmith<0.2.0,>=0.1.17 (from langchain==0.2.0)
  Downloading langsmith-0.1.84-py3-none-any.whl (127 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.9/127.9 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
Collecting marshmallow<4.0.0,>=

## Vector DB install - Chroma DB and LangChain Wrapper
* We will be using **ChromaDB** which is an open source Vector database.
* Key features of ChromaDB include:
   * Various underlying storage options like DuckDB for standalone or ClickHouse for scalability.
   * SDKs for Python and JavaScript/TypeScript.
   * Focuses on simplicity, speed, and enabling analysis.
* Important technical considerations with ChromaDB:
1. You have to create a collection similar to tables in a relational database.
   * **Note: Chroma converts all text into embeddings using `all-MiniLM-L6-v2`, but you can modify the collection to use another embedding model.**
   * This is one of the most popular open source sentence transformer models also available on HuggingFace.
2. Add text documents to the newly created collection with metadata and a unique ID - this allows indexing.
   * When a collection receives text, it automatically converts it into an embedding.
3. You can query collections by text or embeddings to receive semantically similar documents in return.
   * Results can also be filtered using metadata.

In [2]:
## langchain wrapper for chroma
!pip install langchain-chroma

Collecting langchain-chroma
  Downloading langchain_chroma-0.1.2-py3-none-any.whl (9.3 kB)
Collecting chromadb<0.6.0,>=0.4.0 (from langchain-chroma)
  Downloading chromadb-0.5.3-py3-none-any.whl (559 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m559.5/559.5 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting fastapi<1,>=0.95.2 (from langchain-chroma)
  Downloading fastapi-0.111.0-py3-none-any.whl (91 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.0/92.0 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
Collecting chroma-hnswlib==0.7.3 (from chromadb<0.6.0,>=0.4.0->langchain-chroma)
  Downloading chroma_hnswlib-0.7.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m51.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting uvicorn[standard]>=0.18.3 (from chromadb<0.6.0,>=0.4.0->langchain-chroma)
  Downloading uvicorn-0.30.1-py3-none-any.whl

## OpenAI API key

In [3]:
from getpass import getpass

OPENAI_KEY = getpass('Please enter your OPENAI key: ')

Please enter your OPENAI key: ··········


### Environment Variable Setup

In [4]:
import os

os.environ['OPENAI_API_KEY'] = OPENAI_KEY

## Create 5 experimental documents
* We will us text from the Visit New England website: https://www.visit-vermont.com/northern/

In [10]:
documents = [
    'Central Vermont is home to the Green Mountain National Forest, which welcomes visitors for leaf-peeping during fall foliage ; snow sports; photography, hiking, camping, and more.',
    'Southern Vermont has it all: ski resorts, shopping, beautiful village greens, cozy inns, fall foliage, and a scattering of historic covered bridges. Ski areas -- Stratton, Bromley, Magic Mountain, and Mount Snow -- offer winter activities and many have warm-weather fun for kids and families. ',
    'Northern Vermont presents contrasts: in the northwest corner is Lake Champlain and Burlington, a cosmopolitan university city of dynamic cultural attractions. Opposite is the Northeast Kingdom, a forested area of proud small towns, outdoor sports and natural grandeur. Jay Peak, has a fabulous ski area an indoor water park. Stowe is a 200-year-old village near Mount Mansfield, Vermonts highest peak. Stowe has an elegant character reminiscent of European mountain resorts, with celebrations all year and family-friendly deals. ',
    'Vermont is a place, but also a state of mind that revels in outdoor beauty, good food, and the preservation of unique small towns and cities. Visitors love Vermont for its colorful fall foliage; ski resorts including Killington, Smugglers Notch, Ascutney, Stowe, and Mount Snow; mountain vistas; and attractions for families and kids. Although Vermont is an inland state, there are wonderful beaches along Lake Champlain, along with many lakes, streams, and swimming holes.',
    'The term Northeast Kingdom was a distinction first mentioned by former Governor George Aiken in the late 1940s to refer to the natural majestic beauty of the area, which follows the northern reaches of the Connecticut River alongside New Hampshire from St. Johnsbury, VT to the Canadian border, then east to the eastern edge of the Green Mountains at Jay Peak. Sometimes abbreviated to the NEK (en-ee-kay), the Northeast Kingdom is made up of over 2,000 square miles of the most untouched and beautifully scenic land in the United States. There are lovely towns to explore in each part of the NEK. In the Eastern Trek: St. Johnsbury, Lyndon, Burke, and Island Pond, which is also known as the snowmobile capital of Vermont. ',
]

### OpenAI Embedding Models
* We can simply access the OpenAI embedding models here using LangChain.
* Some of the OpenAI models include:
1. `text-embedding-3-small`
2. `text-embedding-3-large`

In [11]:
# import embeddings via langchain
from langchain_openai import OpenAIEmbeddings

# instantiate embeddings
openai_embed_model = OpenAIEmbeddings(model='text-embedding-3-small')

## ChromaDB Connection
* Here we connect to the Vector DB.
* Below we will see code to delete the vector DB instance if it already exists.
* We will also see code to instantiate or create the Vector DB and persist on disk.
  * We want to initialize a connection to a Chroma vector DB client.
  * We also want to save to disk so we would simply init the Chroma client and pass the directory where we want the data to be saved.

In [12]:
## this will delete the Vector DB if it already exists
# !rm -rf ./chroma_db

In [13]:
## Create vector DB if not already created
from langchain_chroma import Chroma

# create vector DB of documents and embeddings
chroma_db = Chroma(collection_name='search_docs',
                   embedding_function=openai_embed_model,
                   persist_directory="./chroma_db")

We will use sample documents.

In [14]:
documents

['Central Vermont is home to the Green Mountain National Forest, which welcomes visitors for leaf-peeping during fall foliage ; snow sports; photography, hiking, camping, and more.',
 'Southern Vermont has it all: ski resorts, shopping, beautiful village greens, cozy inns, fall foliage, and a scattering of historic covered bridges. Ski areas -- Stratton, Bromley, Magic Mountain, and Mount Snow -- offer winter activities and many have warm-weather fun for kids and families. ',
 'Northern Vermont presents contrasts: in the northwest corner is Lake Champlain and Burlington, a cosmopolitan university city of dynamic cultural attractions. Opposite is the Northeast Kingdom, a forested area of proud small towns, outdoor sports and natural grandeur. Jay Peak, has a fabulous ski area an indoor water park. Stowe is a 200-year-old village near Mount Mansfield, Vermonts highest peak. Stowe has an elegant character reminiscent of European mountain resorts, with celebrations all year and family-frie

## Create Document IDs
* We do this so we can uniquely identify each document.
* If we don't create document IDs, they will be created by default by ChromaDB.

In [15]:
# Create custom IDs using list comprehension
ids = ['doc_'+str(i) for i in range(len(documents))]
ids

['doc_0', 'doc_1', 'doc_2', 'doc_3', 'doc_4']

Check Vector DB to see if its empty.

In [16]:
chroma_db.get()

{'ids': [],
 'embeddings': None,
 'metadatas': [],
 'documents': [],
 'uris': None,
 'data': None,
 'included': ['metadatas', 'documents']}

Looks like it is empty, let's add documents to the Vector DB.

## Add documents to Vector DB
* 1. Create embeddings.
   * We will take the document text we created above and pass it through the Open AI embedding model we instaniated and add it to the Chroma Vector DB.
   * If we had any documents in the LangChain `Document` format we could use the function `add_Documents` instead but we don't at the moment so we wont't.
* 2. Add embeddings to Vector DB with indexes.

In [17]:
# add documents and custom ids to vector DB
chroma_db.add_texts(texts=documents, ids=ids)

['doc_0', 'doc_1', 'doc_2', 'doc_3', 'doc_4']

Check Vector DB to see documents have been added and indexed.

In [18]:
chroma_db.get()

{'ids': ['doc_0', 'doc_1', 'doc_2', 'doc_3', 'doc_4'],
 'embeddings': None,
 'metadatas': [None, None, None, None, None],
 'documents': ['Central Vermont is home to the Green Mountain National Forest, which welcomes visitors for leaf-peeping during fall foliage ; snow sports; photography, hiking, camping, and more.',
  'Southern Vermont has it all: ski resorts, shopping, beautiful village greens, cozy inns, fall foliage, and a scattering of historic covered bridges. Ski areas -- Stratton, Bromley, Magic Mountain, and Mount Snow -- offer winter activities and many have warm-weather fun for kids and families. ',
  'Northern Vermont presents contrasts: in the northwest corner is Lake Champlain and Burlington, a cosmopolitan university city of dynamic cultural attractions. Opposite is the Northeast Kingdom, a forested area of proud small towns, outdoor sports and natural grandeur. Jay Peak, has a fabulous ski area an indoor water park. Stowe is a 200-year-old village near Mount Mansfield, 

Run search queries on Vector DB.

In [19]:
# query 1
query = "Tell me about the Northeast Kingdom."
docs = chroma_db.similarity_search_with_score(query=query,k=1)
docs

[(Document(page_content='The term Northeast Kingdom was a distinction first mentioned by former Governor George Aiken in the late 1940s to refer to the natural majestic beauty of the area, which follows the northern reaches of the Connecticut River alongside New Hampshire from St. Johnsbury, VT to the Canadian border, then east to the eastern edge of the Green Mountains at Jay Peak. Sometimes abbreviated to the NEK (en-ee-kay), the Northeast Kingdom is made up of over 2,000 square miles of the most untouched and beautifully scenic land in the United States. There are lovely towns to explore in each part of the NEK. In the Eastern Trek: St. Johnsbury, Lyndon, Burke, and Island Pond, which is also known as the snowmobile capital of Vermont. '),
  0.49111950946367544)]

In [20]:
# query 2
query = "Where is Stowe Mountain located?"
docs = chroma_db.similarity_search_with_score(query=query, k=1)
docs

[(Document(page_content='Northern Vermont presents contrasts: in the northwest corner is Lake Champlain and Burlington, a cosmopolitan university city of dynamic cultural attractions. Opposite is the Northeast Kingdom, a forested area of proud small towns, outdoor sports and natural grandeur. Jay Peak, has a fabulous ski area an indoor water park. Stowe is a 200-year-old village near Mount Mansfield, Vermonts highest peak. Stowe has an elegant character reminiscent of European mountain resorts, with celebrations all year and family-friendly deals. '),
  0.8995435859606269)]

In [23]:
# query 3
query = "Where is leaf peeping popular?"
docs = chroma_db.similarity_search_with_score(query=query, k=1)
docs

[(Document(page_content='Central Vermont is home to the Green Mountain National Forest, which welcomes visitors for leaf-peeping during fall foliage ; snow sports; photography, hiking, camping, and more.'),
  1.0454452153900495)]

## Add more documents to Vector DB
* You are able to add new documents to a Vector DB anytime.

In [24]:
new_documents = [
    'Greater Portland, on Casco Bay, is Maine’s largest city and a delightful year-round tourist destination for vacationers who love urban settings. ',
    'Mount Monadnock is believed to be the most frequently climbed mountain in the world. Henry David Thoreau, Ralph Waldo Emerson, Mark Twain, and others hiked this mountain and wrote about it as a symbol of spiritual and environmental awareness. ',
    'The gateway city to the Katahdin Moosehead is Bangor, the Lumber Capital of the World, where visitors are greeted by a large statue of Paul Bunyan. The Maine Forest and Logging Museum is an authentic reconstruction of a logging and milling community.',
    'Families from all over the world come to the majestic White Mountains of New Hampshire to be thrilled by the ride up on the historic Mount Washington Cog Railway. The dramatic ascent to the summit of the Northeast’s highest peak features amazing views of lush forests, deep chasms and the steepest railroad trestle in the world in Jacob’s Ladder.',
    'Killington’s nickname in parts of the skiing world is The Beast of the East. It is the biggest ski resort in New England, and it encompasses seven peaks, 200 trails, and five terrain parks. Vertical drop is 3,050 feet. Natural snow averages 250 inches a year, and artificial snowmaking covers 600 acres of skiable area. Naturally, with such vast spaces and resources, Killington offers a big variety of ski trails, from easy learning areas to steep gladed trails and everything in between.',
]

### Create new ids for new docs

In [25]:
new_ids = ['doc_'+str(i+len(ids)) for i in range(len(new_documents))]
new_ids

['doc_5', 'doc_6', 'doc_7', 'doc_8', 'doc_9']

### Add new docs with ids to Vector DB

In [26]:
chroma_db.add_texts(texts=new_documents, ids=new_ids)

['doc_5', 'doc_6', 'doc_7', 'doc_8', 'doc_9']

In [28]:
## get the docs
chroma_db.get()

{'ids': ['doc_0',
  'doc_1',
  'doc_2',
  'doc_3',
  'doc_4',
  'doc_5',
  'doc_6',
  'doc_7',
  'doc_8',
  'doc_9'],
 'embeddings': None,
 'metadatas': [None, None, None, None, None, None, None, None, None, None],
 'documents': ['Central Vermont is home to the Green Mountain National Forest, which welcomes visitors for leaf-peeping during fall foliage ; snow sports; photography, hiking, camping, and more.',
  'Southern Vermont has it all: ski resorts, shopping, beautiful village greens, cozy inns, fall foliage, and a scattering of historic covered bridges. Ski areas -- Stratton, Bromley, Magic Mountain, and Mount Snow -- offer winter activities and many have warm-weather fun for kids and families. ',
  'Northern Vermont presents contrasts: in the northwest corner is Lake Champlain and Burlington, a cosmopolitan university city of dynamic cultural attractions. Opposite is the Northeast Kingdom, a forested area of proud small towns, outdoor sports and natural grandeur. Jay Peak, has a f

In [29]:
## new query
query = "What is Maine's largest city?"
docs = chroma_db.similarity_search_with_score(query=query, k=1)
docs

[(Document(page_content='Greater Portland, on Casco Bay, is Maine’s largest city and a delightful year-round tourist destination for vacationers who love urban settings. '),
  0.6029868341188629)]

## Updating documents in Vector DB
* When building applications you will want to be able to update data in the vector DB.
* Chroma lets users provide ids to simplify the process and update documents as we will see below using the function `update_documents`.

In [31]:
## lets get a document we want to update
chroma_db.get(['doc_8'])

{'ids': ['doc_8'],
 'embeddings': None,
 'metadatas': [None],
 'documents': ['Families from all over the world come to the majestic White Mountains of New Hampshire to be thrilled by the ride up on the historic Mount Washington Cog Railway. The dramatic ascent to the summit of the Northeast’s highest peak features amazing views of lush forests, deep chasms and the steepest railroad trestle in the world in Jacob’s Ladder.'],
 'uris': None,
 'data': None,
 'included': ['metadatas', 'documents']}

In [49]:
from langchain_core.documents import Document

# id of docs to update
ids = ['doc_8']
# Updating/adding text to document
texts = ['Families from all over the world come to the majestic White Mountains of New Hampshire to be thrilled by the ride up on the historic Mount Washington Cog Railway. The dramatic ascent to the summit of the Northeast’s highest peak features amazing views of lush forests, deep chasms and the steepest railroad trestle in the world in Jacob’s Ladder. The Cog is the first mountain-climbing cog railway in the world. With an average grade of 25% (some sections approach nearly 38%), it’s also the second steepest! ']

# Create a LangChain document object to update this document
documents = [Document(page_content=text, metadata={'doc':id})
              for id, text in zip(ids, texts)]

#print it
documents

[Document(metadata={'doc': 'doc_8'}, page_content='Families from all over the world come to the majestic White Mountains of New Hampshire to be thrilled by the ride up on the historic Mount Washington Cog Railway. The dramatic ascent to the summit of the Northeast’s highest peak features amazing views of lush forests, deep chasms and the steepest railroad trestle in the world in Jacob’s Ladder. The Cog is the first mountain-climbing cog railway in the world. With an average grade of 25% (some sections approach nearly 38%), it’s also the second steepest! ')]

In [50]:
# update
chroma_db.update_documents(ids=ids, documents=documents)

In [51]:
# get the updated id
chroma_db.get(['doc_8'])

{'ids': ['doc_8'],
 'embeddings': None,
 'metadatas': [{'doc': 'doc_8'}],
 'documents': ['Families from all over the world come to the majestic White Mountains of New Hampshire to be thrilled by the ride up on the historic Mount Washington Cog Railway. The dramatic ascent to the summit of the Northeast’s highest peak features amazing views of lush forests, deep chasms and the steepest railroad trestle in the world in Jacob’s Ladder. The Cog is the first mountain-climbing cog railway in the world. With an average grade of 25% (some sections approach nearly 38%), it’s also the second steepest! '],
 'uris': None,
 'data': None,
 'included': ['metadatas', 'documents']}

In [54]:
# query
query = "How steep is the cog railway?"
docs = chroma_db.similarity_search_with_score(query=query, k=1)
docs

[(Document(metadata={'doc': 'doc_8'}, page_content='Families from all over the world come to the majestic White Mountains of New Hampshire to be thrilled by the ride up on the historic Mount Washington Cog Railway. The dramatic ascent to the summit of the Northeast’s highest peak features amazing views of lush forests, deep chasms and the steepest railroad trestle in the world in Jacob’s Ladder. The Cog is the first mountain-climbing cog railway in the world. With an average grade of 25% (some sections approach nearly 38%), it’s also the second steepest! '),
  0.8441089728223838)]

## Deleting documents in Vector DBs
* To delete a document in Chroma you need the document ID.

In [55]:
## pick an id to delete
chroma_db.delete(['doc_9'])

In [56]:
## get result
chroma_db.get()

{'ids': ['doc_0',
  'doc_1',
  'doc_2',
  'doc_3',
  'doc_4',
  'doc_5',
  'doc_6',
  'doc_7',
  'doc_8'],
 'embeddings': None,
 'metadatas': [None,
  None,
  None,
  None,
  None,
  None,
  None,
  None,
  {'doc': 'doc_8'}],
 'documents': ['Central Vermont is home to the Green Mountain National Forest, which welcomes visitors for leaf-peeping during fall foliage ; snow sports; photography, hiking, camping, and more.',
  'Southern Vermont has it all: ski resorts, shopping, beautiful village greens, cozy inns, fall foliage, and a scattering of historic covered bridges. Ski areas -- Stratton, Bromley, Magic Mountain, and Mount Snow -- offer winter activities and many have warm-weather fun for kids and families. ',
  'Northern Vermont presents contrasts: in the northwest corner is Lake Champlain and Burlington, a cosmopolitan university city of dynamic cultural attractions. Opposite is the Northeast Kingdom, a forested area of proud small towns, outdoor sports and natural grandeur. Jay Pe

## Load Vector DB from disk
* Once saved DB to disk, you can load it anytime, connect and run queries.

In [57]:
# load from disk
db = Chroma(persist_directory="./chroma_db",
            embedding_function=openai_embed_model,
            collection_name='search_docs')

query = "How steep is the cog railway?"
docs = db.similarity_search_with_score(query=query, k=1)
docs

[(Document(metadata={'doc': 'doc_8'}, page_content='Families from all over the world come to the majestic White Mountains of New Hampshire to be thrilled by the ride up on the historic Mount Washington Cog Railway. The dramatic ascent to the summit of the Northeast’s highest peak features amazing views of lush forests, deep chasms and the steepest railroad trestle in the world in Jacob’s Ladder. The Cog is the first mountain-climbing cog railway in the world. With an average grade of 25% (some sections approach nearly 38%), it’s also the second steepest! '),
  0.8440755621232334)]