<a href="https://colab.research.google.com/github/aaubs/ds-master/blob/main/notebooks/M3_3_LanceDB_v2_Solutions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![](https://lancedb.github.io/lancedb/assets/ecosystem-illustration.png)

LanceDB is an open-source database for vector-search built with persistent storage, which greatly simplifies retrieval, filtering and management of embeddings.

In [None]:
!pip install lancedb --q
!pip install pypdf --q
# !pip install -qqq chromadb==0.4.10 --progress-bar off
!pip install -qqq langchain==0.0.299 --progress-bar off
!pip install -qqq sentence_transformers==2.2.2 --progress-bar off

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m112.3/112.3 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.6/21.6 MB[0m [31m35.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.3/38.3 MB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.7/98.7 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
ibis-framework 7.1.0 requires pyarrow<15,>=2, but you have pyarrow 15.0.0 which is incompatible.[0m[31m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m284.0/284.0 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for sentence_transformers

In [None]:
import lancedb

uri = "/content/data/sample-lancedb"
db = lancedb.connect(uri)

In [None]:
table = db.create_table("my_table",
                        data=[{"vector": [3.1, 4.1], "item": "foo", "price": 10.0},
                              {"vector": [5.9, 26.5], "item": "bar", "price": 20.0}])

In [None]:
result = table.search([100, 100]).limit(2).to_list()

In [None]:
result

[{'vector': [5.900000095367432, 26.5],
  'item': 'bar',
  'price': 20.0,
  '_distance': 14257.0595703125},
 {'vector': [3.0999999046325684, 4.099999904632568],
  'item': 'foo',
  'price': 10.0,
  '_distance': 18586.421875}]

# Implementing a Vector Database for Documents

In [None]:
from langchain.document_loaders import UnstructuredMarkdownLoader
from langchain.document_loaders import PyPDFLoader
from langchain.llms import HuggingFaceHub
from langchain.embeddings import HuggingFaceEmbeddings

loader = PyPDFLoader("/content/attention.pdf")

docs = loader.load()
len(docs)

15

The Markdown file we're loading is the original Attention paper: "Attention is all you need!". Let's see how we can use the RecursiveCharacterTextSplitter to split the document into smaller chunks:

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=64)
texts = text_splitter.split_documents(docs)
len(texts)

47

Splitting the document into chunks is required due to the limited number of tokens a LLM can look at once (4096 for Llama 2). Next, we'll use the HuggingFaceEmbeddings class to create embeddings for the chunks:

In [None]:
import lancedb
from langchain.vectorstores import LanceDB
from langchain.embeddings import HuggingFaceEmbeddings


uri = "/content/data/paper-lancedb-"
db = lancedb.connect(uri)

In [None]:
# We will use HuggingFace embeddings
embeddings = HuggingFaceEmbeddings()

In [None]:
table = db.create_table(
    "paper_table",
    data=[
        {
            "vector": embeddings.embed_query("Hello World"),
            "text": "Hello World",
            "id": "1",
        }
    ],
    mode="overwrite",
)

docsearch = LanceDB.from_documents(texts[5:20], embeddings, connection=table)

In [None]:
retriever = docsearch.as_retriever(search_kwargs={'k': 2})

In [None]:
retriever

In [None]:
texts[0].page_content

In [None]:
result = table.search(embeddings.embed_query(texts[0].page_content)).limit(2).to_list()

In [None]:
result[0].keys()

## Exercise 1: Create a LanceDB for Two Papers and Load Each into a Table

In [None]:
table = db.create_table(
    "paper_table_1",
    data=[
        {
            "vector": embeddings.embed_query("Hello World"),
            "text": "Hello World",
            "id": "1",
        }
    ],
    mode="overwrite",
)

docsearch = LanceDB.from_documents(texts[5:10], embeddings, connection=table)

In [None]:
table = db.create_table(
    "paper_table_2",
    data=[
        {
            "vector": embeddings.embed_query("Hello World"),
            "text": "Hello World",
            "id": "1",
        }
    ],
    mode="overwrite",
)

docsearch = LanceDB.from_documents(texts[10:15], embeddings, connection=table)

# ChromaDB

![](https://images.datacamp.com/image/upload/v1693482377/image4_7b6910cd7c.png)

In the spirit of using free tools, we're also using free embeddings hosted by HuggingFace. We'll use Chroma database to store/cache the embeddings and make it easy to search them:

To combine the LLM with the database, we'll use the RetrievalQA chain:

In [None]:
!pip install -qqq chromadb==0.4.10 --progress-bar off

In [None]:
from langchain.vectorstores import Chroma

db = Chroma.from_documents(texts, embeddings, persist_directory="db")
results = db.similarity_search("Transformer models", k=2)
print(results[0].page_content)

In [None]:
results[0]

## Exercise 2: Create a ChromaDB for Two Papers and Load Each into a collection

In [None]:
from langchain.vectorstores import Chroma

db = Chroma.from_documents(texts[5:10], embeddings, persist_directory="db_1", collection_name='paper_1')
results = db.similarity_search("Transformer models", k=2)
print(results[0].page_content)

In [None]:
from langchain.vectorstores import Chroma

db = Chroma.from_documents(texts[5:10], embeddings, persist_directory="db_1", collection_name='paper_2')
results = db.similarity_search("Transformer models", k=2)
print(results[0].page_content)