<a target="_blank" href="https://colab.research.google.com/github/UpstageAI/cookbook/blob/main/Solar-Fullstack-LLM-101/05_1_ChromaDB.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# 05-1.ChromaDB

## Overview  
In this exercise, we will explore how to utilize ChromaDB to embed documents and construct a vectorspace. Additionally, we will gain insight into the creation of a Retriever object to facilitate efficient query searches within documents. This tutorial will guide you through the process of embedding documents and using a vectorspace for effective information retrieval.
 
## Purpose of the Exercise
The purpose of this exercise is to demonstrate the use of the Solar Embedding API to generate embeddings and create a vectorspace. By the end of this tutorial, users will be able to create a Retriever object and conduct efficient searches within the vectorspace, thereby enhancing the ability to retrieve relevant information from embedded documents.



## Keyword VS Semantic Search 
![Vector](https://blog.dataiku.com/hs-fs/hubfs/dftt%202.webp?width=1346&height=632&name=dftt%202.webp)

from https://blog.dataiku.com/semantic-search-an-overlooked-nlp-superpower

![Emb_search](figures/emb_search.png)

from https://sreent.medium.com/llms-embeddings-and-vector-search-d4bd9362df56

In [1]:
! pip3 install -qU  markdownify  langchain-upstage rank_bm25 python-dotenv

In [3]:
# @title set API key
import os
import getpass
from pprint import pprint
import warnings

warnings.filterwarnings("ignore")

from IPython import get_ipython

if "google.colab" in str(get_ipython()):
    # Running in Google Colab. Please set the UPSTAGE_API_KEY in the Colab Secrets
    from google.colab import userdata
    os.environ["UPSTAGE_API_KEY"] = userdata.get("UPSTAGE_API_KEY")
else:
    # Running locally. Please set the UPSTAGE_API_KEY in the .env file
    from dotenv import load_dotenv

    load_dotenv()

if "UPSTAGE_API_KEY" not in os.environ:
    os.environ["UPSTAGE_API_KEY"] = getpass.getpass("Enter your Upstage API key: ")


In [22]:
from langchain_chroma import Chroma
from langchain_upstage import UpstageEmbeddings
from langchain.docstore.document import Document

from langchain_text_splitters import (
    Language,
    RecursiveCharacterTextSplitter,
)

sample_text = [
    "Korea is a beautiful country to visit in the spring.",
    "The best time to visit Korea is in the fall.",
    "Best way to find bug is using unit test.",
    "Python is a great programming language for beginners.",
    "Sung Kim is a great teacher.",
]

splits = RecursiveCharacterTextSplitter().create_documents(sample_text)

print(splits)

vectorstore = Chroma.from_documents(
    documents=splits,
    ids=[doc.page_content for doc in splits],
    embedding=UpstageEmbeddings(model="solar-embedding-1-large"),
)

[Document(page_content='Korea is a beautiful country to visit in the spring.'), Document(page_content='The best time to visit Korea is in the fall.'), Document(page_content='Best way to find bug is using unit test.'), Document(page_content='Python is a great programming language for beginners.'), Document(page_content='Sung Kim is a great teacher.')]


In [23]:
# check if text is in the vector store
def is_in_vectorstore(vectorstore, text):
    search_results = vectorstore.get(ids=[text])
    if search_results and search_results["ids"]:
        return True
    else:
        return False

In [24]:
is_in_vectorstore(vectorstore, "Hello, new sentence")

False

In [25]:
is_in_vectorstore(vectorstore, splits[0].page_content)

True

In [8]:
from langchain_upstage import UpstageLayoutAnalysisLoader


layzer = UpstageLayoutAnalysisLoader("pdfs/kim-tse-2008.pdf", output_type="html")
# For improved memory efficiency, consider using the lazy_load method to load documents page by page.
docs = layzer.load()  # or layzer.lazy_load()

In [9]:
from langchain_text_splitters import (
    Language,
    RecursiveCharacterTextSplitter,
)

# 2. Split
text_splitter = RecursiveCharacterTextSplitter.from_language(
    chunk_size=1000, chunk_overlap=100, language=Language.HTML
)
splits = text_splitter.split_documents(docs)
print("Splits:", len(splits))

Splits: 125


In [10]:
from langchain_chroma import Chroma

vectorstore = Chroma(
    persist_directory="./chroma_db",
    embedding_function=UpstageEmbeddings(model="solar-embedding-1-large"),
)
retriever = vectorstore.as_retriever()


unique_splits = [
    split for split in splits if not is_in_vectorstore(vectorstore, split.page_content)
]
print(len(unique_splits))

# 3. Embed & indexing
if len(unique_splits) > 0:
    vectorstore = Chroma.from_documents(
        ids=[split.page_content for split in unique_splits],
        persist_directory="./chroma_db",
        documents=unique_splits,
        embedding=UpstageEmbeddings(model="solar-embedding-1-large"),
    )

125


In [13]:
from langchain_chroma import Chroma

vectorstore = Chroma(
    persist_directory="./chroma_db",
    embedding_function=UpstageEmbeddings(model="solar-embedding-1-large"),
)
retriever = vectorstore.as_retriever()

unique_splits = [
    split for split in splits if not is_in_vectorstore(vectorstore, split.page_content)
]
print(len(unique_splits))

# 3. Embed & indexing
if len(unique_splits) > 0:
    vectorstore = Chroma.from_documents(
        ids=[split.page_content for split in unique_splits],
        persist_directory="./chroma_db",
        documents=unique_splits,
        embedding=UpstageEmbeddings(model="solar-embedding-1-large"),
    )

0


In [15]:
search_result = retriever.invoke("How to find problems in code?")
print(search_result[0].page_content[:100])

<p id='13' style='font-size:16px'>introduced bugs immediately. Several bug-finding techni-<br>ques c
