## Build a GenAI Application
We will build a sample app that would scrap the content from a website, store embedding in vectordb, and use prompt enginering to get response from db using llm.

In Part1, We will build the vector database using Chroma, and will store the scrapped content embeddings. Steps followed: 

Step0: Initialize the embedding model

Step1: Data Ingestion: Scrape the website content using WebBaseLoader and BeautifulSoup.

Step2: Text Splitting: Split the webcontent into small document chunks using RecursiveCharacterTextSplitter

Step3: Build Vectorstore: Build the vector db using embedding model and Chroma dB. Save the dB into the local storage

In [1]:
## Read Environment Variables from .env file
import os
from dotenv import load_dotenv
load_dotenv()
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")

In [2]:
## Initialize the LLM Model
from utility.llm_factory import LLMFactory
llm = LLMFactory.get_llm('openai')

from utility.embedding_factory import EmbeddingFactory

embedding_model = EmbeddingFactory.get_llm('openai')
embedding_model

OpenAIEmbeddings(client=<openai.resources.embeddings.Embeddings object at 0x7fd23683b290>, async_client=<openai.resources.embeddings.AsyncEmbeddings object at 0x7fd236841dd0>, model='text-embedding-3-large', dimensions=None, deployment='text-embedding-ada-002', openai_api_version=None, openai_api_base=None, openai_api_type=None, openai_proxy=None, embedding_ctx_length=8191, openai_api_key=SecretStr('**********'), openai_organization=None, allowed_special=None, disallowed_special=None, chunk_size=1000, max_retries=2, request_timeout=None, headers=None, tiktoken_enabled=True, tiktoken_model_name=None, show_progress_bar=False, model_kwargs={}, skip_empty=False, default_headers=None, default_query=None, retry_min_seconds=4, retry_max_seconds=20, http_client=None, http_async_client=None, check_embedding_ctx_length=True)

In [3]:
! pip install beautifulsoup4 # needed for web scraping



In [4]:
## Step1: Data Ingestion using Web Loader
### Scraping a Web Page
from langchain_community.document_loaders import WebBaseLoader
from bs4 import SoupStrainer  # ✅ Correct import

# ✅ Use 'parse_only' instead of 'parser_only'
loader = WebBaseLoader(
    web_paths=("https://docs.smith.langchain.com/administration/tutorials/manage_spend",)
)

docs = loader.load()
docs

USER_AGENT environment variable not set, consider setting it to identify your requests.


[Document(metadata={'source': 'https://docs.smith.langchain.com/administration/tutorials/manage_spend', 'title': 'Optimize tracing spend on LangSmith | 🦜️🛠️ LangSmith', 'description': 'Before diving into this content, it might be helpful to read the following:', 'language': 'en'}, page_content='\n\n\n\n\nOptimize tracing spend on LangSmith | 🦜️🛠️ LangSmith\n\n\n\n\n\n\n\n\nSkip to main contentWe are growing and hiring for multiple roles for LangChain, LangGraph and LangSmith. Join our team!API ReferenceRESTPythonJS/TSSearchRegionUSEUGo to AppGet StartedObservabilityEvaluationPrompt EngineeringDeployment (LangGraph Platform)AdministrationTutorialsOptimize tracing spend on LangSmithHow-to GuidesSetupConceptual GuideSelf-hostingPricingReferenceCloud architecture and scalabilityAuthz and AuthnAuthentication methodsdata_formatsEvaluationDataset transformationsRegions FAQsdk_referenceChangelogCloud architecture and scalabilityAuthz and AuthnAuthentication methodsdata_formatsEvaluationDataset

In [None]:
## Step2: Text Splitting
### Using RecursiveCharacterTextSplitter to split the documents into manageable chunks
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50
)

# Note: the input here is a list of Document objects, not just strings.
document_chunks = text_splitter.split_documents(docs)

document_chunks[::]  # Display the first two c

[Document(metadata={'source': 'https://docs.smith.langchain.com/administration/tutorials/manage_spend', 'title': 'Optimize tracing spend on LangSmith | 🦜️🛠️ LangSmith', 'description': 'Before diving into this content, it might be helpful to read the following:', 'language': 'en'}, page_content='Optimize tracing spend on LangSmith | 🦜️🛠️ LangSmith'),
 Document(metadata={'source': 'https://docs.smith.langchain.com/administration/tutorials/manage_spend', 'title': 'Optimize tracing spend on LangSmith | 🦜️🛠️ LangSmith', 'description': 'Before diving into this content, it might be helpful to read the following:', 'language': 'en'}, page_content='Skip to main contentWe are growing and hiring for multiple roles for LangChain, LangGraph and LangSmith. Join our team!API ReferenceRESTPythonJS/TSSearchRegionUSEUGo to AppGet StartedObservabilityEvaluationPrompt EngineeringDeployment (LangGraph Platform)AdministrationTutorialsOptimize tracing spend on LangSmithHow-to GuidesSetupConceptual GuideSelf-

In [6]:
# test embedding model
embeddings = embedding_model.embed_documents([doc.page_content for doc in document_chunks])
embeddings[:2]  # Display the first two embeddings

[[0.01623169332742691,
  0.017965668812394142,
  -0.013592668808996677,
  -0.008585289120674133,
  -0.004436437506228685,
  6.125740765128285e-05,
  0.02256704494357109,
  0.030213449150323868,
  -0.04114171862602234,
  0.026728583499789238,
  0.02950294315814972,
  0.029621360823512077,
  -0.009160460904240608,
  -0.006250767037272453,
  0.03681101277470589,
  0.027963511645793915,
  -0.010276971384882927,
  -0.025764325633645058,
  0.037487685680389404,
  -0.03809669241309166,
  0.02972286194562912,
  -0.03931470215320587,
  -0.02442789636552334,
  0.01767808198928833,
  -0.02916460670530796,
  -0.014988306909799576,
  -0.02053702622652054,
  0.03549150004982948,
  -0.02501998469233513,
  -0.00804395042359829,
  0.009050501510500908,
  0.005755949765443802,
  -0.017280537635087967,
  -0.02579815872013569,
  0.028995439410209656,
  -0.02950294315814972,
  0.05349100008606911,
  -0.014480802230536938,
  -0.026221079751849174,
  0.020959947258234024,
  -0.03694634512066841,
  -0.0198941

In [7]:
## Step3: Vector Store Creation
### Using Chroma to create a vector store from the document chunks

from langchain.vectorstores import Chroma
vector_store_db = Chroma.from_documents(
    document_chunks,
    embedding_model,
    collection_name="ragtest1_collection",
    persist_directory="./_data/chroma_db"
)
# Check the number of documents in the vector store
print(f"Number of documents in the vector store: {vector_store_db._collection.count()}")

Number of documents in the vector store: 37
