### Dependencies
`langchain_community`  
`langchain-huggingface`  
`langchain-openai`  



### Load data from a URL  
https://python.langchain.com/docs/integrations/document_loaders/web_base/

DocumentLoaders are objects that load in data from a source and return a list of Documents.  
A Document is an object with some page_content (str) and metadata (dict).  
https://python.langchain.com/docs/how_to/#document-loaders

In [1]:
from langchain_community.document_loaders import WebBaseLoader
import os

os.environ['USER_AGENT'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'

USER_AGENT environment variable not set, consider setting it to identify your requests.


#### Load a single page

In [2]:
loader = WebBaseLoader("https://www.home0001.com/how-it-works")
data = loader.load()
print(data)

[Document(metadata={'source': 'https://www.home0001.com/how-it-works', 'title': 'Learn how to own your home and live anywhere | Home0001', 'description': 'Flexible Living Fully Furnished Homes For Sale | Home0001', 'language': 'en'}, page_content="Learn how to own your home and live anywhere | Home0001HOME0001MenuHOME0001MenuHomes:0001: Lower East SideStudioStudio Max1 Bedroom0001: Bed-Stuy1 Bedroom2 Bedroom0001: Echo Park TownhousesTownhouse Type ATownhouse Type B0001: Echo Park BungalowsBungalow 1Bungalow 20001: Peckham1 BEDROOM0001: Hackney2 BEDROOM0001: Schöneberg1 BEDROOM0001 GDSHow It WorksContact UsLegalPrices:Fiat CryptoHomes:0001: Lower East SideStudioStudio Max1 Bedroom0001: Bed-Stuy1 Bedroom2 Bedroom0001: Echo Park TownhousesTownhouse Type ATownhouse Type B0001: Echo Park BungalowsBungalow 1Bungalow 20001: Peckham1 BEDROOM0001: Hackney2 BEDROOM0001: Schöneberg1 BEDROOM0001 homes are fully equipped, part of a global network, and uniquely simple to buy and own.HOW IT WORKS:Buy

#### Load multiple pages

In [None]:
loader_multiple_pages = WebBaseLoader(["https://www.home0001.com/how-it-works", "https://www.home0001.com/legal"])
docs = loader_multiple_pages.load()
print(docs)
print(docs[1].page_content)

[Document(metadata={'source': 'https://www.home0001.com/how-it-works', 'title': 'Learn how to own your home and live anywhere | Home0001', 'description': 'Flexible Living Fully Furnished Homes For Sale | Home0001', 'language': 'en'}, page_content="Learn how to own your home and live anywhere | Home0001HOME0001MenuHOME0001MenuHomes:0001: Lower East SideStudioStudio Max1 Bedroom0001: Bed-Stuy1 Bedroom2 Bedroom0001: Echo Park TownhousesTownhouse Type ATownhouse Type B0001: Echo Park BungalowsBungalow 1Bungalow 20001: Peckham1 BEDROOM0001: Hackney2 BEDROOM0001: Schöneberg1 BEDROOM0001 GDSHow It WorksContact UsLegalPrices:Fiat CryptoHomes:0001: Lower East SideStudioStudio Max1 Bedroom0001: Bed-Stuy1 Bedroom2 Bedroom0001: Echo Park TownhousesTownhouse Type ATownhouse Type B0001: Echo Park BungalowsBungalow 1Bungalow 20001: Peckham1 BEDROOM0001: Hackney2 BEDROOM0001: Schöneberg1 BEDROOM0001 homes are fully equipped, part of a global network, and uniquely simple to buy and own.HOW IT WORKS:Buy

### Pre-process data


#### Chunk, split and store the data

TBD: it's important to figure out the right chunk size later on

We use RecursiveCharacterTextSplitter, which will recursively split the document using common separators like new lines until each chunk is the appropriate size.  
This is the recommended text splitter for generic text use cases.

We set add_start_index=True so that the character index where each split Document starts within the initial Document is preserved as metadata attribute “start_index”.  

Next we need to index our text chunks so that we can search over them at runtime. The most common way to do this is to embed the contents of each document split and insert these embeddings into a vector database (or vector store). When we want to search over our splits, we take a text search query, embed it, and perform some sort of “similarity” search to identify the stored splits with the most similar embeddings to our query embedding. The simplest similarity measure is cosine similarity — we measure the cosine of the angle between each pair of embeddings (which are high dimensional vectors).

In [None]:
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

# set up the splitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
# split the docs
splits = text_splitter.split_documents(docs)
# create a vector database with the splits
vectorstore = Chroma.from_documents(
    documents=splits, 
    embedding=OpenAIEmbeddings(model="text-embedding-3-large"),
    # persist_directory="./chroma_langchain_db",  # Where to save data locally, remove if not necessary
)

print(len(splits))
print(len(splits[12].page_content))
print(splits[12].metadata)

99
990
{'source': 'https://www.home0001.com/legal', 'title': 'Legal Notices for 0001 homes live flexibly own your home', 'description': 'Own the perfect home.', 'language': 'en'}


### Retrieve

A Retriever is an interface that returns relevant Documents from an index based on a string query.  

The most common type of Retriever is the VectorStoreRetriever, which uses the similarity search capabilities of a vector store to facilitate retrieval.  
Any VectorStore can easily be turned into a Retriever with `VectorStore.as_retriever()`

In [5]:
# Retrieve and generate using the relevant snippets of the site.
# retriever = vectorstore.as_retriever()

retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 6})
retrieved_docs = retriever.invoke("What is home0001?")

print(len(retrieved_docs))
print(retrieved_docs[0].page_content)

6
of buying a home and streamlined every part of it. You can spend time in your new home to see how it feels and then buy it online instantly, safely, and securely. We’ll guide you through each step, from choosing the right home and getting financing to completing the purchase online at your own pace.Move in.Each new 0001 home is fully equipped with all the furniture, appliances, and home essentials you’ll need. Developed in collaboration with world-class architects, every detail has been thoughtfully designed, so you can truly move in with nothing but your suitcase.Swap.HOME0001 is a peer-to-peer housing collective. Members of the collective get access to 0001 homes in other locations, so you can spend time in other places for free while making your home available to other members while you’re away.Total control.You own your home outright. You're 100% in control, just like owning any other home. It’s your call whether you make your home available to other members of the collective, an

other retrieval techniques include:  
- MultiQueryRetriever generates variants of the input question to improve retrieval hit rate.
- MultiVectorRetriever instead generates variants of the embeddings, also in order to improve retrieval hit rate.
- Maximal marginal relevance selects for relevance and diversity among the retrieved documents to avoid passing in duplicate context.
- Documents can be filtered during vector store retrieval using metadata filters, such as with a Self Query Retriever.

### Generate 

In [6]:
from langchain import hub

prompt = hub.pull("rlm/rag-prompt")

example_messages = prompt.invoke(
    {"context": "filler context", "question": "filler question"}
).to_messages()

# print(example_messages)
print(example_messages[0].content)



You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: filler question 
Context: filler context 
Answer:


In [7]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")

In [8]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

# print(format_docs(docs))

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

for chunk in rag_chain.stream("What is Home0001?"):
    print(chunk, end="", flush=True)


print(rag_chain.invoke("can i rent an apartment?"))

Home0001 is a housing collective that offers fully furnished homes for sale, allowing buyers to own their homes outright and participate in a community where they can share and access homes in various locations. Members have the flexibility to make their homes available to others and can swap homes without paying nightly rates. The homes are designed with meticulous attention to detail and come equipped with all necessary furnishings and appliances.Yes, you can rent an apartment through HOME0001. Each home is fully furnished and equipped, making it easy to move in. You also have the option to swap homes with other members in the network.


In [9]:
# cleanup
vectorstore.delete_collection()

In [None]:
# from collections import defaultdict

# website_text = defaultdict(str)

# for doc in docs:
#     url = doc.metadata["source"]
#     website_text[url] += f"{doc.page_content}\n"

# print(website_text)
# website_text = dict(website_text)
# print(website_text)

defaultdict(<class 'str'>, {'https://www.home0001.com/how-it-works': "Learn how to own your home and live anywhere | Home0001HOME0001MenuHOME0001MenuHomes:0001: Lower East SideStudioStudio Max1 Bedroom0001: Bed-Stuy1 Bedroom2 Bedroom0001: Echo Park TownhousesTownhouse Type ATownhouse Type B0001: Echo Park BungalowsBungalow 1Bungalow 20001: Peckham1 BEDROOM0001: Hackney2 BEDROOM0001: Schöneberg1 BEDROOM0001 GDSHow It WorksContact UsLegalPrices:Fiat CryptoHomes:0001: Lower East SideStudioStudio Max1 Bedroom0001: Bed-Stuy1 Bedroom2 Bedroom0001: Echo Park TownhousesTownhouse Type ATownhouse Type B0001: Echo Park BungalowsBungalow 1Bungalow 20001: Peckham1 BEDROOM0001: Hackney2 BEDROOM0001: Schöneberg1 BEDROOM0001 homes are fully equipped, part of a global network, and uniquely simple to buy and own.HOW IT WORKS:Buy.We’ve reinvented the typical months-long ordeal of buying a home and streamlined every part of it. You can spend time in your new home to see how it feels and then buy it online

### Embedding Models

In [None]:
single_text = documents[1].page_content
vector = embed.embed_query(single_text)
print(vector[:3])

[0.008912574, -0.009420217, 0.004011392]


Many more models and providers are available such as Mistral, Ollama, etc.

### Vector Databases

In [11]:
from langchain_core.vectorstores import InMemoryVectorStore

vector_store = InMemoryVectorStore.from_documents(docs, OpenAIEmbeddings())

# query = "is furniture included?"

results = vector_store.similarity_search(query="furniture", k=5)

for doc in results:
    print(f"* {doc.page_content} [{doc.metadata}]")

* Legal Notices for 0001 homes live flexibly own your homeHOME0001MenuHOME0001MenuHomes:0001: Lower East SideStudioStudio Max1 Bedroom0001: Bed-Stuy1 Bedroom2 Bedroom0001: Echo Park TownhousesTownhouse Type ATownhouse Type B0001: Echo Park BungalowsBungalow 1Bungalow 20001: Peckham1 BEDROOM0001: Hackney2 BEDROOM0001: Schöneberg1 BEDROOM0001 GDSHow It WorksContact UsLegalPrices:Fiat CryptoHomes:0001: Lower East SideStudioStudio Max1 Bedroom0001: Bed-Stuy1 Bedroom2 Bedroom0001: Echo Park TownhousesTownhouse Type ATownhouse Type B0001: Echo Park BungalowsBungalow 1Bungalow 20001: Peckham1 BEDROOM0001: Hackney2 BEDROOM0001: Schöneberg1 BEDROOMLegalALL MATERIAL PRESENTED HEREIN IS INTENDED FOR INFORMATION PURPOSES ONLY. WHILE THIS INFORMATION IS BELIEVED TO BE CORRECT, IT IS REPRESENTED SUBJECT TO ERRORS, OMISSIONS, CHANGES OR WITHDRAWAL WITHOUT NOTICE. ALL PROPERTY INFORMATION, INCLUDING, BUT NOT LIMITED TO SQUARE FOOTAGE, ROOM COUNT, NUMBER OF BEDROOMS AND THE SCHOOL DISTRICT IN PROPERTY 

tbd: add local files and more granular examples