This is a starter notebook for the project, you'll have to import the libraries you'll need, you can find a list of the ones available in this workspace in the requirements.txt file in this workspace. 


# Project: Personalized Real Estate Agent

An example of AI agent built by Python, Langchain, Vector Database and OpenAI's API.


## Step 1: Synthetic Data Generation

Generate a list of at least 10 real estates using LLM, 
which will be served as the data source to store into the vector database.


In [1]:
# Import Python Packages

from langchain.llms import OpenAI
from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate
from langchain.output_parsers import PydanticOutputParser
from pydantic import BaseModel, Field, NonNegativeInt
from typing import List
from random import sample 
from langchain.document_loaders.csv_loader import CSVLoader 




In [2]:

# Step 1.1: Initialize OpenAI

from langchain_openai import ChatOpenAI
from dotenv import load_dotenv
import os

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

model_name = "gpt-3.5-turbo"
llm = ChatOpenAI(model=model_name, temperature=0, api_key=OPENAI_API_KEY)
#llm = OpenAI(model_name = model_name, temperature=0.0)



In [3]:

# Step 1.2: Define data model for parser

class RealEstate(BaseModel):
    title: str = Field(description="The name or title of a house")
    bedroom: int = Field(description="Number of bedroom for a house")
    bathroom: int = Field(description="Number of bathroom for a house")
    garage: int = Field(description="Number of garage for a house")
    price_usd: int = Field(description="The price of a house in USD")
    size_sqft: int = Field(description="The size of a house in square feet") 
    description: str = Field(description="The 200-word description of a house")
    neighborhood: str = Field(description="The brief summary or name of the neighborhood for the house")
    neighborhood_details: str = Field(description="The 200-word description of the neighborhood")


parser = PydanticOutputParser(pydantic_object=RealEstate)
print(parser.get_format_instructions())

The output should be formatted as a JSON instance that conforms to the JSON schema below.

As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.

Here is the output schema:
```
{"properties": {"title": {"description": "The name or title of a house", "title": "Title", "type": "string"}, "bedroom": {"description": "Number of bedroom for a house", "title": "Bedroom", "type": "integer"}, "bathroom": {"description": "Number of bathroom for a house", "title": "Bathroom", "type": "integer"}, "garage": {"description": "Number of garage for a house", "title": "Garage", "type": "integer"}, "price_usd": {"description": "The price of a house in USD", "title": "Price Usd", "type": "integer"}, "size_sqft": {"description": "The siz

In [28]:

# Step 1.3: Ask LLM to generate a list of real estate and save to a text file 

from langchain_core.prompts import ChatPromptTemplate
import pandas as pd
import os.path

HOUSE_FILE_NAME_CSV = "Listings.csv"
HOUSE_FILE_NAME_TXT = "Listings.txt"

def generate_house_list():
    question = """
    Generate 11 houses which are currently for sale in the US market, 
    earch house should include these properties: 
    title, 
    number of bedrooms,
    number of bathrooms,
    number of garadges
    price (integer in USD), 
    size (integer in squre feet), 
    description of the house with at least 200 words, 
    neighborhood, 
    description of neighborhood with at least 100 words
    """
    
    structured_llm = llm.with_structured_output(RealEstate, method="json_mode")
    listings = structured_llm.invoke(question + "\n\n" + parser.get_format_instructions())
    return listings

def save_house_list_to(file_name: str, list_dic): 
    df = pd.DataFrame.from_dict(list_dic)
    df.to_csv(file_name)


# if the house list file is not there, generate it from LLM
if not os.path.isfile(HOUSE_FILE_NAME_CSV):
    houses = generate_house_list()
    save_house_list_to(HOUSE_FILE_NAME_CSV, houses["houses"])


print("House data source is ready!")


House data source is ready!



## Step 2: Semantic Search



In [36]:

# Step 2.1: Create a vector database from the house list

from langchain.vectorstores import Chroma
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings

DATABASE_FILE = "listings_chroma_db"

if not os.path.isfile(DATABASE_FILE):
    loader = CSVLoader(file_path = HOUSE_FILE_NAME_CSV)
    docs = loader.load()

    splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
    split_docs = splitter.split_documents(docs)

    embeddings = OpenAIEmbeddings()
    db = Chroma.from_documents(split_docs, embeddings, persist_directory= DATABASE_FILE)
else: 
    db = Chroma(persist_directory=DATABASE_FILE, embedding_function=embedding_function)

print(db)


<langchain_community.vectorstores.chroma.Chroma object at 0x1254204c0>



## Step 3: Augmented Response Generation
