This is a starter notebook for the project, you'll have to import the libraries you'll need, you can find a list of the ones available in this workspace in the requirements.txt file in this workspace. 

# Synthetic Data Generation

### Criteria 1 - Generating Real Estate Listings with an LLM
The submission must demonstrate using a Large Language Model (LLM) to generate at least 10 diverse and realistic real estate listings containing facts about the real estate.

In [1]:
!pip install pandas
!pip install lancedb

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable


In [2]:
# import necessary libraries
import openai
import os
import pandas as pd
import numpy as np
import lancedb

# OpenAI key here.
openai.api_key = "my key"

In [3]:
# Define a function to generate real estate listings

def generate_real_estate_listings():
    prompt = """
    Generate 15 diverse and realistic real estate listings containing facts about the real estate, such as:
    - Location (city, suburb)
    - Type of the real estate (apartment, house, townhouse, etc.)
    - Sale price
    - Number of bedrooms
    - Number of bathrooms
    - Garage space (number of cars)
    - Proximity to public transport, schools, shops
    - Neighborhood vibes

    Each listing should be unique and realistic.
    """

    response = openai.ChatCompletion.create(
      model="gpt-3.5-turbo",
      messages=[
            {"role": "system", "content": "You are an expert in generating real estate listings."},
            {"role": "user", "content": prompt}
        ],
        max_tokens=2500,
        n=1,
        stop=None,
        temperature=0.5,
    )

    listings = response.choices[0].message['content']
    return listings

listings = generate_real_estate_listings()
listings

'1. Location: Los Angeles, CA\n   Type: Luxury House\n   Sale Price: $2,500,000\n   Bedrooms: 5\n   Bathrooms: 4\n   Garage Space: 3 cars\n   Proximity: Close to public transport, top-rated schools, and upscale shops\n   Neighborhood Vibes: Exclusive gated community with stunning views of the city\n\n2. Location: Brooklyn, NY\n   Type: Brownstone Townhouse\n   Sale Price: $1,200,000\n   Bedrooms: 3\n   Bathrooms: 2\n   Garage Space: Street parking\n   Proximity: Walking distance to subway stations, schools, and trendy boutiques\n   Neighborhood Vibes: Historic charm with a vibrant arts and culinary scene\n\n3. Location: Austin, TX\n   Type: Modern Loft\n   Sale Price: $600,000\n   Bedrooms: 2\n   Bathrooms: 2\n   Garage Space: 1 car\n   Proximity: Easy access to public transport, local schools, and eclectic shops\n   Neighborhood Vibes: Hip and lively urban neighborhood with live music venues\n\n4. Location: Miami, FL\n   Type: Beachfront Condo\n   Sale Price: $800,000\n   Bedrooms: 2\

In [4]:
# Parse the listings ready for embedding

def parse_listings(listings):
    listings_list = listings.strip().split("\n\n")  
    parsed_listings = []
    
    for listing in listings_list:
        listing_dict = {}
        for line in listing.split("\n"):
            if ": " in line:
                key, value = line.split(": ", 1)
                # Remove numbering and leading hyphens from keys
                key = key.split(".")[-1].strip().lstrip("-").strip()
                listing_dict[key] = value.strip()
        parsed_listings.append(listing_dict)
    
    return parsed_listings

listings_data = parse_listings(listings)
listings_data

[{'Location': 'Los Angeles, CA',
  'Type': 'Luxury House',
  'Sale Price': '$2,500,000',
  'Bedrooms': '5',
  'Bathrooms': '4',
  'Garage Space': '3 cars',
  'Proximity': 'Close to public transport, top-rated schools, and upscale shops',
  'Neighborhood Vibes': 'Exclusive gated community with stunning views of the city'},
 {'Location': 'Brooklyn, NY',
  'Type': 'Brownstone Townhouse',
  'Sale Price': '$1,200,000',
  'Bedrooms': '3',
  'Bathrooms': '2',
  'Garage Space': 'Street parking',
  'Proximity': 'Walking distance to subway stations, schools, and trendy boutiques',
  'Neighborhood Vibes': 'Historic charm with a vibrant arts and culinary scene'},
 {'Location': 'Austin, TX',
  'Type': 'Modern Loft',
  'Sale Price': '$600,000',
  'Bedrooms': '2',
  'Bathrooms': '2',
  'Garage Space': '1 car',
  'Proximity': 'Easy access to public transport, local schools, and eclectic shops',
  'Neighborhood Vibes': 'Hip and lively urban neighborhood with live music venues'},
 {'Location': 'Miami, F

## Semantic Search

### Criteria 2- Creating a Vector Database and Storing Listings¶


The project must demonstrate the creation of a vector database and successfully storing real estate listing embeddings within it. The database should effectively store and organize the embeddings generated from the LLM-created listings.

In [5]:
# Create embeddings

def create_embeddings(texts):
    embeddings = []
    for text in texts:
        response = openai.Embedding.create(
            input=text,
            model="text-embedding-ada-002"
        )
        embeddings.append(response['data'][0]['embedding'])
    return embeddings

In [6]:
# Create the listings data in the specified format with vectors
listings_with_vectors = []
for listing in listings_data:
    listing_text = ' '.join([f"{key}: {value}" for key, value in listing.items()])
    vector = create_embeddings([listing_text])[0]
    listing['vector'] = vector
    listings_with_vectors.append(listing)

listings_with_vectors

[{'Location': 'Los Angeles, CA',
  'Type': 'Luxury House',
  'Sale Price': '$2,500,000',
  'Bedrooms': '5',
  'Bathrooms': '4',
  'Garage Space': '3 cars',
  'Proximity': 'Close to public transport, top-rated schools, and upscale shops',
  'Neighborhood Vibes': 'Exclusive gated community with stunning views of the city',
  'vector': [-0.007277491502463818,
   0.007855069823563099,
   -0.003648372134193778,
   -0.025901196524500847,
   -0.00019082157814409584,
   -0.007245403714478016,
   -0.010499097406864166,
   -0.0038344808854162693,
   -0.022307373583316803,
   -0.0007019185577519238,
   -0.006757670547813177,
   0.011821110732853413,
   -0.003077210858464241,
   -0.020433450117707253,
   0.005040978547185659,
   -0.013592352159321308,
   0.030650176107883453,
   -0.017661072313785553,
   0.009619894437491894,
   0.009780332446098328,
   -0.0333712138235569,
   -0.011256366968154907,
   -0.008310715667903423,
   -0.0018723176326602697,
   -0.007328832056373358,
   -0.01515823230147

### Criteria 3 - Semantic Search of Listings Based on Buyer Preferences
The application must include a functionality where listings are semantically searched based on given buyer preferences. The search should return listings that closely match the input preferences

In [7]:
# Initialize LanceDB
db = lancedb.connect("~/.lancedb")

In [8]:
table = db.create_table('my_listing',listings_with_vectors, mode="overwrite")
db["my_listing"].head()

[2024-06-10T08:03:11Z WARN  lance::dataset] No existing dataset at /home/student/.lancedb/my_listing.lance, it will be created


pyarrow.Table
Location: string
Type: string
Sale Price: string
Bedrooms: string
Bathrooms: string
Garage Space: string
Proximity: string
Neighborhood Vibes: string
vector: fixed_size_list<item: float>[1536]
  child 0, item: float
----
Location: [["Los Angeles, CA","Brooklyn, NY","Austin, TX","Miami, FL","Denver, CO"]]
Type: [["Luxury House","Brownstone Townhouse","Modern Loft","Beachfront Condo","Mountain Chalet"]]
Sale Price: [["$2,500,000","$1,200,000","$600,000","$800,000","$1,000,000"]]
Bedrooms: [["5","3","2","2","4"]]
Bathrooms: [["4","2","2","2.5","3"]]
Garage Space: [["3 cars","Street parking","1 car","Valet parking","2 cars"]]
Proximity: [["Close to public transport, top-rated schools, and upscale shops","Walking distance to subway stations, schools, and trendy boutiques","Easy access to public transport, local schools, and eclectic shops","Steps away from the beach, schools, and upscale dining options","Near public transportation, top-rated schools, and outdoor recreation"]]


In [9]:
table.to_pandas().head(1)

Unnamed: 0,Location,Type,Sale Price,Bedrooms,Bathrooms,Garage Space,Proximity,Neighborhood Vibes,vector
0,"Los Angeles, CA",Luxury House,"$2,500,000",5,4,3 cars,"Close to public transport, top-rated schools, ...",Exclusive gated community with stunning views ...,"[-0.0072774915, 0.00785507, -0.0036483721, -0...."


In [11]:
# Collect buyer preference
questions = [
    "What location do you prefer?",
    "What type of property are you looking for, i.e., apartment, house, villa, etc",
    "What is your budget so we can match the sale price?",
    "How many bedrooms are you looking for?",
    "How many bathrooms are you looking for?",
    "How many car space are you looking for?",
    "Which amenities would you like, i.e. public transport, parks, shops, schooles, etc", 
    "What kind of neighborhood is your prefernce?",
  ]

# Collect answers interactively
answers = []
print("Please answer the following questions to specify your preferences:")

for question in questions:
    answer = input(question + " ")
    answers.append(answer)

# Combine the answers into a single query variable
query = {
    "Location": answers[0],
    "Type": answers[1],
    "Sale price": answers[2],
    "Bedrooms": answers[3],
    "Bathrooms": answers[4],
    "Garage space": answers[5],
    "Proximity": answers[6],
    "Neighborhood vibes": answers[7]
}

# Display the collected preferences
print("\nYour preferences have been collected as follows:")
print(query)

Please answer the following questions to specify your preferences:
What location do you prefer? Los Angeles
What type of property are you looking for, i.e., apartment, house, villa, etc house
What is your budget so we can match the sale price? under 3 million
How many bedrooms are you looking for? minimum 4
How many bathrooms are you looking for? minimum 3
How many car space are you looking for? 3 or more
Which amenities would you like, i.e. public transport, parks, shops, schooles, etc close to top-rating schools
What kind of neighborhood is your prefernce? gated community with good views

Your preferences have been collected as follows:
{'Location': 'Los Angeles', 'Type': 'house', 'Sale price': 'under 3 million', 'Bedrooms': 'minimum 4', 'Bathrooms': 'minimum 3', 'Garage space': '3 or more', 'Proximity': 'close to top-rating schools', 'Neighborhood vibes': 'gated community with good views'}


In [12]:
# parse buyer preference

def reformat_query(query):
    formatted_query = {
        'Location': f"{query['Location']}",
        'Type': f"{query['Type']}",
        'Sale Price': f"{query['Sale price']}",
        'Bedrooms': f"{query['Bedrooms']}",
        'Bathrooms': f"{query['Bathrooms']}",
        'Garage': f"{query['Garage space']}",
        'Proximity': f"{query['Proximity']}",
        'Neighborhood Vibes': query['Neighborhood vibes']
    }
    return formatted_query

formatted_query = reformat_query(query)
texts = [
    formatted_query['Location'],
    formatted_query['Type'],
    formatted_query['Sale Price'],
    formatted_query['Bedrooms'],
    formatted_query['Bathrooms'],
    formatted_query['Garage'],
    formatted_query['Proximity'],
    formatted_query['Neighborhood Vibes']
]

texts

['Los Angeles',
 'house',
 'under 3 million',
 'minimum 4',
 'minimum 3',
 '3 or more',
 'close to top-rating schools',
 'gated community with good views']

In [13]:
# embed buyer preference
emb_query = create_embeddings(texts)[0]

In [17]:
# Semantic serch based on buyer preference

result = table.search(emb_query).metric("cosine").limit(2).to_pandas()
result

Unnamed: 0,Location,Type,Sale Price,Bedrooms,Bathrooms,Garage Space,Proximity,Neighborhood Vibes,vector,_distance
0,"Los Angeles, CA",Luxury House,"$2,500,000",5,4,3 cars,"Close to public transport, top-rated schools, ...",Exclusive gated community with stunning views ...,"[-0.0072774915, 0.00785507, -0.0036483721, -0....",0.191506
1,"San Francisco, CA",Victorian House,"$3,000,000",6,4,1 car,"Walking distance to public transportation, top...",Iconic architecture in a trendy and upscale ur...,"[-0.0018483576, -0.0035691294, 0.0020299219, -...",0.2475


In [18]:
# drop vector and distance

result=result.iloc[:,:-2]
result

Unnamed: 0,Location,Type,Sale Price,Bedrooms,Bathrooms,Garage Space,Proximity,Neighborhood Vibes
0,"Los Angeles, CA",Luxury House,"$2,500,000",5,4,3 cars,"Close to public transport, top-rated schools, ...",Exclusive gated community with stunning views ...
1,"San Francisco, CA",Victorian House,"$3,000,000",6,4,1 car,"Walking distance to public transportation, top...",Iconic architecture in a trendy and upscale ur...


## Augmented Response Generation

### Criteria 4 - Logic for Searching and Augmenting Listing Descriptions

The project must demonstrate a logical flow where buyer preferences are used to search and then augment the description of real estate listings. The augmentation should personalize the listing without changing factual information.

### Criteria 5 - Use of LLM for Generating Personalized Descriptions

The submission must utilize an LLM to generate personalized descriptions for the real estate listings based on buyer preferences. The descriptions should be unique, appealing, and tailored to the preferences provided.

In [23]:
# Convert search result to sentences
def row_to_sentence(row):
    
    sentence = (f"Location: {row['Location']}, Type: {row['Type']}, Sale Price: {row['Sale Price']}, "
                f"Bedrooms: {row['Bedrooms']}, Bathrooms: {row['Bathrooms']}, Garage Space: {row['Garage Space']}, "
                f"Proximity: {row['Proximity']}, Neighborhood Vibes: {row['Neighborhood Vibes']}.")
    return sentence

In [24]:
# Rewrite sentence using LLM
def rewrite_sentence(sentence):
    response = openai.Completion.create(
        model='gpt-3.5-turbo-instruct',  
        prompt=f"Rewrite the following sentence as if you are an experienced property sales agent with excellent customer services skills, to make it more engaging and descriptive. Do not use Welcome to to start your sentence:\n\n{sentence}",
        max_tokens=400,
        temperature=0.2,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0,
        stop=None
    )
    rewritten_sentence = response.choices[0].text.strip()
    return rewritten_sentence

# Apply the functions to each row
result['Description Sentence'] = result.apply(row_to_sentence, axis=1)
result['Rewritten Sentence'] = result['Description Sentence'].apply(rewrite_sentence)

message_for_1st_on_list = "We are delighted to recommend this property to you. We are confident it aligns perfectly with your preferences and requirements:"
message_for_2nd_on_list = "Although this property may not fully match your criteria, we believe it has unique features that are worth considering:"
closing_message = "We understand that purchasing property can be a stressful journey. Rest assured, we are here to support you every step of the way. If you are satisfied with our recommendations, please let us know. Should you need further options, we are more than happy to find additional properties for you."

# Print the resulting DataFrame
print(message_for_1st_on_list)
print(result['Rewritten Sentence'].iloc[0])
print("\n")

print(message_for_2nd_on_list)
print(result['Rewritten Sentence'].iloc[1])
print("\n")

print(closing_message)

We are delighted to recommend this property to you. We are confident it aligns perfectly with your preferences and requirements:
"Welcome to the luxurious lifestyle of Los Angeles, where this stunning 5-bedroom, 4-bathroom luxury house awaits you in an exclusive gated community with breathtaking views of the city. With a sale price of $2,500,000 and a spacious 3-car garage, this property offers the perfect combination of elegance and convenience. Not to mention, its prime location close to public transport, top-rated schools, and upscale shops adds to the already desirable neighborhood vibes. Don't miss out on the opportunity to call this prestigious address your new home."


Although this property may not fully match your criteria, we believe it has unique features that are worth considering:
"Step into the heart of San Francisco's vibrant and sought-after neighborhood, where the iconic Victorian architecture meets the trendy and upscale urban lifestyle. This stunning 6-bedroom, 4-bat