# Real Estate Data Generataion

# 1. Data Generation Prompt Build

From some House Match Website [House Match](https://search.housematch.com/) e.g. I collect which feature should a real estate data has.

```
4 Bedrooms 2 Bathrooms 2,018 Size sqft 9,583.2 Lot sqft
Single Family Detached • Built in 1959 • $180/SqFt • 1 day on site
Fabulous pool home on a large fenced lot, with four bedrooms and two bathrooms with ceramic tile floors throughout the living space and luxury vinyl in the bedrooms. Upgrades throughout include a recently updated hall bath, remodeled kitchen, and new floors; the pool was resurfaced in 2018; the roof and the electrical were updated in 2018; interior paint was updated in 2020; a new range, dishwasher, microwave, and pool pavers were installed in 2023; the exterior was repainted in 2024. Huge backyard, new vinyl fence (2023), huge pool, and plenty of entertaining areas. All bedrooms have ceiling fans. The Great Room leads to the kitchen and dining room with sliding glass doors that lead to the pool area. The seller is offering a home warranty. No HOA, No CDD, and low taxes make this a perfect home for any buyer. It is located near I-75, minutes to retail, shopping, The Grove, Krates at the Grove, Tampa Premium Outlet Mall, Medical, and access to Downtown Tampa, Orlando, and the sunny beaches.
```
based on this sample and my own experience, I think a House Match estate data for rental should have:

- Location
- House type (SFR, Mansion e.g., if mansion which level)
- Square & Layout (LDK?)
- Rental Price
- Building info
    -  building year
    -  structure (wood, concrete, steel e.g.)
- Living Related
    - Nearest SuperMarket
    - Nearest SubwayStation

- Some Detail Descrption(All furniture, Air Conditionar e.g.)

In [1]:
import os
import dotenv

if (envfiles:= dotenv.find_dotenv() )!="":
    dotenv.load_dotenv(envfiles)
else:
    print(".env file not found! Please set OPENAPI KEY first")

I would divide the prompt generate into two steps, as the Location, House type, e.g. is some short text or string that represent some class, it can be seen as a features directly, but detail descrption is a more complicated documentary that merge all this information together but also provide new from other point of view.

so I would first generate a dict like data that can be analysied by Pydantic Prompt, and then based on the generated basic features generated a detail description, and at last merge all these together.

the prompt itself is already test on chatgpt web application and refined by prompt engineering in GPTs, as the prompt may run many times, I would not want to let it use too many tokens.

## 1.1 generate basic house information pipeline

Build pydantic model first.

In [35]:
# House Attribute Model
from langchain.output_parsers import ResponseSchema, StructuredOutputParser

response_schemas = [
    ResponseSchema(name="name", description="the name of the house"),
    ResponseSchema(name="year", description="the house built year"),
    ResponseSchema(name="location", description="the location of the house, including cityname, ward name"),
    ResponseSchema(name="layout", description="the layout of the house, 1LDK, 1DK, 2LDK e.g."),
    ResponseSchema(name="price", description="the rental price of the house, format should like ¥150,000/month."),
    ResponseSchema(name="description", description="""other auxiliary information of the house,
                   like is pet allowed, is somking allowed, free-wifi provided or not, has elevator or not,
                   time and method to go to nearest subway station and the station name, time to go to nearest
                   supermarket.
                   """),
]


In [36]:
from langchain.prompts import PromptTemplate
from langchain_openai import OpenAI
from langchain_community.chat_models import ChatOpenAI

model = OpenAI(max_tokens=2048, temperature=0.9) #ChatModel seem more precise than complettion type

parser = StructuredOutputParser.from_response_schemas(response_schemas)
parser_prompt = parser.get_format_instructions()

# This Prompt would not generate exact num_of_data, because ChatGPT may
# not remember how much data he need to generate  when N is big,
# N around 10 works fine, but when N comes to 20, it would generate
# 6 ~ 25 data at a time.
prompt = PromptTemplate(
    template="""Generate a house located in {location} for rental.\n{format_instructions}\n""",
    input_variables=["location"], 
    input_types={"location": str},
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

chain = prompt | model | parser

In [37]:
# check generation sample
# test
output = chain.batch([{"location":"tokyo"}, {"location":"yokohama"}])
output

[{'name': 'Modern Tokyo House in Shibuya',
  'year': '2018',
  'location': 'Shibuya Ward, Tokyo',
  'layout': '3LDK',
  'price': '¥250,000/month',
  'description': 'This modern house was built in 2018 and is located in the bustling Shibuya Ward of Tokyo. It features a spacious 3LDK layout, perfect for families or roommates. Pets and smoking are not allowed in the house. Free wifi is provided for tenants. The nearest subway station is a 5-minute walk away and the nearest supermarket can be reached in 10 minutes on foot.'},
 {'name': 'Cozy Yokohama House',
  'year': '2015',
  'location': 'Yokohama, Kanagawa',
  'layout': '1LDK',
  'price': '¥150,000/month',
  'description': 'This modern house is located in the heart of Yokohama, a bustling city known for its beautiful parks, shopping, and cultural attractions. The house is pet-friendly and smoking is allowed on the balcony. Free wifi is provided and there is an elevator for easy access. The nearest subway station is a 5-minute walk away 

In [41]:
# try 400 times, get as much data as we can
from time import sleep
from tqdm import tqdm
from numpy import random
cities = ['tokyo', 'yokohama', 'kawasaki', 'chiba', 'saitama', 'fujisawa']

tried = 0
generated_data = []

with tqdm(total=200) as pbar:
    while len(generated_data)<200:
        tried += 1
        try:
            sampled_cities = random.choice(cities, 5, replace=True)
            batch = [{"location":c} for c in sampled_cities]
            output = chain.batch(batch)
            generated_data += output
            pbar.update(5)
            sleep(0.1)
        except Exception as e:
            pass


100%|██████████| 200/200 [04:06<00:00,  1.23s/it]


In [11]:
import json
import pandas as pd
df = None

for house_obj in generated_data:
    house_obj['index'] = [0]
    row = pd.DataFrame.from_dict(house_obj)
    if df is None:
        df = row
    else:
        df = pd.concat([df, row], axis=0)
df = df.reset_index(drop=True)
df = df.drop(columns=["index"])
df.to_csv("../data/chatgpt_house_match_data.csv", index=False)

## 1.2 generated data clean and EDA

In [12]:
import pandas as pd

df = pd.read_csv("../data/chatgpt_house_match_data.csv")
df.head()

Unnamed: 0,price,location_city,location_ward,type_type,type_square,type_layout,type_layout_details,built_info_year,built_info_structure,live_info_nearest_supermarket,live_info_nearest_subway,live_info_subwaystation_name
0,$800,Tokyo City,Shinjuku Ward,Apartment,40 square meters,1LDK,Living room: 20 square meters,2010,Concrete,10 minutes by walk,5 minutes by bus,Shinjuku Station
1,$900,Tokyo City,Minato Ward,Apartment,45,1LDK,Living room: 20 sqm,2010,Concrete,10 minutes by walk,5 minutes by bus,Shinagawa Station
2,$800,Tokyo City,Shinjuku,Apartment,50 sqm,1LDK,Living room: 20 sqm,2000,Concrete,10 minutes by walk,5 minutes by bus,Shinjuku Station
3,$800,Tokyo City,Minato Ward,Apartment,50 sqm,1LDK,Living room: 20 sqm,2005,Concrete,10 minutes by walk,5 minutes by bus,Shinagawa Station
4,$1000,Tokyo City,Shibuya Ward,Apartment,60 sqm,1LDK,20 sqm living room,2010,Concrete,10 minutes by bus,5 minutes by walk,Shibuya Station


In [19]:
df.loc[df.location_city.isna()]

Unnamed: 0,price,location_city,location_ward,type_type,type_square,type_layout,type_layout_details,built_info_year,built_info_structure,live_info_nearest_supermarket,live_info_nearest_subway,live_info_subwaystation_name
157,,,,,,,,0,,,,


In [13]:
df.location_city.unique()

array(['Tokyo City', nan], dtype=object)

In [15]:
df.type_layout.unique()

array(['1LDK', '2LDK', '1DK', nan], dtype=object)

In [14]:
df.location_ward.unique()

array(['Shinjuku Ward', 'Minato Ward', 'Shinjuku', 'Shibuya Ward',
       'Shibuya', 'Minato', 'Chiyoda', 'Chiyoda Ward', nan], dtype=object)

It seems ChatGPT 3.5 may genereate data with some fault, for example Yokohama is a city and Kawasaki is also a city, its relation is not City <-> Ward, so transform it to Kanagawa may make this become more accurate.

In [95]:
# map yokohama -> kanagawa
fix_ = {
    "Yokohama, Minato": "Kanagawa, Yokohama Minato",
    "Yokohama, Nishi": "Kanagawa, Yokohama Nishi",
    "Yokohama, Kanagawa": "Kanagawa, Yokohama City",
    "Yokohama, Kawasaki": "Kanagawa, Yokosuka",
    "Yokohama, Minato": "Kanagawa, Yokohama Minato"
    "Yokohama, Minato": "Kanagawa, Yokohama Minato"
    "Yokohama, Minato": "Kanagawa, Yokohama Minato"
    
}
df.city = df.city.replace({"Yokohama":'Kanagawa'})
df.head()

AttributeError: 'DataFrame' object has no attribute 'city'

In [87]:
data = np.vstack(df.location.apply(lambda x: x.split(",")).values)
data = pd.DataFrame(data, columns=["city", "ward"])
df[["city", "ward"]] = data
df.head()

Unnamed: 0,location,price,type_type,type_square,type_layout,type_layout_details,built_info_year,built_info_structure,live_info_nearest_supermarket,live_info_nearest_subway,city,ward
0,"Tokyo, Shibuya",$800,apartment,50 sqm,1LDK,20 sqm living room,2005,concrete,10 minutes by walk,5 minutes by bus,Tokyo,Shibuya
1,"Yokohama, Minato",$1000,condominium,80 sqm,2LDK,30 sqm living room,1998,steel,15 minutes by walk,10 minutes by bus,Yokohama,Minato
2,"Chiba, Urayasu",$1200,house,120 sqm,3LDK,40 sqm living room,2010,wood,5 minutes by walk,20 minutes by bus,Chiba,Urayasu
3,"Tokyo, Shinjuku",$900,apartment,60 sqm,1DK,25 sqm living room,1990,concrete,20 minutes by walk,15 minutes by bus,Tokyo,Shinjuku
4,"Yokohama, Nishi",$1100,condominium,90 sqm,2LDK,35 sqm living room,2002,steel,10 minutes by walk,10 minutes by bus,Yokohama,Nishi


# 1.2 generate description pipeline