# Real Estate Data Generataion

# 1. Data Generation Prompt Build

From some House Match Website [House Match](https://search.housematch.com/) e.g. I collect which feature should a real estate data has.

```
4 Bedrooms 2 Bathrooms 2,018 Size sqft 9,583.2 Lot sqft
Single Family Detached • Built in 1959 • $180/SqFt • 1 day on site
Fabulous pool home on a large fenced lot, with four bedrooms and two bathrooms with ceramic tile floors throughout the living space and luxury vinyl in the bedrooms. Upgrades throughout include a recently updated hall bath, remodeled kitchen, and new floors; the pool was resurfaced in 2018; the roof and the electrical were updated in 2018; interior paint was updated in 2020; a new range, dishwasher, microwave, and pool pavers were installed in 2023; the exterior was repainted in 2024. Huge backyard, new vinyl fence (2023), huge pool, and plenty of entertaining areas. All bedrooms have ceiling fans. The Great Room leads to the kitchen and dining room with sliding glass doors that lead to the pool area. The seller is offering a home warranty. No HOA, No CDD, and low taxes make this a perfect home for any buyer. It is located near I-75, minutes to retail, shopping, The Grove, Krates at the Grove, Tampa Premium Outlet Mall, Medical, and access to Downtown Tampa, Orlando, and the sunny beaches.
```
based on this sample and my own experience, I think a House Match estate data for rental should have:

- Location
- House type (SFR, Mansion e.g., if mansion which level)
- Square & Layout (LDK?)
- Rental Price
- Building info
    -  building year
    -  structure (wood, concrete, steel e.g.)
- Living Related
    - Nearest SuperMarket
    - Nearest SubwayStation

- Some Detail Descrption(All furniture, Air Conditionar e.g.)

In [1]:
import os
import dotenv

if (envfiles:= dotenv.find_dotenv() )!="":
    dotenv.load_dotenv(envfiles)
else:
    print(".env file not found! Please set OPENAPI KEY first")

I would divide the prompt generate into two steps, as the Location, House type, e.g. is some short text or string that represent some class, it can be seen as a features directly, but detail descrption is a more complicated documentary that merge all this information together but also provide new from other point of view.

so I would first generate a dict like data that can be analysied by Pydantic Prompt, and then based on the generated basic features generated a detail description, and at last merge all these together.

the prompt itself is already test on chatgpt web application and refined by prompt engineering in GPTs, as the prompt may run many times, I would not want to let it use too many tokens.

## 1.1 generate basic house information pipeline

Build pydantic model first.

In [5]:
# House Attribute Model
from typing import List
from langchain_core.pydantic_v1 import BaseModel, Field

class HouseType(BaseModel):
    type: str = Field(description="type of this house, like apartment ,condominium e.g.")
    square: str = Field(description="the square of this hosue.")
    layout: str = Field(description="layout of this house, like 1LDK, 2LDK, 1DK e.g.")
    layout_details: str = Field(description="""details about layout of this house, 
                      like how much square of living room e.g.""")

class BuildInfo(BaseModel):
    year: int = Field(description="""when the house built,
                      should smaller than 2024 but bigger than 1980.""")
    structure: str = Field(description="""the structure of this house,
                           one of the following selection: wood, steel, concrete.""")

class LivingInfo(BaseModel):
    nearest_supermarket: str = Field(description="""how long and by which method
                                     will it take to go to nearest supermarket.
                                     The method would be by bus or by walk""")
    nearest_subway: str = Field(description="""how long and by which method
                                     will it take to go to nearest subwaystation.
                                     The method would be by bus or by walk.""")
    subwaystation_name: str = Field(description="""the name of the subway station,
                                    should in the ward the house belongs to.""")

class Location(BaseModel):
    city: str
    ward: str

class House(BaseModel):
    location: Location = Field(description="""
                          the location of this house, include two attribute,
                          city and ward, city selection: Tokyo City, Yokohama City, Chiba City,
                          ward is a ward in that city.
                          """)
    type: HouseType
    built_info: BuildInfo
    live_info: LivingInfo
    price: str = Field(description="""the money should be paid to
                       rent this house monthly, in dollars,
                       basically it would be $800 for 1ldk in Tokyo
                       with change based on the lcoation, type, built info, and live info.""")


In [6]:
from langchain.prompts import PromptTemplate
from langchain_community.chat_models import ChatOpenAI
from langchain.output_parsers import PydanticOutputParser

model = ChatOpenAI(temperature=0.9) #ChatModel seem more precise than complettion type

parser = PydanticOutputParser(pydantic_object=House)
parser_prompt = parser.get_format_instructions()

# This Prompt would not generate exact num_of_data, because ChatGPT may
# not remember how much data he need to generate  when N is big,
# N around 10 works fine, but when N comes to 20, it would generate
# 6 ~ 25 data at a time.
prompt = PromptTemplate(
    template="""Generate house information for rental.\n{format_instructions}\n""",
    input_variables=["num_of_data"], 
    input_types={"num_of_data": int},
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

chain = prompt | model | parser

In [7]:
# check generation sample
import json
# test
output = chain.invoke({"num_of_data":1})
print(json.dumps(json.loads(output.json()), indent=2))

{
  "location": {
    "city": "Tokyo City",
    "ward": "Shinjuku"
  },
  "type": {
    "type": "apartment",
    "square": "60",
    "layout": "1LDK",
    "layout_details": "20 square meters living room"
  },
  "built_info": {
    "year": 2005,
    "structure": "concrete"
  },
  "live_info": {
    "nearest_supermarket": "10 minutes by walk",
    "nearest_subway": "5 minutes by walk",
    "subwaystation_name": "Shinjuku Station"
  },
  "price": "$1000"
}


In [9]:
# try 400 times, get as much data as we can
from time import sleep
from tqdm import tqdm

generated_data = []
MAX_GENERATION_LIMIT = 400

tried = 0

with tqdm(total=200) as pbar:
    while tried <=MAX_GENERATION_LIMIT and len(generated_data)<200:
        tried += 1
        try:
            output = chain.invoke({"num_of_data":1})
            generated_data.append(output)
            pbar.update(1)
            sleep(0.1)
        except Exception as e:
            pass


 70%|██████▉   | 139/200 [14:09<06:08,  6.03s/it]

In [68]:
import numpy as np
flatten_data: List[House] = np.array(generated_data).ravel().tolist()
len(flatten_data)

200

In [73]:
import json
import pandas as pd
df = None

for house_obj in flatten_data:
    raw_attr = json.loads(house_obj.json())
    house_attr = raw_attr.copy()
    # normalize json object
    for key, val in raw_attr.items():
        if isinstance(val, dict):
            val = raw_attr[key]
            house_attr.pop(key)
            for sub_key, sub_val in val.items():
                house_attr[key+'_'+sub_key] = sub_val
        else:
            pass
    house_attr['index'] = [0]
    row = pd.DataFrame.from_dict(house_attr)
    if df is None:
        df = row
    else:
        df = pd.concat([df, row], axis=0)
df = df.reset_index(drop=True)
df = df.drop(columns=["index"])
df.to_csv("../data/chatgpt_house_match_data.csv", index=False)

## 1.2 generated data clean and EDA

In [74]:
import pandas as pd

df = pd.read_csv("../data/chatgpt_house_match_data.csv")
df.head()

Unnamed: 0,price,location_city,location_ward,type_type,type_square,type_layout,type_layout_details,built_info_year,built_info_structure,live_info_nearest_supermarket,live_info_nearest_subway,live_info_subwaystation_name
0,$800,Tokyo City,Shinjuku,Apartment,50 sqm,1LDK,10 sqm living room,2005,Steel,10 minutes by walk,5 minutes by bus,Shinjuku Station
1,$800,Tokyo City,Shinjuku,apartment,50sqm,1LDK,"Living room: 25sqm, Bedroom: 12sqm, Kitchen: 6...",2005,steel,10 minutes by walk,5 minutes by walk,Shinjuku Station
2,$850,Tokyo City,Shinjuku Ward,Apartment,40m^2,1LDK,20m^2 living room,1995,reinforced concrete,10 minutes by bus,15 minutes by walk,Yoyogi Station
3,$800,Tokyo City,Shinjuku Ward,Apartment,600 sq. ft.,1LDK,300 sq. ft. living room,2010,steel,10 minutes by walk,5 minutes by bus,Shinjuku Station
4,$800,Tokyo City,Shibuya Ward,apartment,40 square meters,1LDK,20 square meters living room,2010,concrete,10 minutes by bus,5 minutes by walk,Shibuya Station


It seems ChatGPT 3.5 may genereate data with some fault, for example Yokohama is a city and Kawasaki is also a city, its relation is not City <-> Ward, so transform it to Kanagawa may make this become more accurate.

In [95]:
# map yokohama -> kanagawa
fix_ = {
    "Yokohama, Minato": "Kanagawa, Yokohama Minato",
    "Yokohama, Nishi": "Kanagawa, Yokohama Nishi",
    "Yokohama, Kanagawa": "Kanagawa, Yokohama City",
    "Yokohama, Kawasaki": "Kanagawa, Yokosuka",
    "Yokohama, Minato": "Kanagawa, Yokohama Minato"
    "Yokohama, Minato": "Kanagawa, Yokohama Minato"
    "Yokohama, Minato": "Kanagawa, Yokohama Minato"
    
}
df.city = df.city.replace({"Yokohama":'Kanagawa'})
df.head()

AttributeError: 'DataFrame' object has no attribute 'city'

In [87]:
data = np.vstack(df.location.apply(lambda x: x.split(",")).values)
data = pd.DataFrame(data, columns=["city", "ward"])
df[["city", "ward"]] = data
df.head()

Unnamed: 0,location,price,type_type,type_square,type_layout,type_layout_details,built_info_year,built_info_structure,live_info_nearest_supermarket,live_info_nearest_subway,city,ward
0,"Tokyo, Shibuya",$800,apartment,50 sqm,1LDK,20 sqm living room,2005,concrete,10 minutes by walk,5 minutes by bus,Tokyo,Shibuya
1,"Yokohama, Minato",$1000,condominium,80 sqm,2LDK,30 sqm living room,1998,steel,15 minutes by walk,10 minutes by bus,Yokohama,Minato
2,"Chiba, Urayasu",$1200,house,120 sqm,3LDK,40 sqm living room,2010,wood,5 minutes by walk,20 minutes by bus,Chiba,Urayasu
3,"Tokyo, Shinjuku",$900,apartment,60 sqm,1DK,25 sqm living room,1990,concrete,20 minutes by walk,15 minutes by bus,Tokyo,Shinjuku
4,"Yokohama, Nishi",$1100,condominium,90 sqm,2LDK,35 sqm living room,2002,steel,10 minutes by walk,10 minutes by bus,Yokohama,Nishi


# 1.2 generate description pipeline