# Airbnb NYC Data Pre-processing & ETL

This notebook implements the end-to-end data pipeline to create the final modeling dataset from raw InsideAirbnb snapshots.

**Objective:** Load all monthly listings snapshots and the full reviews history, clean features, engineer the `estimated_occupancy_rate` sample weight, and produce a single, model-ready `listing-month` panel.

### 0. Setup & Data Loading

In [126]:
import pandas as pd
import numpy as np
import os
import glob

# --- Configuration ---
# Parent directory containing the 'listings-YY-MM.csv' files and '{CITY}-reviews-detailed...csv'
CITY = "toronto"
INPUT_DATA_DIR = os.path.expanduser(f"~/Downloads/insideairbnb/{CITY}") 
OUTPUT_DATA_DIR = os.path.expanduser(f"../data/{CITY}")
OUTPUT_FILENAME = f"{CITY}_dataset_oct_17.parquet"

# Configure pandas display
pd.options.display.max_columns = 100

# --- Load All Monthly Listings Snapshots ---
listings_files = sorted(glob.glob(os.path.join(INPUT_DATA_DIR, 'listings-*.csv')))
if not listings_files:
    raise FileNotFoundError(f"No 'listings-*.csv' files found in {INPUT_DATA_DIR}")

print(f"Found {len(listings_files)} monthly listings files. Loading and concatenating...")

dfs = []
for file in listings_files:
    # low_memory=False handles mixed data types in raw CSVs
    df = pd.read_csv(file, low_memory=False) 
    dfs.append(df)

raw_listings_df = pd.concat(dfs, ignore_index=True)
print(f"Successfully loaded {len(raw_listings_df):,} total listing records.")

# --- Load Full Reviews History ---
reviews_path = os.path.join(INPUT_DATA_DIR, f'{CITY}-reviews-detailed-insideairbnb.csv')
print(f"Loading reviews from: {os.path.basename(reviews_path)}...")
try:
    raw_reviews_df = pd.read_csv(reviews_path)
    print(f"Successfully loaded {len(raw_reviews_df):,} reviews.")
except FileNotFoundError:
    raise FileNotFoundError(f"Could not find reviews file at: {reviews_path}")

# Display samples
print("\nListings Sample:")
display(raw_listings_df.head(2))
print("\nReviews Sample:")
display(raw_reviews_df.head(2))

# Display column info
print("\nListings DataFrame Info:")
print(raw_listings_df.info())
print("\nReviews DataFrame Info:")
print(raw_reviews_df.info())

Found 12 monthly listings files. Loading and concatenating...
Successfully loaded 259,178 total listing records.
Loading reviews from: toronto-reviews-detailed-insideairbnb.csv...
Successfully loaded 634,104 reviews.

Listings Sample:


Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,host_url,host_name,host_since,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_thumbnail_url,host_picture_url,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,neighbourhood,neighbourhood_cleansed,neighbourhood_group_cleansed,latitude,longitude,property_type,room_type,accommodates,bathrooms,bathrooms_text,bedrooms,beds,amenities,price,minimum_nights,maximum_nights,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,minimum_nights_avg_ntm,maximum_nights_avg_ntm,calendar_updated,has_availability,availability_30,availability_60,availability_90,availability_365,calendar_last_scraped,number_of_reviews,number_of_reviews_ltm,number_of_reviews_l30d,first_review,last_review,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,availability_eoy,number_of_reviews_ly,estimated_occupancy_l365d,estimated_revenue_l365d
0,1419,https://www.airbnb.com/rooms/1419,20241005023627,2024-10-05,previous scrape,Beautiful home in amazing area!,"This large, family home is located in one of T...",The apartment is located in the Ossington stri...,https://a0.muscache.com/pictures/76206750/d643...,1565,https://www.airbnb.com/users/show/1565,Alexandra,2008-08-08,"Vancouver, Canada","I live in Vancouver, Canada with my husband an...",,,,f,https://a0.muscache.com/im/pictures/user/7aeea...,https://a0.muscache.com/im/pictures/user/7aeea...,Commercial Drive,1.0,1.0,"['email', 'phone']",t,t,Neighborhood highlights,Little Portugal,,43.6459,-79.42423,Entire home,Entire home/apt,10,,3 baths,5.0,,"[""Smoke alarm"", ""Wifi"", ""Kitchen"", ""Essentials...",,28,730,28.0,28.0,730.0,730.0,28.0,730.0,,t,0,0,0,0,2024-10-05,6,0,0,2015-07-19,2017-08-07,5.0,5.0,5.0,5.0,5.0,5.0,5.0,,f,1,1,0,0,0.05,,,,
1,8077,https://www.airbnb.com/rooms/8077,20241005023627,2024-10-05,previous scrape,Downtown Harbourfront Private Room,Guest room in a luxury condo with access to al...,,https://a0.muscache.com/pictures/11780344/141c...,22795,https://www.airbnb.com/users/show/22795,Kathie & Larry,2009-06-22,"Toronto, Canada",My husband and I have been airbnb host for alm...,,,,f,https://a0.muscache.com/im/pictures/user/9a077...,https://a0.muscache.com/im/pictures/user/9a077...,Harbourfront,2.0,3.0,"['email', 'phone']",t,f,,Waterfront Communities-The Island,,43.6408,-79.37673,Private room in rental unit,Private room,2,,1.5 baths,,,"[""Free parking on premises"", ""Smoke alarm"", ""W...",,180,365,180.0,180.0,365.0,365.0,180.0,365.0,,,0,0,0,0,2024-10-05,169,0,0,2009-08-20,2013-08-27,4.84,4.81,4.89,4.87,4.9,4.92,4.83,,f,2,1,1,0,0.92,,,,



Reviews Sample:


Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,1419,38924112,2015-07-19,11308465,Marcela,Having the opportunity of arriving to Alexandr...
1,1419,44791978,2015-08-29,9580285,Marco,We have no enough words to describe how beauty...



Listings DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 259178 entries, 0 to 259177
Data columns (total 79 columns):
 #   Column                                        Non-Null Count   Dtype  
---  ------                                        --------------   -----  
 0   id                                            259178 non-null  int64  
 1   listing_url                                   259178 non-null  object 
 2   scrape_id                                     259178 non-null  int64  
 3   last_scraped                                  259178 non-null  object 
 4   source                                        259178 non-null  object 
 5   name                                          259178 non-null  object 
 6   description                                   253815 non-null  object 
 7   neighborhood_overview                         119863 non-null  object 
 8   picture_url                                   259178 non-null  object 
 9   host_id               

### 1. Remove unnecessary columns

In [127]:
cols_to_keep = [
    'id',
    'host_id',
    'name',
    'description',
    'host_is_superhost',
    'neighbourhood_cleansed',
    'latitude',
    'longitude',
    'property_type',
    'room_type',
    'accommodates',
    'bathrooms',
    'bedrooms',
    'beds',
    'amenities',
    'minimum_nights',
    'review_scores_rating',  #float
    'review_scores_accuracy',  #float
    'review_scores_cleanliness',  #float
    'review_scores_checkin',  #float
    'review_scores_communication',  #float
    'review_scores_location',  #float
    'review_scores_value',  #float
    'last_scraped',
    'price'
    ]

listings_df = raw_listings_df[cols_to_keep].copy()
print(f"\nReduced listings DataFrame to {len(listings_df.columns)} columns.")
print(listings_df.info())


Reduced listings DataFrame to 25 columns.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 259178 entries, 0 to 259177
Data columns (total 25 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   id                           259178 non-null  int64  
 1   host_id                      259178 non-null  int64  
 2   name                         259178 non-null  object 
 3   description                  253815 non-null  object 
 4   host_is_superhost            249001 non-null  object 
 5   neighbourhood_cleansed       259178 non-null  object 
 6   latitude                     259178 non-null  float64
 7   longitude                    259178 non-null  float64
 8   property_type                259178 non-null  object 
 9   room_type                    259178 non-null  object 
 10  accommodates                 259178 non-null  int64  
 11  bathrooms                    190841 non-null  float64
 12  bedrooms       

### 2. Convert the scrape-date to month (1-12), convert `host_is_superhost` col to numeric 0/1

In [128]:
# Convert last_scraped to datetime
raw_listings_df['last_scraped'] = pd.to_datetime(raw_listings_df['last_scraped'], errors='coerce')

# Convert last_scraped to month only (no year)
listings_df['month'] = raw_listings_df['last_scraped'].dt.month

# Drop the last_scraped column as it's no longer needed
listings_df = listings_df.drop(columns=['last_scraped'])

### 3. Clean price column, drop outliers, add price-per-person and log1p of both

In [129]:
# Convert prices to float
listings_df['price'] = listings_df['price'].replace(r'[\$,]', '', regex=True).astype(float)

# Drop NaN's from price column and make it float
listings_df = listings_df.dropna(subset=['price'])

# Add price_per_person column
listings_df['price_per_person'] = listings_df['price'] / listings_df['accommodates']

# Drop the bottom 1% and top 2% of price_per_person to remove outliers
lower_bound = listings_df['price_per_person'].quantile(0.01)
upper_bound = listings_df['price_per_person'].quantile(0.98)
listings_df = listings_df[(listings_df['price_per_person'] >= lower_bound) & (listings_df['price_per_person'] <= upper_bound)]

# Add log1p transformed columns
listings_df['log_price'] = np.log1p(listings_df['price'])
listings_df['log_price_per_person'] = np.log1p(listings_df['price_per_person'])

# Print info and a sample
print("\nUpdated Listings DataFrame Info:")
print(listings_df.info())
print("\nListings DataFrame Sample with New Columns:")
display(listings_df.head(5))


Updated Listings DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
Index: 185198 entries, 2 to 259177
Data columns (total 28 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   id                           185198 non-null  int64  
 1   host_id                      185198 non-null  int64  
 2   name                         185198 non-null  object 
 3   description                  182441 non-null  object 
 4   host_is_superhost            177113 non-null  object 
 5   neighbourhood_cleansed       185198 non-null  object 
 6   latitude                     185198 non-null  float64
 7   longitude                    185198 non-null  float64
 8   property_type                185198 non-null  object 
 9   room_type                    185198 non-null  object 
 10  accommodates                 185198 non-null  int64  
 11  bathrooms                    185083 non-null  float64
 12  bedrooms                     

Unnamed: 0,id,host_id,name,description,host_is_superhost,neighbourhood_cleansed,latitude,longitude,property_type,room_type,accommodates,bathrooms,bedrooms,beds,amenities,minimum_nights,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,price,month,price_per_person,log_price,log_price_per_person
2,26654,113345,"World Class @ CN Tower, convention centre, The...","CN Tower, TIFF Bell Lightbox, Metro Convention...",t,Waterfront Communities-The Island,43.64608,-79.39032,Entire condo,Entire home/apt,4,1.0,1.0,2.0,"[""Iron"", ""Building staff"", ""City skyline view""...",28,4.79,4.79,4.79,4.64,4.76,4.86,4.67,155.0,10,38.75,5.049856,3.68261
6,40701,175687,Bright Beaches loft close to Queen and the lake,Highly walkable neighborhood. Close to the lak...,f,The Beaches,43.67239,-79.28858,Entire rental unit,Entire home/apt,2,1.0,1.0,1.0,"[""Window AC unit"", ""Stove"", ""Iron"", ""Coffee ma...",28,4.75,4.88,4.38,4.88,4.63,4.88,4.5,85.0,10,42.5,4.454347,3.772761
7,44452,195095,Yonge & Bloor Studio Skyline,,t,Rosedale-Moore Park,43.67193,-79.3859,Entire rental unit,Entire home/apt,2,1.0,1.0,1.0,"[""Iron"", ""Hot water"", ""Carbon monoxide alarm"",...",28,4.18,4.51,4.03,4.79,4.84,4.95,4.23,126.0,10,63.0,4.844187,4.158883
8,45399,195095,Fountain View Studio - Eaton center,"Open Space studio style, Big windows, calm & r...",t,Bay Street Corridor,43.66123,-79.38336,Entire condo,Entire home/apt,3,1.0,0.0,1.0,"[""Iron"", ""Hot water"", ""Coffee maker: drip coff...",28,4.17,4.49,3.97,4.63,4.69,4.92,4.21,146.0,10,48.666667,4.990433,3.905334
9,45893,195095,Yonge & Bloor Lakeview Master BR,,t,Rosedale-Moore Park,43.6718,-79.38488,Private room in rental unit,Private room,1,1.0,1.0,1.0,"[""Iron"", ""Coffee maker"", ""Carbon monoxide alar...",28,4.4,4.29,4.0,4.52,4.81,4.86,4.24,90.0,10,90.0,4.51086,4.51086


### 4. Keep only listings with at least one review, drop rows with NaN's, keep only listings that appear in at least 5 months

In [130]:
# Compare IDs between listings_df and raw_reviews_df
listings_ids = set(listings_df['id'].unique())
reviews_ids = set(raw_reviews_df['listing_id'].unique())

common_ids = listings_ids & reviews_ids
only_in_listings = listings_ids - reviews_ids
only_in_reviews = reviews_ids - listings_ids

print(f"Total unique IDs in listings: {len(listings_ids)}")
print(f"Total unique IDs in reviews: {len(reviews_ids)}")
print(f"Common IDs: {len(common_ids)}")
print(f"IDs only in listings: {len(only_in_listings)}")
print(f"IDs only in reviews: {len(only_in_reviews)}")

# Optionally, display some samples
print("\nSample common IDs:", list(common_ids)[:5])
print("Sample only in listings:", list(only_in_listings)[:5])
print("Sample only in reviews:", list(only_in_reviews)[:5])

# Keep only common IDs in listings and reviews
common_listings_df = listings_df[listings_df['id'].isin(common_ids)]
common_reviews_df = raw_reviews_df[raw_reviews_df['listing_id'].isin(common_ids)]

# Drop all listings with NaN's
common_listings_df = common_listings_df.dropna()

# Keep only listings that appear at least 5 times
common_listings_df = common_listings_df[common_listings_df.groupby('id')['id'].transform('size') >= 5]

# Display info after filtering
print("\nFiltered Listings DataFrame Info:")
print(common_listings_df.info())
print("\nFiltered Reviews DataFrame Info:")
print(common_reviews_df.info())

Total unique IDs in listings: 26686
Total unique IDs in reviews: 16546
Common IDs: 14012
IDs only in listings: 12674
IDs only in reviews: 2534

Sample common IDs: [np.int64(35880960), np.int64(1081313620383858689), np.int64(1136166639130869762), np.int64(1239457201874862082), np.int64(1312646562476523531)]
Sample only in listings: [np.int64(986302774835249152), np.int64(594055665721704452), np.int64(1343715665444208646), np.int64(1022457932458819592), np.int64(1103777582724939786)]
Sample only in reviews: [np.int64(38731776), np.int64(23248896), np.int64(28606469), np.int64(27549705), np.int64(26083338)]

Filtered Listings DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
Index: 104843 entries, 2 to 256799
Data columns (total 28 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   id                           104843 non-null  int64  
 1   host_id                      104843 non-null  int64  
 2   name  

### 5. Add column with total reviews extracted from `common_reviews_df`, format `host_is_superhost`, `bedrooms`, and `beds` columns 

In [131]:
# Aggregate reviews to get total reviews per listing
reviews_count = common_reviews_df.groupby('listing_id').size().reset_index(name='total_reviews')

# Merge on the listing ID
final_df = common_listings_df.merge(reviews_count, left_on='id', right_on='listing_id', how='left')

# Convert total_reviews to int
final_df['total_reviews'] = final_df['total_reviews'].astype('int')

# Drop the redundant listing_id column
final_df = final_df.drop(columns=['listing_id'])

# Convert host_is_superhost to numeric 0/1
final_df['host_is_superhost'] = final_df['host_is_superhost'].astype(str).map({'t': 1, 'f': 0})

# Convert bedrooms and beds to int
final_df['bedrooms'] = final_df['bedrooms'].astype('int')
final_df['beds'] = final_df['beds'].astype('int')

# Print information about the final DataFrame
print(f"\nFinal listings dataset for {CITY}:")
display(final_df.info())

# Display 3 sample listings (all occurrences)
sample_ids = np.random.choice(final_df['id'].unique(), size=3, replace=False)
for listing_id in sample_ids:
    listing_reviews = final_df[final_df['id'] == listing_id]
    print(f"\nSample data for listing ID {listing_id}:")
    display(listing_reviews)


Final listings dataset for toronto:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 104843 entries, 0 to 104842
Data columns (total 29 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   id                           104843 non-null  int64  
 1   host_id                      104843 non-null  int64  
 2   name                         104843 non-null  object 
 3   description                  104843 non-null  object 
 4   host_is_superhost            104843 non-null  int64  
 5   neighbourhood_cleansed       104843 non-null  object 
 6   latitude                     104843 non-null  float64
 7   longitude                    104843 non-null  float64
 8   property_type                104843 non-null  object 
 9   room_type                    104843 non-null  object 
 10  accommodates                 104843 non-null  int64  
 11  bathrooms                    104843 non-null  float64
 12  bedrooms             

None


Sample data for listing ID 15127295:


Unnamed: 0,id,host_id,name,description,host_is_superhost,neighbourhood_cleansed,latitude,longitude,property_type,room_type,accommodates,bathrooms,bedrooms,beds,amenities,minimum_nights,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,price,month,price_per_person,log_price,log_price_per_person,total_reviews
8725,15127295,50778946,Bright Beach Toronto Home,Light filled 4br executive home only steps to ...,1,The Beaches,43.66588,-79.30843,Entire home,Entire home/apt,8,2.5,4,5,"[""Shampoo"", ""Washer"", ""Air conditioning"", ""Wif...",28,5.0,5.0,5.0,5.0,5.0,5.0,5.0,380.0,11,47.5,5.942799,3.881564,4
16744,15127295,50778946,Bright Beach Toronto Home,Light filled 4br executive home only steps to ...,1,The Beaches,43.66414,-79.30968,Entire home,Entire home/apt,8,2.5,4,5,"[""Hangers"", ""Shampoo"", ""Heating"", ""Essentials""...",28,5.0,5.0,5.0,5.0,5.0,5.0,5.0,380.0,12,47.5,5.942799,3.881564,4
32537,15127295,50778946,Bright Beach Toronto Home,Light filled 4br executive home only steps to ...,1,The Beaches,43.66588,-79.30843,Entire home,Entire home/apt,8,2.5,4,5,"[""Free parking on premises"", ""Heating"", ""Hange...",28,5.0,5.0,5.0,5.0,5.0,5.0,5.0,380.0,1,47.5,5.942799,3.881564,4
33659,15127295,50778946,Bright Beach Toronto Home,Light filled 4br executive home only steps to ...,1,The Beaches,43.66588,-79.30843,Entire home,Entire home/apt,8,2.5,4,5,"[""Free parking on premises"", ""Heating"", ""Air c...",28,5.0,5.0,5.0,5.0,5.0,5.0,5.0,334.0,2,41.75,5.814131,3.755369,4
50899,15127295,50778946,Bright Beach Toronto Home,Light filled 4br executive home only steps to ...,0,The Beaches,43.66588,-79.30843,Entire home,Entire home/apt,8,2.5,4,5,"[""Washer"", ""Iron"", ""Heating"", ""Kitchen"", ""Wifi...",28,5.0,5.0,5.0,5.0,5.0,5.0,5.0,334.0,4,41.75,5.814131,3.755369,4
59813,15127295,50778946,Bright Beach Toronto Home,Light filled 4br executive home only steps to ...,0,The Beaches,43.66588,-79.30843,Entire home,Entire home/apt,8,2.5,4,5,"[""Hangers"", ""TV"", ""Dryer"", ""Shampoo"", ""Heating...",28,5.0,5.0,5.0,5.0,5.0,5.0,5.0,334.0,5,41.75,5.814131,3.755369,4
78112,15127295,50778946,Bright Beach Toronto Home,Light filled 4br executive home only steps to ...,0,The Beaches,43.66588,-79.30843,Entire home,Entire home/apt,8,2.5,4,5,"[""Indoor fireplace"", ""Hangers"", ""Kitchen"", ""Wi...",28,5.0,5.0,5.0,5.0,5.0,5.0,5.0,411.0,7,51.375,6.021023,3.958429,4
87308,15127295,50778946,Bright Beach Toronto Home,Light filled 4br executive home only steps to ...,0,The Beaches,43.66588,-79.30843,Entire home,Entire home/apt,8,2.5,4,5,"[""Wifi"", ""Iron"", ""Washer"", ""Essentials"", ""Sham...",28,5.0,5.0,5.0,5.0,5.0,5.0,5.0,335.0,8,41.875,5.817111,3.758289,4
96369,15127295,50778946,Bright Beach Toronto Home,Light filled 4br executive home only steps to ...,0,The Beaches,43.66588,-79.30843,Entire home,Entire home/apt,8,2.5,4,5,"[""Hair dryer"", ""Heating"", ""Free parking on pre...",28,5.0,5.0,5.0,5.0,5.0,5.0,5.0,335.0,9,41.875,5.817111,3.758289,4



Sample data for listing ID 933532471021940547:


Unnamed: 0,id,host_id,name,description,host_is_superhost,neighbourhood_cleansed,latitude,longitude,property_type,room_type,accommodates,bathrooms,bedrooms,beds,amenities,minimum_nights,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,price,month,price_per_person,log_price,log_price_per_person,total_reviews
5588,933532471021940547,110704189,"Home sweet home, Toronto.",Welcome to our beautiful spacious oasis. Newly...,1,Weston-Pellam Park,43.669052,-79.451128,Entire home,Entire home/apt,6,2.0,2,6,"[""Exterior security cameras on property"", ""Tra...",28,4.5,5.0,4.0,5.0,5.0,4.0,4.5,369.0,10,61.5,5.913503,4.135167,2
21637,933532471021940547,110704189,"Home sweet home, Toronto.",Welcome to our beautiful spacious oasis. Newly...,1,Weston-Pellam Park,43.669052,-79.451128,Entire home,Entire home/apt,5,2.0,2,6,"[""Free parking on premises \u2013 1 space"", ""S...",28,4.5,5.0,4.0,5.0,5.0,4.0,4.5,269.0,12,53.8,5.598422,4.00369,2
32248,933532471021940547,110704189,"Home sweet home, Toronto.",Welcome to our beautiful spacious oasis. Newly...,1,Weston-Pellam Park,43.669052,-79.451128,Entire home,Entire home/apt,5,2.0,2,6,"[""Body soap"", ""Free washer \u2013 In unit"", ""H...",28,4.5,5.0,4.0,5.0,5.0,4.0,4.5,279.0,1,55.8,5.63479,4.039536,2
83005,933532471021940547,110704189,"Home sweet home, Toronto.",Welcome to our beautiful spacious oasis. Newly...,0,Weston-Pellam Park,43.669052,-79.451128,Entire home,Entire home/apt,4,2.0,2,6,"[""Private patio or balcony"", ""Extra pillows an...",28,4.5,5.0,4.0,5.0,5.0,4.0,4.5,252.0,7,63.0,5.533389,4.158883,2
92144,933532471021940547,110704189,"Home sweet home, Toronto.",Welcome to our beautiful spacious oasis. Newly...,0,Weston-Pellam Park,43.669052,-79.451128,Entire home,Entire home/apt,4,2.0,2,6,"[""Children\u2019s books and toys"", ""Dedicated ...",28,4.5,5.0,4.0,5.0,5.0,4.0,4.5,288.0,8,72.0,5.666427,4.290459,2



Sample data for listing ID 37604683:


Unnamed: 0,id,host_id,name,description,host_is_superhost,neighbourhood_cleansed,latitude,longitude,property_type,room_type,accommodates,bathrooms,bedrooms,beds,amenities,minimum_nights,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,price,month,price_per_person,log_price,log_price_per_person,total_reviews
2079,37604683,281667112,❤️ Cozy Studio Apartment - near Downtown Toronto,Welcome to our newly renovated bachelor apartm...,0,South Parkdale,43.63582,-79.43369,Entire rental unit,Entire home/apt,2,1.0,1,1,"[""Window AC unit"", ""Exterior security cameras ...",28,4.27,4.6,4.53,4.67,4.73,4.6,4.47,103.0,10,51.5,4.644391,3.960813,19
10035,37604683,281667112,❤️ Cozy Studio Apartment - near Downtown Toronto,Welcome to our newly renovated bachelor apartm...,0,South Parkdale,43.63582,-79.43369,Entire rental unit,Entire home/apt,2,1.0,1,1,"[""Coffee maker"", ""Freezer"", ""Clothing storage""...",28,4.25,4.56,4.5,4.69,4.75,4.63,4.44,103.0,11,51.5,4.644391,3.960813,19
18091,37604683,281667112,❤️ Cozy Studio Apartment - near Downtown Toronto,Welcome to our newly renovated bachelor apartm...,0,South Parkdale,43.63582,-79.43369,Entire rental unit,Entire home/apt,2,1.0,1,1,"[""43 inch HDTV with Netflix"", ""Essentials"", ""D...",28,4.25,4.56,4.5,4.69,4.75,4.63,4.44,96.0,12,48.0,4.574711,3.89182,19
31951,37604683,281667112,❤️ Cozy Studio Apartment - near Downtown Toronto,Welcome to our newly renovated bachelor apartm...,0,South Parkdale,43.63582,-79.43369,Entire rental unit,Entire home/apt,2,1.0,1,1,"[""Bathtub"", ""Window AC unit"", ""Stove"", ""Privat...",28,4.25,4.56,4.5,4.69,4.75,4.63,4.44,96.0,1,48.0,4.574711,3.89182,19
34968,37604683,281667112,❤️ Cozy Studio Apartment - near Downtown Toronto,Welcome to our newly renovated bachelor apartm...,0,South Parkdale,43.63582,-79.43369,Entire rental unit,Entire home/apt,2,1.0,1,1,"[""Carbon monoxide alarm"", ""Dining table"", ""Ref...",28,4.25,4.56,4.5,4.69,4.75,4.63,4.44,96.0,2,48.0,4.574711,3.89182,19
43538,37604683,281667112,❤️ Cozy Studio Apartment - near Downtown Toronto,Welcome to our newly renovated bachelor apartm...,0,South Parkdale,43.63582,-79.43369,Entire rental unit,Entire home/apt,2,1.0,1,1,"[""Keypad"", ""Central heating"", ""Stove"", ""Bathtu...",28,4.25,4.56,4.5,4.69,4.75,4.63,4.44,96.0,3,48.0,4.574711,3.89182,19
52199,37604683,281667112,❤️ Cozy Studio Apartment - near Downtown Toronto,Welcome to our newly renovated bachelor apartm...,0,South Parkdale,43.63582,-79.43369,Entire rental unit,Entire home/apt,2,1.0,1,1,"[""Paid parking lot on premises"", ""Kitchen"", ""P...",28,4.25,4.56,4.5,4.69,4.75,4.63,4.44,120.0,4,60.0,4.795791,4.110874,19
61130,37604683,281667112,❤️ Cozy Studio Apartment - near Downtown Toronto,Welcome to our newly renovated bachelor apartm...,0,South Parkdale,43.63582,-79.43369,Entire rental unit,Entire home/apt,2,1.0,1,1,"[""Self check-in"", ""Carbon monoxide alarm"", ""Re...",28,4.25,4.56,4.5,4.69,4.75,4.63,4.44,120.0,5,60.0,4.795791,4.110874,19
70297,37604683,281667112,❤️ Cozy Studio Apartment - near Downtown Toronto,Welcome to our newly renovated bachelor apartm...,0,South Parkdale,43.63582,-79.43369,Entire rental unit,Entire home/apt,2,1.0,1,1,"[""Wifi"", ""Freezer"", ""Outdoor dining area"", ""Wi...",28,4.29,4.59,4.53,4.71,4.76,4.65,4.47,96.0,6,48.0,4.574711,3.89182,19
79480,37604683,281667112,❤️ Cozy Studio Apartment - near Downtown Toronto,Welcome to our newly renovated bachelor apartm...,0,South Parkdale,43.63582,-79.43369,Entire rental unit,Entire home/apt,2,1.0,1,1,"[""Kitchen"", ""Paid street parking off premises""...",28,4.29,4.59,4.53,4.71,4.76,4.65,4.47,97.0,7,48.5,4.584967,3.901973,19


### 6. Clean up amenities

In [132]:
import ast
import re

# --- 1. Define a robust parsing function ---
def clean_and_format_amenities(amenities_str: str) -> str:
    """
    Safely parses a stringified list of amenities and returns
    a single, comma-separated string suitable for a sentence transformer.

    Args:
        amenities_str: The raw string from the 'amenities' column.

    Returns:
        A clean, comma-separated string of amenities, or an empty string
        if the input is invalid or empty.
    """
    if not isinstance(amenities_str, str) or amenities_str in ('', '[]'):
        return ""
    
    try:
        # ast.literal_eval is the safest way to parse a string
        # that should contain a Python literal (like a list).
        amenities_list = ast.literal_eval(amenities_str)
        
        # Ensure it's actually a list before trying to join
        if isinstance(amenities_list, list):
            # Join the list elements into a single, clean string
            return ", ".join(sorted(amenities_list))
        else:
            return ""
            
    except (ValueError, SyntaxError):
        # This catches malformed strings that ast cannot parse
        return ""

# --- 3. Apply the function and verify the result ---

print("\n--- Before Formatting ---")
print(final_df[['amenities']].head())
print(f"\nOriginal dtype: {final_df['amenities'].dtype}")


# Apply the cleaning function to the 'amenities' column
final_df['amenities'] = final_df['amenities'].apply(clean_and_format_amenities)

# Function to check if a string contains surrogates
def has_surrogates(text):
    if not isinstance(text, str):
        return False
    return bool(re.search(r'[\ud800-\udfff]', text))

# Assuming the has_surrogates function is already defined from earlier
mask = final_df['amenities'].apply(has_surrogates)
final_df = final_df[~mask]  # Keep only rows where amenities does NOT have surrogates
print(f"Dropped {mask.sum()} rows with surrogate characters in 'amenities'.")


print("\n\n--- After Formatting ---")
print(final_df[['amenities']].head())
print(f"\nNew dtype: {final_df['amenities'].dtype}")

print("\n\nExample of a fully formatted amenities string:")
print(f"'{final_df['amenities'].iloc[0]}'")


--- Before Formatting ---
                                           amenities
0  ["Iron", "Building staff", "City skyline view"...
1  ["Window AC unit", "Stove", "Iron", "Coffee ma...
2  ["Iron", "Hot water", "Coffee maker: drip coff...
3  ["Iron", "Coffee maker", "Hot water", "Bathtub...
4  ["Iron", "Smoke alarm", "Wifi", "Kitchen", "Es...

Original dtype: object
Dropped 0 rows with surrogate characters in 'amenities'.


--- After Formatting ---
                                           amenities
0  Bed linens, Building staff, Carbon monoxide al...
1  32 inch HDTV with Netflix, standard cable, Bak...
2  27 inch TV, Air conditioning, Bathtub, Bed lin...
3  Air conditioning, Bathtub, Carbon monoxide ala...
4  Air conditioning, Carbon monoxide alarm, Essen...

New dtype: object


Example of a fully formatted amenities string:
'Bed linens, Building staff, Carbon monoxide alarm, Central air conditioning, Children’s dinnerware, City skyline view, Coffee maker, Cooking basics, Dedicated w

### 8. Finalize & Save Modeling Dataset

In [133]:
import re

# Function to check if a string contains surrogates
def has_surrogates(text):
    if not isinstance(text, str):
        return False
    return bool(re.search(r'[\ud800-\udfff]', text))

# Scan all string columns for surrogates
print("Scanning for surrogate characters in string columns...")
offending_found = False
for col in final_df.select_dtypes(include=['object']).columns:
    mask = final_df[col].apply(has_surrogates)
    if mask.any():
        offending_found = True
        offending_rows = final_df[mask]
        print(f"\nOffending column: '{col}' ({len(offending_rows)} offending rows)")
        for idx in offending_rows.index[:5]:  # Limit to first 5 for readability
            val = final_df.at[idx, col]
            print(f"  Row {idx}: {repr(val[:200])}...")  # Truncate long strings

if not offending_found:
    print("No surrogate characters found in string columns.")

Scanning for surrogate characters in string columns...
No surrogate characters found in string columns.


In [134]:
# Save to Parquet
output_path = os.path.join(OUTPUT_DATA_DIR, OUTPUT_FILENAME)
print(f"\nSaving to {output_path}...")
final_df.to_parquet(output_path, index=False)
print("Done.")


Saving to ../data/toronto/toronto_dataset_oct_17.parquet...
Done.
Done.


In [135]:
# Save a sample with all occurrences of 2 random listings
sample_ids = np.random.choice(final_df['id'].unique(), size=2, replace=False)
sample_df = final_df[final_df['id'].isin(sample_ids)]
sample_output_path = os.path.join(OUTPUT_DATA_DIR, f"{CITY}_sample_listings.csv")
sample_df.to_csv(sample_output_path, index=False)
print(f"Sample listings saved to {sample_output_path}.")

Sample listings saved to ../data/toronto/toronto_sample_listings.csv.
