# Airbnb NYC Data Pre-processing & ETL

This notebook implements the end-to-end data pipeline to create the final modeling dataset from raw InsideAirbnb snapshots.

**Objective:** Load all monthly listings snapshots and the full reviews history, clean features, engineer the `estimated_occupancy_rate` sample weight, and produce a single, model-ready `listing-month` panel.

### 0. Setup & Data Loading

In [66]:
import pandas as pd
import numpy as np
import os
import glob

# --- Configuration ---
# Parent directory containing the 'listings-YY-MM.csv' files and '{CITY}-reviews-detailed...csv'
CITY = "toronto"
INPUT_DATA_DIR = os.path.expanduser(f"~/Downloads/insideairbnb/{CITY}") 
OUTPUT_DATA_DIR = os.path.expanduser(f"../data/{CITY}")
OUTPUT_FILENAME = f"{CITY}_dataset_oct_17.parquet"

# Configure pandas display
pd.options.display.max_columns = 100

# --- Load All Monthly Listings Snapshots ---
listings_files = sorted(glob.glob(os.path.join(INPUT_DATA_DIR, 'listings-*.csv')))
if not listings_files:
    raise FileNotFoundError(f"No 'listings-*.csv' files found in {INPUT_DATA_DIR}")

print(f"Found {len(listings_files)} monthly listings files. Loading and concatenating...")

dfs = []
for file in listings_files:
    # low_memory=False handles mixed data types in raw CSVs
    df = pd.read_csv(file, low_memory=False) 
    dfs.append(df)

raw_listings_df = pd.concat(dfs, ignore_index=True)
print(f"Successfully loaded {len(raw_listings_df):,} total listing records.")

# --- Load Full Reviews History ---
reviews_path = os.path.join(INPUT_DATA_DIR, f'{CITY}-reviews-detailed-insideairbnb.csv')
print(f"Loading reviews from: {os.path.basename(reviews_path)}...")
try:
    raw_reviews_df = pd.read_csv(reviews_path)
    print(f"Successfully loaded {len(raw_reviews_df):,} reviews.")
except FileNotFoundError:
    raise FileNotFoundError(f"Could not find reviews file at: {reviews_path}")

# Display samples
print("\nListings Sample:")
display(raw_listings_df.head(2))
print("\nReviews Sample:")
display(raw_reviews_df.head(2))

# Display column info
print("\nListings DataFrame Info:")
print(raw_listings_df.info())
print("\nReviews DataFrame Info:")
print(raw_reviews_df.info())

Found 12 monthly listings files. Loading and concatenating...
Successfully loaded 259,178 total listing records.
Loading reviews from: toronto-reviews-detailed-insideairbnb.csv...
Successfully loaded 617,247 reviews.

Listings Sample:


Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,host_url,host_name,host_since,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_thumbnail_url,host_picture_url,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,neighbourhood,neighbourhood_cleansed,neighbourhood_group_cleansed,latitude,longitude,property_type,room_type,accommodates,bathrooms,bathrooms_text,bedrooms,beds,amenities,price,minimum_nights,maximum_nights,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,minimum_nights_avg_ntm,maximum_nights_avg_ntm,calendar_updated,has_availability,availability_30,availability_60,availability_90,availability_365,calendar_last_scraped,number_of_reviews,number_of_reviews_ltm,number_of_reviews_l30d,first_review,last_review,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,availability_eoy,number_of_reviews_ly,estimated_occupancy_l365d,estimated_revenue_l365d
0,1419,https://www.airbnb.com/rooms/1419,20241005023627,2024-10-05,previous scrape,Beautiful home in amazing area!,"This large, family home is located in one of T...",The apartment is located in the Ossington stri...,https://a0.muscache.com/pictures/76206750/d643...,1565,https://www.airbnb.com/users/show/1565,Alexandra,2008-08-08,"Vancouver, Canada","I live in Vancouver, Canada with my husband an...",,,,f,https://a0.muscache.com/im/pictures/user/7aeea...,https://a0.muscache.com/im/pictures/user/7aeea...,Commercial Drive,1.0,1.0,"['email', 'phone']",t,t,Neighborhood highlights,Little Portugal,,43.6459,-79.42423,Entire home,Entire home/apt,10,,3 baths,5.0,,"[""Smoke alarm"", ""Wifi"", ""Kitchen"", ""Essentials...",,28,730,28.0,28.0,730.0,730.0,28.0,730.0,,t,0,0,0,0,2024-10-05,6,0,0,2015-07-19,2017-08-07,5.0,5.0,5.0,5.0,5.0,5.0,5.0,,f,1,1,0,0,0.05,,,,
1,8077,https://www.airbnb.com/rooms/8077,20241005023627,2024-10-05,previous scrape,Downtown Harbourfront Private Room,Guest room in a luxury condo with access to al...,,https://a0.muscache.com/pictures/11780344/141c...,22795,https://www.airbnb.com/users/show/22795,Kathie & Larry,2009-06-22,"Toronto, Canada",My husband and I have been airbnb host for alm...,,,,f,https://a0.muscache.com/im/pictures/user/9a077...,https://a0.muscache.com/im/pictures/user/9a077...,Harbourfront,2.0,3.0,"['email', 'phone']",t,f,,Waterfront Communities-The Island,,43.6408,-79.37673,Private room in rental unit,Private room,2,,1.5 baths,,,"[""Free parking on premises"", ""Smoke alarm"", ""W...",,180,365,180.0,180.0,365.0,365.0,180.0,365.0,,,0,0,0,0,2024-10-05,169,0,0,2009-08-20,2013-08-27,4.84,4.81,4.89,4.87,4.9,4.92,4.83,,f,2,1,1,0,0.92,,,,



Reviews Sample:


Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,762143,11893239,2014-04-18,727913,Asem,My wife and I stayed at Pat's condo downtown T...
1,762143,15051648,2014-07-01,14434615,Eva,"Our host was very nice, very professional, kno..."



Listings DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 259178 entries, 0 to 259177
Data columns (total 79 columns):
 #   Column                                        Non-Null Count   Dtype  
---  ------                                        --------------   -----  
 0   id                                            259178 non-null  int64  
 1   listing_url                                   259178 non-null  object 
 2   scrape_id                                     259178 non-null  int64  
 3   last_scraped                                  259178 non-null  object 
 4   source                                        259178 non-null  object 
 5   name                                          259178 non-null  object 
 6   description                                   253815 non-null  object 
 7   neighborhood_overview                         119863 non-null  object 
 8   picture_url                                   259178 non-null  object 
 9   host_id               

### 1. Remove unnecessary columns

In [67]:
cols_to_keep = [
    'id',
    'host_id',
    'name',
    'description',
    'host_is_superhost',
    'neighbourhood_cleansed',
    'latitude',
    'longitude',
    'property_type',
    'room_type',
    'accommodates',
    'bathrooms',
    'bedrooms',
    'beds',
    'amenities',
    'minimum_nights',
    'review_scores_rating',  #float
    'review_scores_accuracy',  #float
    'review_scores_cleanliness',  #float
    'review_scores_checkin',  #float
    'review_scores_communication',  #float
    'review_scores_location',  #float
    'review_scores_value',  #float
    'last_scraped',
    'price'
    ]

listings_df = raw_listings_df[cols_to_keep].copy()
print(f"\nReduced listings DataFrame to {len(listings_df.columns)} columns.")
print(listings_df.info())


Reduced listings DataFrame to 25 columns.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 259178 entries, 0 to 259177
Data columns (total 25 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   id                           259178 non-null  int64  
 1   host_id                      259178 non-null  int64  
 2   name                         259178 non-null  object 
 3   description                  253815 non-null  object 
 4   host_is_superhost            249001 non-null  object 
 5   neighbourhood_cleansed       259178 non-null  object 
 6   latitude                     259178 non-null  float64
 7   longitude                    259178 non-null  float64
 8   property_type                259178 non-null  object 
 9   room_type                    259178 non-null  object 
 10  accommodates                 259178 non-null  int64  
 11  bathrooms                    190841 non-null  float64
 12  bedrooms       

### 2. Convert the scrape-date to month (1-12), convert `host_is_superhost` col to numeric 0/1

In [68]:
# Convert last_scraped to datetime
raw_listings_df['last_scraped'] = pd.to_datetime(raw_listings_df['last_scraped'], errors='coerce')

# Convert last_scraped to month only (no year)
listings_df['month'] = raw_listings_df['last_scraped'].dt.month

# Drop the last_scraped column as it's no longer needed
listings_df = listings_df.drop(columns=['last_scraped'])

### 3. Clean price column, drop outliers, add price-per-person and log1p of both

In [69]:
# Convert prices to float
listings_df['price'] = listings_df['price'].replace(r'[\$,]', '', regex=True).astype(float)

# Drop NaN's from price column and make it float
listings_df = listings_df.dropna(subset=['price'])

# Add price_per_person column
listings_df['price_per_person'] = listings_df['price'] / listings_df['accommodates']

# Drop the bottom 1% and top 2% of price_per_person to remove outliers
lower_bound = listings_df['price_per_person'].quantile(0.01)
upper_bound = listings_df['price_per_person'].quantile(0.98)
listings_df = listings_df[(listings_df['price_per_person'] >= lower_bound) & (listings_df['price_per_person'] <= upper_bound)]

# Add log1p transformed columns
listings_df['log_price'] = np.log1p(listings_df['price'])
listings_df['log_price_per_person'] = np.log1p(listings_df['price_per_person'])

# Print info and a sample
print("\nUpdated Listings DataFrame Info:")
print(listings_df.info())
print("\nListings DataFrame Sample with New Columns:")
display(listings_df.head(5))


Updated Listings DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
Index: 185198 entries, 2 to 259177
Data columns (total 28 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   id                           185198 non-null  int64  
 1   host_id                      185198 non-null  int64  
 2   name                         185198 non-null  object 
 3   description                  182441 non-null  object 
 4   host_is_superhost            177113 non-null  object 
 5   neighbourhood_cleansed       185198 non-null  object 
 6   latitude                     185198 non-null  float64
 7   longitude                    185198 non-null  float64
 8   property_type                185198 non-null  object 
 9   room_type                    185198 non-null  object 
 10  accommodates                 185198 non-null  int64  
 11  bathrooms                    185083 non-null  float64
 12  bedrooms                     

Unnamed: 0,id,host_id,name,description,host_is_superhost,neighbourhood_cleansed,latitude,longitude,property_type,room_type,accommodates,bathrooms,bedrooms,beds,amenities,minimum_nights,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,price,month,price_per_person,log_price,log_price_per_person
2,26654,113345,"World Class @ CN Tower, convention centre, The...","CN Tower, TIFF Bell Lightbox, Metro Convention...",t,Waterfront Communities-The Island,43.64608,-79.39032,Entire condo,Entire home/apt,4,1.0,1.0,2.0,"[""Iron"", ""Building staff"", ""City skyline view""...",28,4.79,4.79,4.79,4.64,4.76,4.86,4.67,155.0,10,38.75,5.049856,3.68261
6,40701,175687,Bright Beaches loft close to Queen and the lake,Highly walkable neighborhood. Close to the lak...,f,The Beaches,43.67239,-79.28858,Entire rental unit,Entire home/apt,2,1.0,1.0,1.0,"[""Window AC unit"", ""Stove"", ""Iron"", ""Coffee ma...",28,4.75,4.88,4.38,4.88,4.63,4.88,4.5,85.0,10,42.5,4.454347,3.772761
7,44452,195095,Yonge & Bloor Studio Skyline,,t,Rosedale-Moore Park,43.67193,-79.3859,Entire rental unit,Entire home/apt,2,1.0,1.0,1.0,"[""Iron"", ""Hot water"", ""Carbon monoxide alarm"",...",28,4.18,4.51,4.03,4.79,4.84,4.95,4.23,126.0,10,63.0,4.844187,4.158883
8,45399,195095,Fountain View Studio - Eaton center,"Open Space studio style, Big windows, calm & r...",t,Bay Street Corridor,43.66123,-79.38336,Entire condo,Entire home/apt,3,1.0,0.0,1.0,"[""Iron"", ""Hot water"", ""Coffee maker: drip coff...",28,4.17,4.49,3.97,4.63,4.69,4.92,4.21,146.0,10,48.666667,4.990433,3.905334
9,45893,195095,Yonge & Bloor Lakeview Master BR,,t,Rosedale-Moore Park,43.6718,-79.38488,Private room in rental unit,Private room,1,1.0,1.0,1.0,"[""Iron"", ""Coffee maker"", ""Carbon monoxide alar...",28,4.4,4.29,4.0,4.52,4.81,4.86,4.24,90.0,10,90.0,4.51086,4.51086


### 4. Keep only listings with at least one review, drop rows with NaN's, keep only listings that appear in at least 5 months

In [70]:
# Compare IDs between listings_df and raw_reviews_df
listings_ids = set(listings_df['id'].unique())
reviews_ids = set(raw_reviews_df['listing_id'].unique())

common_ids = listings_ids & reviews_ids
only_in_listings = listings_ids - reviews_ids
only_in_reviews = reviews_ids - listings_ids

print(f"Total unique IDs in listings: {len(listings_ids)}")
print(f"Total unique IDs in reviews: {len(reviews_ids)}")
print(f"Common IDs: {len(common_ids)}")
print(f"IDs only in listings: {len(only_in_listings)}")
print(f"IDs only in reviews: {len(only_in_reviews)}")

# Optionally, display some samples
print("\nSample common IDs:", list(common_ids)[:5])
print("Sample only in listings:", list(only_in_listings)[:5])
print("Sample only in reviews:", list(only_in_reviews)[:5])

# Keep only common IDs in listings and reviews
common_listings_df = listings_df[listings_df['id'].isin(common_ids)]
common_reviews_df = raw_reviews_df[raw_reviews_df['listing_id'].isin(common_ids)]

# Drop all listings with NaN's
common_listings_df = common_listings_df.dropna()

# Keep only listings that appear at least 5 times
common_listings_df = common_listings_df[common_listings_df.groupby('id')['id'].transform('size') >= 5]

# Display info after filtering
print("\nFiltered Listings DataFrame Info:")
print(common_listings_df.info())
print("\nFiltered Reviews DataFrame Info:")
print(common_reviews_df.info())

Total unique IDs in listings: 26686
Total unique IDs in reviews: 16391
Common IDs: 13838
IDs only in listings: 12848
IDs only in reviews: 2553

Sample common IDs: [np.int64(35880960), np.int64(1081313620383858689), np.int64(1136166639130869762), np.int64(1239457201874862082), np.int64(1312646562476523531)]
Sample only in listings: [np.int64(986302774835249152), np.int64(594055665721704452), np.int64(1343715665444208646), np.int64(1022457932458819592), np.int64(1103777582724939786)]
Sample only in reviews: [np.int64(38731776), np.int64(23248896), np.int64(28606469), np.int64(27549705), np.int64(26083338)]

Filtered Listings DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
Index: 108108 entries, 2 to 256799
Data columns (total 28 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   id                           108108 non-null  int64  
 1   host_id                      108108 non-null  int64  
 2   name  

### 5. Add column with total reviews extracted from `common_reviews_df`, format `host_is_superhost`, `bedrooms`, and `beds` columns 

In [71]:
# Aggregate reviews to get total reviews per listing
reviews_count = common_reviews_df.groupby('listing_id').size().reset_index(name='total_reviews')

# Merge on the listing ID
final_df = common_listings_df.merge(reviews_count, left_on='id', right_on='listing_id', how='left')

# Convert total_reviews to int
final_df['total_reviews'] = final_df['total_reviews'].astype('int')

# Drop the redundant listing_id column
final_df = final_df.drop(columns=['listing_id'])

# Convert host_is_superhost to numeric 0/1
final_df['host_is_superhost'] = final_df['host_is_superhost'].astype(str).map({'t': 1, 'f': 0})

# Convert bedrooms and beds to int
final_df['bedrooms'] = final_df['bedrooms'].astype('int')
final_df['beds'] = final_df['beds'].astype('int')

# Print information about the final DataFrame
print(f"\nFinal listings dataset for {CITY}:")
display(final_df.info())

# Display 3 sample listings (all occurrences)
sample_ids = np.random.choice(final_df['id'].unique(), size=3, replace=False)
for listing_id in sample_ids:
    listing_reviews = final_df[final_df['id'] == listing_id]
    print(f"\nSample data for listing ID {listing_id}:")
    display(listing_reviews)


Final listings dataset for toronto:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 108108 entries, 0 to 108107
Data columns (total 29 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   id                           108108 non-null  int64  
 1   host_id                      108108 non-null  int64  
 2   name                         108108 non-null  object 
 3   description                  108108 non-null  object 
 4   host_is_superhost            108108 non-null  int64  
 5   neighbourhood_cleansed       108108 non-null  object 
 6   latitude                     108108 non-null  float64
 7   longitude                    108108 non-null  float64
 8   property_type                108108 non-null  object 
 9   room_type                    108108 non-null  object 
 10  accommodates                 108108 non-null  int64  
 11  bathrooms                    108108 non-null  float64
 12  bedrooms             

None


Sample data for listing ID 761634524402467898:


Unnamed: 0,id,host_id,name,description,host_is_superhost,neighbourhood_cleansed,latitude,longitude,property_type,room_type,accommodates,bathrooms,bedrooms,beds,amenities,minimum_nights,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,price,month,price_per_person,log_price,log_price_per_person,total_reviews
4358,761634524402467898,3610455,Spacious 1 bedroom condo downtown,Fully Furnished & Equipped spacious beautiful ...,1,North St.James Town,43.67231,-79.37594,Entire rental unit,Entire home/apt,2,1.0,1,1,"[""Iron"", ""Hot water"", ""Hot water kettle"", ""Bat...",28,5.0,5.0,4.67,5.0,5.0,4.67,5.0,170.0,10,85.0,5.141664,4.454347,4
12478,761634524402467898,3610455,Spacious 1 bedroom condo downtown,Fully Furnished & Equipped spacious beautiful ...,1,North St.James Town,43.67231,-79.37594,Entire rental unit,Entire home/apt,2,1.0,1,1,"[""Free dryer \u2013 In unit"", ""Freezer"", ""Air ...",28,5.0,5.0,4.67,5.0,5.0,4.67,5.0,170.0,11,85.0,5.141664,4.454347,4
20862,761634524402467898,3610455,Spacious 1 bedroom condo downtown,Fully Furnished & Equipped spacious beautiful ...,1,North St.James Town,43.67144,-79.37661,Entire rental unit,Entire home/apt,2,1.0,1,1,"[""Elevator"", ""Shower gel"", ""Essentials"", ""40 i...",28,5.0,5.0,4.67,5.0,5.0,4.67,5.0,170.0,12,85.0,5.141664,4.454347,4
33658,761634524402467898,3610455,Spacious 1 bedroom condo downtown,Fully Furnished & Equipped spacious beautiful ...,1,North St.James Town,43.67231,-79.37594,Entire rental unit,Entire home/apt,2,1.0,1,1,"[""Bathtub"", ""Body soap"", ""Free washer \u2013 I...",28,5.0,5.0,4.67,5.0,5.0,4.67,5.0,170.0,1,85.0,5.141664,4.454347,4
38241,761634524402467898,3610455,Spacious 1 bedroom condo downtown,Fully Furnished & Equipped spacious beautiful ...,1,North St.James Town,43.67231,-79.37594,Entire rental unit,Entire home/apt,2,1.0,1,1,"[""Shower gel"", ""Room-darkening shades"", ""Dinin...",28,5.0,5.0,4.67,5.0,5.0,4.67,5.0,126.0,2,63.0,4.844187,4.158883,4
47070,761634524402467898,3610455,Spacious 1 bedroom condo downtown,Fully Furnished & Equipped spacious beautiful ...,1,North St.James Town,43.67231,-79.37594,Entire rental unit,Entire home/apt,2,1.0,1,1,"[""Hair dryer"", ""Room-darkening shades"", ""Eleva...",28,5.0,5.0,4.75,5.0,5.0,4.75,5.0,114.0,3,57.0,4.744932,4.060443,4
56033,761634524402467898,3610455,Spacious 1 bedroom condo downtown,Fully Furnished & Equipped spacious beautiful ...,1,North St.James Town,43.67231,-79.37594,Entire rental unit,Entire home/apt,2,1.0,1,1,"[""Patio or balcony"", ""Kitchen"", ""Dishwasher"", ...",28,5.0,5.0,4.75,5.0,5.0,4.75,5.0,162.0,4,81.0,5.09375,4.406719,4
65372,761634524402467898,3610455,Spacious 1 bedroom condo downtown,Fully Furnished & Equipped spacious beautiful ...,1,North St.James Town,43.67231,-79.37594,Entire rental unit,Entire home/apt,2,1.0,1,1,"[""Clothing storage: closet and dresser"", ""Hair...",28,5.0,5.0,4.75,5.0,5.0,4.75,5.0,153.0,5,76.5,5.036953,4.350278,4
74880,761634524402467898,3610455,Spacious 1 bedroom condo downtown,Fully Furnished & Equipped spacious beautiful ...,1,North St.James Town,43.67231,-79.37594,Entire rental unit,Entire home/apt,2,1.0,1,1,"[""Wine glasses"", ""Paid parking on premises"", ""...",28,5.0,5.0,4.75,5.0,5.0,4.75,5.0,157.0,6,78.5,5.062595,4.375757,4
84460,761634524402467898,3610455,Spacious 1 bedroom condo downtown,Fully Furnished & Equipped spacious beautiful ...,0,North St.James Town,43.67231,-79.37594,Entire rental unit,Entire home/apt,2,1.0,1,1,"[""Extra pillows and blankets"", ""Kitchen"", ""Clo...",28,5.0,5.0,4.75,5.0,5.0,4.75,5.0,153.0,7,76.5,5.036953,4.350278,4



Sample data for listing ID 937217418323841680:


Unnamed: 0,id,host_id,name,description,host_is_superhost,neighbourhood_cleansed,latitude,longitude,property_type,room_type,accommodates,bathrooms,bedrooms,beds,amenities,minimum_nights,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,price,month,price_per_person,log_price,log_price_per_person,total_reviews
13779,937217418323841680,389017451,Cheerful 3-Bedrooms and 2 Bathrooms with parking,Reconnect with loved ones in this great place....,1,Woburn,43.75819,-79.23418,Private room in home,Private room,6,2.0,3,3,"[""Free dryer \u2013 In unit"", ""Freezer"", ""Priv...",14,4.94,4.89,4.92,4.97,4.97,4.83,4.83,208.0,11,34.666667,5.342334,3.574217,42
30671,937217418323841680,389017451,Cheerful 3-Bedrooms and 2 Bathrooms with parking,Reconnect with loved ones in this great place....,1,Woburn,43.75819,-79.23418,Private room in home,Private room,6,2.0,3,3,"[""Baking sheet"", ""Bathtub"", ""Free driveway par...",14,4.95,4.9,4.92,4.97,4.97,4.85,4.85,335.0,1,55.833333,5.817111,4.040123,42
57382,937217418323841680,389017451,Cheerful 3-Bedrooms and 2 Bathrooms with parking,Reconnect with loved ones in this great place....,1,Woburn,43.75819,-79.23418,Private room in home,Private room,6,2.0,3,3,"[""Kitchen"", ""Stainless steel oven"", ""Dishes an...",14,4.95,4.9,4.92,4.97,4.97,4.85,4.85,539.0,4,89.833333,6.291569,4.509026,42
66736,937217418323841680,389017451,Cheerful 3-Bedrooms and 2 Bathrooms with parking,Reconnect with loved ones in this great place....,1,Woburn,43.75819,-79.23418,Private room in home,Private room,6,2.0,3,3,"[""Toaster"", ""Free driveway parking on premises...",1,4.95,4.9,4.92,4.97,4.97,4.85,4.85,191.0,5,31.833333,5.257495,3.491444,42
76249,937217418323841680,389017451,Cheerful 3-Bedrooms and 2 Bathrooms with parking,Reconnect with loved ones in this great place....,1,Woburn,43.75819,-79.23418,Private room in home,Private room,6,2.0,3,3,"[""55 inch HDTV"", ""LG electric stove"", ""Stainle...",1,4.95,4.9,4.92,4.97,4.97,4.85,4.85,187.0,6,31.166667,5.236442,3.470931,42
85781,937217418323841680,389017451,Cheerful 3-Bedrooms and 2 Bathrooms with parking,Reconnect with loved ones in this great place....,1,Woburn,43.75819,-79.23418,Private room in home,Private room,6,2.0,3,3,"[""Extra pillows and blankets"", ""Kitchen"", ""Cof...",1,4.95,4.9,4.93,4.98,4.98,4.83,4.83,200.0,7,33.333333,5.303305,3.536117,42



Sample data for listing ID 25843940:


Unnamed: 0,id,host_id,name,description,host_is_superhost,neighbourhood_cleansed,latitude,longitude,property_type,room_type,accommodates,bathrooms,bedrooms,beds,amenities,minimum_nights,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,price,month,price_per_person,log_price,log_price_per_person,total_reviews
1406,25843940,175529316,Feel Like Home sweet Home,"Room for two single, also available double bed...",0,Weston,43.69803,-79.51321,Private room in rental unit,Private room,2,1.0,1,2,"[""Exterior security cameras on property"", ""Cof...",28,4.25,3.88,3.88,4.75,4.75,4.0,3.75,100.0,10,50.0,4.615121,3.931826,8
17872,25843940,175529316,Feel Like Home sweet Home,"Room for two single, also available double bed...",0,Weston,43.69803,-79.51321,Private room in rental unit,Private room,2,1.0,1,2,"[""Essentials"", ""Dishes and silverware"", ""Host ...",28,4.25,3.88,3.88,4.75,4.75,4.0,3.75,100.0,12,50.0,4.615121,3.931826,8
28153,25843940,175529316,Feel Like Home sweet Home,"Room for two single, also available double bed...",0,Weston,43.69803,-79.51321,Private room in rental unit,Private room,2,1.0,1,2,"[""Breakfast"", ""Private entrance"", ""Dishes and ...",28,4.25,3.88,3.88,4.75,4.75,4.0,3.75,100.0,1,50.0,4.615121,3.931826,8
35339,25843940,175529316,Feel Like Home sweet Home,"Room for two single, also available double bed...",0,Weston,43.69803,-79.51321,Private room in rental unit,Private room,2,1.0,1,2,"[""Carbon monoxide alarm"", ""Exterior security c...",28,4.25,3.88,3.88,4.75,4.75,4.0,3.75,100.0,2,50.0,4.615121,3.931826,8
53135,25843940,175529316,Feel Like Home sweet Home,"Room for two single, also available double bed...",0,Weston,43.69803,-79.51321,Private room in rental unit,Private room,2,1.0,1,2,"[""Kitchen"", ""Dishes and silverware"", ""Paid par...",28,4.25,3.88,3.88,4.75,4.75,4.0,3.75,100.0,4,50.0,4.615121,3.931826,8
71905,25843940,175529316,Feel Like Home sweet Home,"Room for two single, also available double bed...",0,Weston,43.69803,-79.51321,Private room in rental unit,Private room,2,1.0,1,2,"[""Paid parking on premises"", ""Wifi"", ""Kitchen""...",28,4.25,3.88,3.88,4.75,4.75,4.0,3.75,100.0,6,50.0,4.615121,3.931826,8
81418,25843940,175529316,Feel Like Home sweet Home,"Room for two single, also available double bed...",0,Weston,43.69803,-79.51321,Private room in rental unit,Private room,2,1.0,1,2,"[""Kitchen"", ""Lock on bedroom door"", ""Free stre...",28,4.25,3.88,3.88,4.75,4.75,4.0,3.75,100.0,7,50.0,4.615121,3.931826,8
90993,25843940,175529316,Feel Like Home sweet Home,"Room for two single, also available double bed...",0,Weston,43.69803,-79.51321,Private room in rental unit,Private room,2,1.0,1,2,"[""Dedicated workspace"", ""Hot water"", ""Essentia...",28,4.25,3.88,3.88,4.75,4.75,4.0,3.75,100.0,8,50.0,4.615121,3.931826,8
100422,25843940,175529316,Feel Like Home sweet Home,"Room for two single, also available double bed...",0,Weston,43.69803,-79.51321,Private room in rental unit,Private room,2,1.0,1,2,"[""Free parking on premises"", ""Smoke alarm"", ""A...",28,4.25,3.88,3.88,4.75,4.75,4.0,3.75,100.0,9,50.0,4.615121,3.931826,8


### 6. Finalize & Save Modeling Dataset

In [73]:
# Save to Parquet
output_path = os.path.join(OUTPUT_DATA_DIR, OUTPUT_FILENAME)
print(f"\nSaving to {output_path}...")
final_df.to_parquet(output_path, index=False)
print("Done.")


Saving to ../data/toronto/toronto_dataset_oct_17.parquet...
Done.


In [74]:
# Save a sample with all occurrences of 2 random listings
sample_ids = np.random.choice(final_df['id'].unique(), size=2, replace=False)
sample_df = final_df[final_df['id'].isin(sample_ids)]
sample_output_path = os.path.join(OUTPUT_DATA_DIR, f"{CITY}_sample_listings.csv")
sample_df.to_csv(sample_output_path, index=False)
print(f"Sample listings saved to {sample_output_path}.")

Sample listings saved to ../data/toronto/toronto_sample_listings.csv.
