# Data Loading

This notebook loads Airbnb listings and reviews data for multiple cities
from the Inside Airbnb dataset. Raw data is not included in the repository
due to size and licensing constraints.


In [None]:
import os
import pandas as pd

## Data Location

Data is to be stored locally and organized by city.  
The base data path would need to be replaced using the `AIRBNB_DATA_PATH`
environment variable.


In [None]:
BASE_PATH = os.getenv("AIRBNB_DATA_PATH", "data/raw")

cities = [
    "new-york-city",
    "los-angeles",
    "san-francisco",
    "chicago",
    "boston",
    "washington-dc",
    "dallas",
    "new-orleans",
    "seattle",
    "austin",
    "london",
    "paris",
    "amsterdam",
    "barcelona",
    "berlin",
    "sydney",
    "rome",
    "madrid",
    "munich",
    "toronto"
]

## Load Data for a Single City

This function loads compressed listings and reviews files for a given city
and adds a city identifier.


In [None]:
def load_city_data(city):
    city_path = os.path.join(BASE_PATH, city)

    listings_path = os.path.join(city_path, "listings.csv.gz")
    reviews_path = os.path.join(city_path, "reviews.csv.gz")

    # Read compressed files directly
    df_listings = pd.read_csv(listings_path, compression="gzip", low_memory=False)
    df_reviews = pd.read_csv(reviews_path, compression="gzip", low_memory=False)

    # Add city name column
    df_listings["city"] = city
    df_reviews["city"] = city

    return df_listings, df_reviews

## Load and Combine All Cities

In [None]:
all_listings = []
all_reviews = []

for city in cities:
    print(f"Loading data for {city}...")
    df_l, df_r = load_city_data(city)
    all_listings.append(df_l)
    all_reviews.append(df_r)

listings = pd.concat(all_listings, ignore_index=True)
reviews = pd.concat(all_reviews, ignore_index=True)

print("Listings shape:", listings.shape)
print("Reviews shape:", reviews.shape)

In [None]:
# Save combined versions to csv

listings.to_csv("combined_listings.csv", index=False)
reviews.to_csv("combined_reviews.csv", index=False)
