# Airbnb London – Raw Data Exploration & Monthly Processing

This notebook explores raw Airbnb datasets and prepares month-specific
processed files (November and December 2024) to be used in an AWS-based
data lake architecture.

## Dataset Access & Project Setup

In [1]:
from pathlib import Path

# Base directories
PROJECTS_DIR = Path.home() / "projects"
DATA_DIR = PROJECTS_DIR / "data"
RAW_DIR = DATA_DIR / "raw"

RAW_DIR

PosixPath('/home/daniel/projects/data/raw')

In [2]:
list(RAW_DIR.iterdir())

[PosixPath('/home/daniel/projects/data/raw/reviews.csv.gz'),
 PosixPath('/home/daniel/projects/data/raw/listings.csv.gz'),
 PosixPath('/home/daniel/projects/data/raw/calendar.csv.gz')]

In [3]:
listings_path = RAW_DIR / "listings.csv.gz"
reviews_path = RAW_DIR / "reviews.csv.gz"
calendar_path = RAW_DIR / "calendar.csv.gz"

listings_path, reviews_path, calendar_path

(PosixPath('/home/daniel/projects/data/raw/listings.csv.gz'),
 PosixPath('/home/daniel/projects/data/raw/reviews.csv.gz'),
 PosixPath('/home/daniel/projects/data/raw/calendar.csv.gz'))

## Raw Data Exploration

In [4]:
import pandas as pd

listings_sample = pd.read_csv(
    listings_path,
    compression="gzip",
    nrows=2000
)

listings_sample.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,...,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,13913,https://www.airbnb.com/rooms/13913,20250914034649,2025-09-16,city scrape,Holiday London DB Room Let-on going,My bright double bedroom with a large window h...,Finsbury Park is a friendly melting pot commun...,https://a0.muscache.com/pictures/miso/Hosting-...,54730,...,4.87,4.78,4.78,,f,2,1,1,0,0.3
1,15400,https://www.airbnb.com/rooms/15400,20250914034649,2025-09-16,city scrape,Bright Chelsea Apartment. Chelsea!,Lots of windows and light. St Luke's Gardens ...,It is Chelsea.,https://a0.muscache.com/pictures/428392/462d26...,60302,...,4.84,4.93,4.74,,f,1,1,0,0,0.51
2,17402,https://www.airbnb.com/rooms/17402,20250914034649,2025-09-16,city scrape,Very Central Modern 3-Bed/2 Bath By Oxford St W1,"You'll have a great time in this beautiful, cl...","Fitzrovia is a very desirable trendy, arty and...",https://a0.muscache.com/pictures/39d5309d-fba7...,67564,...,4.72,4.89,4.61,,f,2,2,0,0,0.32
3,24328,https://www.airbnb.com/rooms/24328,20250914034649,2025-09-18,previous scrape,Battersea live/work artist house,"Artist house by SW Battersea Park, bright high...","- Battersea is a quiet family area, easy acces...",https://a0.muscache.com/pictures/9194b40f-c627...,41759,...,4.93,4.6,4.65,,f,1,1,0,0,0.53
4,36274,https://www.airbnb.com/rooms/36274,20250914034649,2025-09-15,city scrape,Bright 1 bedroom apt off brick lane in Shoreditch,*Update June '25- Pump Installed to improve wa...,,https://a0.muscache.com/pictures/hosting/Hosti...,133271,...,4.46,4.85,4.54,,t,2,2,0,0,0.09


In [5]:
listings_sample.shape

(2000, 79)

In [6]:
listings_sample.columns.tolist()

['id',
 'listing_url',
 'scrape_id',
 'last_scraped',
 'source',
 'name',
 'description',
 'neighborhood_overview',
 'picture_url',
 'host_id',
 'host_url',
 'host_name',
 'host_since',
 'host_location',
 'host_about',
 'host_response_time',
 'host_response_rate',
 'host_acceptance_rate',
 'host_is_superhost',
 'host_thumbnail_url',
 'host_picture_url',
 'host_neighbourhood',
 'host_listings_count',
 'host_total_listings_count',
 'host_verifications',
 'host_has_profile_pic',
 'host_identity_verified',
 'neighbourhood',
 'neighbourhood_cleansed',
 'neighbourhood_group_cleansed',
 'latitude',
 'longitude',
 'property_type',
 'room_type',
 'accommodates',
 'bathrooms',
 'bathrooms_text',
 'bedrooms',
 'beds',
 'amenities',
 'price',
 'minimum_nights',
 'maximum_nights',
 'minimum_minimum_nights',
 'maximum_minimum_nights',
 'minimum_maximum_nights',
 'maximum_maximum_nights',
 'minimum_nights_avg_ntm',
 'maximum_nights_avg_ntm',
 'calendar_updated',
 'has_availability',
 'availability_30

## Reviews Dataset – Time-Based Filtering

In [7]:
reviews_sample = pd.read_csv(
    reviews_path,
    compression="gzip",
    nrows=2000,
    parse_dates=["date"]
)

reviews_sample.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,13913,80770,2010-08-18,177109,Michael,My girlfriend and I hadn't known Alina before ...
1,13913,367568,2011-07-11,19835707,Mathias,Alina was a really good host. The flat is clea...
2,13913,529579,2011-09-13,1110304,Kristin,Alina is an amazing host. She made me feel rig...
3,13913,595481,2011-10-03,1216358,Camilla,"Alina's place is so nice, the room is big and ..."
4,13913,612947,2011-10-09,490840,Jorik,"Nice location in Islington area, good for shor..."


In [8]:
reviews_sample["date"].min(), reviews_sample["date"].max()

(Timestamp('2009-12-21 00:00:00'), Timestamp('2025-09-12 00:00:00'))

In [9]:
reviews_sample["date"].dt.to_period("M").value_counts().sort_index().tail(24)

date
2023-10    22
2023-11    20
2023-12    16
2024-01    12
2024-02    16
2024-03    21
2024-04    21
2024-05    25
2024-06    19
2024-07    19
2024-08    14
2024-09    19
2024-10    24
2024-11    25
2024-12    24
2025-01    15
2025-02    15
2025-03    18
2025-04    16
2025-05    20
2025-06    25
2025-07    24
2025-08    16
2025-09     4
Freq: M, Name: count, dtype: int64

## Monthly Dataset Processing (Raw → Processed)

In [10]:
import shutil

PROCESSED_DIR = (Path.home() / "projects" / "data" / "processed")
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)

out_listings_nov = PROCESSED_DIR / "2024-11-listings.csv.gz"
out_listings_dec = PROCESSED_DIR / "2024-12-listings.csv.gz"

# Listings data is a snapshot and not time-partitioned,
# so we reuse the same dataset for both months.
shutil.copy2(listings_path, out_listings_nov)
shutil.copy2(listings_path, out_listings_dec)

out_listings_nov, out_listings_dec

(PosixPath('/home/daniel/projects/data/processed/2024-11-listings.csv.gz'),
 PosixPath('/home/daniel/projects/data/processed/2024-12-listings.csv.gz'))

In [11]:
out_reviews_nov = PROCESSED_DIR / "2024-11-reviews.csv.gz"

# If file exists from earlier runs, remove it for a clean write
if out_reviews_nov.exists():
    out_reviews_nov.unlink()

first_write = True

for chunk in pd.read_csv(
    reviews_path,
    compression="gzip",
    parse_dates=["date"],
    chunksize=500_000,
):
    nov = chunk[(chunk["date"] >= "2024-11-01") & (chunk["date"] < "2024-12-01")]
    if not nov.empty:
        nov.to_csv(
            out_reviews_nov,
            mode="a",
            index=False,
            header=first_write,
            compression="gzip",
        )
        first_write = False

out_reviews_nov

PosixPath('/home/daniel/projects/data/processed/2024-11-reviews.csv.gz')

In [12]:
out_reviews_dec = PROCESSED_DIR / "2024-12-reviews.csv.gz"

# If file exists from earlier runs, remove it for a clean write
if out_reviews_dec.exists():
    out_reviews_dec.unlink()

first_write = True

for chunk in pd.read_csv(
    reviews_path,
    compression="gzip",
    parse_dates=["date"],
    chunksize=500_000,
):
    dec = chunk[(chunk["date"] >= "2024-12-01") & (chunk["date"] < "2025-01-01")]
    if not dec.empty:
        dec.to_csv(
            out_reviews_dec,
            mode="a",
            index=False,
            header=first_write,
            compression="gzip",
        )
        first_write = False

out_reviews_dec

PosixPath('/home/daniel/projects/data/processed/2024-12-reviews.csv.gz')

In [13]:
sorted(PROCESSED_DIR.iterdir())

[PosixPath('/home/daniel/projects/data/processed/2024-11-listings.csv.gz'),
 PosixPath('/home/daniel/projects/data/processed/2024-11-reviews.csv.gz'),
 PosixPath('/home/daniel/projects/data/processed/2024-12-listings.csv.gz'),
 PosixPath('/home/daniel/projects/data/processed/2024-12-reviews.csv.gz')]

### Processed Dataset Summary

The following processed datasets were created from the raw Airbnb data:

- Listings snapshots for November and December 2024 (renamed copies)
- Reviews filtered by month for November and December 2024

These datasets will be used in subsequent steps to build and query an AWS-based data lake.