# Feature Engineering

This notebook constructs city–holiday level features used for clustering
holiday travel experiences. Features combine:

- Sentiment polarity extracted from Airbnb review text
- Emotional signals derived from lexicon-based analysis
- Market characteristics from Airbnb listings


In [1]:
import pandas as pd
import numpy as np
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from nrclex import NRCLex
from datetime import datetime


## Load Combined Data

This notebook assumes that combined listings and reviews datasets
have already been generated during the data loading step.

The CSV files used here are outputs created locally
and have not been committed.



In [2]:
listings = pd.read_csv("combined_listings.csv", low_memory=False)
reviews = pd.read_csv("combined_reviews.csv", low_memory=False)

reviews["date"] = pd.to_datetime(reviews["date"], errors="coerce")

print("Listings:", listings.shape)
print("Reviews:", reviews.shape)


Listings: (477433, 80)
Reviews: (17779163, 7)


In [3]:
HOLIDAYS = {
    "christmas": ((12, 20), (12, 27)),
    "new_year": ((12, 28), (1, 3)),
    "valentines": ((2, 12), (2, 17)),
    "halloween": ((10, 25), (11, 2))
}

EASTER_DATES = {
2015: (4, 5),
2016: (3, 27),
2017: (4, 16),
2018: (4, 1),
2019: (4, 21),
2020: (4, 12),
2021: (4, 4),
2022: (4, 17),
2023: (4, 9),
2024: (3, 31),
2025: (4, 20)
}

In [4]:
def in_holiday_window(date, start, end):
    if pd.isna(date):
        return False

    start_month, start_day = start
    end_month, end_day = end

    d = date.replace(year=2000)

    start_date = datetime(2000, start_month, start_day)
    end_date = datetime(2000, end_month, end_day)

    if start_month <= end_month:
        return start_date <= d <= end_date
    else:
        return d >= start_date or d <= end_date


def is_easter_window(date):
    if pd.isna(date) or date.year not in EASTER_DATES:
        return False

    em, ed = EASTER_DATES[date.year]
    easter_date = datetime(date.year, em, ed)

    return abs((date - easter_date).days) <= 3


In [5]:
reviews["holiday"] = None

for holiday, (start, end) in HOLIDAYS.items():
    mask = reviews["date"].apply(lambda x: in_holiday_window(x, start, end))
    reviews.loc[mask, "holiday"] = holiday

# Easter
reviews.loc[reviews["date"].apply(is_easter_window), "holiday"] = "easter"

holiday_reviews = reviews.dropna(subset=["holiday"])

print("Holiday reviews:", holiday_reviews.shape)
holiday_reviews["holiday"].value_counts()


Holiday reviews: (1678542, 8)


holiday
halloween     436214
new_year      377653
easter        367825
christmas     257370
valentines    239480
Name: count, dtype: int64

## Sentiment Feature Extraction

Sentiment polarity is computed using the VADER sentiment analyzer.

Sentiment features are summarized at the city–holiday level using
mean, variability, and proportions of positive, neutral, and
negative reviews.


In [6]:
analyzer = SentimentIntensityAnalyzer()

def vader_scores(text):
    if not isinstance(text, str):
        return pd.Series([0, 0, 0, 0])
    s = analyzer.polarity_scores(text)
    return pd.Series([s["compound"], s["pos"], s["neu"], s["neg"]])


def emotion_scores(text):
    if not isinstance(text, str):
        return pd.Series([0]*6)
    e = NRCLex(text).raw_emotion_scores
    return pd.Series([
        e.get("joy", 0),
        e.get("sadness", 0),
        e.get("anger", 0),
        e.get("fear", 0),
        e.get("trust", 0),
        e.get("anticipation", 0)
    ])

holiday_reviews[["compound","pos","neu","neg"]] = holiday_reviews["comments"].apply(vader_scores)
holiday_reviews[["joy","sadness","anger","fear","trust","anticipation"]] = holiday_reviews["comments"].apply(emotion_scores)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  holiday_reviews[["compound","pos","neu","neg"]] = holiday_reviews["comments"].apply(vader_scores)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  holiday_reviews[["compound","pos","neu","neg"]] = holiday_reviews["comments"].apply(vader_scores)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  holiday_r

In [7]:
sentiment_agg = holiday_reviews.groupby(["city","holiday"]).agg(
    mean_compound=("compound","mean"),
    std_compound=("compound","std"),
    pct_positive=("compound", lambda x: (x > 0.05).mean()),
    pct_negative=("compound", lambda x: (x < -0.05).mean()),
    pct_neutral=("compound", lambda x: ((x >= -0.05) & (x <= 0.05)).mean()),
    avg_review_length=("comments", lambda x: x.str.len().mean()),
    joy=("joy","mean"),
    sadness=("sadness","mean"),
    anger=("anger","mean"),
    fear=("fear","mean"),
    trust=("trust","mean"),
    anticipation=("anticipation","mean"),
    review_count=("compound","count")
).reset_index()

sentiment_agg.head()


Unnamed: 0,city,holiday,mean_compound,std_compound,pct_positive,pct_negative,pct_neutral,avg_review_length,joy,sadness,anger,fear,trust,anticipation,review_count
0,amsterdam,christmas,0.613105,0.458714,0.777466,0.063581,0.158953,224.391149,1.172666,0.293128,0.081346,0.147109,1.441951,0.935172,6417
1,amsterdam,easter,0.585709,0.513567,0.760092,0.087132,0.152777,265.079232,1.299866,0.376254,0.071333,0.225215,1.57074,1.050241,12659
2,amsterdam,halloween,0.601028,0.487607,0.770199,0.076295,0.153506,250.099725,1.235221,0.341616,0.078629,0.188527,1.499541,0.976486,11993
3,amsterdam,new_year,0.572969,0.498455,0.745284,0.081325,0.173391,250.414749,1.156332,0.305443,0.070313,0.20065,1.451033,0.942504,11079
4,amsterdam,valentines,0.600796,0.466091,0.767739,0.068551,0.16371,225.520896,1.158749,0.305171,0.063139,0.146572,1.425135,0.897625,6652


## Market Features

Market features capture pricing behavior, availability, and listing
composition during holiday periods. These features provide context
beyond sentiment and help distinguish emotionally similar but
structurally different holiday markets.


In [8]:
listings["price_clean"] = (
    listings["price"].astype(str)
    .str.replace(r"[^\d.]", "", regex=True)
)

listings["price_clean"] = pd.to_numeric(listings["price_clean"], errors="coerce")

market_agg = listings.groupby("city").agg(
    avg_price=("price_clean","mean"),
    price_std=("price_clean","std"),
    avg_min_nights=("minimum_nights","mean"),
    avg_availability=("availability_365","mean"),
    listing_count=("id","count"),
    pct_entire_home=("room_type", lambda x: (x == "Entire home/apt").mean())
).reset_index()

market_agg.head()


Unnamed: 0,city,avg_price,price_std,avg_min_nights,avg_availability,listing_count,pct_entire_home
0,amsterdam,336.785155,1985.661882,4.390267,93.999809,10480,0.816889
1,austin,386.470583,2620.198322,7.888984,174.44301,15187,0.811615
2,barcelona,187.312713,363.96717,15.953478,195.09119,19410,0.607367
3,berlin,201.240393,1656.989769,39.965532,146.82955,14274,0.676965
4,boston,769.980605,4848.036564,26.159991,204.099796,4419,0.676624


In [9]:
city_holiday_features = sentiment_agg.merge(market_agg, on="city", how="left")

city_holiday_features.to_csv("city_holiday_features.csv", index=False)

city_holiday_features.shape


(100, 21)