# 02 — Feature Engineering

Transform raw listings and reviews into modelling-ready features:
1. **Sentiment Analysis** — VADER polarity scores on 1.3M+ reviews
2. **Amenity Extraction** — 30+ binary features from free-text amenities
3. **Host Features** — verification methods, response time encoding
4. **Location Features** — Haversine distance to city centre
5. **Date Features** — hosting duration, time to first review
6. **Categorical Encoding** — property type, room type, neighbourhood one-hot encoding
7. **Final Feature Set** — 140 explanatory variables

In [None]:
import sys
sys.path.insert(0, '..')

import pandas as pd
import numpy as np

from src.data_loader import load_listings, load_reviews
from src.data_cleaning import clean_listings
from src.feature_engineering import (
    compute_vader_sentiment,
    aggregate_sentiment_by_listing,
    merge_sentiment,
    create_amenity_dummies,
    encode_host_verifications,
    encode_host_response_time,
    group_rare_property_types,
    one_hot_encode_categoricals,
    compute_date_features,
    compute_city_center_distance,
    log_transform_price,
    drop_text_columns,
)
from src.data_cleaning import clean_percentage_columns, filter_valid_listings
from src.config import AMENITY_KEYWORDS, URL_AND_REDUNDANT_COLUMNS
from src.visualization import set_style

set_style()

## 1. Load and Clean Data

In [None]:
listings_raw = load_listings()  # Uses DEFAULT_CITY from config
reviews_raw = load_reviews()  # Uses DEFAULT_CITY from config

listings = clean_listings(listings_raw)
print(f'Cleaned listings: {listings.shape[0]:,} rows × {listings.shape[1]} columns')

## 2. Sentiment Analysis (VADER)

Apply VADER sentiment analysis to each review comment, then compute the
mean polarity score per listing.

In [None]:
# This cell may take several minutes for 1.3M+ reviews
reviews_scored = compute_vader_sentiment(reviews_raw)
print(f'Reviews scored: {len(reviews_scored):,}')
reviews_scored[['comments', 'neg', 'neu', 'pos', 'compound']].head()

In [None]:
sentiment_agg = aggregate_sentiment_by_listing(reviews_scored)
listings = merge_sentiment(listings, sentiment_agg)
print(f'Listings with sentiment: {listings.shape}')

## 3. Amenity Extraction

Create binary dummy variables for 30+ amenity categories using keyword matching.

In [None]:
print(f'Amenity categories: {len(AMENITY_KEYWORDS)}')
for name, keywords in list(AMENITY_KEYWORDS.items())[:5]:
    print(f'  {name}: {keywords[:3]}...')

In [None]:
listings = create_amenity_dummies(listings)
amenity_cols = list(AMENITY_KEYWORDS.keys())
print('Amenity feature coverage:')
print(listings[amenity_cols].sum().sort_values(ascending=False))

## 4. Host Feature Encoding

In [None]:
listings = encode_host_verifications(listings)
listings = encode_host_response_time(listings)
print('Host features added:', ['email_verified', 'phone_verified',
      'host_response_within_hour', 'host_response_few_hours',
      'host_response_within_day', 'host_response_few_days'])

## 5. Property Type Grouping & Categorical Encoding

In [None]:
listings = group_rare_property_types(listings)
print('Property types after grouping:')
print(listings['property_type'].value_counts())

In [None]:
listings = one_hot_encode_categoricals(listings)
print(f'Columns after one-hot encoding: {listings.shape[1]}')

## 6. Date Features & Distance

In [None]:
listings = log_transform_price(listings)
listings = compute_date_features(listings)
listings = compute_city_center_distance(listings)  # Uses DEFAULT_CITY from config

print('New features: hosting_duration_days, joining_to_first_review_duration, distance_to_city_center')
listings[['hosting_duration_days', 'joining_to_first_review_duration', 'distance_to_city_center']].describe()

## 7. Final Cleanup

In [None]:
listings = drop_text_columns(listings)
listings = clean_percentage_columns(listings)

print(f'\nFinal feature set: {listings.shape[1]} columns × {listings.shape[0]:,} rows')
print(f'\nColumn names:')
print(listings.columns.tolist())

In [None]:
listings.describe()