# Feature engineering
This step is about enriching the data. The best feature is not always the one provided by the dataset. We often need to create new variables or transform existing ones so that the model can learn better.

## Why is it important? 
- The performance of a model often depends not on the algorithm, but on the features.
- Good feature engineering = better predictive performance + easier interpretability.

## Types of Features to Create and How to Approach Them
- Creating new features (e.g., from date: month, season, weekend)
- Coding categories (one-hot encoding, label encoding)
- Scaling/normalization (so that variables are of similar magnitude)
- Processing text variables (e.g., description → length, sentiment)
- Feature selection: removing irrelevant/redundant features (e.g., highly correlated variables)

### Setting Up Libraries and Environment for Feature Engineering

In [1]:
import os
import sys
import re
import nltk

import pandas as pd
import numpy as np
from dotenv import load_dotenv
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sia = SentimentIntensityAnalyzer()

pd.set_option('display.max_columns', None) 

project_root = "/Users/erikvida/PycharmProjects/airbnb-price-prediction"
if project_root not in sys.path:
    sys.path.append(project_root)

from src.db_connection import DatabaseConfig, DatabaseConnection


dotenv_path = "/Users/erikvida/PycharmProjects/airbnb-price-prediction/.env"
load_dotenv(dotenv_path)

True

### 1.0 Loading Data and Initial Overview for Processing

In [2]:
amsterdams_airbnbs_cleaned_data = pd.read_csv("../data/cleaned/amsterdam_airbnbs_clean_data.csv")
df = amsterdams_airbnbs_cleaned_data
df.head()

Unnamed: 0,id,name,description,host_response_rate,host_acceptance_rate,host_is_superhost,host_listings_count,host_total_listings_count,host_has_profile_pic,neighbourhood,neighbourhood_cleansed,property_type,room_type,accommodates,bathrooms,bathrooms_text,bedrooms,beds,amenities,price,minimum_nights,maximum_nights,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,minimum_nights_avg_ntm,maximum_nights_avg_ntm,number_of_reviews,number_of_reviews_ltm,number_of_reviews_l30d,number_of_reviews_ly,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,instant_bookable
0,27886,"Romantic, stylish B&B houseboat in canal district",Stylish and romantic houseboat on fantastic hi...,100%,98%,t,1.0,1.0,t,"Amsterdam, North Holland, Netherlands",Centrum-West,Private room in houseboat,Private room,2,1.5,1.5 baths,1.0,1.0,"[""Coffee maker: Nespresso"", ""Shampoo"", ""Paid s...",$132.00,3,356,3,3,30,30,3.0,30.0,302,28,1,26,4.92,4.9,4.94,4.95,4.92,4.9,4.78,f
1,28871,Comfortable double room,Basic bedroom in the center of Amsterdam.,100%,99%,t,2.0,2.0,t,"Amsterdam, North Holland, Netherlands",Centrum-West,Private room in rental unit,Private room,2,1.0,1 shared bath,1.0,1.0,"[""Carbon monoxide alarm"", ""Wifi"", ""Heating"", ""...",$78.00,2,730,1,2,730,730,2.0,730.0,710,93,9,96,4.88,4.9,4.87,4.94,4.94,4.94,4.84,f
2,29051,Comfortable single / double room,This room can also be rented as a single or a ...,100%,99%,t,2.0,2.0,t,"Amsterdam, North Holland, Netherlands",Centrum-Oost,Private room in condo,Private room,2,1.0,1 shared bath,1.0,1.0,"[""Carbon monoxide alarm"", ""Wifi"", ""Heating"", ""...",$70.00,2,730,1,2,730,730,2.0,730.0,822,86,7,88,4.81,4.88,4.83,4.93,4.92,4.87,4.79,f
3,47061,Charming apartment in old centre,"A beautiful, quiet apartment in the center of ...",100%,50%,f,1.0,2.0,t,"Amsterdam, Noord-Holland, Netherlands",De Baarsjes - Oud-West,Entire rental unit,Entire home/apt,3,1.5,1.5 baths,2.0,2.0,"[""Shampoo"", ""Paid street parking off premises""...",$120.00,2,20,2,2,20,20,2.0,20.0,203,5,1,6,4.77,4.78,4.61,4.76,4.9,4.85,4.63,f
4,49552,Multatuli Luxury Guest Suite in top location,Stylish & spacious 60m2 guest suite in Amsterd...,100%,92%,t,1.0,2.0,t,"Amsterdam, North Holland, Netherlands",Centrum-West,Entire guest suite,Entire home/apt,3,1.0,1 bath,2.0,2.0,"[""Marie Stella Maris body soap"", ""Coffee maker...",$284.00,3,1125,1,4,1125,1125,3.0,1125.0,599,56,8,58,4.93,4.93,4.93,4.96,4.97,4.98,4.78,f


### 1.1 Inspecting and Understanding the Loaded Data

In [3]:
df.info()

print(f"Number of rows: {df.shape[0]}, columns: {df.shape[1]}")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6097 entries, 0 to 6096
Data columns (total 40 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   id                           6097 non-null   int64  
 1   name                         6097 non-null   object 
 2   description                  6097 non-null   object 
 3   host_response_rate           6097 non-null   object 
 4   host_acceptance_rate         6097 non-null   object 
 5   host_is_superhost            6097 non-null   object 
 6   host_listings_count          6097 non-null   float64
 7   host_total_listings_count    6097 non-null   float64
 8   host_has_profile_pic         6097 non-null   object 
 9   neighbourhood                6097 non-null   object 
 10  neighbourhood_cleansed       6097 non-null   object 
 11  property_type                6097 non-null   object 
 12  room_type                    6097 non-null   object 
 13  accommodates      

### 2.0 Analyzing and Preparing Host-Level Features

#### 2.1 Converting Percentage Strings to Numeric Values for Modeling.

In [4]:
percent_cols = ['host_response_rate', 'host_acceptance_rate']  

for col in percent_cols:
    df[col] = (
        df[col]
        .astype(str)                 
        .str.rstrip('%')            
        .replace('nan', np.nan)      
        .astype(float) / 100
    )

#### 2.2  Converting Binary Features (True/False) to 0/1 for Easier Processing

In [5]:
binary_cols = ['host_is_superhost', 'host_has_profile_pic']

for col in binary_cols:
    df[col] = (
        df[col]
        .replace({'t': 1, 'f': 0, 'nan': np.nan})
        .astype(int)   
    )

  .replace({'t': 1, 'f': 0, 'nan': np.nan})
  .replace({'t': 1, 'f': 0, 'nan': np.nan})


#### 2.3 Creating Host Experience Feature: Ratio of Total to Active Listings

In [6]:
df['host_experience_ratio'] = (
    df['host_total_listings_count'] /
    df['host_listings_count'].replace(0, np.nan)   
)

df['host_experience_ratio'] = df['host_experience_ratio'].fillna(0)

#### 2.4 Save Processed Host Features to a Separate Table and CSV File

In [7]:
host_features_df = df[['host_response_rate', 'host_acceptance_rate', 'host_is_superhost',
    'host_listings_count', 'host_total_listings_count', 'host_has_profile_pic', 'host_experience_ratio']]


config = DatabaseConfig()
db = DatabaseConnection(config)

TABLE_NAME = "host_features"

db.write_dataframe(host_features_df, TABLE_NAME, if_exists="replace")

host_features_path = "../data/processed/host_features.csv"
df.to_csv(host_features_path, index=False)
print(f"Cleaned data saved to CSV: {host_features_path}")

DataFrame successfully saved to table: host_features
Cleaned data saved to CSV: ../data/processed/host_features.csv


### 3.0 Location and Neighborhood based features

#### Encoding and Ranking Neighbourhood Features for Location-Based Price Patterns

In [8]:
unique_neighbourhoods = df["neighbourhood_cleansed"].unique()

neighbourhood_dict = {name: i+1 for i, name in enumerate(unique_neighbourhoods)}

df['neighbourhood_rank'] = df['neighbourhood_cleansed'].map(neighbourhood_dict)

#### 3.2 Save Processed Location Features to a Separate Table and CSV File

In [9]:
location_features_df = df[['neighbourhood', 'neighbourhood_cleansed', 'neighbourhood_rank']]


config = DatabaseConfig()
db = DatabaseConnection(config)

TABLE_NAME = "location_features"

db.write_dataframe(location_features_df, TABLE_NAME, if_exists="replace")

location_features_path = "../data/processed/location_features.csv"
df.to_csv(location_features_path, index=False)
print(f"Cleaned data saved to CSV: {location_features_path}")

DataFrame successfully saved to table: location_features
Cleaned data saved to CSV: ../data/processed/location_features.csv


### 4.0 Proprerty Type Features

#### 4.1 Encoding and Ranking Property Type Features 

In [10]:
unique_property_types = sorted(df["property_type"].unique())
unique_room_types = sorted(df["room_type"].unique())

property_type_dict = {name: i+1 for i, name in enumerate(unique_property_types)}
room_type_dict = {name: i+1 for i, name in enumerate(unique_room_types)}

df['property_type_id'] = df['property_type'].map(property_type_dict)
df['room_type_id'] = df['room_type'].map(room_type_dict)

#### 4.2 Bedroom Bath Ratio

In [11]:
df['bedroom_bath_ratio'] = df['bedrooms'] / df['bathrooms']

#### 4.3 People per Bed

In [12]:
df['people_per_bed'] = df['accommodates'] / df['beds']

#### 4.4 Number of Total Rooms

In [13]:
def parse_bathrooms(text):
    if pd.isna(text):
        return np.nan
    if 'Half' in text:
        return 0.5
    else:
        match = re.search(r'\d+(\.\d+)?', text)
        return float(match.group()) if match else np.nan

df['bathrooms'] = df['bathrooms_text'].apply(parse_bathrooms)

df['rooms_total'] = df['bedrooms'] + df['bathrooms']

df['rooms_total'] = df['rooms_total'].replace("", np.nan)

#### 4.5 Save Processed Property Features to a Separate Table and CSV File

In [14]:
property_features_df = df[[
    'property_type', 'room_type', 'property_type_id', 'room_type_id',
    'accommodates', 'bathrooms', 'bathrooms_text', 'bedrooms', 'beds',
    'bedroom_bath_ratio', 'people_per_bed', 'rooms_total'
]]


config = DatabaseConfig()
db = DatabaseConnection(config)

TABLE_NAME = "property_features"

db.write_dataframe(property_features_df, TABLE_NAME, if_exists="replace")

property_features_df_path = "../data/processed/property_features.csv"

df.to_csv(property_features_df_path, index=False)
print(f"Cleaned data saved to CSV: {property_features_df_path}")

DataFrame successfully saved to table: property_features
Cleaned data saved to CSV: ../data/processed/property_features.csv


### 5.0 Sentiment Anlysis

In [15]:
def get_sentiment_vader(text):
    return sia.polarity_scores(str(text))['compound']

df['description_sentiment'] = df['description'].apply(get_sentiment_vader)
df['amenities_sentiment'] = df['amenities'].apply(get_sentiment_vader)

def sentiment_label_vader(compound):
    if compound >= 0.05:
        return "positive"
    elif compound <= -0.05:
        return "negative"
    else:
        return "neutral"

df['description_sentiment_label'] = df['description_sentiment'].apply(sentiment_label_vader)
df['amenities_sentiment_label'] = df['amenities_sentiment'].apply(sentiment_label_vader)

### 6.0 Check and Remove Rows with NULL value.

In [16]:
df.isna().sum()

id                             0
name                           0
description                    0
host_response_rate             0
host_acceptance_rate           0
host_is_superhost              0
host_listings_count            0
host_total_listings_count      0
host_has_profile_pic           0
neighbourhood                  0
neighbourhood_cleansed         0
property_type                  0
room_type                      0
accommodates                   0
bathrooms                      9
bathrooms_text                 0
bedrooms                       0
beds                           0
amenities                      0
price                          0
minimum_nights                 0
maximum_nights                 0
minimum_minimum_nights         0
maximum_minimum_nights         0
minimum_maximum_nights         0
maximum_maximum_nights         0
minimum_nights_avg_ntm         0
maximum_nights_avg_ntm         0
number_of_reviews              0
number_of_reviews_ltm          0
number_of_

In [17]:
df['rooms_total'].replace("", np.nan, inplace=False)
df = df.dropna(subset=['rooms_total','bedroom_bath_ratio'])

### 7.0 Reorder Rows and Save to New Table and new CSV File

In [18]:
featured_df = df[[
    # 1. Basic info
    'id', 'name', 'description', 'description_sentiment', 'description_sentiment_label',
    
    # 2. Host info
    'host_response_rate', 'host_acceptance_rate', 'host_is_superhost',
    'host_listings_count', 'host_total_listings_count', 'host_has_profile_pic', 'host_experience_ratio',
    
    # 3. Location
    'neighbourhood', 'neighbourhood_cleansed', 'neighbourhood_rank',
    
    # 4. Property features
    'property_type', 'room_type', 'property_type_id', 'room_type_id',
    'accommodates', 'bathrooms', 'bathrooms_text', 'bedrooms', 'beds',
    'bedroom_bath_ratio', 'people_per_bed', 'rooms_total',
    
    # 5. Amenities
    'amenities', 'amenities_sentiment', 'amenities_sentiment_label',
    
    # 6. Price & availability
    'price', 'minimum_nights', 'maximum_nights', 'minimum_minimum_nights', 'maximum_minimum_nights',
    'minimum_maximum_nights', 'maximum_maximum_nights', 'minimum_nights_avg_ntm', 'maximum_nights_avg_ntm',
    
    # 7. Reviews
    'number_of_reviews', 'number_of_reviews_ltm', 'number_of_reviews_l30d', 'number_of_reviews_ly',
    'review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness', 'review_scores_checkin',
    'review_scores_communication', 'review_scores_location', 'review_scores_value', 'instant_bookable',
    
    
]]

config = DatabaseConfig()
db = DatabaseConnection(config)

TABLE_NAME = "feature_eningineered_data"

db.write_dataframe(featured_df, TABLE_NAME, if_exists="replace")

featured_df_path = "../data/processed/amsterdam_airbnbs_feature_engineered_data.csv"

df.to_csv(featured_df_path, index=False)
print(f"Cleaned data saved to CSV: {featured_df_path}")

DataFrame successfully saved to table: feature_eningineered_data
Cleaned data saved to CSV: ../data/processed/amsterdam_airbnbs_feature_engineered_data.csv
