# Notebook 04: Feature Engineering

## Purpose
This notebook prepares features for machine learning by:
- Encoding categorical variables
- Scaling numerical features
- Creating new derived features
- Selecting relevant features

## Learning Objectives
- Apply feature transformation techniques
- Understand when to use different encoding methods
- Create meaningful derived features
- Prepare data for modeling

---
## 1. Import Libraries and Load Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
from sklearn.model_selection import train_test_split

# Display settings
pd.set_option('display.max_columns', None)
plt.style.use('seaborn-v0_8-whitegrid')

print("Libraries imported successfully!")

In [None]:
# Load cleaned dataset
df = pd.read_csv('../data/cleaned_dataset.csv')

print(f"Dataset loaded: {df.shape[0]:,} rows, {df.shape[1]} columns")
df.head()

---
## 2. Feature Selection

Select features that will be useful for predicting price.

In [None]:
# Define features for modeling
# We'll exclude ID columns, names, and the target variable

# Target variable
target = 'price'

# Features to exclude
exclude_features = ['id', 'name', 'host_id', 'host_name', 'last_review', 'price']

# Select features
feature_columns = [col for col in df.columns if col not in exclude_features]

print("SELECTED FEATURES:")
print("="*80)
for i, col in enumerate(feature_columns, 1):
    print(f"{i:2d}. {col}")

print(f"\nTotal features: {len(feature_columns)}")
print(f"Target variable: {target}")

---
## 3. Create Derived Features

### Assumption:
Creating new features from existing ones can improve model performance by capturing additional patterns.

In [None]:
# Create a copy for feature engineering
df_fe = df.copy()

# Feature 1: Reviews per availability (popularity metric)
df_fe['reviews_per_availability'] = df_fe['number_of_reviews'] / (df_fe['availability_365'] + 1)

# Feature 2: Is the listing highly available? (binary)
df_fe['high_availability'] = (df_fe['availability_365'] > 180).astype(int)

# Feature 3: Has reviews (binary)
df_fe['has_reviews'] = (df_fe['number_of_reviews'] > 0).astype(int)

# Feature 4: Price category (based on quartiles)
# This won't be used as a feature but helps understand price segments
price_quartiles = df_fe['price'].quantile([0.25, 0.5, 0.75])

print("NEW DERIVED FEATURES CREATED:")
print("="*80)
print("1. reviews_per_availability - Reviews normalized by availability")
print("2. high_availability - Binary indicator for highly available listings")
print("3. has_reviews - Binary indicator for listings with reviews")

print("\nSample of new features:")
df_fe[['reviews_per_availability', 'high_availability', 'has_reviews']].head(10)

---
## 4. Encode Categorical Variables

### 4.1 One-Hot Encoding for Room Type

**Explanation**: One-hot encoding creates binary columns for each category. This is suitable for nominal categorical variables where there's no inherent order.

In [None]:
# One-hot encode room_type
if 'room_type' in df_fe.columns:
    room_type_encoded = pd.get_dummies(df_fe['room_type'], prefix='room_type', drop_first=True)
    df_fe = pd.concat([df_fe, room_type_encoded], axis=1)
    
    print("ONE-HOT ENCODED COLUMNS (room_type):")
    print("="*80)
    print(room_type_encoded.columns.tolist())
    print("\nSample:")
    print(room_type_encoded.head())

### 4.2 Label Encoding for Neighbourhood Group

**Explanation**: Label encoding assigns a unique integer to each category. While this can introduce ordinal relationships, it's memory-efficient for high-cardinality features.

In [None]:
# Label encode neighbourhood_group
if 'neighbourhood_group' in df_fe.columns:
    le = LabelEncoder()
    df_fe['neighbourhood_group_encoded'] = le.fit_transform(df_fe['neighbourhood_group'])
    
    # Create mapping for reference
    mapping = dict(zip(le.classes_, le.transform(le.classes_)))
    
    print("LABEL ENCODING MAPPING (neighbourhood_group):")
    print("="*80)
    for category, code in mapping.items():
        print(f"{category}: {code}")
    
    print("\nSample:")
    print(df_fe[['neighbourhood_group', 'neighbourhood_group_encoded']].head(10))

### 4.3 Handle Neighbourhood (High Cardinality)

**Explanation**: The neighbourhood column has many unique values. We'll use frequency encoding or group rare categories.

In [None]:
# Frequency encoding for neighbourhood
if 'neighbourhood' in df_fe.columns:
    neighbourhood_freq = df_fe['neighbourhood'].value_counts(normalize=True)
    df_fe['neighbourhood_frequency'] = df_fe['neighbourhood'].map(neighbourhood_freq)
    
    print(f"Neighbourhood unique values: {df_fe['neighbourhood'].nunique()}")
    print("\nTop 10 neighbourhoods by frequency:")
    print(neighbourhood_freq.head(10))

---
## 5. Feature Scaling

### 5.1 StandardScaler (Z-score normalization)

**Explanation**: StandardScaler transforms features to have mean=0 and std=1. This is suitable for features with normal distribution.

In [None]:
# Select numerical features for scaling
numerical_features = ['latitude', 'longitude', 'minimum_nights', 'number_of_reviews', 
                      'reviews_per_month', 'calculated_host_listings_count', 'availability_365',
                      'reviews_per_availability', 'neighbourhood_frequency']

# Filter to only include columns that exist
numerical_features = [col for col in numerical_features if col in df_fe.columns]

# Apply StandardScaler
scaler_standard = StandardScaler()
df_fe_scaled_standard = df_fe.copy()
df_fe_scaled_standard[numerical_features] = scaler_standard.fit_transform(df_fe[numerical_features])

print("STANDARD SCALER APPLIED")
print("="*80)
print("Features scaled:")
for col in numerical_features:
    print(f"  - {col}")

print("\nScaled features statistics:")
print(df_fe_scaled_standard[numerical_features].describe())

### 5.2 MinMaxScaler (0-1 normalization)

**Explanation**: MinMaxScaler transforms features to a [0, 1] range. This is useful when you want to preserve the shape of the distribution.

In [None]:
# Apply MinMaxScaler
scaler_minmax = MinMaxScaler()
df_fe_scaled_minmax = df_fe.copy()
df_fe_scaled_minmax[numerical_features] = scaler_minmax.fit_transform(df_fe[numerical_features])

print("MINMAX SCALER APPLIED")
print("="*80)
print("\nScaled features statistics:")
print(df_fe_scaled_minmax[numerical_features].describe())

### 5.3 Compare Scaling Methods

In [None]:
# Visualize scaling comparison
sample_feature = numerical_features[0] if numerical_features else 'latitude'

if sample_feature in df_fe.columns:
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    
    # Original
    axes[0].hist(df_fe[sample_feature], bins=50, color='skyblue', edgecolor='black')
    axes[0].set_title(f'Original: {sample_feature}', fontsize=11, fontweight='bold')
    axes[0].set_xlabel(sample_feature)
    axes[0].set_ylabel('Frequency')
    
    # StandardScaler
    axes[1].hist(df_fe_scaled_standard[sample_feature], bins=50, color='lightgreen', edgecolor='black')
    axes[1].set_title(f'StandardScaler: {sample_feature}', fontsize=11, fontweight='bold')
    axes[1].set_xlabel(f'{sample_feature} (scaled)')
    axes[1].set_ylabel('Frequency')
    
    # MinMaxScaler
    axes[2].hist(df_fe_scaled_minmax[sample_feature], bins=50, color='coral', edgecolor='black')
    axes[2].set_title(f'MinMaxScaler: {sample_feature}', fontsize=11, fontweight='bold')
    axes[2].set_xlabel(f'{sample_feature} (scaled)')
    axes[2].set_ylabel('Frequency')
    
    plt.tight_layout()
    plt.show()

---
## 6. Final Feature Set Preparation

### Decision: We'll use StandardScaler for our final model

**Rationale**: StandardScaler is more robust to outliers and works well with most ML algorithms.

In [None]:
# Prepare final feature set
df_final = df_fe_scaled_standard.copy()

# Select final features for modeling
final_features = [
    # Scaled numerical features
    'latitude', 'longitude', 'minimum_nights', 'number_of_reviews',
    'reviews_per_month', 'calculated_host_listings_count', 'availability_365',
    # Derived features
    'reviews_per_availability', 'high_availability', 'has_reviews',
    # Encoded categorical features
    'neighbourhood_group_encoded', 'neighbourhood_frequency'
]

# Add one-hot encoded room_type columns
room_type_cols = [col for col in df_final.columns if col.startswith('room_type_')]
final_features.extend(room_type_cols)

# Filter to only include columns that exist
final_features = [col for col in final_features if col in df_final.columns]

print("FINAL FEATURE SET:")
print("="*80)
for i, col in enumerate(final_features, 1):
    print(f"{i:2d}. {col}")

print(f"\nTotal features for modeling: {len(final_features)}")

---
## 7. Prepare X and y for Modeling

In [None]:
# Prepare features (X) and target (y)
X = df_final[final_features]
y = df_final['price']

print("MODELING DATA PREPARED:")
print("="*80)
print(f"Feature matrix (X) shape: {X.shape}")
print(f"Target vector (y) shape: {y.shape}")
print(f"\nFeature matrix sample:")
print(X.head())
print(f"\nTarget variable sample:")
print(y.head())

---
## 8. Train-Test Split

**Assumption**: We use 80-20 split with random_state for reproducibility.

In [None]:
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("TRAIN-TEST SPLIT:")
print("="*80)
print(f"Training set size: {X_train.shape[0]:,} samples ({X_train.shape[0]/len(X)*100:.1f}%)")
print(f"Testing set size: {X_test.shape[0]:,} samples ({X_test.shape[0]/len(X)*100:.1f}%)")
print(f"\nFeatures: {X_train.shape[1]}")
print(f"\nTraining target statistics:")
print(y_train.describe())
print(f"\nTesting target statistics:")
print(y_test.describe())

---
## 9. Save Engineered Data

In [None]:
# Save engineered features
X_train.to_csv('../data/X_train.csv', index=False)
X_test.to_csv('../data/X_test.csv', index=False)
y_train.to_csv('../data/y_train.csv', index=False, header=['price'])
y_test.to_csv('../data/y_test.csv', index=False, header=['price'])

print("Engineered data saved:")
print("  - ../data/X_train.csv")
print("  - ../data/X_test.csv")
print("  - ../data/y_train.csv")
print("  - ../data/y_test.csv")

---
## 10. Summary

### Feature Engineering Completed:

1. **Derived Features** ✅
   - Created `reviews_per_availability`
   - Created `high_availability` binary indicator
   - Created `has_reviews` binary indicator

2. **Categorical Encoding** ✅
   - One-hot encoded `room_type`
   - Label encoded `neighbourhood_group`
   - Frequency encoded `neighbourhood`

3. **Feature Scaling** ✅
   - Applied StandardScaler to numerical features
   - Compared with MinMaxScaler
   - Selected StandardScaler for final model

4. **Data Preparation** ✅
   - Created feature matrix (X) and target vector (y)
   - Split into training (80%) and testing (20%) sets
   - Saved engineered data for modeling

### Key Assumptions:

- StandardScaler is appropriate for our features
- 80-20 train-test split provides sufficient data for both training and evaluation
- One-hot encoding for room_type won't cause dimensionality issues (only 3 categories)
- Frequency encoding captures neighbourhood importance

### Next Steps:

The engineered features are now ready for **Machine Learning Modeling** in the next notebook, where we will:
- Train multiple regression models
- Compare model performance
- Generate predictions

---
**Next Notebook**: [05_modeling.ipynb](05_modeling.ipynb)