# Lab 3: Predicting Hotel Room Prices
Machine Learning in Tourism

## Learning objectives:

1. Apply data preprocessing techniques to prepare hotel price data for predictive modeling
2. Implement and compare multiple regression algorithms (linear, polynomial, tree-based) for price prediction
3. Evaluate model performance using appropriate metrics (RMSE, MAE, R²)
4. Interpret feature importance to identify key factors affecting hotel pricing
5. Conduct price sensitivity analysis for business decision-making in the tourism industry

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta

# For data preprocessing
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# For modeling
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
import xgboost as xgb

# For model evaluation
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Configure visualization
plt.style.use('seaborn-v0_8-whitegrid')
sns.set(font_scale=1.2)
plt.rcParams['figure.figsize'] = (12, 6)

## PART 1: DATA LOADING AND EXPLORATION

In [None]:
# Function to generate synthetic hotel pricing data
def generate_hotel_data(n_samples=10000):
    np.random.seed(42)
    
    # Create date range covering 2 years
    start_date = datetime(2022, 1, 1)
    dates = [start_date + timedelta(days=i) for i in range(730)]
    
    # Randomly select dates for our samples
    sample_dates = np.random.choice(dates, size=n_samples)
    
    # Extract date features
    month = np.array([d.month for d in sample_dates])
    day_of_week = np.array([d.weekday() for d in sample_dates])
    is_weekend = (day_of_week >= 5).astype(int)
    
    # High season: June-August (6,7,8) and December (12)
    is_high_season = np.isin(month, [6, 7, 8, 12]).astype(int)
    
    # Create hotel features
    hotel_category = np.random.choice([1, 2, 3, 4, 5], size=n_samples)  # 1-5 stars
    distance_to_center = np.random.uniform(0, 10, size=n_samples)  # km
    has_pool = np.random.choice([0, 1], size=n_samples, p=[0.7, 0.3])
    has_spa = np.random.choice([0, 1], size=n_samples, p=[0.8, 0.2])
    has_gym = np.random.choice([0, 1], size=n_samples, p=[0.6, 0.4])
    room_capacity = np.random.choice([1, 2, 3, 4], size=n_samples, p=[0.2, 0.5, 0.2, 0.1])
    
    # Booking features
    days_in_advance = np.random.randint(1, 365, size=n_samples)
    length_of_stay = np.random.randint(1, 15, size=n_samples)
    nr_previous_bookings = np.random.randint(0, 10, size=n_samples)
    
    # Create base price influenced by various factors
    base_price = 50 + 30 * hotel_category - 5 * distance_to_center + 20 * has_pool + 25 * has_spa + 15 * has_gym
    
    # Add seasonal effects
    season_effect = 40 * is_high_season + 30 * is_weekend
    
    # Add booking effects: last-minute bookings are more expensive, longer stays get discounts
    booking_effect = -0.1 * days_in_advance + 10 * np.exp(-days_in_advance/10) - 5 * length_of_stay
    
    # Add loyalty effects
    loyalty_effect = -2 * nr_previous_bookings
    
    # Add random effects + nonlinear relationships
    capacity_effect = 10 * (room_capacity - 1) + 5 * (room_capacity - 1)**2
    
    # Sum all effects
    price = base_price + season_effect + booking_effect + loyalty_effect + capacity_effect
    
    # Add noise
    price = price + np.random.normal(0, 15, n_samples)
    
    # Ensure price is positive and round to nearest integer
    price = np.maximum(price, 20)
    price = np.round(price)
    
    # Combine into dataframe
    df = pd.DataFrame({
        'date': sample_dates,
        'month': month,
        'day_of_week': day_of_week,
        'is_weekend': is_weekend,
        'is_high_season': is_high_season,
        'hotel_category': hotel_category,
        'distance_to_center': distance_to_center,
        'has_pool': has_pool,
        'has_spa': has_spa,
        'has_gym': has_gym,
        'room_capacity': room_capacity,
        'days_in_advance': days_in_advance,
        'length_of_stay': length_of_stay,
        'nr_previous_bookings': nr_previous_bookings,
        'price': price
    })
    
    # Add some missing values to make it more realistic
    for col in ['has_pool', 'has_spa', 'has_gym', 'nr_previous_bookings']:
        mask = np.random.choice([True, False], size=n_samples, p=[0.05, 0.95])
        df.loc[mask, col] = np.nan
        
    return df

# Generate synthetic data
hotel_df = generate_hotel_data(n_samples=10000)

# Display basic information
print("Dataset shape:", hotel_df.shape)
print("\nFirst 5 rows:")
display(hotel_df.head())

# Summary statistics
print("\nSummary statistics:")
display(hotel_df.describe().T)

# Check for missing values
print("\nMissing values per column:")
print(hotel_df.isnull().sum())

## PART 2: DATA VISUALIZATION AND EXPLORATION

In [None]:
# Histogram of prices
plt.figure(figsize=(10, 6))
sns.histplot(hotel_df['price'], bins=30, kde=True)
plt.title('Distribution of Hotel Room Prices')
plt.xlabel('Price (€)')
plt.ylabel('Frequency')
plt.show()

# Price by hotel category (boxplot)
plt.figure(figsize=(10, 6))
sns.boxplot(x='hotel_category', y='price', data=hotel_df)
plt.title('Price by Hotel Category (Stars)')
plt.xlabel('Hotel Category')
plt.ylabel('Price (€)')
plt.show()

# Scatterplot: distance to center vs price, colored by hotel category
plt.figure(figsize=(10, 6))
sns.scatterplot(x='distance_to_center', y='price', hue='hotel_category', 
                palette='viridis', alpha=0.6, data=hotel_df)
plt.title('Price vs Distance to Center by Hotel Category')
plt.xlabel('Distance to City Center (km)')
plt.ylabel('Price (€)')
plt.legend(title='Hotel Stars')
plt.show()

# Price by month (to see seasonality)
plt.figure(figsize=(12, 6))
monthly_avg_price = hotel_df.groupby('month')['price'].mean().reset_index()
sns.barplot(x='month', y='price', data=monthly_avg_price)
plt.title('Average Price by Month')
plt.xlabel('Month')
plt.ylabel('Average Price (€)')
plt.xticks(range(12), ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 
                      'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
plt.show()

# Price by day of week
plt.figure(figsize=(10, 6))
day_avg_price = hotel_df.groupby('day_of_week')['price'].mean().reset_index()
sns.barplot(x='day_of_week', y='price', data=day_avg_price)
plt.title('Average Price by Day of Week')
plt.xlabel('Day of Week')
plt.ylabel('Average Price (€)')
plt.xticks(range(7), ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'])
plt.show()

# Correlation matrix
plt.figure(figsize=(12, 10))
numeric_cols = hotel_df.select_dtypes(include=['float64', 'int64']).columns
correlation_matrix = hotel_df[numeric_cols].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix of Numeric Features')
plt.tight_layout()
plt.show()

# Advanced analysis: Price vs Days in Advance
plt.figure(figsize=(12, 6))
sns.scatterplot(x='days_in_advance', y='price', data=hotel_df, alpha=0.3)
# Add a smoothed line to see the trend
sns.regplot(x='days_in_advance', y='price', data=hotel_df, scatter=False, 
           line_kws={"color": "red"})
plt.title('Price vs Days in Advance')
plt.xlabel('Days Booked in Advance')
plt.ylabel('Price (€)')
plt.show()

# Advanced analysis: Price vs Length of Stay
plt.figure(figsize=(12, 6))
stay_avg_price = hotel_df.groupby('length_of_stay')['price'].mean().reset_index()
sns.lineplot(x='length_of_stay', y='price', data=stay_avg_price, marker='o')
plt.title('Average Price vs Length of Stay')
plt.xlabel('Length of Stay (days)')
plt.ylabel('Average Price (€)')
plt.show()

## PART 3: DATA PREPROCESSING

In [None]:
# Define features and target variable
X = hotel_df.drop(['price', 'date'], axis=1)  # Drop price (target) and date
y = hotel_df['price']

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training set shape:", X_train.shape)
print("Test set shape:", X_test.shape)

# Identify categorical and numerical features
categorical_features = ['month', 'day_of_week', 'hotel_category', 'room_capacity']
numerical_features = ['distance_to_center', 'days_in_advance', 'length_of_stay', 
                     'nr_previous_bookings']
binary_features = ['is_weekend', 'is_high_season', 'has_pool', 'has_spa', 'has_gym']

# Create preprocessing pipeline for numerical features
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Create preprocessing pipeline for categorical features
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(drop='first', sparse_output=False))
])

# Create preprocessing pipeline for binary features
binary_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent'))
])

# Combine all transformers
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features),
        ('bin', binary_transformer, binary_features)
    ])

# Apply preprocessing to training data
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)

print("Processed training data shape:", X_train_processed.shape)

In [None]:
# Check the processed feature names
# Get feature names after preprocessing
# Note: The feature names for the categorical features are generated by OneHotEncoder
# and need to be extracted from the pipeline
feature_names = (
    numerical_features + 
    list(preprocessor.transformers_[1][1].named_steps['onehot'].get_feature_names_out(categorical_features)) +
    binary_features
)
print("Processed feature names:", feature_names)

## PART 4: BASELINE MODELS

In [None]:
# Function to evaluate model performance
def evaluate_model(model, X_train, X_test, y_train, y_test):
    # Train the model
    model.fit(X_train, y_train)
    
    # Make predictions
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    
    # Calculate metrics
    train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
    test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
    train_mae = mean_absolute_error(y_train, y_train_pred)
    test_mae = mean_absolute_error(y_test, y_test_pred)
    train_r2 = r2_score(y_train, y_train_pred)
    test_r2 = r2_score(y_test, y_test_pred)
    
    # Return all metrics
    return {
        'train_rmse': train_rmse,
        'test_rmse': test_rmse,
        'train_mae': train_mae,
        'test_mae': test_mae,
        'train_r2': train_r2,
        'test_r2': test_r2
    }

# Create dictionary to store all model results
model_results = {}

# 1. Linear Regression
lr_model = LinearRegression()
model_results['Linear Regression'] = evaluate_model(
    lr_model, X_train_processed, X_test_processed, y_train, y_test
)

# 2. Ridge Regression
ridge_model = Ridge(alpha=1.0)
model_results['Ridge Regression'] = evaluate_model(
    ridge_model, X_train_processed, X_test_processed, y_train, y_test
)

# 3. Lasso Regression
lasso_model = Lasso(alpha=0.1)
model_results['Lasso Regression'] = evaluate_model(
    lasso_model, X_train_processed, X_test_processed, y_train, y_test
)

# 4. ElasticNet
elasticnet_model = ElasticNet(alpha=0.1, l1_ratio=0.5)
model_results['ElasticNet'] = evaluate_model(
    elasticnet_model, X_train_processed, X_test_processed, y_train, y_test
)

# Display results in a dataframe
results_df = pd.DataFrame(model_results).T
results_df = results_df[['train_rmse', 'test_rmse', 'train_mae', 'test_mae', 'train_r2', 'test_r2']]
results_df = results_df.sort_values('test_rmse')
print("\nLinear Model Performance:")
display(results_df)

## PART 5: POLYNOMIAL REGRESSION

In [None]:
# Create a polynomial features pipeline
polynomial_preprocessor = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('poly', PolynomialFeatures(degree=2, include_bias=False, interaction_only=True))
])

# Apply preprocessing to training data
X_train_poly = polynomial_preprocessor.fit_transform(X_train)
X_test_poly = polynomial_preprocessor.transform(X_test)

print("Polynomial training data shape:", X_train_poly.shape)

# Ridge regression with polynomial features (to avoid overfitting)
poly_ridge_model = Ridge(alpha=10.0)
model_results['Polynomial Ridge'] = evaluate_model(
    poly_ridge_model, X_train_poly, X_test_poly, y_train, y_test
)

# Update results dataframe
results_df = pd.DataFrame(model_results).T
results_df = results_df[['train_rmse', 'test_rmse', 'train_mae', 'test_mae', 'train_r2', 'test_r2']]
results_df = results_df.sort_values('test_rmse')
print("\nLinear and Polynomial Model Performance:")
display(results_df)

## PART 6: TREE-BASED MODELS

In [None]:
# 1. Decision Tree
dt_model = DecisionTreeRegressor(max_depth=10, random_state=42)
model_results['Decision Tree'] = evaluate_model(
    dt_model, X_train_processed, X_test_processed, y_train, y_test
)

# 2. Random Forest
rf_model = RandomForestRegressor(n_estimators=100, max_depth=15, random_state=42, n_jobs=-1)
model_results['Random Forest'] = evaluate_model(
    rf_model, X_train_processed, X_test_processed, y_train, y_test
)

# 3. Gradient Boosting
gb_model = GradientBoostingRegressor(n_estimators=100, max_depth=5, random_state=42)
model_results['Gradient Boosting'] = evaluate_model(
    gb_model, X_train_processed, X_test_processed, y_train, y_test
)

# 4. XGBoost
xgb_model = xgb.XGBRegressor(n_estimators=100, max_depth=5, learning_rate=0.1, random_state=42)
model_results['XGBoost'] = evaluate_model(
    xgb_model, X_train_processed, X_test_processed, y_train, y_test
)

# Update results dataframe
results_df = pd.DataFrame(model_results).T
results_df = results_df[['train_rmse', 'test_rmse', 'train_mae', 'test_mae', 'train_r2', 'test_r2']]
results_df = results_df.sort_values('test_rmse')
print("\nAll Models Performance:")
display(results_df)

## PART 7: HYPERPARAMETER TUNING
We adjust Random Forest as an example

In [None]:
# Define the parameter grid for ElasticNet
param_grid_elasticnet = {
    'elasticnet__alpha': [0.01, 0.1, 1.0, 10.0],  # Regularization strength
    'elasticnet__l1_ratio': [0.1, 0.5, 0.7, 0.9]  # Balance between L1 and L2 regularization
}

# Create an ElasticNet model
elasticnet = ElasticNet(max_iter=10000)

# Create a pipeline with polynomial preprocessing and ElasticNet
elasticnet_pipeline = Pipeline(steps=[
    ('preprocessor', polynomial_preprocessor),
    ('elasticnet', elasticnet)
])

# Create GridSearchCV for ElasticNet
grid_search_elasticnet = GridSearchCV(
    estimator=elasticnet_pipeline,
    param_grid=param_grid_elasticnet,
    cv=3,
    scoring='neg_root_mean_squared_error',
    verbose=1,
    n_jobs=-1
)

# Fit GridSearchCV
grid_search_elasticnet.fit(X_train, y_train)

# Best parameters and score
print("\nBest parameters for ElasticNet:", grid_search_elasticnet.best_params_)
print("Best RMSE for ElasticNet:", -grid_search_elasticnet.best_score_)

# Create the best ElasticNet model with optimal parameters
best_elasticnet_model = grid_search_elasticnet.best_estimator_

# Evaluate the best ElasticNet model
model_results['ElasticNet (Tuned)'] = evaluate_model(
    best_elasticnet_model.named_steps['elasticnet'], 
    X_train_poly, 
    X_test_poly, 
    y_train, 
    y_test
)

# Update results dataframe
results_df = pd.DataFrame(model_results).T
results_df = results_df[['train_rmse', 'test_rmse', 'train_mae', 'test_mae', 'train_r2', 'test_r2']]
results_df = results_df.sort_values('test_rmse')
print("\nUpdated Model Performance:")
display(results_df)

In [None]:
# Random Forest hyperparameter tuning with GridSearchCV
param_grid_rf = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': [1.0, 'sqrt', 'log2']
}

# Create GridSearchCV for Random Forest
grid_search_rf = GridSearchCV(
    estimator=RandomForestRegressor(random_state=42),
    param_grid=param_grid_rf,
    cv=3,
    scoring='neg_root_mean_squared_error',
    verbose=1,
    n_jobs=-1
)

# Fit GridSearchCV
grid_search_rf.fit(X_train_processed, y_train)

# Best parameters and score
print("\nBest parameters for Random Forest:", grid_search_rf.best_params_)
print("Best RMSE for Random Forest:", -grid_search_rf.best_score_)

# Create best model with optimal parameters
best_rf_model = grid_search_rf.best_estimator_
model_results['Random Forest (Tuned)'] = evaluate_model(
    best_rf_model, X_train_processed, X_test_processed, y_train, y_test
)

# Update results dataframe
results_df = pd.DataFrame(model_results).T
results_df = results_df[['train_rmse', 'test_rmse', 'train_mae', 'test_mae', 'train_r2', 'test_r2']]
results_df = results_df.sort_values('test_rmse')
print("\nFinal Model Performance:")
display(results_df)

## PART 8: FEATURE IMPORTANCE AND MODEL INTERPRETATION

In [None]:
# Feature importance for Random Forest model
importances = best_rf_model.feature_importances_
feature_importances = pd.DataFrame(importances, index=feature_names, columns=['importance'])
feature_importances = feature_importances.sort_values('importance', ascending=False)
# Plot feature importances
plt.figure(figsize=(12, 8))
sns.barplot(x=feature_importances['importance'], y=feature_importances.index)
plt.title('Feature Importances from Random Forest Model')
plt.xlabel('Importance')
plt.ylabel('Features')
plt.show()

## PART 9: PREDICTIONS ON NEW DATA

In [None]:
# Create a function to make price predictions for new hotels
def predict_price(model, preprocessor, hotel_data):
    # Preprocess the data
    hotel_data_processed = preprocessor.transform(hotel_data)
    
    # Make prediction
    predicted_price = model.predict(hotel_data_processed)
    
    return predicted_price[0]

# Example: Create a new hotel data point
new_hotel = pd.DataFrame({
    'month': [7],  # July
    'day_of_week': [5],  # Saturday
    'is_weekend': [1],
    'is_high_season': [1],
    'hotel_category': [4],  # 4-star hotel
    'distance_to_center': [1.5],  # 1.5 km from center
    'has_pool': [1],
    'has_spa': [1],
    'has_gym': [1],
    'room_capacity': [2],  # Double room
    'days_in_advance': [30],  # Booked 30 days in advance
    'length_of_stay': [7],  # 7-day stay
    'nr_previous_bookings': [2]  # Customer with 2 previous bookings
})

# Predict price for the new hotel
predicted_price = predict_price(best_elasticnet_model.named_steps['elasticnet'], polynomial_preprocessor, new_hotel)
print(f"\nPredicted price for the new hotel: €{predicted_price:.2f}")

# Create a few more examples with different parameters
example_hotels = pd.DataFrame({
    'month': [7, 7, 1, 12],  # July, July, January, December
    'day_of_week': [5, 1, 3, 6],  # Saturday, Tuesday, Thursday, Sunday
    'is_weekend': [1, 0, 0, 1],
    'is_high_season': [1, 1, 0, 1],
    'hotel_category': [4, 3, 5, 5],  
    'distance_to_center': [1.5, 4.0, 0.5, 2.0],
    'has_pool': [1, 0, 1, 1],
    'has_spa': [1, 0, 1, 1],
    'has_gym': [1, 1, 1, 1],
    'room_capacity': [2, 2, 1, 3],
    'days_in_advance': [30, 10, 90, 180],
    'length_of_stay': [7, 3, 2, 10],
    'nr_previous_bookings': [2, 0, 5, 1]
})

# Preprocess the examples
examples_processed = preprocessor.transform(example_hotels)

# Make predictions
predicted_prices = best_rf_model.predict(examples_processed)

# Add predictions to the example dataframe
example_hotels['predicted_price'] = predicted_prices.round(2)

# Display the examples with predictions
print("\nPredicted prices for different hotels:")
display(example_hotels[['hotel_category', 'distance_to_center', 'is_weekend', 
                      'is_high_season', 'days_in_advance', 'length_of_stay', 
                      'predicted_price']])

## PART 10: SCENARIO ANALYSIS
### With RF model

In [None]:
# Create a function to analyze price sensitivity to a specific feature
def analyze_price_sensitivity(model, preprocessor, base_hotel, feature, range_values):
    # Create multiple versions of the hotel with different values of the feature
    hotels = pd.concat([base_hotel] * len(range_values), ignore_index=True)
    hotels[feature] = range_values
    
    # Preprocess the hotels
    hotels_processed = preprocessor.transform(hotels)
    
    # Make predictions
    predicted_prices = model.predict(hotels_processed)
    
    return predicted_prices

# Base hotel for sensitivity analysis
base_hotel = pd.DataFrame({
    'month': [7],  # July
    'day_of_week': [3],  # Thursday
    'is_weekend': [0],
    'is_high_season': [1],
    'hotel_category': [3],  # 3-star hotel
    'distance_to_center': [2.0],  # 2 km from center
    'has_pool': [1],
    'has_spa': [0],
    'has_gym': [1],
    'room_capacity': [2],  # Double room
    'days_in_advance': [30],  # Booked 30 days in advance
    'length_of_stay': [4],  # 4-day stay
    'nr_previous_bookings': [1]  # Customer with 1 previous booking
})

# Analyze sensitivity to days in advance
days_range = range(1, 180, 7)  # 1 to 180 days in 7-day increments
prices_by_days = analyze_price_sensitivity(
    best_rf_model, preprocessor, base_hotel, 'days_in_advance', days_range
)

# Plot the results
plt.figure(figsize=(12, 6))
plt.plot(days_range, prices_by_days, marker='o')
plt.title('Price Sensitivity to Booking Days in Advance')
plt.xlabel('Days in Advance')
plt.ylabel('Predicted Price (€)')
plt.grid(True)
plt.show()

# Analyze sensitivity to hotel category
category_range = [1, 2, 3, 4, 5]
prices_by_category = analyze_price_sensitivity(
    best_rf_model, preprocessor, base_hotel, 'hotel_category', category_range
)

# Plot the results
plt.figure(figsize=(10, 6))
plt.bar(category_range, prices_by_category)
plt.title('Price Sensitivity to Hotel Category')
plt.xlabel('Hotel Category (Stars)')
plt.ylabel('Predicted Price (€)')
plt.xticks(category_range)
plt.grid(axis='y')
plt.show()

# Analyze sensitivity to length of stay
stay_range = range(1, 15)
prices_by_stay = analyze_price_sensitivity(
    best_rf_model, preprocessor, base_hotel, 'length_of_stay', stay_range
)

# Plot the results
plt.figure(figsize=(12, 6))
plt.plot(stay_range, prices_by_stay, marker='o')
plt.title('Price Sensitivity to Length of Stay')
plt.xlabel('Length of Stay (days)')
plt.ylabel('Predicted Price (€)')
plt.grid(True)
plt.show()

### With polynomial ElasticNet model

In [None]:
# Create a function to analyze price sensitivity to a specific feature
def analyze_price_sensitivity(model, preprocessor, base_hotel, feature, range_values):
    # Create multiple versions of the hotel with different values of the feature
    hotels = pd.concat([base_hotel] * len(range_values), ignore_index=True)
    hotels[feature] = range_values
    
    # Preprocess the hotels
    hotels_processed = preprocessor.transform(hotels)
    
    # Make predictions
    predicted_prices = model.predict(hotels_processed)
    
    return predicted_prices

# Base hotel for sensitivity analysis
base_hotel = pd.DataFrame({
    'month': [7],  # July
    'day_of_week': [3],  # Thursday
    'is_weekend': [0],
    'is_high_season': [1],
    'hotel_category': [3],  # 3-star hotel
    'distance_to_center': [2.0],  # 2 km from center
    'has_pool': [1],
    'has_spa': [0],
    'has_gym': [1],
    'room_capacity': [2],  # Double room
    'days_in_advance': [30],  # Booked 30 days in advance
    'length_of_stay': [4],  # 4-day stay
    'nr_previous_bookings': [1]  # Customer with 1 previous booking
})

# Analyze sensitivity to days in advance
days_range = range(1, 180, 7)  # 1 to 180 days in 7-day increments
prices_by_days = analyze_price_sensitivity(
    best_elasticnet_model.named_steps['elasticnet'], polynomial_preprocessor, base_hotel, 'days_in_advance', days_range
)

# Plot the results
plt.figure(figsize=(12, 6))
plt.plot(days_range, prices_by_days, marker='o')
plt.title('Price Sensitivity to Booking Days in Advance')
plt.xlabel('Days in Advance')
plt.ylabel('Predicted Price (€)')
plt.grid(True)
plt.show()

# Analyze sensitivity to hotel category
category_range = [1, 2, 3, 4, 5]
prices_by_category = analyze_price_sensitivity(
    best_rf_model, preprocessor, base_hotel, 'hotel_category', category_range
)

# Plot the results
plt.figure(figsize=(10, 6))
plt.bar(category_range, prices_by_category)
plt.title('Price Sensitivity to Hotel Category')
plt.xlabel('Hotel Category (Stars)')
plt.ylabel('Predicted Price (€)')
plt.xticks(category_range)
plt.grid(axis='y')
plt.show()

# Analyze sensitivity to length of stay
stay_range = range(1, 15)
prices_by_stay = analyze_price_sensitivity(
    best_rf_model, preprocessor, base_hotel, 'length_of_stay', stay_range
)

# Plot the results
plt.figure(figsize=(12, 6))
plt.plot(stay_range, prices_by_stay, marker='o')
plt.title('Price Sensitivity to Length of Stay')
plt.xlabel('Length of Stay (days)')
plt.ylabel('Predicted Price (€)')
plt.grid(True)
plt.show()

## PART 11: CONCLUSION AND KEY FINDINGS

In [None]:
# order the results by test RMSE
results_df = results_df.sort_values('test_rmse')
# Display the best model performance
print("\nBest model performance:")
print("Best model performance:", results_df.iloc[0].name, 
      f"with RMSE: {results_df.iloc[0]['test_rmse']:.2f} and R²: {results_df.iloc[0]['test_r2']:.4f}")
print("Second best model performance:", results_df.iloc[1].name,
      f"with RMSE: {results_df.iloc[1]['test_rmse']:.2f} and R²: {results_df.iloc[1]['test_r2']:.4f}")


----- KEY FINDINGS -----
1. Most important factors affecting hotel prices (based on feature importance):
   - Hotel category (star rating)
   - Seasonality (high season vs. low season)
   - Weekend vs. weekday
   - Distance to city center
2. Pricing insights:
   - Booking far in advance can save money
   - Longer stays generally get better per-night rates
   - High season periods show significant price increases
   - Each star category represents a substantial price jump

----- BUSINESS RECOMMENDATIONS -----
1. For hotels:
   - Adjust pricing strategies based on demand patterns and advance booking curve
   - Implement dynamic pricing for weekends and high seasons
   - Consider loyalty discounts for repeat customers
2. For customers:
   - Book well in advance for high season travel
   - Consider longer stays for better per-night rates
   - Weigh hotel category against location (distance to center)
3. For online travel agencies:
   - Highlight potential savings for advance bookings
   - Create personalized pricing recommendations
   - Implement price prediction tools for customers

This analysis demonstrates how regression techniques can be effectively applied
to predict hotel room prices and generate actionable insights for the tourism industry.