In [2]:
import pandas as pd

# Load the dataset
file_path = '/Users/arka_bagchi/Desktop/AirDNA/cleaned_joshua_tree_data.csv'
data = pd.read_csv(file_path)

# Display the first few rows of the dataframe
data.head()


Unnamed: 0,longitude,Has Parking,Cancellation Policy,Has Hot Tub,city,bedrooms,AirDNA Market,adr,Pets Allowed,Occupancy Rate LTM,...,Overall Rating,Minimum Stay,Cleaning Fee,Price Tier,Has Kitchen,Number of Reviews,Number of Photos,Number of Bookings LTM,latitude,zipcode
0,-116.52897,True,strict,False,Pioneertown,1.0,"JOSHUA TREE, CA",345.45,True,0.55,...,99.0,31.0,75.0,luxury,True,205.0,26.0,33.0,34.25491,92268
1,-116.35124,True,strict,True,Palm Desert,3.0,"COACHELLA VALLEY, CA",550.0,False,1.0,...,97.0,4.0,200.0,midscale,True,101.0,24.0,1.0,33.74618,92211
2,-116.328949,True,strict,True,Joshua Tree,3.0,"JOSHUA TREE, CA",261.49,True,0.454795,...,96.738404,2.0,85.0,economy,True,539.0,54.0,65.0,34.121407,92252
3,-116.43598,True,strict,True,Yucca Valley,3.0,"JOSHUA TREE, CA",417.13,False,0.75,...,99.0,2.0,200.0,upscale,True,333.0,47.0,87.0,34.15398,92284
4,-116.39154,True,strict,False,Landers,1.0,"JOSHUA TREE, CA",224.0,False,0.157068,...,97.0,1.0,35.0,upscale,True,21.0,19.0,10.0,34.28935,92285


In [3]:
# Identify the categorical variables
categorical_columns = data.select_dtypes(include=['object', 'bool']).columns

# Convert categorical variables into dummy/indicator variables
data_with_dummies = pd.get_dummies(data, columns=categorical_columns, drop_first=True)

# Display the new dataframe with dummy variables
data_with_dummies.head()


Unnamed: 0,longitude,bedrooms,adr,Occupancy Rate LTM,Max Guests,bathrooms,Overall Rating,Minimum Stay,Cleaning Fee,Number of Reviews,...,Created Date_2023-07-13,Created Date_2023-07-14,Created Date_2023-07-15,Created Date_2023-07-17,Created Date_2023-07-22,Price Tier_economy,Price Tier_luxury,Price Tier_midscale,Price Tier_upscale,Has Kitchen_True
0,-116.52897,1.0,345.45,0.55,7.0,1.0,99.0,31.0,75.0,205.0,...,False,False,False,False,False,False,True,False,False,True
1,-116.35124,3.0,550.0,1.0,6.0,2.0,97.0,4.0,200.0,101.0,...,False,False,False,False,False,False,False,True,False,True
2,-116.328949,3.0,261.49,0.454795,8.0,2.0,96.738404,2.0,85.0,539.0,...,False,False,False,False,False,True,False,False,False,True
3,-116.43598,3.0,417.13,0.75,6.0,2.5,99.0,2.0,200.0,333.0,...,False,False,False,False,False,False,False,False,True,True
4,-116.39154,1.0,224.0,0.157068,2.0,1.0,97.0,1.0,35.0,21.0,...,False,False,False,False,False,False,False,False,True,True


The categorical variables have been converted into dummy variables. As a result, you can see that the dataset now has a significantly increased number of columns. This is because each unique value in the original categorical columns has been converted into a new binary column (0 or 1).

Next, I will standardize the numeric features. Standardization involves rescaling the features so that they have a mean of 0 and a standard deviation of 1. This is especially important when different features have different ranges as it can have a significant impact on the performance of machine learning models that are sensitive to the scale of the data, like linear models or k-nearest neighbors.

In [4]:
from sklearn.preprocessing import StandardScaler

# Identify numeric columns (excluding the target variable "Occupancy Rate LTM")
numeric_columns = data_with_dummies.select_dtypes(include=['float64', 'int64']).columns.drop('Occupancy Rate LTM')

# Initialize the StandardScaler
scaler = StandardScaler()

# Standardize the numeric columns
data_with_dummies[numeric_columns] = scaler.fit_transform(data_with_dummies[numeric_columns])

# Display the scaled dataframe
data_with_dummies.head()


Unnamed: 0,longitude,bedrooms,adr,Occupancy Rate LTM,Max Guests,bathrooms,Overall Rating,Minimum Stay,Cleaning Fee,Number of Reviews,...,Created Date_2023-07-13,Created Date_2023-07-14,Created Date_2023-07-15,Created Date_2023-07-17,Created Date_2023-07-22,Price Tier_economy,Price Tier_luxury,Price Tier_midscale,Price Tier_upscale,Has Kitchen_True
0,-1.734874,-1.032259,0.022291,0.55,0.484728,-0.939955,0.371059,1.994525,-0.699211,0.936067,...,False,False,False,False,False,False,True,False,False,True
1,0.519302,0.523524,0.678816,1.0,0.13665,0.166122,-0.038107,-0.192429,0.395466,0.115009,...,False,False,False,False,False,False,False,True,False,True
2,0.802023,0.523524,-0.247188,0.454795,0.832806,0.166122,-0.091625,-0.354425,-0.611637,3.572926,...,False,False,False,False,False,True,False,False,False,True
3,-0.555468,0.523524,0.252356,0.75,0.13665,0.71916,0.371059,-0.354425,0.395466,1.9466,...,False,False,False,False,False,False,False,False,True,True
4,0.008172,-1.032259,-0.367516,0.157068,-1.255661,-0.939955,-0.038107,-0.435423,-1.049508,-0.516574,...,False,False,False,False,False,False,False,False,True,True


The numeric features have been standardized, except for the target variable "Occupancy Rate LTM," which we do not want to scale since it's the target for prediction.

The final preprocessing step is to split the dataset into training and testing sets. This allows us to train our machine learning model on one subset of the data and then test its performance on a separate subset that the model hasn't seen before.

We will use train_test_split from scikit-learn to split the dataset.

In [5]:
from sklearn.model_selection import train_test_split

# Define the features and the target variable
X = data_with_dummies.drop('Occupancy Rate LTM', axis=1)  # Features
y = data_with_dummies['Occupancy Rate LTM']                # Target variable

# Split the dataset into training set and test set with a test size of 25%
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Display the size of the training and testing sets
(X_train.shape, X_test.shape, y_train.shape, y_test.shape)


((3156, 1699), (1052, 1699), (3156,), (1052,))

Training features (X_train): 3156 samples, 1699 features

Testing features (X_test): 1052 samples, 1699 features

Training target (y_train): 3156 samples

Testing target (y_test): 1052 samples

The preprocessing steps are now complete, and you have a prepared dataset ready for model training.

We are done with preprocessing (feature engineering not done on this dataset, albeit that would've been important)

Let's move onto modeling!

In [6]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Initialize the Linear Regression model
linear_model = LinearRegression()

# Fit the model with the training dataset
linear_model.fit(X_train, y_train)

# Predict the target variable for the testing set
y_pred_linear = linear_model.predict(X_test)

# Evaluate the model performance
linear_mse = mean_squared_error(y_test, y_pred_linear)
linear_r2 = r2_score(y_test, y_pred_linear)

print(f"Linear Regression MSE: {linear_mse}")
print(f"Linear Regression R-squared: {linear_r2}")


Linear Regression MSE: 3.116163013178e+20
Linear Regression R-squared: -5.127382421446022e+21


In [7]:
from sklearn.ensemble import RandomForestRegressor

# Initialize the Random Forest Regressor
rf_model = RandomForestRegressor(random_state=42)

# Fit the model with the training dataset
rf_model.fit(X_train, y_train)

# Predict the target variable for the testing set
y_pred_rf = rf_model.predict(X_test)

# Evaluate the model performance
rf_mse = mean_squared_error(y_test, y_pred_rf)
rf_r2 = r2_score(y_test, y_pred_rf)

print(f"Random Forest MSE: {rf_mse}")
print(f"Random Forest R-squared: {rf_r2}")


Random Forest MSE: 0.04219514334594888
Random Forest R-squared: 0.30571463897272344


In [8]:
from sklearn.ensemble import GradientBoostingRegressor

# Initialize the Gradient Boosting Regressor
gb_model = GradientBoostingRegressor(random_state=42)

# Fit the model with the training dataset
gb_model.fit(X_train, y_train)

# Predict the target variable for the testing set
y_pred_gb = gb_model.predict(X_test)

# Evaluate the model performance
gb_mse = mean_squared_error(y_test, y_pred_gb)
gb_r2 = r2_score(y_test, y_pred_gb)

print(f"Gradient Boosting MSE: {gb_mse}")
print(f"Gradient Boosting R-squared: {rf_r2}")


Gradient Boosting MSE: 0.04405501916053564
Gradient Boosting R-squared: 0.30571463897272344


I'll now run the code for the Linear Regression model, Random Forest Regressor, and Gradient Boosting Regressor one by one and show their baseline performance metrics. 

In [9]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Initialize the Linear Regression model
linear_model = LinearRegression()

# Fit the model with the training dataset
linear_model.fit(X_train, y_train)

# Predict the target variable for the testing set
y_pred_linear = linear_model.predict(X_test)

# Evaluate the model performance
linear_mse = mean_squared_error(y_test, y_pred_linear)
linear_r2 = r2_score(y_test, y_pred_linear)

linear_mse, linear_r2


(3.116163013178e+20, -5.127382421446022e+21)

In [10]:
from sklearn.ensemble import RandomForestRegressor

# Initialize the Random Forest Regressor with default parameters
rf_model = RandomForestRegressor(random_state=42)

# Fit the model with the training dataset
rf_model.fit(X_train, y_train)

# Predict the target variable for the testing set
y_pred_rf = rf_model.predict(X_test)

# Evaluate the model performance
rf_mse = mean_squared_error(y_test, y_pred_rf)
rf_r2 = r2_score(y_test, y_pred_rf)

rf_mse, rf_r2


(0.04219514334594888, 0.30571463897272344)

In [11]:
from sklearn.ensemble import GradientBoostingRegressor

# Initialize the Gradient Boosting Regressor with default parameters
gb_model = GradientBoostingRegressor(random_state=42)

# Fit the model with the training dataset
gb_model.fit(X_train, y_train)

# Predict the target variable for the testing set
y_pred_gb = gb_model.predict(X_test)

# Evaluate the model performance
gb_mse = mean_squared_error(y_test, y_pred_gb)
gb_r2 = r2_score(y_test, y_pred_gb)

print(f"Gradient Boosting MSE: {gb_mse}")
print(f"Gradient Boosting R-squared: {gb_r2}")


Gradient Boosting MSE: 0.04405501916053564
Gradient Boosting R-squared: 0.27511195702875413


# Model Performance Comparison

After fitting three different models to the dataset, we evaluated their performance using Mean Squared Error (MSE) and R-squared metrics. Below is a summary of their performance:

## Linear Regression Model:
- **MSE**: \( 3.116163013178 \times 10^{20} \) - This value is extremely high, which indicates very poor model performance.
- **R-squared**: \( -5.127382421446022 \times 10^{21} \) - The negative value here suggests that the model is performing worse than a model that would simply predict the mean value of the target variable for all observations.

## Random Forest Regressor:
- **MSE**: 0.04219514334594888 - This is a relatively low value, indicating a better fit than the Linear Regression model.
- **R-squared**: 0.30571463897272344 - This positive value means the model explains approximately 30.57% of the variance in the target variable, which is decent but indicates there might be room for improvement.

## Gradient Boosting Regressor:
- **MSE**: 0.04405501916053564 - Similar to the Random Forest, this MSE is also low, but slightly higher than that of the Random Forest model.
- **R-squared**: 0.27511195702875413 - This is slightly lower than that of the Random Forest model, indicating that it explains about 27.51% of the variance in the target.

# Conclusion
The Linear Regression model is not suitable for this dataset, which might be due to violations of the linear regression assumptions by the data.

The Random Forest and Gradient Boosting models show much better performance, with Random Forest having a slight edge. Given these results, hyperparameter tuning for the Random Forest model could be a beneficial next step to improve its performance. Alternatively, if time permits, exploring hyperparameter tuning for the Gradient Boosting model could also be worthwhile.


In [12]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

# Define a grid of hyperparameter ranges
rf_params = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Initialize a Random Forest Regressor
rf = RandomForestRegressor(random_state=42)

# Initialize GridSearchCV
rf_grid_search = GridSearchCV(estimator=rf, param_grid=rf_params, cv=3, n_jobs=-1, verbose=2, scoring='neg_mean_squared_error')

# Fit the GridSearchCV to the data
rf_grid_search.fit(X_train, y_train)

# Print the best parameters and best score
print("Best parameters found: ", rf_grid_search.best_params_)
print("Best score found: ", rf_grid_search.best_score_)


Fitting 3 folds for each of 108 candidates, totalling 324 fits
Best parameters found:  {'max_depth': None, 'min_samples_leaf': 4, 'min_samples_split': 2, 'n_estimators': 200}
Best score found:  -0.04265617783956185


The best score is slightly better than the one from the untuned model (remember that GridSearchCV uses the negative mean squared error, so a higher value, closer to zero, is better). Here's how to interpret the results:

# Mathematical Insight into Random Forest Hyperparameter Tuning

Hyperparameter tuning for a Random Forest model via GridSearchCV is a systematic process that involves searching across a predefined range of hyperparameter values to find the combination that yields the best model performance. Here's how the mathematics and machine learning theory underpin the process:

A Random Forest is an ensemble of decision trees, each contributing a vote to the final prediction. The goal is to build a model that reduces overfitting (capturing noise as if it were a signal) and underfitting (failing to capture underlying patterns). The hyperparameters determine the structure and behavior of the trees within the forest:

- `n_estimators`: Number of trees in the forest. Increasing this number improves the robustness of the model by averaging more trees, but it also increases computational load.
- `max_depth`: Maximum depth of each tree. Unlimited depth (`None`) allows trees to grow until all leaves are pure or contain less than `min_samples_split` samples, capturing more complex patterns.
- `min_samples_split`: Minimum samples required to split a node. A higher number creates a simpler model by requiring more evidence to make a split.
- `min_samples_leaf`: Minimum samples that a leaf node must have. Increasing this number smooths the model, potentially reducing overfitting by avoiding leaves with very few samples.

The `GridSearchCV` process performs cross-validation to evaluate each model, which involves dividing the training data into `k` parts, training the model on `k-1` parts, and validating it on the remaining part. This cycle repeats `k` times with different parts held out for validation, and the results are averaged. This process ensures the model's performance is not dependent on a particular data split and provides a more accurate estimate of its performance on unseen data.

For our grid search, we defined a grid with 108 combinations of hyperparameters and performed 3-fold cross-validation, resulting in a total of 324 fits. The 'best' model is chosen based on the highest average cross-validation score across the folds, which for regression problems is typically the Mean Squared Error (MSE), but inverted (negated) since by convention, GridSearchCV treats higher values as better:

- `Best parameters found`: Indicates the hyperparameters that led to the best performing model on the validation sets.
- `Best score found`: Represents the negated MSE of the best model. Since MSE is a loss function (lower is better), its negation is used here to transform it into a score (higher is better).

In our case, the best parameters allowed the trees to grow to their natural depth (`max_depth`: None) while avoiding overfitting by requiring at least 4 samples at the leaf nodes (`min_samples_leaf`: 4) and allowing splits on nodes with at least 2 samples (`min_samples_split`: 2). With 200 trees in the forest (`n_estimators`: 200), the ensemble is robust while still being computationally feasible. The best score (negated MSE) of -0.04265617783956185 suggests that, on average, the squared difference between our model's predictions and the actual values is approximately 0.04265617783956185, when the sign is reversed.

The mathematics of this process aim to balance bias (error from erroneous assumptions in the learning algorithm) with variance (error from sensitivity to small fluctuations in the training set), achieving a model that is as accurate as possible on new, unseen data. The result of this grid search suggests we've found a reasonably good model within the searched parameter space for predicting occupancy rates LTM.



In [14]:
from sklearn.ensemble import GradientBoostingRegressor

# Define a grid of hyperparameter ranges
gb_params = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 4, 5],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

# Initialize a Gradient Boosting Regressor
gb = GradientBoostingRegressor(random_state=42)

# Initialize GridSearchCV
gb_grid_search = GridSearchCV(estimator=gb, param_grid=gb_params, cv=3, n_jobs=-1, verbose=2, scoring='neg_mean_squared_error')

# Fit the GridSearchCV to the data
gb_grid_search.fit(X_train, y_train)

# Print the best parameters and best score
print("Best parameters found: ", gb_grid_search.best_params_)
print("Best score found: ", gb_grid_search.best_score_)


Fitting 3 folds for each of 108 candidates, totalling 324 fits
Best parameters found:  {'learning_rate': 0.1, 'max_depth': 5, 'min_samples_leaf': 2, 'min_samples_split': 5, 'n_estimators': 100}
Best score found:  -0.043226292073383764


Best Parameters:

learning_rate: 0.1 (the rate at which the model learns)
max_depth: 5 (the maximum depth of individual trees)
min_samples_leaf: 2 (the minimum number of samples required to be at a leaf node)
min_samples_split: 5 (the minimum number of samples required to split an internal node)
n_estimators: 100 (the number of boosting stages to be run)
Best Score:

−
0.043226292073383764
−0.043226292073383764: This is the mean squared error for the best Gradient Boosting model found by the grid search. It is slightly worse than the Random Forest best score of 
−
0.04265617783956185
−0.04265617783956185, but remember this is a cross-validated score on the training set, not the test set.


In [15]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Best parameters for Random Forest from grid search
rf_best_params = {
    'n_estimators': 200,
    'max_depth': None,
    'min_samples_split': 2,
    'min_samples_leaf': 4
}

# Initialize the Random Forest Regressor with the best parameters
tuned_rf_model = RandomForestRegressor(**rf_best_params, random_state=42)

# Fit the model with the training dataset
tuned_rf_model.fit(X_train, y_train)

# Predict the target variable for the testing set
y_pred_tuned_rf = tuned_rf_model.predict(X_test)

# Evaluate the tuned model performance
tuned_rf_mse = mean_squared_error(y_test, y_pred_tuned_rf)
tuned_rf_r2 = r2_score(y_test, y_pred_tuned_rf)

print(f"Tuned Random Forest MSE: {tuned_rf_mse}")
print(f"Tuned Random Forest R-squared: {tuned_rf_r2}")


Tuned Random Forest MSE: 0.04182190315674118
Tuned Random Forest R-squared: 0.31185599029814415


In [16]:
from sklearn.ensemble import GradientBoostingRegressor

# Best parameters for Gradient Boosting from grid search
gb_best_params = {
    'learning_rate': 0.1,
    'max_depth': 5,
    'min_samples_split': 5,
    'min_samples_leaf': 2,
    'n_estimators': 100
}

# Initialize the Gradient Boosting Regressor with the best parameters
tuned_gb_model = GradientBoostingRegressor(**gb_best_params, random_state=42)

# Fit the model with the training dataset
tuned_gb_model.fit(X_train, y_train)

# Predict the target variable for the testing set
y_pred_tuned_gb = tuned_gb_model.predict(X_test)

# Evaluate the tuned model performance
tuned_gb_mse = mean_squared_error(y_test, y_pred_tuned_gb)
tuned_gb_r2 = r2_score(y_test, y_pred_tuned_gb)

print(f"Tuned Gradient Boosting MSE: {tuned_gb_mse}")
print(f"Tuned Gradient Boosting R-squared: {tuned_gb_r2}")


Tuned Gradient Boosting MSE: 0.04229732980244848
Tuned Gradient Boosting R-squared: 0.3040332473428564
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=  13.5s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=5, n_estimators=300; total time=  14.4s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=10, n_estimators=300; total time=  13.8s
[CV] END max_depth=None, min_samples_leaf=2, min_samples_split=5, n_estimators=100; total time=   4.9s
[CV] END max_depth=None, min_samples_leaf=2, min_samples_split=5, n_estimators=300; total time=  15.7s
[CV] END max_depth=None, min_samples_leaf=2, min_samples_split=10, n_estimators=300; total time=  12.0s
[CV] END max_depth=None, min_samples_leaf=4, min_samples_split=5, n_estimators=100; total time=   3.8s
[CV] END max_depth=None, min_samples_leaf=4, min_samples_split=5, n_estimators=300; total time=  11.2s
[CV] END max_depth=None, min_samples_leaf=4, min_samples_split=10, n_es

[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=  13.3s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=5, n_estimators=200; total time=  10.0s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=10, n_estimators=100; total time=   4.4s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=10, n_estimators=300; total time=  13.3s
[CV] END max_depth=None, min_samples_leaf=2, min_samples_split=5, n_estimators=100; total time=   5.0s
[CV] END max_depth=None, min_samples_leaf=2, min_samples_split=5, n_estimators=200; total time=  10.9s
[CV] END max_depth=None, min_samples_leaf=2, min_samples_split=10, n_estimators=200; total time=   8.4s
[CV] END max_depth=None, min_samples_leaf=4, min_samples_split=2, n_estimators=100; total time=   4.0s
[CV] END max_depth=None, min_samples_leaf=4, min_samples_split=2, n_estimators=300; total time=  11.1s
[CV] END max_depth=None, min_samples_leaf=4, min_samples_split=5, n_es

[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=200; total time=  13.4s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=5, n_estimators=200; total time=  10.5s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=10, n_estimators=200; total time=   8.2s
[CV] END max_depth=None, min_samples_leaf=2, min_samples_split=2, n_estimators=100; total time=   4.3s
[CV] END max_depth=None, min_samples_leaf=2, min_samples_split=2, n_estimators=300; total time=  15.9s
[CV] END max_depth=None, min_samples_leaf=2, min_samples_split=5, n_estimators=300; total time=  15.2s
[CV] END max_depth=None, min_samples_leaf=4, min_samples_split=2, n_estimators=200; total time=   7.6s
[CV] END max_depth=None, min_samples_leaf=4, min_samples_split=5, n_estimators=200; total time=   7.6s
[CV] END max_depth=None, min_samples_leaf=4, min_samples_split=10, n_estimators=100; total time=   3.6s
[CV] END max_depth=None, min_samples_leaf=4, min_samples_split=10, n_es

[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   7.0s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=300; total time=  19.2s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=10, n_estimators=200; total time=   8.6s
[CV] END max_depth=None, min_samples_leaf=2, min_samples_split=2, n_estimators=200; total time=   9.9s
[CV] END max_depth=None, min_samples_leaf=2, min_samples_split=5, n_estimators=200; total time=  10.9s
[CV] END max_depth=None, min_samples_leaf=2, min_samples_split=10, n_estimators=100; total time=   4.6s
[CV] END max_depth=None, min_samples_leaf=2, min_samples_split=10, n_estimators=300; total time=  12.0s
[CV] END max_depth=None, min_samples_leaf=4, min_samples_split=2, n_estimators=300; total time=  11.8s
[CV] END max_depth=None, min_samples_leaf=4, min_samples_split=10, n_estimators=200; total time=   6.9s
[CV] END max_depth=10, min_samples_leaf=1, min_samples_split=2, n_est

[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   7.3s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time=   4.8s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time=   4.9s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=5, n_estimators=300; total time=  14.7s
[CV] END max_depth=None, min_samples_leaf=2, min_samples_split=2, n_estimators=100; total time=   4.2s
[CV] END max_depth=None, min_samples_leaf=2, min_samples_split=2, n_estimators=200; total time=  10.3s
[CV] END max_depth=None, min_samples_leaf=2, min_samples_split=5, n_estimators=200; total time=  11.0s
[CV] END max_depth=None, min_samples_leaf=2, min_samples_split=10, n_estimators=200; total time=   8.6s
[CV] END max_depth=None, min_samples_leaf=4, min_samples_split=2, n_estimators=100; total time=   3.9s
[CV] END max_depth=None, min_samples_leaf=4, min_samples_split=2, n_esti

[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   7.3s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time=   5.1s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=5, n_estimators=200; total time=   9.6s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=10, n_estimators=100; total time=   4.5s
[CV] END max_depth=None, min_samples_leaf=1, min_samples_split=10, n_estimators=300; total time=  12.4s
[CV] END max_depth=None, min_samples_leaf=2, min_samples_split=2, n_estimators=300; total time=  17.3s
[CV] END max_depth=None, min_samples_leaf=2, min_samples_split=10, n_estimators=100; total time=   4.9s
[CV] END max_depth=None, min_samples_leaf=2, min_samples_split=10, n_estimators=300; total time=  12.1s
[CV] END max_depth=None, min_samples_leaf=4, min_samples_split=5, n_estimators=100; total time=   3.7s
[CV] END max_depth=None, min_samples_leaf=4, min_samples_split=5, n_e

# Model Performance Evaluation

We have evaluated the performance of both untuned and tuned models using Mean Squared Error (MSE) and R-squared metrics. Here's how the tuned models compare to their untuned counterparts:

## Untuned Models Performance:

### Random Forest Regressor:
- **MSE**: 0.04219514334594888
- **R-squared**: 0.30571463897272344

### Gradient Boosting Regressor:
- **MSE**: 0.04405501916053564
- **R-squared**: 0.27511195702875413

## Tuned Models Performance:

### Tuned Random Forest Regressor:
- **MSE**: 0.04182190315674118
- **R-squared**: 0.31185599029814415

### Tuned Gradient Boosting Regressor:
- **MSE**: 0.04229732980244848
- **R-squared**: 0.3040332473428564

## Performance Comparison Summary:

After hyperparameter tuning, both models showed improvement over their untuned versions:

1. **Random Forest Regressor**:
   - The tuned Random Forest model exhibited a slight but noticeable improvement in performance with a reduction in MSE and an increase in R-squared value compared to the untuned model.

2. **Gradient Boosting Regressor**:
   - Similar to the Random Forest, the tuned Gradient Boosting model demonstrated improved performance with a lower MSE and a higher R-squared value than its untuned version.

Based on these results, the **tuned Random Forest model** stands out as the best performing model for this project. It has the lowest MSE and the highest R-squared value, indicating a better fit to the test data and stronger predictive power.

These metrics suggest that the hyperparameter tuning was successful in enhancing the model performance, and the tuned Random Forest model should be selected as the final model for predicting occupancy rates LTM for AirBnb rental property data.


# Evaluation of Model Performance in Business Context

In the role of a Machine Learning Engineer, our mandate extends beyond the construction of statistically robust models to also encompass their applicability and value in real-world business scenarios. This means critically evaluating standard performance metrics, such as Mean Squared Error (MSE) and R-squared, to determine if they genuinely signify models that can make accurate and actionable predictions. Let's consider the implications of these metrics for the business and dissect the mathematical underpinnings of the fine-tuning process when the metrics might not be adequate indicators of real-world performance.

## Business Implications of Model Metrics:

### Mean Squared Error (MSE):
- **MSE**: Commonly, a lower MSE is preferred as it indicates that the model's predictions are closer to the actual data. However, in the context of predicting occupancy rates LTM, an MSE of 0.0418 or 0.0423 might not be sufficiently low. This is particularly true if our dataset has a ceiling effect, where a significant portion of the occupancy rates are clustered at the higher end (e.g., 90%+ occupancy). In such a scenario, the model may not differentiate well among the high-occupancy properties, which are of particular interest for understanding and forecasting revenue potentials and making investment decisions.

### R-squared:
- **R-squared**: While an R-squared value of around 0.31 indicates that the model accounts for approximately 31% of the variability in the occupancy rates, this might not translate into reliable predictability in a business environment. If most occupancy rates are clustered at a high level, the R-squared value might be inflated without providing meaningful insights into the nuances of the data needed for precise business forecasting.

## Mathematical Considerations in Fine-Tuning:

The process of fine-tuning, or hyperparameter optimization, is theoretically sound for enhancing model performance. However, its effectiveness can be compromised in practice by dataset-specific challenges:

- **Grid Search**: We used grid search to comb through a predefined hyperparameter space. While this approach is thorough, it assumes that the model form is appropriate for the data. A ceiling effect in our data could mean that no combination of parameters would yield a significantly better model.
  
- **Cross-Validation**: Employing cross-validation reduces the risk of overfitting, but it doesn't necessarily address the underlying issue if the model is unable to capture the central characteristics of the dataset due to a skewed distribution of target values.

- **Optimization Objective**: Our objective function, minimizing the negative MSE, may not be the right target if it doesn't account for the distribution of the occupancy rates. In the presence of a ceiling effect, a different loss function or a transformation of the target variable might be more appropriate.

- **Final Model Selection**: The selection of the model based on the best hyperparameters assumes that the grid search has identified a model that can generalize well. However, with a skewed target distribution, the model might perform well statistically but still lack practical predictive power.

## Conclusion:

Given the observed ceiling effect where most occupancy rates cluster at the higher end, the MSE and R-squared values from the fine-tuned models do not indicate a level of predictive accuracy that would be deemed sufficient for real-world business applications. This suggests that the current models might not effectively capture the intricacies necessary for reliable occupancy rate LTM predictions. It highlights the need for either a different modeling approach that can handle such skewed distributions more adeptly or a reevaluation of the target variable's representation in the model. Therefore, while the fine-tuned Random Forest model may provide the best performance among the models tested, caution should be exercised before deploying it for strategic decision-making. Continuous monitoring and further investigation into alternative modeling strategies are advised to ensure the models' predictions remain relevant and valuable for the business.
