In [1]:
import pandas as pd 
import numpy as np
from imblearn.over_sampling import RandomOverSampler

In [2]:
# Import dataset
hotel_df = pd.read_csv("./data/cleaned-hotel-reservations.csv")
hotel_df.head(3)

Unnamed: 0,no_of_adults,no_of_children,no_of_weekend_nights,no_of_week_nights,type_of_meal_plan,required_car_parking_space,room_type_reserved,lead_time,arrival_year,arrival_month,arrival_date,market_segment_type,repeated_guest,no_of_previous_cancellations,no_of_previous_bookings_not_canceled,avg_price_per_room,no_of_special_requests,booking_status
0,2,0,1,2,Meal Plan 1,0,Room_Type 1,224,2017,10,2,Offline,0,0,0,65.0,0,Not_Canceled
1,2,0,2,3,Not Selected,0,Room_Type 1,5,2018,11,6,Online,0,0,0,106.68,1,Not_Canceled
2,1,0,2,1,Meal Plan 1,0,Room_Type 1,1,2018,2,28,Online,0,0,0,60.0,0,Canceled


## Examine the class label imbalance

Let's look at the dataset imbalance:

In [3]:
# Encode target column 'Not_canceled' as 0 and 'Canceled' as 1
hotel_df['booking_status'] = hotel_df['booking_status'].map({'Not_Canceled': 0, 'Canceled': 1})

In [4]:
# Count the occurrences of each class in the oversampled data
not_canceled, canceled = np.bincount(hotel_df['booking_status'])

# Calculate the total number of samples
total = not_canceled + canceled

# Print the class distribution
print('Samples:\n    Total: {}\n    Canceled: {} ({:.2f}% of total)'.format(
    total, canceled, 100 * canceled / total))

Samples:
    Total: 36275
    Canceled: 11885 (32.76% of total)


This shows that the original dataset has a slightly lower fraction of positive samples, resulting in an imbalanced dataset which may lead to biases when training models for prediction. Hence, we explore techniques to balance the original dataset.

## Random Oversampling

We perform random oversampling on the minority class.

In [5]:
# Split X and y columns
X = hotel_df.drop('booking_status', axis=True)
y = hotel_df['booking_status']

In [6]:
# Create the RandomOverSampler
oversampler = RandomOverSampler(random_state=42)

# Oversample the data
X_oversampled, y_oversampled = oversampler.fit_resample(X, y)

In [7]:
# Count the occurrences of each class in the oversampled data
not_canceled_oversampled, canceled_oversampled = np.bincount(y_oversampled)

# Calculate the total number of samples
total_oversampled = not_canceled_oversampled + canceled_oversampled

# Print the class distribution
print('Samples after oversampling:\n Total: {}\n Canceled: {} ({:.2f}% of total)'.format(total_oversampled, canceled_oversampled, 100 * canceled_oversampled / total_oversampled))

Samples after oversampling:
 Total: 48780
 Canceled: 24390 (50.00% of total)


The dataset is now balanced.

In [8]:
# Convert to a dataframe
oversampled_data = pd.concat([pd.DataFrame(X_oversampled), pd.DataFrame(y_oversampled)], axis=1)

# Decode target column
oversampled_data['booking_status'] = oversampled_data['booking_status'].map({0: 'Not_Canceled', 1: 'Canceled'})

In [9]:
# Save as csv
oversampled_data.to_csv("./data/oversampled-hotel-reservations.csv")