In [1]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

In [2]:
# load data
df = pd.read_csv('../../data/H2.csv')

First I want to create a split for the 2'nd solution; ***Booking Cancellation Prediction***

* Target variable is `IsCanceled`, *stratified splitting* can be used to preserve the proportions of both classes (cancelled-1- and uncancelled-0- reservations) in the train and test sets.

In [3]:
# Analyze the class distribution of the target variable (IsCanceled)
df['IsCanceled'].value_counts(normalize=True)

IsCanceled
0    0.58273
1    0.41727
Name: proportion, dtype: float64

- 0 (Uncancelled reservations): 58.27%
- 1 (Cancelled reservations): 41.73%

This rate is a pretty good start for a balanced data set, but using stratified splitting will still ensure that this distribution is preserved in both the train and test sets.

In [None]:
from sklearn.model_selection import train_test_split

# Stratified splitting into train and test sets (80% train, 20% test)
train_data_strat, test_data_strat = train_test_split(
    df, 
    test_size=0.2, 
    stratify=df['IsCanceled'], 
    random_state=42
)

# Check the resulting split sizes and class distribution in both sets
train_size = train_data_strat.shape[0]
test_size = test_data_strat.shape[0]
train_class_distribution = train_data_strat['IsCanceled'].value_counts(normalize=True)
test_class_distribution = test_data_strat['IsCanceled'].value_counts(normalize=True)

train_size, test_size, train_class_distribution, test_class_distribution


(63464,
 15866,
 IsCanceled
 0    0.582724
 1    0.417276
 Name: proportion, dtype: float64,
 IsCanceled
 0    0.582756
 1    0.417244
 Name: proportion, dtype: float64)

Split is protected across train and test and rate is okay for ML preparation. 

For other solutions (***Customer Segmentation***) random sampling is okay.

In [7]:
train_data_rnd, test_data_rnd = train_test_split(df, test_size=0.2, random_state=42)

train_data_rnd.shape, test_data_rnd.shape

((63464, 31), (15866, 31))

The application of ***dynamic pricing*** involves making pricing decisions based on historical data and using those decisions to predict the future. Therefore, it is important to both preserve temporal order and ensure that the model generalizes to future events when separating the data.

Since we aim to predict the future based on past data, we should divide the dataset according to date. For example, we can use one date range as a training set and the next range as a test set. This method is the most suitable strategy for a dynamic pricing scenario.

In [9]:
# Convert the 'ReservationStatusDate' column to a datetime object
df['ReservationStatusDate'] = pd.to_datetime(df['ReservationStatusDate'])

# Sort the data by the 'ReservationStatusDate' column
data = df.sort_values(by='ReservationStatusDate')

# Split the data into train and test sets based on a specific date
split_date = '2017-01-01'  # For example, this data
train_data_date = data[data['ReservationStatusDate'] < split_date]
test_data_date = data[data['ReservationStatusDate'] >= split_date]

train_data_date.shape, test_data_date.shape

((55215, 31), (24115, 31))

In [13]:
train_data_date.head()['ReservationStatusDate'], train_data_date.tail()['ReservationStatusDate'], test_data_date.head()['ReservationStatusDate'], test_data_date.tail()['ReservationStatusDate']

(33657   2014-10-17
 33763   2014-10-17
 33764   2014-10-17
 33765   2014-10-17
 33766   2014-10-17
 Name: ReservationStatusDate, dtype: datetime64[ns],
 63687   2016-12-31
 63691   2016-12-31
 63692   2016-12-31
 63693   2016-12-31
 23176   2016-12-31
 Name: ReservationStatusDate, dtype: datetime64[ns],
 63761   2017-01-01
 63760   2017-01-01
 25961   2017-01-01
 23299   2017-01-01
 32982   2017-01-01
 Name: ReservationStatusDate, dtype: datetime64[ns],
 79325   2017-09-06
 79328   2017-09-07
 79326   2017-09-07
 79327   2017-09-07
 79329   2017-09-07
 Name: ReservationStatusDate, dtype: datetime64[ns])