## Scenario

One of the problems that often occurs in the hotel business is the practice of customers cancelling their room reservation. Some guests who decide to cancel their booking take the trouble to notify the hotel in advance, but many do not. This leads to unwanted vacancies in rooms that would have otherwise been occupied. The cancellation of bookings impacts hotels in various ways, including loss of revenue when the room cannot be resold; a reduction in profit margin caused by lowering the price at the last minute to resell the room; and additional costs incurred by paying for publicity to help sell these rooms. Therefore, being able to predict bookings that may be cancelled would minimise the impact that cancellations can have on a hotel. The manager of a hotel has given you the task of detecting and investigating the three most problematic areas, and providing solutions for each, with recommendations based on the overall results you receive. You are required to use machine learning techniques (regression/classification) taught in the Big Data Analytics module. 

In [8]:
#Required modules
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score


In [21]:
#load dataset
df = pd.read_csv('./dataset/hotel-reservations.csv')
df.head(50)

Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,...,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
0,Resort Hotel,0,342,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
1,Resort Hotel,0,737,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
2,Resort Hotel,0,7,2015,July,27,1,0,1,1,...,No Deposit,,,0,Transient,75.0,0,0,Check-Out,2015-07-02
3,Resort Hotel,0,13,2015,July,27,1,0,1,1,...,No Deposit,304.0,,0,Transient,75.0,0,0,Check-Out,2015-07-02
4,Resort Hotel,0,14,2015,July,27,1,0,2,2,...,No Deposit,240.0,,0,Transient,98.0,0,1,Check-Out,2015-07-03
5,Resort Hotel,0,14,2015,July,27,1,0,2,2,...,No Deposit,240.0,,0,Transient,98.0,0,1,Check-Out,2015-07-03
6,Resort Hotel,0,0,2015,July,27,1,0,2,2,...,No Deposit,,,0,Transient,107.0,0,0,Check-Out,2015-07-03
7,Resort Hotel,0,9,2015,July,27,1,0,2,2,...,No Deposit,303.0,,0,Transient,103.0,0,1,Check-Out,2015-07-03
8,Resort Hotel,1,85,2015,July,27,1,0,3,2,...,No Deposit,240.0,,0,Transient,82.0,0,1,Canceled,2015-05-06
9,Resort Hotel,1,75,2015,July,27,1,0,3,2,...,No Deposit,15.0,,0,Transient,105.5,0,0,Canceled,2015-04-22


In [10]:
#Separate numerical and categorical columns
numerical_cols = df.select_dtypes(include=['float64', 'int64']).columns
categorical_cols = df.select_dtypes(include=['object']).columns

#numerical_cols
categorical_cols

Index(['hotel', 'arrival_date_month', 'meal', 'country', 'market_segment',
       'distribution_channel', 'reserved_room_type', 'assigned_room_type',
       'deposit_type', 'customer_type', 'reservation_status',
       'reservation_status_date'],
      dtype='object')

In [14]:
#impute missing values for numerical columns with the mean
numerical_imputer = SimpleImputer(strategy='mean')
df[numerical_cols] = numerical_imputer.fit_transform(df[numerical_cols])


In [None]:
#impute missing values for categorical columns with the mean
categorical_imputer = SimpleImputer(strategy='most_frequet')
df[categorical_cols_cols] = numerical_imputer.fit_transform(df[categorical_cols])

In [16]:
#Task a. Identify the top six countries showing months th a high rate of cancellation
#Group by 'country', 'arrival_date_month', and calculate cancellation rates
grouped_data = df.groupby(['country', 'arrival_date_month'])['is_canceled'].mean().reset_index()
#grouped_data

In [19]:
#sort by cancellation rate in descending order
top_countries = grouped_data.sort_values(by='is_canceled', ascending=False).head(6)
top_countries

Unnamed: 0,country,arrival_date_month,is_canceled
583,JEY,September,1.0
127,BIH,December,1.0
1131,VEN,April,1.0
620,KHM,April,1.0
619,KEN,March,1.0
616,KEN,December,1.0
