# **Project Name**    -  **Hotel Booking Analysis**



##### **Project Type**    - EDA
##### **Contribution**    - Individual

# **Project Summary -**

This project aims to perform an exploratory data analysis (EDA) on a hotel booking dataset to gain insights and understand patterns in hotel bookings. The dataset consists of booking information for a city hotel and a resort hotel, including details such as booking dates, length of stay, cancellation status, customer demographics, and other relevant factors. All personally identifying information has been removed from the data to ensure privacy.

The primary objective of this project is to explore the dataset using EDA techniques and extract meaningful insights that can assist hotel management in making informed decisions, optimizing booking strategies, and improving customer satisfaction.

# **GitHub Link -**

https://github.com/arshadmujawar2408/Hotel-Booking-Analysis-EDA

# **Problem Statement**


Have you ever wondered when the best time of year to book a hotel room is? Or the optimal length of stay in order to get the best daily rate? What if you wanted to predict whether or not a hotel was likely to receive a disproportionately high number of special requests? This hotel booking dataset can help you explore those questions! This data set contains booking information for a city hotel and a resort hotel, and includes information such as when the booking was made, length of stay, the number of adults, children, and/or babies, and the number of available parking spaces, among other things. All personally identifying information has been removed from the data. Explore and analyse the data to discover important factors that govern the bookings.



#### **Define Your Business Objective?**

Understand booking patterns, customer preferences, and market trends to make informed decisions that maximize revenue, optimize pricing strategies, improve customer satisfaction, and enhance overall hotel performance.

# **General Guidelines** : -

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.

     The additional credits will have advantages over other students during Star Student selection.

             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.


```
# Chart visualization code
```


*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
dataset=pd.read_csv("https://raw.githubusercontent.com/arshadmujawar2408/Hotel-Booking-Analysis-EDA/main/Hotel%20Bookings.csv")

### Dataset First View

In [None]:
# Dataset First Look
dataset.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
dataset.shape

### Dataset Information

In [None]:
# Dataset Info
dataset.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
len(dataset[dataset.duplicated()])

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print(dataset.isnull().sum())

In [None]:
# Visualizing the missing values
# Checking Null Value by plotting Heatmap
sns.heatmap(dataset.isnull(), cbar=False)

### What did you know about your dataset?

This data set contains booking information for a city hotel and a resort hotel and includes information such as when the booking was made, length of stay, the number of adults, children, and/or babies, and the number of available parking spaces, among other things.

The above dataset has 119390 rows and 32 columns. There are 129421 missing values and 31994 duplicate values in the dataset.

The dataset have 32 variables (Continuous and Categorical) with one identified dependent variable (categorical), which is 'is_cancelled

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
dataset.columns

In [None]:
# Dataset Describe
dataset.describe(include='all')

### Variables Description

* **hotel                :** Hotel (H1 = Resort Hotel or H2 = City Hotel)

* **is_canceled       :** Value indicating if the booking was canceled (1) or not (0)

* **lead_time            :** Number of days that elapsed between the entering date of the booking into the PMS and the arrival date

* **arrival_date_year            :** Year of arrival date

* **arrival_date_month          :** Month of arrival date

* **arrival_date_week_number       :** Week number of year for arrival date

* **arrival_date_day           :** Day of arrival date

* **stays_in_weekend_nights**         :Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel

* **stays_in_week_nights**         :Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel

* **adults**          :Number of adults

* **childern**         :Number of childre

* **babies**         :Number of babies

* **meal**          :Kind of meal opted for

* **country**         :Country code

* **market_segment**         :Which segment the customer belongs to
* **Distribution _channel** :How the customer accessed the stay-
corporate booking/Direct/TA.TO

* **is_repeated_guest**:Guest coming for frst time or not

* **previous_cancellation**:Was there a cancellation before

* **previous_bookings**:Count of previous bookings

* **reserved_room_type**:Type of room reserved

* **assigned_room_type**:Type of room assigned

* **booking_changes**:Count of changes made to booking

* **deposit_type**:Deposit type

* **agent**:Booked through agent

* **days_in_waiting_list**:Number of days in waiting lst

* **customer_type**:Type of customer

* **required_car_parking**:If car parking is required

* **total_of _special_req**:Number of additional special
requirements

* **reservation_status**:Reservation of status

* **reservation_status_date**:Date of the specific status




### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in dataset.columns.tolist():
  print("No. of unique values in ",i,"is",dataset[i].nunique(),".")

## 3. ***Data Cleaning and Manupulating***

1. Removing duplicate rows if any.

In [None]:
# Finding duplicate values in dataset
dataset[dataset.duplicated()].shape

In [None]:
# Dropping duplicate values
dataset.drop_duplicates(inplace = True)

In [None]:
dataset.shape

2. Handling missing values.

In [None]:
# Lets drop columns with high missing values
# The column "agent" has 16,340 missing values/NaN values.
# The column "company" has 112,593 missing values/NaN values.
dataset=dataset.drop(['agent','company'],axis=1)

In [None]:
# Remove null values from dataset
dataset=dataset.dropna(axis=0)

In [None]:
#  is used to count the number of missing values (null values) in each column of the dataset.
dataset.isnull().sum()

3. Identify Continuous and Categorical Variables


In [None]:
# Identify Continuous and Categorical Variables
def var(hotel):
    unique_list = pd.DataFrame([[i,len(hotel[i].unique())] for i in hotel.columns])
    unique_list.columns = ['name','uniques']

    total_var = set(hotel.columns)
    cat_var = set(unique_list.name[(unique_list.uniques<=12)      |
                                   (unique_list.name=='country')  |
                                   (unique_list.name=='agent')
                                  ])
    con_var = total_var - cat_var

    return cat_var, con_var


cat_var, con_var = var(dataset)

print("Continuous Variables (",len(con_var),")\n",con_var,'\n\n'
      "Categorical Variables(",len(cat_var),")\n",cat_var)


4. Converting columns to appropriate naming conventions.

In [None]:
dataset.columns = ['Hotel', 'Canceled', 'LeadTime', 'ArrivingYear', 'ArrivingMonth', 'ArrivingWeek','ArrivingDate', 'WeekendStay',
              'WeekStay', 'Adults', 'Children', 'Babies', 'Meal','Country', 'Segment', 'DistChannel','RepeatGuest', 'PrevCancel',
              'PrevBook', 'BookRoomType','AssignRoomType', 'ChangeBooking', 'DepositType', 'WaitingDays',
              'CustomerType', 'ADR','ParkSpace', 'SpecialRequest','Reservation', 'ReservationDate']

### What all manipulations have you done and insights you found?

Duplicate Rows: There are 31,994 duplicate rows in the dataset, and the dataset has a total of 32 columns. To ensure data integrity and avoid biased analysis, it is recommended to remove these duplicate records from the dataset.

Missing Values: The column "agent" has 16,340 missing values, and the column "company" has 112,593 missing values. Considering the high number of missing values in these columns, it is advisable to remove them from the dataset to avoid potential bias in the analysis.

Variable Types: The dataset consists of 12 continuous variables and 18 categorical variables. These variable types provide different types of information and can be treated differently during the analysis process.

Column Name Renaming: To improve readability and enhance understanding, it is suggested to rename the columns with appropriate and descriptive names that accurately reflect the information they represent. This can make the dataset more user-friendly and facilitate easier interpretation of the results.

## ***4. Data Preparation***

In [None]:
#Lets combine children and babies together as kids
dataset['Kids'] = dataset.Children + dataset.Babies

#Combine total numbers by adding kids and adults
dataset['total_members'] = dataset.Kids + dataset.Adults

In [None]:
#convert the datatypes to string
dataset['ArrivingYear'] = dataset['ArrivingYear'].astype('str')
dataset['ArrivingMonth'] = dataset['ArrivingMonth'].astype('str')
dataset['ArrivingDate'] = dataset['ArrivingDate'].astype('str')

dataset['Canceled'] = dataset['Canceled'].astype('str')
dataset['RepeatGuest'] = dataset['RepeatGuest'].astype('str')

In [None]:
# Lets convert arrival date to datetime
dataset['Arrival Date'] = dataset['ArrivingDate'] + '-' + dataset['ArrivingMonth'] + '-' + dataset['ArrivingYear']
dataset['Arrival Date'] = pd.to_datetime(dataset['Arrival Date'], errors='coerce')

In [None]:
# creates a new DataFrame called confirmed_bookings by filtering the original dataset DataFrame
confirmed_bookings = dataset[dataset.Canceled=='0']

In [None]:
# adds a new column called 'ArrivingMonth' to the confirmed_bookings DataFrame. It extracts the month component from the 'Arrival Date' column in the dataset DataFrame using the dt.month accessor from the datetime module.
import datetime as dt
confirmed_bookings['ArrivingMonth'] = dataset['Arrival Date'].dt.month
final=confirmed_bookings['ArrivingMonth'].value_counts().sort_index()
final

## ***5. Expolatory Data Analysis***

In [None]:
print('Total Bookings canceled')
print('-'*50)
print(dataset.Canceled.value_counts())
print('-'*50)
print('*'*75)
print('Cancelation percentage in both hotels ')
print('-'*50)
print(dataset.Canceled.value_counts(normalize=True))


23987 bookings were canceled which is around 27%

In [None]:
dataset.Country.value_counts(normalize=True)

Around 31% of all bookings were booked from Portugal followed by Great Britain(12%) & France(10%).


In [None]:
dataset.ArrivingMonth.value_counts(normalize=True)

August is the most occupied (busiest) month with 12.91% bookings and January is the most unoccupied month with 5.33% bookings.

In [None]:
dataset.Segment.value_counts(normalize=True)

Around 59% of bookings are made via Online Travel Agents, almost 15% of bookings are made via Offline Travel Agents and less than 13% are Direct bookings without any other agents.

In [None]:
dataset.ArrivingYear.value_counts(normalize=True)

48% bookings were done in 2016, 36% in 2017 and 15% in 2015. We can see increasing tendency in bookings year wise.

In [None]:
dataset.Meal.value_counts(normalize=True)

Out of the meals, BB (Bed & Breakfast) is the most ordered meal which is around 77.7%, followed by SC(no meal package),HB(Half Board), Undefined and FB (Full Board).

In [None]:
dataset.CustomerType.value_counts(normalize=True)

Transient type of customers are the more around 82%

In [None]:
dataset.Reservation.value_counts(normalize=True)

We can see 72% visitors checked out and 26% bookings were canceled

## ***6. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 - Pie Chart [Percentage ratio between cancelled and confirmed bookings?][Univariate Analysis]

In [None]:
# Chart - 1 visualization code
cols = ['gold', 'lightcoral']
dataset['Canceled'].value_counts().plot.pie(autopct='%1.1f%%',shadow=True, colors=cols)

##### 1. Why did you pick the specific chart?

A pie chart expresses a part-to-whole relationship in your data. It's easy to explain the percentage comparison through area covered in a circle with different colors. Where differenet percentage comparison comes into action pie chart is used frequently. So, I used Pie chart and which helped me to get the percentage comparision of the dependant variable.

##### 2. What is/are the insight(s) found from the chart?

Confirmed bookings are slightly more than the cancelled bookings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

In above pie chart we discovering that confirm booking are 72.4% due to different- different offers given by resorts and hotels.... Only 27.6% cancelled booking because some coustomer are not happy with offers of hotels and resorts... That leads to slightly negative growth

#### Chart - 2 - Boxplot [which month results in high revenue?][Multivariate Analysis]

In [None]:
# Chart - 2 visualization code
reindex = ['January', 'February','March','April','May','June','July','August','September','October','November','December']
dataset['ArrivingMonth'] = pd.Categorical(dataset['ArrivingMonth'],categories=reindex,ordered=True)
plt.figure(figsize = (12,15))
sns.boxplot(x = dataset['ArrivingMonth'],y = dataset['ADR'])
plt.show()

##### 1. Why did you pick the specific chart?

a box plot to visualize the distribution of the average daily rate (adr) across different months of the year.

##### 2. What is/are the insight(s) found from the chart?

Avg adr rises from beginning of year upto middle of year and reaches peak at August and then lowers to the end of year. But hotels do make some good deals with high adr at end of year also.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

in above box plot graph we observing that revenue of hotels is increasing from month January to August.. this shows positive growth in revenue.. From month August to November the revenues of hotels dips down that shows negative growth of revenue of hotels... And again in month of December revenue get increases..

#### Chart - 3 Countplot [To know the most preferred type of meal by customers][Univariate Analysis]

In [None]:
# Chart - 3 visualization code
plt.figure( figsize = (6, 7))
sns.countplot(x = dataset['Meal'])
plt.show()

##### 1. Why did you pick the specific chart?

 count plot is used to display the count of categorical observations in each bin in the dataset. A count plot resembles a histogram over a categorical variable as opposed to a quantitative one.

##### 2. What is/are the insight(s) found from the chart?

Customers Preffered BB which means Bed and Breakfast so that we can add new dishes to the menu and see which dish is more preffered in the breakfast.We can keep Tea,Coffee,Juice this type of Drinks in the menu beacuse Usually Most of them consume this in the Morning.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

As most customers prefer Bread & Breakfast (BB), the hotel management can introduce offers on Full Board (FB) meal, which can increase revenue as well.

#### Chart - 4 Barplot [Which hotel has higher lead time?][Bivariate Analysis]

In [None]:
# Chart - 4 visualization code
#lead time of both the hotel
plt.figure(figsize=(20,8))
sns.barplot(x = "Hotel", y = "LeadTime", data = dataset, hue = "Hotel");
plt.title('Lead time of the different Market segment')
plt.tight_layout();

##### 1. Why did you pick the specific chart?

We pick the specific chart for the comparison of lead time between resorts and city hotels...

##### 2. What is/are the insight(s) found from the chart?

Resort hotel has higher lead time over the city hotel

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Above graph we noticing that city hotels have higher lead time because of their different different offers and their lower prices by which business of city hotels is growing in positive direction... For resorts their lead time is less than city hotels because of their higher prices and they are not for medium budget customers so their business is going towards negative direction.

#### Chart - 5 Lineplot [Looking into prices per month per hotel][Multivariate Analysis]

In [None]:
# Chart - 5 visualization code
#Resizing plot
plt.figure(figsize=(12,5))

# Calculating average daily rate per person
dataset['adr_pp'] = dataset['ADR'] / (dataset['Adults'] + dataset['Children'])
actual_guests = dataset.loc[dataset["Canceled"] == '0']
actual_guests['price'] = actual_guests['ADR'] * (actual_guests['WeekendStay'] + actual_guests['WeekStay'])
sns.lineplot(data = actual_guests, x = 'ArrivingMonth', y = 'price', hue = 'Hotel')
plt.show()

##### 1. Why did you pick the specific chart?

a line plot or line graph, is a type of chart that displays data points connected by straight lines. It is commonly used to visualize the trend or change in data over a continuous period or interval.

##### 2. What is/are the insight(s) found from the chart?

Prices of resort hotel are much higher and Prices of city hotel do not fluctuate that much.

#### Chart - 6 Heatmap [Let’s plot the heatmap and see the correlation][Multivariate Analysis]

In [None]:
# Correlation Heatmap visualization code
#Lets see the correlation
plt.figure(figsize=(12,8))
sns.heatmap(dataset.corr(),annot=True,cmap='RdYlGn')

##### 1. Why did you pick the specific chart?

A correlation matrix is a table showing correlation coefficients between variables. Each cell in the table shows the correlation between two variables. A correlation matrix is used to summarize data, as an input into a more advanced analysis, and as a diagnostic for advanced analyses. The range of correlation is [-1,1].

Thus to know the correlation between all the variables along with the correlation coeficients, i used correlation heatmap.

##### 2. What is/are the insight(s) found from the chart?

The heatmap is a graphical representation of the correlation matrix and provides a quick way to visualize the linear relationship between the variables in the data.

#### Chart - 7 Choropleth map [Most visited country][Multivariate Analysis]

In [None]:
# Chart - 7 visualization code
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from plotly.colors import n_colors
from plotly.subplots import make_subplots
# Minmax scaler
from sklearn.preprocessing import MinMaxScaler

country_visitors = dataset[dataset['Canceled'] == '0'].groupby(['Country']).size().reset_index(name = 'count')


import plotly.express as px

px.choropleth(country_visitors,
                    locations = "Country",
                    color= "count" ,
                    hover_name= "Country", # column to add to hover information
                    color_continuous_scale="Viridis",
                    title="Home country of visitors")

##### 1. Why did you pick the specific chart?

A choropleth map is a type of statistical thematic map that uses pseudocolor, meaning color corresponding with an aggregate summary of a geographic characteristic within spatial enumeration units, such as population density or per-capita income.
Choropleth maps provide an easy way to visualize how a variable varies across a geographic area or show the level of variability within a region. A heat map or is arithmic map is similar but uses regions drawn according to the pattern of the variable, rather than the a priori geographic areas of choropleth maps.
So, to check from which country the visitors are comming we use that graph.

##### 2. What is/are the insight(s) found from the chart?

More visitors are from western europe, namely France,UK and Portugal being the highest.

#### Chart - 8 Countplot [visualization of most preferred room type][Univariate Analysis]

In [None]:
# Chart - 8 visualization code
plt.figure(figsize=(14,7))
sns.countplot(x=dataset['AssignRoomType'],order=dataset['AssignRoomType'].value_counts().index)
plt.title("Most preferred Room type", fontsize = 20)
plt.xlabel('Type of the Room', fontsize = 15)
plt.ylabel('Room type count', fontsize = 15)

##### 1. Why did you pick the specific chart?

We have choose countplot to visualize most prefferd roomtype because countplot display the count of each observation for each category and here we have to represent room type vs room type count.

##### 2. What is/are the insight(s) found from the chart?

The insight found from the chart is A type rooms are most prefered rooms and the count is 46283 and after that D type rooms are prefered by the guest and count is 22419.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Hotels can provide those facilities in other room types which provided in room type A

#### Chart - 9 - Countplot[year wise booking using countplot chart][Bivariate Analysis]

In [None]:
# Chart - 9 visualization code
plt.figure(figsize=(14,7))
sns.countplot(x=dataset['ArrivingYear'],hue=dataset['Hotel'])
plt.title('Year wise Bookings', fontsize = 20)
plt.xlabel('Arrival_date_year', fontsize = 15)
plt.ylabel('Count of bookings', fontsize = 15)

##### 1. What is/are the insight(s) found from the chart?

2016 had highest bookings and 2015 had lowest bookings

#### Chart - 10 - Countplot [Type of hotel most prefered][Bivariate Analysis]

In [None]:
Hotel_typ =dataset['Hotel'].value_counts()
Hotel_typ

In [None]:
plt.subplot(2,2,1 )
Hotel_typ.plot.pie(x='City Hotel', y ='Resort Hotel',autopct='%1.0f%%',textprops={'weight': 'bold'},figsize =(12,12),explode =[0.05]*2)
plt.title('Hotel type',fontweight="bold", size=20)

##### 1. What is/are the insight(s) found from the chart?

city hotels are more preferred as compare to resort hotel

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

From the analysis, several key factors can be identified to achieve the business objectives of increasing hotel revenue, generating customer satisfaction, and enhancing facilities. These factors include:

* Most Preferred Hotel: By identifying the most preferred hotel, the client can focus on promoting and enhancing the services and amenities offered in that specific hotel to attract more guests.
Percentage of Repeated Guests: Understanding the percentage of repeated guests helps the client prioritize strategies to improve customer loyalty and retention, as repeat guests often contribute significantly to revenue generation.
* Preferred Food Choices: Knowing the food preferences of guests allows the client to offer the most popular food options, ensuring customer satisfaction and enhancing the dining experience.
* Preferred Months: Identifying the preferred months for bookings helps the client allocate resources and staff accordingly, ensuring a seamless and enjoyable stay for guests during peak periods.
* ADR Analysis: Analyzing the Average Daily Rate (ADR) across different hotel types helps the client identify which hotels generate higher income, allowing them to optimize pricing strategies and maximize revenue.
* Busiest Hotel: By determining the busiest hotel, the client can focus on improving facilities and services in less busy hotels to attract more guests and increase occupancy rates.
* ADR and Total Number of People: Analyzing the relationship between ADR and the total number of people provides insights into revenue generation potential based on group bookings or larger party sizes, allowing the client to cater to the needs of different customer segments.


**Strategies to Counter High Cancellations at the Hotel**

* Set Non-refundable Rates, Collect deposits, and implement more rigid cancellation policies.
* Encourage Direct bookings by offering special discounts
* Monitor where the cancellations are coming from such as Market Segment, distribution channels, etc.


# **Conclusion**

* Majority of the hotels booked are city hotel. Definitely need to spend the most targeting fund on those hotel.
* We also realise that the high rate of cancellations can be due high no deposit policies.
* We should also target months between May to Aug. Those are peak months due to the summer period.
* Majority of the guests are from Western Europe. We should spend a significant amount of our budget on those area.
* Given that we do not have repeated guests, we should target our advertisement on guests to increase returning guests.
*   Bed and Breakfast (BB) is the most preferred meal package, indicating the potential to introduce offers and promotions for other meal packages like Full Board (FB) to increase revenue.
* The distribution channel TA/TO (Travel Agents/Tour Operators) accounts for 80% of bookings, highlighting the importance of collaborating with these channels for marketing and promotions.



And many more conclusions.