<a href="https://colab.research.google.com/github/VIK98110/Almabetter_capstone_project/blob/main/Project_on_Hotel_Bookings_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Name** - Vikas Kumar

# **Project Summary -**

**Exploratory Data Analysis for Enhanced Competitiveness and Profitability in the Hotel Industry**

This Python EDA project focuses on analyzing hotel booking datasets to extract valuable insights that will empower company in making informed decisions to enhance competitiveness and drive profitability. By conducting in-depth exploratory data analysis, this project aims to uncover meaningful patterns, trends, and customer preferences. It contains 119390 rows and 32 columns. This summary provides an overview of the project's objectives, methodologies, and expected outcomes.

Methodology:

1. Data Acquisition: The project begins by acquiring hotel booking datasets that include information such as booking dates, guest demographics, room types, booking channels, and customer reviews. These datasets can be obtained from public sources, industry databases, or collaborations with hotels.

2. Data Cleaning and Preparation: Once the datasets are obtained, thorough data cleaning and preprocessing tasks are performed. This includes handling missing values, removing duplicates, standardizing data formats, and resolving inconsistencies to ensure data quality and reliability.

3. Descriptive Statistics: Compute descriptive statistics to gain an overall understanding of the dataset. Analyze key metrics such as booking counts, average daily rates, booking lead time, and length of stay. Identify any outliers or anomalies that may impact the analysis.

4. Booking Patterns and Seasonality: Analyze booking patterns over time to identify seasonality trends, peak booking periods, and variations in demand. This information can guide pricing strategies, inventory management, and marketing campaigns.

5. Customer Segmentation: Segment customers based on demographics, booking behavior, or other relevant factors. Identify different customer segments and their preferences, enabling targeted marketing efforts and personalized experiences to enhance guest satisfaction and loyalty.

# **GitHub Link -**

https://github.com/VIK98110/Almabetter_capstone_project.git

# **Problem Statement**


**Write Problem Statement Here.**

Uncovering Insights and Patterns in Datasets




#### **Define Your Business Objective?**

The primary objective of this EDA project is to utilize Python programming and data analysis techniques to analyze hotel booking datasets. By uncovering insights into booking patterns, customer preferences, and market dynamics, the project aims to provide actionable recommendations for companies to improve competitiveness, optimize revenue, and enhance overall business performance.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import folium
import plotly.express as px
%matplotlib inline

### Dataset Loading

In [None]:
# Mounting drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
hotel_df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Python Datasets/Hotel Bookings.csv')

### Dataset First View

In [None]:
# Dataset First Look
hotel_df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
hotel_df.shape

### Dataset Information

In [None]:
# Dataset Info
hotel_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
hotel_df[hotel_df.duplicated()].shape

In [None]:
# Dropping duplicate values
hotel_df.drop_duplicates(inplace = True)


In [None]:
# Finding out rows and columns after deletion of duplicate values.
hotel_df.shape

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
hotel_df.isnull().sum()

We can see there is missing values in agent,company, and children.

In [None]:
# Visualizing the missing values
hotel_df[['company','agent']] = hotel_df[['company','agent']].fillna(0) # Replacing null values by 0.
hotel_df['children'].fillna(hotel_df['children'].mean(), inplace = True) # Replacing null value by mean of child values.

### What did you know about your dataset?

We know that we have dataset in which booking of hotel is given year and month wise. we too have different categories of people who stayed in the hotel and thier choice of mode of booking the hotel.

We have seen that in the given datasets we have 119390 rows and 32 columns and also some missing values in children, adult, and company data.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
hotel_df.columns

In [None]:
# Dataset Describe
hotel_df.describe()

### Variables Description

With the description we can find out the mean,count,standered deviation, 1st quartile, 2nd quartile, 3rd quartile and maximum value of data.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

# Finding unique value of hotel
hotel_df['hotel'].unique()

In [None]:
# Finding unique value of is_canceled
hotel_df['is_canceled'].unique()

In [None]:
# Finding unique value of arrival_date_year
hotel_df['arrival_date_year'].unique()

In [None]:
# Finding unique value of meal
hotel_df['meal'].unique()

In [None]:
# Finding unique value of market_segment
hotel_df['market_segment'].unique()

In [None]:
# Finding unique value of distribution_channel
hotel_df['distribution_channel'].unique()

In [None]:
# Finding unique value of children
hotel_df['children'].unique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

# Creating a copy of dataframe

df1 = hotel_df.copy()

In [None]:
# Replacing the company and agent null values to 0
df1[['company','agent']] = df1[['company','agent']].fillna(0)

In [None]:
# Replacing the childern value to the mean value of it
df1['children'].fillna(df1['children'].mean(), inplace = True)

In [None]:
# Replacing the null values in countries to others
df1['country'].fillna('others', inplace = True)

In [None]:
# Checking if all null values are removed
df1.isnull().sum()

In [None]:
# Rows and Columns with the 0 value
df1[df1['adults']+df1['babies']+df1['children'] == 0].shape

In [None]:
# Droping 0 values in adult,babies,children
df1.drop(df1[df1['adults']+df1['babies']+df1['children'] == 0].index, inplace = True)

**Converting columns to appropriate datatypes.**

In [None]:
# Converting datatype of columns 'children', 'company' and 'agent' from float to int.
df1[['children','company','agent']]= df1[['children', 'company', 'agent']].astype('int')


In [None]:
# changing datatype of column 'reservation_status_date' to data_type.
df1['reservation_status_date'] = pd.to_datetime(df1['reservation_status_date'], format = '%Y-%m-%d')

**Adding important columns.**

In [None]:
# Adding total staying days in hotels
df1['total_stay'] = df1['stays_in_weekend_nights']+df1['stays_in_week_nights']

# Adding total people num as column, i.e. total people num = num of adults + children + babies
df1['total_people'] = df1['adults']+df1['children']+df1['babies']

In [None]:
# Using groupby by arrival_date_year and arrival date month
df1.groupby(['arrival_date_year','arrival_date_month']).count()

In the above data we have used some Data wrangling codes to make an analysis.

We fill the company and adult null values to 0

we fill the country null value to others

we fill the children value by the mean value of it.

we find out the 0 values in adult,babies and children and drop this.

we convert children,company,adult values to appropritate data type.

We add some coulum to make a compresive analysis for total stay and total people.

We used groupby in arrival year to keep the data according to it.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
plt.figure(figsize=(8,5))
sns.countplot (x= 'arrival_date_year', data= df1, hue= 'hotel')
plt.title('Yearly Bookings')
plt.xlabel('Booking Year')
plt.ylabel('No. of Hotels')
plt.show()

##### 1. Why did you pick the specific chart?

To Check the booking hotel wise.

##### 2. What is/are the insight(s) found from the chart?

We can clearly see that the City Hotel has higher bookings as compare to Resort hotel in each of the year.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

It will be good for the City hotel but we need to work how we could improve the Resort hotel bookings.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
ax = sns.countplot(x = "market_segment", data = df1)
plt.xticks(rotation = 90)
plt.title("Segments wise booking")
plt.xlabel('Market Segment')
plt.ylabel('No. of Bookings')
plt.show()

##### 1. Why did you pick the specific chart?

We picked this chart to identify which mode of bookings is highly popular.

##### 2. What is/are the insight(s) found from the chart?

We have seen that online Ta mode is the most preferable choice when it comes to bookings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

We Should also improve online Ta platform to make it more seamless bookings of hotel.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
y = list(df1.country.value_counts().head(10))
x = list(df1.country.value_counts().head(10).index)

sns.barplot(y=y, x=x)
plt.title('Guest List by Their Countries')
plt.xlabel('Contries Name')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

Pick this chart to check from which Country the most guest are coming?

##### 2. What is/are the insight(s) found from the chart?

We can clearly say the most of the bookings is from Country- PRT.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

We should try to find the way to attract from the countries where there is less bookings.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
fig, axes = plt.subplots(1, 2, figsize=(18, 8))

grp_by_room = df1.groupby('assigned_room_type')
df1['Num_of_bookings'] = grp_by_room.size()

sns.countplot(ax = axes[0], x = df1['assigned_room_type'])
sns.boxplot(ax = axes[1], x = df1['assigned_room_type'], y = df1['adr'])
plt.title('Rooms Booking Vs ADR')
plt.xlabel('Room Types')
plt.ylabel('ADR')
plt.show()

##### 1. Why did you pick the specific chart?

Pick this chat to analyse which room has higher in demand and what is adr from it. wheather the demanding rooms have higher ADR in revenue or not.



##### 2. What is/are the insight(s) found from the chart?

Most demanded room type is A, but better adr rooms are of type H, G and C also.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Hotels should increase the no. of room types A and H to maximise revenue.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
plt.figure( figsize=(10, 8))

sns.countplot(x = df1['meal'])
plt.title('Most Prefered Meal')
plt.xlabel('Types of Meal')
plt.ylabel('Quantity')
plt.show()

##### 1. Why did you pick the specific chart?

Pick this chart to check which meal type is most preffered meal of customers?

##### 2. What is/are the insight(s) found from the chart?

Most preferred meal type is BB (Bed and breakfast).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

We should try to change our menu or make it more delicious meal.



#### Chart - 6

In [None]:
# Chart - 6 visualization code
grouped_by_hotel = df1.groupby('hotel')
d1 = pd.DataFrame((grouped_by_hotel.size()/df1.shape[0])*100).reset_index().rename(columns = {0:'Booking %'})      #Calculating percentage
plt.figure(figsize = (8,5))
sns.barplot(x = d1['hotel'], y = d1['Booking %'] )
plt.title('Percentage By Hotel Booking')
plt.xlabel('Type of Hotel')
plt.ylabel('Percentage of booking')
plt.show()

##### 1. Why did you pick the specific chart?

What is percentage of bookings in each hotel?

##### 2. What is/are the insight(s) found from the chart?

Around 60% bookings are for City hotel and 40% bookings are for Resort hotel.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

In terms of total booking rate of City hotel booking is higher.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
d3 = grouped_by_hotel['adr'].agg(np.mean).reset_index().rename(columns = {'adr':'avg_adr'})   # calculating average adr
plt.figure(figsize = (6,4))
sns.barplot(x = d3['hotel'], y = d3['avg_adr'] )
plt.title('City Hotel Revenue Vs Resort Hotel Revenue')
plt.xlabel('Type of Hotel')
plt.ylabel('Average ADR')
plt.show()

##### 1. Why did you pick the specific chart?

To check which hotel seems to make more revenue?

##### 2. What is/are the insight(s) found from the chart?

Avg adr of Resort hotel is slightly lower than that of City hotel.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

City hotel seems to be making slightly more revenue.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
plt.figure(figsize = (10,5))
sns.countplot(x='market_segment',data=df1 ,hue='hotel')
plt.title("Booking Cancelled or not by market segment")
plt.xlabel('Market Segment')
plt.ylabel('No. of Cancelled Hotel')

##### 1. Why did you pick the specific chart?

To Check which mode is higher cancellation in terms of market segmentation?

##### 2. What is/are the insight(s) found from the chart?

Most of the Cancellation is done by Online Ta mode bookings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

We should add comment option while cancelling the bookings with reason so that we could work on it.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
# Selecting and counting number of cancelled bookings for each hotel.
cancelled_data = df1[df1['is_canceled'] == 1]
cancel_grp = cancelled_data.groupby('hotel')
D1 = pd.DataFrame(cancel_grp.size()).rename(columns = {0:'total_cancelled_bookings'})

# Counting total number of bookings for each type of hotel
grouped_by_hotel = df1.groupby('hotel')
total_booking = grouped_by_hotel.size()
D2 = pd.DataFrame(total_booking).rename(columns = {0: 'total_bookings'})
D3 = pd.concat([D1,D2], axis = 1)

# Calculating cancel percentage
D3['cancel_%'] = round((D3['total_cancelled_bookings']/D3['total_bookings'])*100,2)
D3

##### 1. Why did you pick the specific chart?

To Check which hotel has higher bookings cancellation rate.

##### 2. What is/are the insight(s) found from the chart?

City Hotel has higher cancellation rate.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

We should improve our service in City hotel so that we could lead to less cancellations.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
group_by_dc_hotel = df1.groupby(['distribution_channel', 'hotel'])
d5 = pd.DataFrame(round((group_by_dc_hotel['adr']).agg(np.mean),2)).reset_index().rename(columns = {'adr': 'avg_adr'})
plt.figure(figsize = (7,5))
sns.barplot(x = d5['distribution_channel'], y = d5['avg_adr'], hue = d5['hotel'])
plt.xlabel('Distribution Channel')
plt.ylabel('Avg ADR')
plt.ylim(40,140)
plt.show()

##### 1. Why did you pick the specific chart?

Which distribution channel brings better revenue generating deals for hotels?

##### 2. What is/are the insight(s) found from the chart?

Direct, GDS and TA/TO channel has more profitabilty providing customers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

We should improve our service in these channel so that we could earn more profit.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
# Selecting and counting repeated customers bookings
repeated_data = df1[df1['is_repeated_guest'] == 1]
repeat_grp = repeated_data.groupby('hotel')
D1 = pd.DataFrame(repeat_grp.size()).rename(columns = {0:'total_repeated_guests'})

# Counting total bookings
total_booking = grouped_by_hotel.size()
D2 = pd.DataFrame(total_booking).rename(columns = {0: 'total_bookings'})
D3 = pd.concat([D1,D2], axis = 1)

# Calculating repeat %
D3['repeat_%'] = round((D3['total_repeated_guests']/D3['total_bookings'])*100,2)

plt.figure(figsize = (10,5))
sns.barplot(x = D3.index, y = D3['repeat_%'])
plt.title('Repeat Customer Rate')
plt.xlabel('Type of Hotel')
plt.ylabel('Repetaion rate')
plt.show()

##### 1. Why did you pick the specific chart?

Which hotel has high chance that its customer will return for another stay?

##### 2. What is/are the insight(s) found from the chart?

Both hotels have very small percentage that customer will repeat, but Resort hotel has slightly higher repeat % than City Hotel.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

We Should work on services to make it better repetation of customer.

#### Chart - 12

In [None]:
# Chart - 12 visualization code
group_by_dc = df1.groupby('distribution_channel')
d1 = pd.DataFrame(round((group_by_dc.size()/df1.shape[0])*100,2)).reset_index().rename(columns = {0: 'Booking_%'})
plt.figure(figsize = (8,8))
data = d1['Booking_%']
labels = d1['distribution_channel']
plt.pie(x=data, autopct="%.2f%%", explode=[0.05]*5, labels=labels, pctdistance=0.5)
plt.title("Booking % by distribution channels", fontsize=14);

##### 1. Why did you pick the specific chart?

Which is the most common channel for booking hotels?

##### 2. What is/are the insight(s) found from the chart?

Most Booking is done by Online TA mode.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

We need to make it more seamless booking mode.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
group_by_dc = df1.groupby('distribution_channel')
d2 = pd.DataFrame(round(group_by_dc['lead_time'].median(),2)).reset_index().rename(columns = {'lead_time': 'median_lead_time'})
plt.figure(figsize = (7,5))
sns.barplot(x = d2['distribution_channel'], y = d2['median_lead_time'])
plt.title('Early Booking')
plt.xlabel('Types of Distribution Channel')
plt.ylabel('Median Lead Time')
plt.show()

##### 1. Why did you pick the specific chart?

Which channel is mostly used for early booking of hotels?

##### 2. What is/are the insight(s) found from the chart?

We have highest TA/TO booking.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

We should work on the others mode too.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization
num_df1 = df1[['lead_time','previous_cancellations','previous_bookings_not_canceled','booking_changes','days_in_waiting_list','adr','required_car_parking_spaces','total_of_special_requests','total_stay','total_people']]

#correlation matrix
corrmat = num_df1.corr()
f, ax = plt.subplots(figsize=(12, 6))
sns.heatmap(corrmat,annot = True,fmt='.2f', annot_kws={'size': 10},  vmax=.8, square=True);

##### 1. Why did you pick the specific chart?

we pick this specific chart to find the correlation between the numerical data.


Since, columns like 'is_cancelled', 'arrival_date_year', 'arrival_date_week_number', 'arrival_date_day_of_month', 'is_repeated_guest', 'company', 'agent' are categorical data having numerical type. So we wont need to check them for correlation.


Also, we have added total_stay and total_people columns. So, we can remove adults, children, babies, stays_in_weekend_nights, stays_in_week_nights columns.Answer Here.

##### 2. What is/are the insight(s) found from the chart?

1) Total stay length and lead time have slight correlation. This may means that for longer hotel stays people generally plan little before the the actual arrival.

2) adr is slightly correlated with total_people, which makes sense as more no. of people means more revenue, therefore more adr.

Lets see does length of stay affects the adr.Answer Here

In [None]:
plt.figure(figsize = (12,6))
sns.scatterplot(y = 'adr', x = 'total_stay', data = df1)
plt.show()

We notice that there is an outlier in adr, so we will remove that for better scatter plot

In [None]:
df1.drop(df1[df1['adr'] > 5000].index, inplace = True)

In [None]:
plt.figure(figsize = (12,6))
sns.scatterplot(y = 'adr', x = 'total_stay', data = df1)
plt.show()

From the scatter plot we can see that as length of tottal_stay increases the adr decreases. This means for longer stay, the better deal for customer can be finalised.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code


##### 1. Why did you pick the specific chart?

To show the Child, adult, and babies relation in terms of it's total people.

##### 2. What is/are the insight(s) found from the chart?

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

We need to work on our services as we not have good amount of repeat customer as well as we have to look for seamless ways to book hotel.

# **Conclusion**

Through thorough data analysis We can say we are working very well but we need to consider some of things to improve the profitability of the business as it has not been a good amount of customer repetation so that our business grow properly and beat the competition in the market.This Python EDA project focuses on analyzing hotel booking datasets to uncover valuable insights that drive competitiveness and profitability in the hotel industry.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***