# **Project Name**    - Hotel Booking Analysis



##### **Project Type**    - EDA
##### **Contribution**    - Individual

# **Project Summary -**

The hotel booking dataset is a real-world record of hotel bookings for both a city hotel and a resort hotel. The project was conducted individually, and after analysing the dataset, it was discovered that it contained 119,390 rows and 32 columns. The data included information about hotel bookings, cancellations, arrival month, arrival year, guest details, and length of stay, among other things, from 2015 to 2017. The main objective of the project was to identify valuable insights from the data to help increase the hotels' revenue.

The dataset had numerous null and duplicate values, which required data cleaning. To avoid affecting the original data, a copy was made, and column names were changed for clarity. All duplicate values were removed, and null values were replaced with 0. Rows with 0 counts of babies, children, and adults were dropped, and specified columns were converted to appropriate data types. Additionally, two new columns were created for total nights and total guests, combining information from weekend nights, week nights, and guests' information to obtain the total nights stayed and total guests who arrived.

The analysis of the cleaned data provided valuable insights into hotel booking trends. By diving deep into the data, the researcher discovered insights such as the total number of nights stayed, deposit types, most reserved and assigned room types, customer arrivals by age group, the maximum number of bookings made by agents, and the top 10 countries that booked the hotels. These insights can help businesses identify areas for improvement and enhance their services. Customers can also make more informed decisions based on their specific requirements and preferences.

One of the key findings was that the guest retention rate was low, with only 3.86% of guests returning. The city hotel was more popular than the resort hotel, and January was found to be the best time to book a hotel based on the lowest average daily rate. The analysis revealed that the highest number of bookings came from Portugal, followed by the UK, France, and other countries. This information can be used to target marketing efforts in these regions and increase the hotels' revenue.

July and August were the busiest months for hotel bookings, which is not surprising given that it is a peak travel season in many parts of the world. To increase bookings, businesses should focus on the target group of pairs of two adults. Offering different offers and incentives to customers who keep bookings can also help increase the number of customers.

For data visualization, the researcher used the seaborn and matplotlib libraries and various types of graphs, such as bar charts, displot, line charts, pie charts, scatter plots, box plots, correlation heatmaps, and pair plots. These visualizations helped to simplify complex data and make it more understandable.

In conclusion, the analysis of the hotel booking dataset provided valuable insights into customer behaviour and booking trends. By understanding these insights, hotel businesses can make data-driven decisions to improve their services and customer experiences. Additionally, customers can make more informed decisions based on their specific requirements and preferences. Data visualization played a significant role in simplifying complex data, making it easier to understand and identify areas for improvement. Overall, the findings of this analysis can help businesses grow and enhance their services to meet customers' needs and preferences.


# **GitHub Link -**

https://github.com/chota-mota01/Capstone-EDA-Project---Hotel-Booking

# **Problem Statement**


The hotel industry generates a vast amount of data, including customer information, room types, prices, and booking patterns. However, analyzing this data can be challenging and time-consuming, but it is crucial for hotel management to make informed decisions and improve their business operations. Analyzing the dataset can help the customers to know the best time of year to book a hotel room and optimal length of stay to get the best daily rate. The following dataset can help to explore various questions.

The hotel booking dataset contains booking information for a city hotel and a resort hotel,includes information such as when the booking was made, length of stay, number of adults,children & babies, and the number of available parking spaces, country, customer types, market segment, distribution channels and many more factors. The data will help us understand the factors to optimize occupancy rates, increase revenue, and improve customer satisfaction.



#### **Define Your Business Objective?**

The main objective is to perform EDA on given dataset and provide actionable insights to enhance hotel business.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required. 
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits. 
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule. 

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Reading Data
path = '/content/drive/My Drive/Hotel Bookings.csv'
hotel_data = pd.read_csv(path)

### Dataset First View

In [None]:
# Dataset First Look
# head() method returns first 5 rows of the dataset
hotel_data.head()

In [None]:
# Dataset Last Look
# tail() method returns last 5 rows of the dataset
hotel_data.tail()

In [None]:
# If number is specified, head() returns specified number of rows
hotel_data.head(7)

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
hotel_data.shape

### Dataset Information

In [None]:
# Dataset Info
hotel_data.info()

In [None]:
# Columns present in dataset
list(hotel_data.columns)

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
hotel_data.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
hotel_data.isna().sum().sum()

In [None]:
#Used isnull().sum method to view null value in each column
hotel_data.isnull().sum()

In [None]:
# Visualizing the missing values
# Check Null value by plotting Heatmap
from pickle import FALSE
plt.figure(figsize=(12,6))
sns.heatmap(hotel_data.isnull(),cbar=FALSE)

### What did you know about your dataset?

The dataset given contains booking information for a city hotel and a resort hotel. We need to analyze the important factors in the dataset that govern the bookings.

The dataset has 119390 rows and 32 columns. The dataset contains 129425 missing/null values and 31994 duplicate values. The null values in column children, country, agent and company are 4, 488, 16340 and 112593 respectively.

Using seaborn library, we have visvualized the following missing/null values.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
hotel_data.columns

In [None]:
# Dataset Describe
hotel_data.describe(include='all')

### Variables Description 

* **Hotel**       **:**  Resort Hotel , City Hotel
* **is_cancelled**       **:**  Cancelled Booking (1,0)
* **lead_time**       **:**  Number of days between booking and check-in date
* **arrival_year**       **:**  Year of arrival 
* **arrival_month**       **:**  Month of arrival 
* **arrival_week_no**       **:**  Week number for arrival
* **arrival_day** **:**  Day of arrival
* **weekend_nights**       **:**  Hotel booked for number of weekend nights 
* **week_nights**       **:**  Hotel booked for number of week nights 
* **adults**       **:**  Number of adults stayed at the hotel
* **children**       **:**  Number of children stayed at the hotel
* **babies**       **:**  Number of babies stayed at the hotel
* **meal**       **:**  Different kind of meal opted by the guest/customer
* **country**       **:**  Country code
* **market_segment**       **:**  Segment to which customer belongs
* **Distribution_channel**       **:**  Stay accessed (corporate booking/Direct/TA)
* **is_repeated_guest**       **:**  Repeated customers (0,1) 
* **previous_cancellation**       **:**  Prior Cancellation check
* **previous_bookings**       **:**  Bookings done previously
* **reserved_room**       **:**  Type of reserved room
* **assigned_room**       **:** Type of assigned room
* **booking_changes**       **:** Changes made in booking
* **deposit_type**       **:** Type of deposit
* **agent**       **:** Booking done through agent
* **company**       **:** Booked through company
* **days_in_waiting**       **:** Days in waiting list
* **customer_type**       **:** Customer Type
* **adr**       **:** Average Daily Rate
* **req_car_parking**       **:** Requirement of car parking spaces
* **special_requests**       **:** Additional special requirements
* **reservation_status**       **:** Booking status
* **reservation_status_date**       **:** Date of reservation status
* **total_nights**       **:** Total stays in night
* **total_guests**       **:** Total number of guests arrived

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for column in hotel_data:
  print(hotel_data[column].unique())

In [None]:
# Count of Unique Values for each variable.
for col in hotel_data:
  print("Count of unique values in",col,"is",hotel_data[col].nunique(),".")

In [None]:
# Count of adults 
hotel_data['adults'].value_counts()

In [None]:
# Count of children
hotel_data['children'].value_counts()

In [None]:
# Count of babies
hotel_data['babies'].value_counts()

In [None]:
# Top 10 country with hotel booking counts
hotel_data['country'].value_counts().head(10)

In [None]:
# Least 10 country with hotel booking counts
hotel_data['country'].value_counts().tail(10)

In [None]:
# Count of Hotel
hotel_data['hotel'].value_counts()

In [None]:
# Count of repeated guest
hotel_data['is_repeated_guest'].value_counts()

In [None]:
# Count of Meal
hotel_data['meal'].value_counts()

In [None]:
# Count of Deposit type
hotel_data['deposit_type'].value_counts()

In [None]:
# Count of Deposit type
hotel_data['customer_type'].value_counts()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# Create copy of dataset
hotel_df=hotel_data.copy()
hotel_df.columns

In [None]:

# Renaming the specific columns 
hotel_df.rename(columns={'arrival_date_year':'arrival_year',
                           'arrival_date_month':'arrival_month',
                           'arrival_date_week_number': 'arrival_week_no',
                           'arrival_date_day_of_month':'arrival_day',
                           'stays_in_weekend_nights':'weekend_nights',
                           'stays_in_week_nights':'week_nights',
                           'reserved_room_type':'reserved_room',
                           'assigned_room_type':'assigned_room',
                           'days_in_waiting_list':'days_in_waiting',
                           'previous_bookings_not_canceled':'previous_bookings',
                           'required_car_parking_spaces':'req_car_parking',
                           'total_of_special_requests':'special_requests',
                          },inplace=True)

In [None]:
# Drop all duplicate rows
hotel_df.drop_duplicates(inplace=True)

In [None]:
# Drop when value is 0 in all 3 columns (adults,children,babies)
hotel_df = hotel_df.drop(hotel_df[(hotel_df.adults + hotel_df.children + hotel_df.babies)==0].index)

In [None]:
# Replace all null values in country column by NA
hotel_df["country"].fillna("NA",inplace=True)

In [None]:
# Replace all null values in agent column by 0
hotel_df.fillna({'agent':0},inplace=True)

In [None]:
# Replace all null values in company column by 0
hotel_df.fillna({'company':0},inplace=True)

In [None]:
# Replace all null values in children column by 0
hotel_df.fillna({'children':0},inplace=True)

In [None]:
# Converting data type of children column
hotel_df['children']=hotel_df['children'].astype(int)

In [None]:
# Converting data type of agent column
hotel_df['agent']=hotel_df['agent'].astype(int)

In [None]:
# Converting data type of company column
hotel_df['company']=hotel_df['company'].astype(int)

In [None]:
# Converting data type of reservation status date to datetime data type
hotel_df['reservation_status_date']=hotel_df['reservation_status_date'].astype('datetime64[ns]')

In [None]:
# Create 'total_nights' column to combine data from weekend nights and week nights
hotel_df['total_nights'] = hotel_df['weekend_nights'] + hotel_df['week_nights']

In [None]:
# Create 'total_guests' column to get data of total guests arrived
hotel_df['total_guests'] = hotel_df['adults'] + hotel_df['children'] + hotel_df['babies']

In [None]:
# Checking null values and data type after the required changes made in dataset
hotel_df.info()

### What all manipulations have you done and insights you found?

While analyzing dataset, we found many null values and duplicate values. Before manipulation of data, we created a copy of the hotel booking dataset because of which the changes made in the duplicate dataset won't affect the original dataset.

After creating duplicate dataset, we renamed some of the specific columns for ease of our understanding. The duplicates in the dataset were dropped. After analyzing the data, we understood that if the value of all three columns adults,children and babies are 0, then no booking was done and the information in the dataset is incorrect. Therefore, we dropped rows with 0 value in all the three columns. As we dropped unwanted rows, we replaced the null values of 3 columns (adults,children,babies) with 0. We also replaced the null values of country, agent and company columns with 0 value.

Accordingly, we observed some columns with float datatype that needs to be in integer form. So, we changed the datatype of children, agent and company from float to integer. Also changed the datatype of reservation_status_date to datetime data type. The data of weekend nights and week nights were combined to obtain total nights/stays of the arrived guests and new column total_nights was created. To know the total guests arrived, we created new column total_guests that is combination of 3 columns - adults, children and babies. 

The manipulations performed are for better vizualization of the dataset, Also the columns that are not so detailed and not directly related to effect hotel booking were not used during vizualization.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 - Pie Chart on Repeated Guest

In [None]:
# Chart - 1 visualization code 
# Percentage of Repeated guest
hotel_df.is_repeated_guest.value_counts()
hotel_df['is_repeated_guest'].value_counts().plot(kind='pie',
                                                 figsize=(15,6),
                                                 autopct="%.2f%%",
                                                 startangle=90,
                                                 labels=['0(%)','1(%)'],
                                                 colors=['pink','brown'],
                                                 explode=[0,0])
plt.legend(title='Repeated Guest:') 
    

##### 1. Why did you pick the specific chart?

A pie chart compares the contribution of each part to the data. It is a circular statistical graphic which is divided into slices to illustrate numerical proportion.

For visualizing the percentage of the repeated guest and the non repeated guest, I choosed pie chart as it shows data as a percentage of whole. 

##### 2. What is/are the insight(s) found from the chart?

The following chart helps us understand that there are very less number of repeated guest. The percentage of non-repeated guest is 96.14% , whereas the percentage of repeated guest is 3.86%. 

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights helps us to know that repeated guest are less and non repeated guest are excessive. The guest retention rate is very low.

#### Chart - 2 - Pie Chart on Hotel

In [None]:
# Chart - 2 visualization code
# Percentage of bookings in each hotel
hotel_df.hotel.value_counts()
hotel_df['hotel'].value_counts().plot(kind='pie',
                                        figsize=(15,6),
                                        autopct="%.2f%%",
                                        startangle=90,
                                        shadow=True,
                                        labels=['City Hotel(%)','Resort Hotel(%)'],
                                        colors=['yellow','red'],
                                        explode=[0,0])

plt.legend(title='Hotel:')


##### 1. Why did you pick the specific chart?

A pie chart compares the contribution of each part to the data. It is a circular statistical graphic which is divided into slices to illustrate numerical proportion.

For visualizing the percentage of the booking at city hotel and resort hotel, I choosed pie chart as it shows data as a percentage of whole. 

##### 2. What is/are the insight(s) found from the chart?

The following chart helps us understand that the booking of City Hotel is fairly high than that of Resort Hotel. The percentage of City Hotel booking is 61.07% and that of Resort Hotel is 38.93%.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights tells us that while booking, City Hotel are more preferable than Resort Hotel.

#### Chart - 3 - Displot on meal

In [None]:
# Chart - 3 visualization code
# Most preferred meal by the guests
plt.figure(figsize=(12,5))
sns.displot(data=hotel_df,y='meal')
plt.title('Meal',fontsize=20,fontweight='bold',c='r')
plt.show()


##### 1. Why did you pick the specific chart?

Displot (Distribution plot) represents the data in histogram form. It is univariant set of collected data. 

For visualizing the data distribution of meal, I choosed displot.

##### 2. What is/are the insight(s) found from the chart?

The following chart makes us understand that 'BB' is most preferred meal followed by 'SC' and 'HB', whereas 'FB' is less preferred meal.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights helps us understand that the favourite meal (i.e.most preferred) of the guests is 'BB' and 'FB' is less preferred meal.

#### Chart - 4 - Countplot of Total Nights in each hotel

In [None]:
# Chart - 4 visualization code
# Total nights spent by guests at each hotel 
plt.figure(figsize=(12,5))
sns.countplot(data=hotel_df, x='total_nights',hue='hotel',palette='viridis')
plt.title('Total Nights in each Hotel',fontsize=20,fontweight='bold',c='black')
plt.show()

##### 1. Why did you pick the specific chart?

The countplot represents the counts of the observation present in the categorical variable. It uses the concept of a bar chart for the visual depiction.

The specific chart will help us to compare total nights with respect to each hotel to check how long guests preferred to stay in each hotel.

##### 2. What is/are the insight(s) found from the chart?

The insights from the following chart helps us understand that the guests arriving/booking the hotel tends to stay as long as 4 nights in both the hotels, mostly guests don't prefer to stay more than 4 nights in each hotel.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights shows that the guests booking hotels prefer to stay for few days in both hotel. The factors that gives a common count of stay for maximum guests should be taken care of.

#### Chart - 5 - Countplot on Deposit Type

In [None]:
# Chart - 5 visualization code
# Countplot on deposit type and hotel
sns.set_theme(style='white')
plt.figure(figsize=(12,5))
sns.countplot(data=hotel_df,x='deposit_type',hue='hotel',palette='coolwarm')
plt.title('Deposit Type wrt Hotel',fontsize=20,fontweight='bold',c='black')
plt.show()

##### 1. Why did you pick the specific chart?

The countplot represents the counts of the observation present in the categorical variable. It uses the concept of a bar chart for the visual depiction.

I selected the specific chart to know the deposit type used while booking city hotel and resort hotel.

##### 2. What is/are the insight(s) found from the chart?

The insights found from this chart tells us about 3 types of deposit - No deposit, Refundable deposit and Non-refundable deposit. The most preferred deposit type is no deposit(booking without money) followed by non-refundable deposit, whereas refundable deposit is rarely preferred by the guests.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights tells us that no deposit (no advance payment) is most preferred deposit type by the guests arriving at each hotel.

#### Chart - 6 - Countplot on Customer Type

In [None]:
# Chart - 6 visualization code
# Countplot on customer type
plt.figure(figsize=(12,5))
sns.countplot(data=hotel_df, y='customer_type')
plt.title('Customer Type',fontsize=20,fontweight='bold',c='black')
plt.show()

##### 1. Why did you pick the specific chart?

The countplot represents the counts of the observation present in the categorical variable. It uses the concept of a bar chart for the visual depiction.

I used countplot to know the different types of customers arriving the hotel.

##### 2. What is/are the insight(s) found from the chart?

We found that there are 4 types of customers - transient, contract, transient-party and group. Most of the customers are transient type followed by transient-party and contract. The group customer type is very less in count.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

The insights will help in managing the different customer types according to their needs/requirements.

#### Chart - 7 - Countplot on Assigned Room Type

In [None]:
# Chart - 7 visualization code
# Most assigned room by the guests
plt.figure(figsize=(12,5))
sns.countplot(data=hotel_df, x='assigned_room',hue='hotel',palette='dark')
plt.title('Assigned Room Type',fontsize=20,fontweight='bold',c='brown')
plt.show()


##### 1. Why did you pick the specific chart?

The countplot represents the counts of the observation present in the categorical variable. It uses the concept of a bar chart for the visual depiction.

The specific chart specifies the most assigned room type in each hotel.

##### 2. What is/are the insight(s) found from the chart?

The insights obtained shows us that the most demanded room type is A in both city and resort hotel. The demand is followed by room type D and E room types.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights tells us that most assigned room type is A in each hotel, followed by room type D and E. So, to increase business profit we should offer more services as room type A to other room types. This will help to maximize revenue.

#### Chart - 8 - Countplot on Reserved Room Type

In [None]:
# Chart - 8 visualization code
# Most reserved room type by the guests
plt.figure(figsize=(12,5))
sns.countplot(data=hotel_df, x='reserved_room',hue='hotel',palette='deep')
plt.title('Reserved Room',fontsize=20,fontweight='bold',c='black')
plt.show()


##### 1. Why did you pick the specific chart?

The countplot represents the counts of the observation present in the categorical variable. It uses the concept of a bar chart for the visual depiction.

I picked the specific chart to know the most reserved room selected by the guests arriving at each hotel.

##### 2. What is/are the insight(s) found from the chart?

The insights obtained shows us that the most demanded room type is A in both city and resort hotel, followed by room types D and E.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights tells us that most reserved room type is A in each hotel, followed by room type D and E. So, to increase business profit we should offer more services as room type A to other room types. This will help to maximize revenue.

#### Chart - 9 - Scatter plot on Total Nights

In [None]:
# Chart - 9 visualization code
# Scatter plot to see if length of nights affect the adr
plt.figure(figsize=(18,6))
sns.scatterplot(data=hotel_df,x='total_nights',y='adr')
plt.title('Total Nights wrt Adr',fontweight='bold',c='black')
plt.show()

In [None]:
hotel_df.drop(hotel_df[hotel_df['adr']>5000].index,inplace=True)
plt.figure(figsize=(18,6))
sns.scatterplot(data=hotel_df,x='total_nights',y='adr')
plt.title('Total Nights wrt Adr',c='black')
plt.show()

##### 1. Why did you pick the specific chart?

Scatter plots are the graphs that represents the relationship between the specified two variables in the dataset. Each member of the dataset gets plotted as a point whose x-y coordinates relates to its values for the two variables.

The specific chart tells us the relationship between total nights and adr. 

##### 2. What is/are the insight(s) found from the chart?

The following chart expresses the relationship between total nights spent by the guests in hotel and adr(average daily rate). From plot, we understand that as length of total nights increases , adr decreases. There is an outlier in adr, so for better result I removed it. 

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights tells us that as length of total nights increases, adr decreases. This means that for longer stay, the better deal for customers can be finalised.

#### Chart - 10 - Lineplot ADR with respect to arrival month

In [None]:
# Chart - 10 visualization code
# Lineplot ADR with respect to arrival month
plt.figure(figsize=(15,6))
sns.lineplot(data=hotel_df,x='arrival_month',y='adr',c='green')
plt.title('ADR with respect to arrival month',fontweight='bold',c='black')
plt.show()

##### 1. Why did you pick the specific chart?

A line chart is a graphical representation of an asset's historical price action that connects a series of data points with a continuous line. It shows the relationship between two variables.

The specific chart shows the relationship between arrival of customer per month and adr(average daily rate).

##### 2. What is/are the insight(s) found from the chart?

The insights represents that adr is directly proportional to arrival of customers per month. Higher the number of customers, higher is adr(average daily rate). Accordingly, adr is highest during august month then gradually decreases.

#### Chart - 11 - Countplot on Customers Arrival 

In [None]:
# Chart - 11 visualization code
# Countplot on arrival of children, adults, babies in each hotel
plt.figure(figsize=(18, 9))
plt.subplot(1,2,1)
sns.countplot(data = hotel_df, x = 'children',hue='hotel',palette='dark')
plt.title('Children Arrival in each Hotel',fontweight="bold", size=20)
plt.subplot(1,2,2)
sns.countplot(data=hotel_df,x='adults',hue='hotel',palette='bright')
plt.title('Adults Arrival in each Hotel',fontweight="bold",size=20)
plt.subplots_adjust(right=2)
plt.show()
plt.subplot(2,1,1)
sns.countplot(data=hotel_df,x='babies',hue='hotel',palette='colorblind')
plt.title('Babies Arrival in each Hotel',fontweight="bold",size=20)
plt.show()

##### 1. Why did you pick the specific chart?

The countplot represents the counts of the observation present in the categorical variable. It uses the concept of a bar chart for the visual depiction.

The chart tells us about arrival of customers of different age group i.e. children, adults, babies in each hotel.

##### 2. What is/are the insight(s) found from the chart?

The insights found from the chart is that adults are most likely to make arrival in both the hotels and mostly 2 adults arrive together at the hotel. In all age groups, higher number of arrival is seen in the city hotel. 

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights helps us understand the target age group i.e. adults . To maximize revenue, more exciting offers should be provided to pair of 2 adults arriving together. 

#### Chart - 12 - Barplot on Agent 

In [None]:
# Chart - 12 visualization code
# Bar graph to know the agent with maximum number of bookings
plt.figure(figsize=(15,6))
agent_data= hotel_df['agent'].value_counts()[:10]
sns.barplot(y=agent_data,x=agent_data.index,orient='v',color='brown')
plt.xlabel('Number of Bookings')
plt.show()

##### 1. Why did you pick the specific chart?

Bar graphs are the pictorial representation of data in the form of vertical or horizontal rectangular bars, where the length of bars are proportional to the measure of data. It is fundamental visualization used for comparing different sets of data and shows the relationship between two axes.

The specific chart tells us about the agent who made higher number of booking.

##### 2. What is/are the insight(s) found from the chart?

The insights found from the chart are that agent 9 has made most number of bookings.

#### Chart - 13 - Barplot on top 10 Country

In [None]:
# Chart - 13 visualization code
# Barplot on top 10 country wrt number of bookings
hotel_df.country.value_counts().head(10).plot.bar(color=['indigo','violet','r','g','b','y','orange'])
plt.title('Top 10 Country',fontsize=20)
plt.xlabel('country',fontweight='bold',fontsize=15)
plt.show()

##### 1. Why did you pick the specific chart?

Bar graphs are the pictorial representation of data in the form of vertical or horizontal rectangular bars, where the length of bars are proportional to the measure of data. It is fundamental visualization used for comparing different sets of data and shows the relationship between two axes.

I choosed the specific chart to know top 10 country relative to number of guests arriving the hotel.

##### 2. What is/are the insight(s) found from the chart?

The insights found from the chart is that PRT is the top country from where most of the hotel booking were done which is 20977 bookings followed by GBR, FRA, ESP, DEU, IRL , ITA, BEL, NLD and USA.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights will help us to know the countries that make higher number of bookings, whereas we also need to focus on the reasons behind less bookings done by other countries.

#### Chart - 14 - Bar Chart on Children arriving yearly

In [None]:
# Chart - 14 visualization code
# Bar chart on children arriving yearly at each hotel
plt.figure(figsize=(12,5))
sns.barplot(data=hotel_df,x='arrival_year',y='children',hue='hotel',palette='magma') 
plt.title('Children Yearly Arrival',fontweight='bold',fontsize=20)
plt.show() 

##### 1. Why did you pick the specific chart?

Bar graphs are the pictorial representation of data in the form of vertical or horizontal rectangular bars, where the length of bars are proportional to the measure of data. It is fundamental visualization used for comparing different sets of data and shows the relationship between two axes.

To know the yearly arrival of children at each hotel and their growth we choosed bar graph.

##### 2. What is/are the insight(s) found from the chart?

The insights from the chart tells us that from year 2015 to 2016, arrival of children increased slightly in the resort hotel, whereas there was noticable increase in the city hotel. The arrivate rate is maximum in year 2017 so gradually the children arrival is increasing with passing time.

#### Chart - 15 - Lineplot on Adults Monthly Arrival

In [None]:
# Chart - 15 visualization code
# Lineplot on adults arrival 
plt.figure(figsize=(15,6))
sns.lineplot(data=hotel_df,x='arrival_month',y='adults',c='hotpink')
plt.title('Adults Monthly Arrival',fontweight='bold',fontsize=20,c='black')
plt.show()

##### 1. Why did you pick the specific chart?

A line chart is a graphical representation of an asset's historical price action that connects a series of data points with a continuous line. It shows the relationship between two variables.

I picked this chart to compare arrival of adults with respect to month to check in which month most of the adults arrived or made booking.

##### 2. What is/are the insight(s) found from the chart?

The insights found from the chart is that most of the adults visited Hotel in month of July-August and least people visited hotel in month of November. This chart can give clear idea about when most of the people visits hotel.

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insight make us understand that in July & August month most of the bookings were made which means some offers/ discount were provided during these months or due to holidays most booking were done. Also need to work on those same factors to improve booking in other months as well.

#### Chart - 16 - Boxplot on Lead time wrt Yearly arrival in both the hotels

In [None]:
# Chart - 16 visualization code
# Boxplot on Lead time wrt Yearly arrival in both the hotels
plt.figure(figsize=(15, 9))
sns.boxplot(data= hotel_df,x='arrival_year', y='lead_time',hue='hotel',palette='rainbow_r',saturation =0.9)
plt.title("Lead time wrt yearly arrival in Both hotels",fontweight="bold", size=20)
plt.show()

##### 1. Why did you pick the specific chart?

Boxplots are a measure of how well the data is distributed in the dataset. These charts display ranges within variables measured. This includes the outliers, the median, the mode, and where the majority of the data points lie in the “box”.

##### 2. What is/are the insight(s) found from the chart?

The insights found from the above charts help us to identify high level information at a glance to offer general information about data group. The graph is of lead time wrt yearly arrival in both the hotels. It is clear from above graphs that there are outliers present in every arrival year.

#### Chart - 17 - Boxplot on Total guests wrt Total Nights 

In [None]:
# Chart - 17 visualization code
# Boxplot on total guests with respect to total nights in both the hotels
plt.figure(figsize=(15, 8))
sns.boxplot(data= hotel_df,x='total_guests', y='total_nights',hue='hotel',palette='rocket',saturation =0.9)
plt.title("Total Guests wrt Total Nights",fontweight="bold", size=20)
plt.show()

##### 1. Why did you pick the specific chart?

Boxplots are a measure of how well the data is distributed in the dataset. These charts display ranges within variables measured. This includes the outliers, the median, the mode, and where the majority of the data points lie in the “box”.

##### 2. What is/are the insight(s) found from the chart?

The insights found from the chart is that values are positively linearly distributed, also in some case there is no outliers and in some values outliers can be seen. Like in 'total guests' of 5 guests there is no outlier present. In 1,2 & 4 number of guests outliers can be clearly seen.

#### Chart - 18 - Barplot on Cancelled Booking wrt Lead Time

In [None]:
# Chart - 18 visualization code
# Barplot on booking camcelled by the customers with respect to lead time
plt.figure(figsize=(15, 6))
sns.barplot(data= hotel_df,x='is_canceled', y='lead_time',hue='hotel',palette='rocket',saturation =0.9)
plt.title("Cancelled booking wrt Lead time",fontweight="bold", size=20)
plt.show()

##### 1. Why did you pick the specific chart?

Bar graphs are the pictorial representation of data in the form of vertical or horizontal rectangular bars, where the length of bars are proportional to the measure of data. It is fundamental visualization used for comparing different sets of data and shows the relationship between two axes.

To show number of canceled hotels with respect to lead time, I used bar chart as it is easy to understand data and visually appealing to get quick data insights without digging much.

##### 2. What is/are the insight(s) found from the chart?

The insights shows that with respect to lead time, the bookings cancellation has increased. The number of cancelled booking of hotel is higher as compare to the number of guests visited. 

##### 3. Will the gained insights help creating a positive business impact? 
Are there any insights that lead to negative growth? Justify with specific reason.

To create a positive business impact, we need to know the reason behind cancellations and try to sort the issues that makes customer unsatisfied.

#### Chart - 19 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(20,8))
correlation_data = hotel_df[['hotel','is_canceled','lead_time','arrival_year',
                    'arrival_month','arrival_week_no',
                    'arrival_day','weekend_nights',
                    'week_nights','adults','children','babies','meal',
                    'country','market_segment','distribution_channel',
                    'is_repeated_guest','previous_cancellations',
                    'previous_bookings','reserved_room',
                    'assigned_room','booking_changes', 'deposit_type',
                    'agent','company','days_in_waiting','customer_type',
                    'adr','req_car_parking',
                    'special_requests','reservation_status',
                    'reservation_status_date','total_nights','total_guests']]
corr= correlation_data.corr()                       
sns.heatmap(corr,cmap='RdBu',annot=True,vmin=-1,vmax=1)
plt.show()                    


##### 1. Why did you pick the specific chart?

Correlation heatmaps are a type of plot that visualize the strength of relationships between numerical variables. Correlation plots are used to understand which variables are related to each other and the strength of this relationship.

I used the correlation heatmap to find correlation between all the variables along with correlation coefficient.

##### 2. What is/are the insight(s) found from the chart?

From the above correlation heatmap, we can see that there is correlation between adults, total number of guests, total stays, stays in week nights, stays in weekend nights and total stays.

There is no linear trend between arrival year and arrival week number, total of special request and is canceled.

#### Chart - 20 - Pair Plot 

In [None]:
# Pair Plot visualization code
sns.pairplot(hotel_data)
plt.show()

##### 1. Why did you pick the specific chart?

Pairplot visualizes given data to find the relationship between them where the variables can be continuous or categorical. Pairplot allows us to plot pairwise relationships between variables within a dataset.

The specific chart consists of entire dataset with each variable plotted. The plots are in matrix in which column name represents y-axis and row name represents x-axis. 

##### 2. What is/are the insight(s) found from the chart?

The pairplot basically plots entire dataframe. Plots between each column take place in pairplot and a big plot is created to compare overall relationship between each column. In the pairplot of entire dataframe we can observe linear relationship between some columns, also clusters can be seen in leadtime with arrival week number, arrival day of month,stay in week nights, stay in weekend nights, agent, company,total stays, agent & company with arrival week number, arrival day of month. Outliers can be seen in adr and arrival week number,arrival month, day, lead time. This creates nice visualization and helps us understand the large amount of data in a single figure.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ? 
Explain Briefly.

1. City hotels were more popular than Resort hotels in the dataset used for analysis. Most of the booking were made of City Hotel which is about 61.07% whereas Resort hotel booking was only 38.93% that gives an idea that most people prefer City Hotels.

2. The customer retention rate is low as the count of repeated guests in both the hotels are way less than total booking of hotel at first time.

3. Most popular meal was BB which gives an idea about which is the best dish to try at a hotel for customers.


4. Most of the reserved room was A, so need to increase service of other rooms to increase good customer experience.

5. Maximun number of bookings were done in the month of July and August which can be due to any offer of discount or holiday season so business owner needs to work on discount and other factor to know about the reason.

6. The customers booking hotel are mostly pair of adults.

7. Top 10 countries to make hotel bookings are Portugal, United Kingdom, France,Spain, Germany, Ireland, Italy, Belgium, Netherlands & USA. The maximum number of bookings is done by Portugal.

8. Highest ADR is achieved in month of August, whereas it is lowest in January month.

# **Conclusion**

The conclusion derived from hotel booking analysis are as follows:

1. The guest retention rate is very low as the repeated guest are less and non repeated guest are excessive. 

2. City Hotel were more preferable than Resort Hotel with 61.07% bookings in city hotel and 38.93% bookings in resort hotel.

3. The favourite meal of the customers is 'BB'. 

4. The guests arriving the hotel tends to stay as long as 4 nights in both the hotels and less preference is given to the stays more than 4 nights in each hotel.

5. The maximum type of customers are of transient type.

6. Most of the reserved room was A, so need to increase type A room also need to work on other type of room as well.

7. Most arriving customers are pair of adults so need to work on other age group as well like children.

8. The length of total nights increases, adr decreases.

9. Most of the bookings were done in month of July-August, means this months are the businest or profitable months for hotel business.

10. Booking were made from different country, but most of the booking were made by Portugal country.

11. Maximum number of the bookings are done by Agent 9.

12. ADR tends to increase with number of customers arrived per month, also it is seen that ADR increases at end of month.


### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***