<a href="https://colab.research.google.com/github/bindukovvada/Data-Science/blob/main/Bindu_Hotel_Booking_Analysis_CapstoneProject.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## <b> Have you ever wondered when the best time of year to book a hotel room is? Or the optimal length of stay in order to get the best daily rate? What if you wanted to predict whether or not a hotel was likely to receive a disproportionately high number of special requests? This hotel booking dataset can help you explore those questions!

## <b>This data set contains booking information for a city hotel and a resort hotel, and includes information such as when the booking was made, length of stay, the number of adults, children, and/or babies, and the number of available parking spaces, among other things. All personally identifying information has been removed from the data. </b>

## <b> Explore and analyze the data to discover important factors that govern the bookings. </b>

*We will answer these Questions in this Project.*


In [45]:
'''
1. How many booking were cancelled?
2. What is the resort ratio between Resort Hotel and City Hotel?
3. What is the percentage of booking for each year?
4. Which is the busiest month for Hotels?
5. From which country most guests come?
6. How long people stay in the Hotel?
7. Which was the most booked accommodation type? (Single, Couple, Family)
'''

'\n1. How many booking were cancelled?\n2. What is the resort ratio between Resort Hotel and City Hotel?\n3. What is the percentage of booking for each year?\n4. Which is the busiest month for Hotels?\n5. From which country most guests come?\n6. How long people stay in the Hotel?\n7. Which was the most booked accommodation type? (Single, Couple, Family)\n'

*First we have to import neccessary packages*

In [46]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline


*Now Import and display the dataset*

In [47]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [48]:
file_path = '/content/drive/MyDrive/Copy of Hotel Bookings.csv'

In [49]:
df = pd.read_csv(file_path)

In [50]:
df.head()

Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,...,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
0,Resort Hotel,0,342,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
1,Resort Hotel,0,737,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
2,Resort Hotel,0,7,2015,July,27,1,0,1,1,...,No Deposit,,,0,Transient,75.0,0,0,Check-Out,2015-07-02
3,Resort Hotel,0,13,2015,July,27,1,0,1,1,...,No Deposit,304.0,,0,Transient,75.0,0,0,Check-Out,2015-07-02
4,Resort Hotel,0,14,2015,July,27,1,0,2,2,...,No Deposit,240.0,,0,Transient,98.0,0,1,Check-Out,2015-07-03


*Copying the dataset*

In [51]:
hotel_df = df


*Finding size of the dataset*

In [52]:
hotel_df.shape

(119390, 32)


*Viewing all the columns*

In [53]:
hotel_df.columns

Index(['hotel', 'is_canceled', 'lead_time', 'arrival_date_year',
       'arrival_date_month', 'arrival_date_week_number',
       'arrival_date_day_of_month', 'stays_in_weekend_nights',
       'stays_in_week_nights', 'adults', 'children', 'babies', 'meal',
       'country', 'market_segment', 'distribution_channel',
       'is_repeated_guest', 'previous_cancellations',
       'previous_bookings_not_canceled', 'reserved_room_type',
       'assigned_room_type', 'booking_changes', 'deposit_type', 'agent',
       'company', 'days_in_waiting_list', 'customer_type', 'adr',
       'required_car_parking_spaces', 'total_of_special_requests',
       'reservation_status', 'reservation_status_date'],
      dtype='object')


*Finding Null values*

In [54]:
hotel_df.isna().sum()

hotel                                  0
is_canceled                            0
lead_time                              0
arrival_date_year                      0
arrival_date_month                     0
arrival_date_week_number               0
arrival_date_day_of_month              0
stays_in_weekend_nights                0
stays_in_week_nights                   0
adults                                 0
children                               4
babies                                 0
meal                                   0
country                              488
market_segment                         0
distribution_channel                   0
is_repeated_guest                      0
previous_cancellations                 0
previous_bookings_not_canceled         0
reserved_room_type                     0
assigned_room_type                     0
booking_changes                        0
deposit_type                           0
agent                              16340
company         

We have 4 features with missing values.

*In the agent and the company column, we have id_number for each agent or company, so for all the missing values, we will just replace it with 0.*

In [55]:
hotel_df['agent'].fillna(0, inplace = True)
hotel_df['company'].fillna(0, inplace = True)


*Children column contains the count of children, so we will replace all the missing values with the rounded mean value.*





In [56]:
hotel_df['children'].fillna(df['children'].mean(), inplace = True)

*And our country column contains country codes representing different countries. 
It is a categorical feature so I will also replace it with the mode value. 
The mode value is the value that appears more than any other value. 
So, in this case, I am replacing it with the country that appears the most often.*

In [57]:
hotel_df['country'].fillna(df['country'].mode(), inplace = True)

*There are many rows that have zero guests including adults, children and babies. We will remove those rows which has zero values*

In [58]:
hotel_df[(hotel_df.adults+hotel_df.children+hotel_df.babies)==0].shape

(180, 32)

In [59]:
hotel_df = hotel_df.drop(hotel_df[(hotel_df.adults+ hotel_df.children+hotel_df.babies)==0].index)

In [60]:
hotel_df[(hotel_df.adults+hotel_df.children+hotel_df.babies)==0].shape

(0, 32)

*2. Converting Datatype of columns like 'children', 'company', 'agent' \from float to integer*


In [63]:
hotel_df[['children', 'company', 'agent']] = hotel_df[['children', 'company', 'agent']].astype('int64')
hotel_df.dtypes

hotel                              object
is_canceled                         int64
lead_time                           int64
arrival_date_year                   int64
arrival_date_month                 object
arrival_date_week_number            int64
arrival_date_day_of_month           int64
stays_in_weekend_nights             int64
stays_in_week_nights                int64
adults                              int64
children                            int64
babies                              int64
meal                               object
country                            object
market_segment                     object
distribution_channel               object
is_repeated_guest                   int64
previous_cancellations              int64
previous_bookings_not_canceled      int64
reserved_room_type                 object
assigned_room_type                 object
booking_changes                     int64
deposit_type                       object
agent                             

*Remove the duplicate values*

In [65]:
hotel_df.drop_duplicates(keep=False,inplace = True)

*Finding how many types of hotels are there*


In [68]:
hotel_df['hotel'].value_counts()

City Hotel      47437
Resort Hotel    31632
Name: hotel, dtype: int64

*Which is more populat? City Hotel or Resort Hotel*

In [69]:
hotel_df['hotel'].mode()

0    City Hotel
dtype: object