# Part 1: Data Preprocessing

In this section, we will perform basic data preprocessing tasks on the Hotel Booking Dataset. This includes:

- **Importing the dataset**: Loading the data into a pandas DataFrame.
- **Parsing the data**: Converting necessary columns (e.g., converting strings to integers or dates).
- **Organizing the data**: Ensuring that the data is clean and ready for further analysis, which includes checking for missing values and setting up appropriate data structures.

In [2]:
# Importing necessary libraries
import pandas as pd
import matplotlib as plt
import seaborn as sns

In [5]:
# Loading the dataset
df = pd.read_csv("Hotel.csv")
df.head()

Unnamed: 0,ID,n_adults,n_children,weekend_nights,week_nights,meal_plan,car_parking_space,room_type,lead_time,year,month,date,market_segment,repeated_guest,previous_cancellations,previous_bookings_not_canceled,avg_room_price,special_requests,status
0,INN00001,2,0,1,2,Meal Plan 1,0,Room_Type 1,224,2017,10,2,Offline,0,0,0,65.0,0,Not_Canceled
1,INN00002,2,0,2,3,Not Selected,0,Room_Type 1,5,2018,11,6,Online,0,0,0,106.68,1,Not_Canceled
2,INN00003,1,0,2,1,Meal Plan 1,0,Room_Type 1,1,2018,2,28,Online,0,0,0,60.0,0,Canceled
3,INN00004,2,0,0,2,Meal Plan 1,0,Room_Type 1,211,2018,5,20,Online,0,0,0,100.0,0,Canceled
4,INN00005,2,0,1,1,Not Selected,0,Room_Type 1,48,2018,4,11,Online,0,0,0,94.5,0,Canceled


In [6]:
rows, columns = df.shape
print(f"The dataset contains {rows} rows and {columns} columns.")

The dataset contains 36275 rows and 19 columns.


In [8]:
# Displaying all the column names
print("Column names:", df.columns.tolist())

Column names: ['ID', 'n_adults', 'n_children', 'weekend_nights', 'week_nights', 'meal_plan', 'car_parking_space', 'room_type', 'lead_time', 'year', 'month', 'date', 'market_segment', 'repeated_guest', 'previous_cancellations', 'previous_bookings_not_canceled', 'avg_room_price', 'special_requests', 'status']


In [9]:
# Displaying the data types of each column
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36275 entries, 0 to 36274
Data columns (total 19 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   ID                              36275 non-null  object 
 1   n_adults                        36275 non-null  int64  
 2   n_children                      36275 non-null  int64  
 3   weekend_nights                  36275 non-null  int64  
 4   week_nights                     36275 non-null  int64  
 5   meal_plan                       36275 non-null  object 
 6   car_parking_space               36275 non-null  int64  
 7   room_type                       36275 non-null  object 
 8   lead_time                       36275 non-null  int64  
 9   year                            36275 non-null  int64  
 10  month                           36275 non-null  int64  
 11  date                            36275 non-null  int64  
 12  market_segment                  