### **Question 4**

You are provided with a dataset (view enclosed CSV file `hotel_bookings.csv`) containing **hotel booking demand information**.  
The dataset includes various attributes such as:

- **Booking and Arrival Dates**: Details of reservation and arrival dates.  
- **Lead Time**: Number of days between the booking date and the arrival date.  
- **Number of Nights**: Duration of the stay.  
- **Number of Adults and Children**: Information about the guests.  
- **Meal Type**: Selected meal plan.  
- **Country**: Guest's country of origin.  
- **Market Segment**: Market segment associated with the booking.  
- **Distribution Channel**: Channel used for the booking.  
- **Other Relevant Booking Details**.  

> **Note**: Some values in the dataset are missing and will require cleaning.

---

### **Your Task**

Perform an **exploratory data analysis (EDA)** using Python to:  

1. **Uncover Trends and Patterns**: Analyze the data to identify key insights.  
2. **Generate Insights**: Summarize your findings based on the analysis.  
3. **Document Your Methodology**: Clearly explain your approach.  
4. **Include Visualizations**: Use graphs and charts to support your conclusions.

In [31]:
#Import libraries needed to work with
import pandas as pd

In [32]:
#Reading the dataset in csv
file_name = 'hotel_bookings.csv' 
hotel_bookings_df = pd.read_csv(file_name)

#Visualising first rows
hotel_bookings_df.head()

Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,...,deposit_type,agent,company,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date
0,Resort Hotel,0,342,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
1,Resort Hotel,0,737,2015,July,27,1,0,0,2,...,No Deposit,,,0,Transient,0.0,0,0,Check-Out,2015-07-01
2,Resort Hotel,0,7,2015,July,27,1,0,1,1,...,No Deposit,,,0,Transient,75.0,0,0,Check-Out,2015-07-02
3,Resort Hotel,0,13,2015,July,27,1,0,1,1,...,No Deposit,304.0,,0,Transient,75.0,0,0,Check-Out,2015-07-02
4,Resort Hotel,0,14,2015,July,27,1,0,2,2,...,No Deposit,240.0,,0,Transient,98.0,0,1,Check-Out,2015-07-03


In [33]:
#Shape of the dataset
print("Number of rows:", hotel_bookings_df.shape[0]) # Number of rows
print("Number of columns:", hotel_bookings_df.shape[1]) # Number of columns

Number of rows: 119390
Number of columns: 32


In [34]:
#View type of data in the different columns
hotel_bookings_df.dtypes

hotel                              object
is_canceled                         int64
lead_time                           int64
arrival_date_year                   int64
arrival_date_month                 object
arrival_date_week_number            int64
arrival_date_day_of_month           int64
stays_in_weekend_nights             int64
stays_in_week_nights                int64
adults                              int64
children                          float64
babies                              int64
meal                               object
country                            object
market_segment                     object
distribution_channel               object
is_repeated_guest                   int64
previous_cancellations              int64
previous_bookings_not_canceled      int64
reserved_room_type                 object
assigned_room_type                 object
booking_changes                     int64
deposit_type                       object
agent                             

In [35]:
#Check unique values per each column and also column naming
hotel_bookings_df.nunique()

hotel                                2
is_canceled                          2
lead_time                          479
arrival_date_year                    3
arrival_date_month                  12
arrival_date_week_number            53
arrival_date_day_of_month           31
stays_in_weekend_nights             17
stays_in_week_nights                35
adults                              14
children                             5
babies                               5
meal                                 5
country                            177
market_segment                       8
distribution_channel                 5
is_repeated_guest                    2
previous_cancellations              15
previous_bookings_not_canceled      73
reserved_room_type                  10
assigned_room_type                  12
booking_changes                     21
deposit_type                         3
agent                              333
company                            352
days_in_waiting_list     

In [36]:
# Shown unique values per each column to understand better the dataset
for column in hotel_bookings_df.columns:
    print(f"Unique values per column'{column}':")
    print(hotel_bookings_df[column].unique())
    print("\n" + "-"*50 + "\n")

Unique values per column'hotel':
['Resort Hotel' 'City Hotel']

--------------------------------------------------

Unique values per column'is_canceled':
[0 1]

--------------------------------------------------

Unique values per column'lead_time':
[342 737   7  13  14   0   9  85  75  23  35  68  18  37  12  72 127  78
  48  60  77  99 118  95  96  69  45  40  15  36  43  70  16 107  47 113
  90  50  93  76   3   1  10   5  17  51  71  63  62 101   2  81 368 364
 324  79  21 109 102   4  98  92  26  73 115  86  52  29  30  33  32   8
 100  44  80  97  64  39  34  27  82  94 110 111  84  66 104  28 258 112
  65  67  55  88  54 292  83 105 280 394  24 103 366 249  22  91  11 108
 106  31  87  41 304 117  59  53  58 116  42 321  38  56  49 317   6  57
  19  25 315 123  46  89  61 312 299 130  74 298 119  20 286 136 129 124
 327 131 460 140 114 139 122 137 126 120 128 135 150 143 151 132 125 157
 147 138 156 164 346 159 160 161 333 381 149 154 297 163 314 155 323 340
 356 142 328 144 33

In [37]:
#Check if there is duplicates
#Function to handle duplicates
def handle_duplicates(df):
    """
    Identify and handle duplicate rows in the dataframe by removing them.
    
    Args:
    df (pd.DataFrame): The input dataframe with potential duplicate rows.
    
    Returns:
    pd.DataFrame: The dataframe with duplicates handled and index reset.
    """
    # Count the number of duplicate rows
    number_of_duplicates = df.duplicated().sum()
    print(f"Number of duplicated rows before cleaning: {number_of_duplicates}")

    # Remove duplicates and reset index
    df_cleaned = df.drop_duplicates(keep='first').reset_index(drop=True)
    
    # Count duplicates again after cleaning
    duplicates_after = df_cleaned.duplicated().sum()
    print(f"Number of duplicated rows after cleaning: {duplicates_after}")

    return df_cleaned

# Step 1: Handle duplicates
hotel_bookings= handle_duplicates(hotel_bookings_df)

Number of duplicated rows before cleaning: 31994
Number of duplicated rows after cleaning: 0


In [None]:
#Check if there is some null values
#Function to handle null values
def handle_null_values(df):
    """
    Handle null values in the dataframe by filling them with appropriate statistics.
    
    Args:
    df (pd.DataFrame): The input dataframe with potential null values.
    
    Returns:
    pd.DataFrame: The dataframe with null values handled.
    """
    # Count the number of null values in each column
    print("Number of null values in each column before handling:")
    print(df.isna().sum())

    # Drop rows where all columns are NaN
    df = df.dropna(how='all')

    #Fill NaN values in certain columns
    df['children'] = df['children'].fillna(0)  # Asume they don not have kids
    df['country'] = df['country'].fillna('Unknown')
    df['agent'] = df['agent'].fillna('0')
    df['company'] = df['company'].fillna('0')

    # Check if there are any remaining null values
    remaining_nulls = df.isnull().sum()
    print("\nNumber of null values in each column after handling:")
    print(remaining_nulls[remaining_nulls > 0])

    return df

# Step 2: Handle null values, including single NaN row removal
hotel_bookings = handle_null_values(hotel_bookings)

Number of null values in each column before handling:
hotel                                 0
is_canceled                           0
lead_time                             0
arrival_date_year                     0
arrival_date_month                    0
arrival_date_week_number              0
arrival_date_day_of_month             0
stays_in_weekend_nights               0
stays_in_week_nights                  0
adults                                0
children                              4
babies                                0
meal                                  0
country                             452
market_segment                        0
distribution_channel                  0
is_repeated_guest                     0
previous_cancellations                0
previous_bookings_not_canceled        0
reserved_room_type                    0
assigned_room_type                    0
booking_changes                       0
deposit_type                          0
agent                     