### Taking a First Look

#### Variables (36 in total):
##### NOTE: We could not find detailed information on the exact meaning of each column variable. Additionally, some inputs are provided as abbreviations or short forms (e.g., 'AUT' is a value in the 'country' variable). Therefore, we will infer the meanings of these variables based on their names and consider their possible interpretations within the context of hotel reservations.
1. hotel (type object)
    - "Resort Hotel", "City Hotel"
2. is_canceled (float64 object)
    - 0 
    - 1
3. lead_time (float64)
    - min: 0
    - max: 737
    - Lead time usually means the number of days between the date of booking and the date of arrival.
4. arrival_date_year (float64)
    - min: 2015
    - max: 2017
5. arrival_date_month (type)
    - January, February, ..., December
6. arrival_date_week_number (float64)
    - min: 1
    - max: 53
    - We believe that this variable most likely represents the week number in the year (e.g., 6 means the 6th week of the year).
7. arrival_date_day_of_month (float64)
    - min: 1
    - max: 31
8. stays_in_weekend_nights (float64)
    - min: 0 
    - max: 19
9. stays_in_week_nights (float64)
    - min: 0
    - max: 50
10. adults (float64)
    - min: 0
    - max: 55
11. children (float64)
    - min: 0
    - max: 10
12. babies (float64)
    - min: 0
    - max: 10
13. meal (type)
    - 'BB': We believe this stands for "Bed and Breakfast", indicating that breakfast is included, but no other meals are provided
    - 'FB': "Full Board" includes breakfast, lunch, and dinner.
    - 'HB': "Half Board" includes breakfast and one additional meal, typically dinner, but not lunch.
    - 'SC': "Self-Catering" means that guests are responsible for arranging their own meals.
    - 'Undefined': This likely indicates that the meal plan is either not specified or does not fit into any of the above categories. This could be due to missing data, an unusual meal plan, or the hotel not offering any predefined meal plans.
        - Since there are only 1,169 'Undefined' values out of 119,390 total values, we will be omitting these rows from our analysis.
14. country (type)
    - We believe that the input values in this column represent country ISO codes (e.g. AUT = Austria)
15. market_segment (type)
    - 'Direct': Reservations made directly with the hotel via phone, email, or fax.
    - 'Corporate': Reservations made by guests who receive discounted rates through their company, often as part of a contract.
    - 'Online TA': Reservations made through online third-party platforms, also known as online travel agencies (e.g., Booking.com, Expedia).
    - 'Offline TA/TO': Reservations made through offline travel agents or tour operators.
    - 'Complementary': Guests staying at the hotel free of charge.
    - 'Groups': Reservations made on behalf of a group, often for events, tours, or corporate gatherings.
    - 'Undefined': The meaning of this value is unclear. Since there are only two 'Undefined' market segment values in the entire dataset, we will be removing these rows for cleaner data.
    - 'Aviation': Possibly reservations made using aviation-related credits or for airline crew members.
16. distribution_channel (type)
    - 'Direct': Reservations made directly with the hotel, bypassing third-party intermediaries.
    - 'Corporate': Bookings made by companies on behalf of their employees or guests, usually under negotiated rates or corporate contracts.
    - 'TA/TO': Reservations made through traditional travel agents or tour operators.
    - 'Undefined': This value indicates that the distribution channel is not specified or could not be categorized into one of the known channels. It might represent missing or improperly categorized data. Given its ambiguity, we will be excluding these values.
    - 'GDS': Global Distribution Systems are computer reservation systems that provide access to hotel inventory for travel agencies and suppliers.
17. is_repeated_guest (float64)
    - 0
    - 1
18. previous_cancellations (float64)
    - min: 0
    - max: 26
19. previous_bookings_not_canceled (float64)
    - min: 0
    - max: 72

We will be excluding the "reserved_room_type" and "assigned_room_type" columns from our analysis since we are not sure what each letter input represents.

20. reserved_room_type (type)
    - 'C', 'A', 'D', 'E', 'G', 'F', 'H', 'L', 'P', 'B'
21. assigned_room_type (type)
    - 'C', 'A', 'D', 'E', 'G', 'F', 'I', 'B', 'H', 'P', 'L', 'K'
22. booking_changes (float64)
    - min: 0
    - max: 21
23. deposit_type (type)
    - 'No Deposit', 'Refundable', 'Non Refund'

We will be excluding the "agent" and "company" columns from our analysis since we are not able to interpret their input values.

24. agent (float64)
    - The agent column contains values ranging from 1 to 535, which likely correspond to unique agents or agencies. However, without additional information to identify these agents, this data is not useful for our analysis.
25. company (float64)
    - The company column contains values ranging from 6 to 543, which likely correspond to unique companies. However, without additional information to identify these companies, this data is not useful for our analysis.
26. days_in_waiting_list (float64)
    - min: 0
    - max: 391
27. customer_type (type)
    - 'Transient': A guest who books a hotel room for a short stay. These are individual bookings not associated with a group or party, and they are the most common type of reservation, often made for business trips or short vacations.
    - 'Contract': Guests with reservations made through a contractual agreement with the hotel. This type of booking typically includes rooms reserved for airline crew or corporate clients under special agreements.
    - 'Transient-Party': Similar to transient bookings but involves multiple individuals booked under a single reservation, such as friends or family traveling together.
    - 'Group': Reservations involving multiple rooms booked together, typically for events, conferences, or tours.
28. adr (float64)
    - min: -6.380
    - max: 5400
    - ADR stands for Average Daily Rate, representing the average revenue earned per occupied room per day. In this dataset, we observe instances of zero and negative ADR values. A negative ADR would suggest that the hotel is paying guests to stay, which is not a typical business practice. While zero ADR could occur in cases of complimentary stays, such scenarios are rare. It is also possible that these zero and negative ADR values are data entry errors or indicate missing information. To ensure a more accurate analysis of the hotel's performance with paying guests, we will exclude rows with zero or negative ADR from our project.
29. required_car_parking_spaces (float64)
    - min: 0
    - max: 8
30. total_of_special_requests (float64)
    - min: 0
    - max: 1
    - Special requests are extra requests made by customers (e.g. high floor, ocean view)
31. reservation_status (type)
    - 'Check-Out', 'Canceled', or 'No-Show'
32. reservation_status_date (type)
    - Entires are dates in the format XXXX-XX-XX.

We will not be using the variables below in our project as we are not interested in the personal details regarding each customer.

33. name
34. email
35. phone-number
36. credit_card

In [2]:
import pandas as pd

df = pd.read_csv('./data/hotel_booking.csv')

In [59]:
column_name = 'total_of_special_requests'

if pd.api.types.is_numeric_dtype(df[column_name]):
    five_number_summary = df[column_name].describe()[['min', '25%', '50%', '75%', 'max']]
    print(f"5-Number Summary for '{column_name}':\n{five_number_summary}\n")
else:
    column_dtype = df[column_name].dtype
    unique_values = df[column_name].unique()
    print(f"'{column_name}' is of type {column_dtype}. Unique values:\n{unique_values}\n")
    value_counts = df[column_name].value_counts()
    print(f"Count of unique values in '{column_name}':\n{value_counts}\n")


5-Number Summary for 'total_of_special_requests':
min    0.0
25%    0.0
50%    0.0
75%    1.0
max    5.0
Name: total_of_special_requests, dtype: float64



Wrangling TO-DO:
- Add a new column named "total_nights_stayed", the values will be the sum of "stays in weekend nights" and "stays in week nights"
- Remove rows with 'Undefined' values in the 'meal' column
- replace 'country' column inputs' ISO country code values with country names
- Remove the "reserved_room_type" and "assigned_room_type" columns
- Remove the "agent" and "company" columns
- Remove rows with zero or negative ADR values
- Convert input values in the "reservation_status_date" column to numeric representation of dates
- Remove the 'name', 'email', 'phone-number', and 'credit_card' columns