# Data Exploration - Hotel Bookings
Group #6: Allyson Vasquez, Alex Miller, Vena Khamvanthong, Mandev Doshi

This notebook explores our dataset to gain deeper insights in order to create meaningful visualizations.

In [29]:
import pandas as pd
import numpy as np
import altair as alt
import streamlit as st
from pandas_profiling import ProfileReport

Let's do some exploratory data analysis on our hotel_booking.csv file. This will help us to identify any patterns, relations, or cleaning that needs to be done.

In [30]:
df = pd.read_csv('hotel_booking.csv')

#Looking at the first 10 rows of our dataset
df.head(10)

Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,...,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date,name,email,phone-number,credit_card
0,Resort Hotel,0,342,2015,July,27,1,0,0,2,...,Transient,0.0,0,0,Check-Out,2015-07-01,Ernest Barnes,Ernest.Barnes31@outlook.com,669-792-1661,************4322
1,Resort Hotel,0,737,2015,July,27,1,0,0,2,...,Transient,0.0,0,0,Check-Out,2015-07-01,Andrea Baker,Andrea_Baker94@aol.com,858-637-6955,************9157
2,Resort Hotel,0,7,2015,July,27,1,0,1,1,...,Transient,75.0,0,0,Check-Out,2015-07-02,Rebecca Parker,Rebecca_Parker@comcast.net,652-885-2745,************3734
3,Resort Hotel,0,13,2015,July,27,1,0,1,1,...,Transient,75.0,0,0,Check-Out,2015-07-02,Laura Murray,Laura_M@gmail.com,364-656-8427,************5677
4,Resort Hotel,0,14,2015,July,27,1,0,2,2,...,Transient,98.0,0,1,Check-Out,2015-07-03,Linda Hines,LHines@verizon.com,713-226-5883,************5498
5,Resort Hotel,0,14,2015,July,27,1,0,2,2,...,Transient,98.0,0,1,Check-Out,2015-07-03,Jasmine Fletcher,JFletcher43@xfinity.com,190-271-6743,************9263
6,Resort Hotel,0,0,2015,July,27,1,0,2,2,...,Transient,107.0,0,0,Check-Out,2015-07-03,Dylan Rangel,Rangel.Dylan@comcast.net,420-332-5209,************6994
7,Resort Hotel,0,9,2015,July,27,1,0,2,2,...,Transient,103.0,0,1,Check-Out,2015-07-03,William Velez,Velez_William@mail.com,286-669-4333,************8729
8,Resort Hotel,1,85,2015,July,27,1,0,3,2,...,Transient,82.0,0,1,Canceled,2015-05-06,Steven Murphy,Steven.Murphy54@aol.com,341-726-5787,************3639
9,Resort Hotel,1,75,2015,July,27,1,0,3,2,...,Transient,105.5,0,0,Canceled,2015-04-22,Michael Moore,MichaelMoore81@outlook.com,316-648-6176,************9190


In [31]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 36 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   hotel                           119390 non-null  object 
 1   is_canceled                     119390 non-null  int64  
 2   lead_time                       119390 non-null  int64  
 3   arrival_date_year               119390 non-null  int64  
 4   arrival_date_month              119390 non-null  object 
 5   arrival_date_week_number        119390 non-null  int64  
 6   arrival_date_day_of_month       119390 non-null  int64  
 7   stays_in_weekend_nights         119390 non-null  int64  
 8   stays_in_week_nights            119390 non-null  int64  
 9   adults                          119390 non-null  int64  
 10  children                        119386 non-null  float64
 11  babies                          119390 non-null  int64  
 12  meal            

We can observe that 
- There are 36 columns/attributes.
- There are 119,390 rows/entries.
- Our attributes are objects, integers, or floats.

Let's see if there is any missing data below.

In [32]:
#Checking for missing data
df.isnull().sum()

hotel                                  0
is_canceled                            0
lead_time                              0
arrival_date_year                      0
arrival_date_month                     0
arrival_date_week_number               0
arrival_date_day_of_month              0
stays_in_weekend_nights                0
stays_in_week_nights                   0
adults                                 0
children                               4
babies                                 0
meal                                   0
country                              488
market_segment                         0
distribution_channel                   0
is_repeated_guest                      0
previous_cancellations                 0
previous_bookings_not_canceled         0
reserved_room_type                     0
assigned_room_type                     0
booking_changes                        0
deposit_type                           0
agent                              16340
company         

Our dataset does contain missing information, specifically in:
- country
- agent
- company

In [17]:
#Looking at the unique values for each column/attribute
#This output also gives us insight into which columns are quantitative and which are categorical

'''#NOTE: Missing data is nan in the dataset. needs to be cleaned/addressed with
for col in df.columns:
    print('{} : {}'.format(col,df[col].unique()))'''


"#NOTE: Missing data is nan in the dataset. needs to be cleaned/addressed with\nfor col in df.columns:\n    print('{} : {}'.format(col,df[col].unique()))"

We can see above that our missing data set is 'nan'. We will address/clean this when done with our data exploration.

Lets also create a Profile Report below to see if we can make any other observations.

In [18]:
profile = ProfileReport(df, title="Hotel Bookings Profile Report", minimal=True)
profile.to_file("hotel_booking_report.html")

Summarize dataset: 100%|██████████| 42/42 [00:02<00:00, 19.06it/s, Completed]                                       
Generate report structure: 100%|██████████| 1/1 [00:06<00:00,  6.95s/it]
Render HTML: 100%|██████████| 1/1 [00:00<00:00,  1.61it/s]
Export report to file: 100%|██████████| 1/1 [00:00<00:00, 200.92it/s]


## Data Cleaning

In [19]:
#adr = average daily rate (the sume of all lodging transactions / num of nights stayed)

In [33]:
# Dropping unnecessary columns
df = df.drop(['agent','company','name','email','phone-number','credit_card', 'arrival_date_month', 'reservation_status_date', 'reservation_status'], axis=1)

In [36]:
df.isnull().sum()

hotel                               0
is_canceled                         0
lead_time                           0
arrival_date_year                   0
arrival_date_week_number            0
arrival_date_day_of_month           0
stays_in_weekend_nights             0
stays_in_week_nights                0
adults                              0
children                            4
babies                              0
meal                                0
country                           488
market_segment                      0
distribution_channel                0
is_repeated_guest                   0
previous_cancellations              0
previous_bookings_not_canceled      0
reserved_room_type                  0
assigned_room_type                  0
booking_changes                     0
deposit_type                        0
days_in_waiting_list                0
customer_type                       0
adr                                 0
required_car_parking_spaces         0
total_of_spe

In [37]:
#Remove Nan from children
df = df.dropna(axis=0, subset=['children'])

In [39]:
df['country'] = df['country'].fillna('Unknown')
df.isnull().sum()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['country'] = df['country'].fillna('Unknown')


hotel                             0
is_canceled                       0
lead_time                         0
arrival_date_year                 0
arrival_date_week_number          0
arrival_date_day_of_month         0
stays_in_weekend_nights           0
stays_in_week_nights              0
adults                            0
children                          0
babies                            0
meal                              0
country                           0
market_segment                    0
distribution_channel              0
is_repeated_guest                 0
previous_cancellations            0
previous_bookings_not_canceled    0
reserved_room_type                0
assigned_room_type                0
booking_changes                   0
deposit_type                      0
days_in_waiting_list              0
customer_type                     0
adr                               0
required_car_parking_spaces       0
total_of_special_requests         0
dtype: int64