<h1>Chapter 2 | Case Study A2 | <b>Finding a Good Deal among Hotels: Data Preparation</b></h1>
<p>In this notebook, I'll be dealing mostly with <b>duplicates</b> and <b>missing values</b> concerning the <code>hotels-vienna</code> dataset. The main goal is to first check whether the number of different ID variables, which replace the name of the hotels for confidentiality reasons, match the number of observations. Ideally, since we are dealing with a dataset in tidy data format, there should be only one observation per hotel. Then, we will check whether our dataset has a significant percentage of missing values and which of the variables present the highest relative proportion of missing values.</p>


In [6]:
import os
import sys
import warnings

import pandas as pd

warnings.filterwarnings("ignore")

In [24]:
# Current script folder
current_path = os.getcwd()
dirname = current_path.split("da_case_studies")[0]

#  Get location folders
data_in = f"{dirname}da_data_repo/hotels-vienna"
data_out = f"{dirname}da_case_studies/ch02-hotels_data_prep/"
output = f"{dirname}da_case_studies/ch02-hotels_data_prep/output/"
func = f"{dirname}da_case_studies/ch00-tech_prep/"
sys.path.append(func)

In [25]:
# Import the prewritten helper functions
from py_helper_functions import *

<p>In this exercise, we'll be using both the clean and the raw files separately. Let's get hold of both of them.</p>

In [18]:
data_in_clean = f"{data_in}/clean/"
data_in_raw = f"{data_in}/raw/"

<p>We can now read the data and start exploring it. Let's start with the clean data.</p>

In [20]:
data = pd.read_csv(f"{data_in_clean}hotels-vienna.csv")
data = data[
    [
        "hotel_id",
        "accommodation_type",
        "distance",
        "stars",
        "rating",
        "rating_count",
        "price",
    ]
]

<p>Let's first take a look at the accommodation types to differentiate between hotels, apartments, hostels, and so on.</p>

In [21]:
data['accommodation_type'].value_counts()

Hotel                  264
Apartment              124
Pension                 16
Guest House              8
Hostel                   6
Bed and breakfast        4
Apart-hotel              4
Vacation home Condo      2
Name: accommodation_type, dtype: int64

<p>Ok, now, by calling the <code>head()</code> function, we can reproduce Table 1.1 from the previous chapter, when this dataset was originally presented.</p>

<p><b>Table 1.1</b> List of observations</p>
<hr>

In [26]:
data.head()

Unnamed: 0,hotel_id,accommodation_type,distance,stars,rating,rating_count,price
0,21894,Apartment,2.7,4.0,4.4,36.0,81
1,21897,Hotel,1.7,4.0,3.9,189.0,81
2,21901,Hotel,1.4,4.0,3.7,53.0,85
3,21902,Hotel,1.7,3.0,4.0,55.0,83
4,21903,Hotel,1.2,4.0,3.9,33.0,82


<p>Let's call the <code>describe()</code> function to reproduce Table 2.2 and observe the types of variables in our dataset.</p>
<p><b>Table 2.2</b> List of the variables in the <code>hotels-vienna</code> dataset</p>
<hr>

In [27]:
data.dtypes

hotel_id                int64
accommodation_type     object
distance              float64
stars                 float64
rating                float64
rating_count          float64
price                   int64
dtype: object

<p>Surely, the book goes beyond and clearly defines the specific type for each variable, but this will suffice for now. Let's get an example from the first observation.</p>

In [28]:
data.iloc[1]

hotel_id              21897
accommodation_type    Hotel
distance                1.7
stars                   4.0
rating                  3.9
rating_count          189.0
price                    81
Name: 1, dtype: object