<h1>Chapter 2 | Case Study A2 | <b>Finding a Good Deal among Hotels: Data Preparation</b></h1>
<p>In this notebook, I'll be dealing mostly with <b>duplicates</b> and <b>missing values</b> concerning the <code>hotels-vienna</code> dataset. The main goal is to first check whether the number of different ID variables, which replace the name of the hotels for confidentiality reasons, match the number of observations. Ideally, since we are dealing with a dataset in tidy data format, there should be only one observation per hotel. Then, we will check whether our dataset has a significant percentage of missing values and which of the variables present the highest relative proportion of missing values.</p>
<h2><b>PART A</b> | Read the data</h2>


In [1]:
import os
import sys
import warnings

import pandas as pd

warnings.filterwarnings("ignore")

In [2]:
# Current script folder
current_path = os.getcwd()
dirname = current_path.split("da_case_studies")[0]

#  Get location folders
data_in = f"{dirname}da_data_repo/hotels-vienna"
data_out = f"{dirname}da_case_studies/ch02-hotels_data_prep/"
output = f"{dirname}da_case_studies/ch02-hotels_data_prep/output/"
func = f"{dirname}da_case_studies/ch00-tech_prep/"
sys.path.append(func)

In [3]:
# Import the prewritten helper functions
from py_helper_functions import *

<p>In this exercise, we'll be using both the clean and the raw files separately. Let's get hold of both of them.</p>

In [4]:
data_in_clean = f"{data_in}/clean/"
data_in_raw = f"{data_in}/raw/"

<p>We can now read the data and start exploring it. Let's start with the clean data.</p>

In [5]:
data = pd.read_csv(f"{data_in_clean}hotels-vienna.csv")
data = data[
    [
        "hotel_id",
        "accommodation_type",
        "distance",
        "stars",
        "rating",
        "rating_count",
        "price",
    ]
]

<p>Let's first take a look at the accommodation types to differentiate between hotels, apartments, hostels, and so on.</p>

In [6]:
data['accommodation_type'].value_counts()

Hotel                  264
Apartment              124
Pension                 16
Guest House              8
Hostel                   6
Bed and breakfast        4
Apart-hotel              4
Vacation home Condo      2
Name: accommodation_type, dtype: int64

<p>Ok, now, by calling the <code>head()</code> function, we can reproduce Table 1.1 from the previous chapter, when this dataset was originally presented.</p>

<p><b>Table 1.1</b> List of observations</p>
<hr>

In [7]:
data.head()

Unnamed: 0,hotel_id,accommodation_type,distance,stars,rating,rating_count,price
0,21894,Apartment,2.7,4.0,4.4,36.0,81
1,21897,Hotel,1.7,4.0,3.9,189.0,81
2,21901,Hotel,1.4,4.0,3.7,53.0,85
3,21902,Hotel,1.7,3.0,4.0,55.0,83
4,21903,Hotel,1.2,4.0,3.9,33.0,82


<p>Let's call the <code>describe()</code> function to reproduce Table 2.2 and observe the types of variables in our dataset.</p>
<p><b>Table 2.2</b> List of the variables in the <code>hotels-vienna</code> dataset</p>
<hr>

In [8]:
data.dtypes

hotel_id                int64
accommodation_type     object
distance              float64
stars                 float64
rating                float64
rating_count          float64
price                   int64
dtype: object

<p>Surely, the book goes beyond and clearly defines the specific type for each variable, but this will suffice for now. Let's get an example from the first observation.</p>

In [9]:
data.iloc[1]

hotel_id              21897
accommodation_type    Hotel
distance                1.7
stars                   4.0
rating                  3.9
rating_count          189.0
price                    81
Name: 1, dtype: object

<p>Ok, now, to reproduce Table 2.3, we can apply a lambda function as a filter, combined with pandas' <code>.loc</code>. Here, our goal is to get only the observations qualified as "Hotel" in the <code>accommodation_type</code> variable.</p>

In [10]:
data = data.loc[lambda x: x["accommodation_type"] == "Hotel"]

In [11]:
data.shape[0]

264

<p><b>Table 2.3</b> A simple tidy data table</p>
<hr>

In [12]:
data[["hotel_id", "price", "distance"]].head(3)

Unnamed: 0,hotel_id,price,distance
1,21897,81,1.7
2,21901,85,1.4
3,21902,83,1.7


<h2><b>PART B</b> | Repeat part of the cleaning code</h2>
<p>Now, following the authors' notes for this chapter, we'll replicate the code to understand their stance on data preparation.</p> 
<hr>
<h3>2.1 Import and prepare the data</h3>

In [13]:
data = pd.read_csv(f"{data_in_raw}hotelbookingdata-vienna.csv")

In [14]:
data.head()

Unnamed: 0,addresscountryname,city_actual,rating_reviewcount,center1distance,center1label,center2distance,center2label,neighbourhood,price,price_night,...,accommodationtype,guestreviewsrating,scarce_room,hotel_id,offer,offer_cat,year,month,weekend,holiday
0,Austria,Vienna,36.0,2.7 miles,City centre,4.4 miles,Donauturm,17. Hernals,81,price for 1 night,...,_ACCOM_TYPE@Apartment,4.4 /5,1,21894,1,15-50% offer,2017,11,0,0
1,Austria,Vienna,189.0,1.7 miles,City centre,3.8 miles,Donauturm,17. Hernals,81,price for 1 night,...,_ACCOM_TYPE@Hotel,3.9 /5,0,21897,1,1-15% offer,2017,11,0,0
2,Austria,Vienna,53.0,1.4 miles,City centre,2.5 miles,Donauturm,Alsergrund,85,price for 1 night,...,_ACCOM_TYPE@Hotel,3.7 /5,0,21901,1,15-50% offer,2017,11,0,0
3,Austria,Vienna,55.0,1.7 miles,City centre,2.5 miles,Donauturm,Alsergrund,83,price for 1 night,...,_ACCOM_TYPE@Hotel,4 /5,0,21902,1,15-50% offer,2017,11,0,0
4,Austria,Vienna,33.0,1.2 miles,City centre,2.8 miles,Donauturm,Alsergrund,82,price for 1 night,...,_ACCOM_TYPE@Hotel,3.9 /5,1,21903,1,15-50% offer,2017,11,0,0


<p>Here, we can notice the first problem: the variables <code>center1distance</code> and <code>center2distance</code>, that represent the distance to center from each hotel, were stored as <b>string</b> in miles, with one decimal. To our purposes, this is not what we need. We can transform them to numerical, so that they can be properly used.</p>

In [17]:
data["distance"] = data["center1distance"].str.split(" ").apply(lambda x: float(x[0]))
data["distance_alter"] = data["center2distance"].str.split(" ").apply(lambda x: float(x[0]))

<p>Ok, now, we have an issue with <code>accommodation_type</code>: all observations were stored with the prefix <code>_ACCOM_TYPE@</code>. We can remove this part, as it will make our data cleaner and easier to be analysed.</p>

In [21]:
data["accommodation_type"] = (
    data["accommodationtype"].str.split("@").apply(lambda x: x[1]).str.strip()
)

<p>Great! We can also clean the variable <code>price_night</code>. We are interested in getting hold of the number of nights for each observation.</p>

In [23]:
data["nnight"] = data["price_night"].str.split(" ").apply(lambda x: int(x[2]))