<h1>Chapter 2 | Case Study A2 | <b>Finding a Good Deal among Hotels: Data Preparation</b></h1>
<p>In this notebook, I'll be dealing mostly with <b>duplicates</b> and <b>missing values</b> concerning the <code>hotels-vienna</code> dataset. The main goal is to first check whether the number of different ID variables, which replace the name of the hotels for confidentiality reasons, match the number of observations. Ideally, since we are dealing with a dataset in tidy data format, there should be only one observation per hotel. Then, we will check whether our dataset has a significant percentage of missing values and which of the variables present the highest relative proportion of missing values.</p>
<h2><b>PART A</b> | Read the data</h2>


In [20]:
import os
import sys
import warnings

import pandas as pd

warnings.filterwarnings("ignore")

In [21]:
pd.set_option("display.max_rows", 500)

In [22]:
# Current script folder
current_path = os.getcwd()
dirname = current_path.split("da_case_studies")[0]

#  Get location folders
data_in = f"{dirname}da_data_repo/hotels-vienna"
data_out = f"{dirname}da_case_studies/ch02-hotels_data_prep/"
output = f"{dirname}da_case_studies/ch02-hotels_data_prep/output/"
func = f"{dirname}da_case_studies/ch00-tech_prep/"
sys.path.append(func)

In [23]:
# Import the prewritten helper functions
from py_helper_functions import *

<p>In this exercise, we'll be using both the clean and the raw files separately. Let's get hold of both of them.</p>

In [24]:
data_in_clean = f"{data_in}/clean/"
data_in_raw = f"{data_in}/raw/"

<p>We can now read the data and start exploring it. Let's start with the clean data.</p>

In [25]:
data = pd.read_csv(f"{data_in_clean}hotels-vienna.csv")
data = data[
    [
        "hotel_id",
        "accommodation_type",
        "distance",
        "stars",
        "rating",
        "rating_count",
        "price",
    ]
]

<p>Let's first take a look at the accommodation types to differentiate between hotels, apartments, hostels, and so on.</p>

In [26]:
data['accommodation_type'].value_counts()

Hotel                  264
Apartment              124
Pension                 16
Guest House              8
Hostel                   6
Bed and breakfast        4
Apart-hotel              4
Vacation home Condo      2
Name: accommodation_type, dtype: int64

<p>Ok, now, by calling the <code>head()</code> function, we can reproduce Table 1.1 from the previous chapter, when this dataset was originally presented.</p>

<p><b>Table 1.1</b> List of observations</p>
<hr>

In [27]:
data.head()

Unnamed: 0,hotel_id,accommodation_type,distance,stars,rating,rating_count,price
0,21894,Apartment,2.7,4.0,4.4,36.0,81
1,21897,Hotel,1.7,4.0,3.9,189.0,81
2,21901,Hotel,1.4,4.0,3.7,53.0,85
3,21902,Hotel,1.7,3.0,4.0,55.0,83
4,21903,Hotel,1.2,4.0,3.9,33.0,82


<p>Let's call the <code>describe()</code> function to reproduce Table 2.2 and observe the types of variables in our dataset.</p>
<p><b>Table 2.2</b> List of the variables in the <code>hotels-vienna</code> dataset</p>
<hr>

In [28]:
data.dtypes

hotel_id                int64
accommodation_type     object
distance              float64
stars                 float64
rating                float64
rating_count          float64
price                   int64
dtype: object

<p>Surely, the book goes beyond and clearly defines the specific type for each variable, but this will suffice for now. Let's get an example from the first observation.</p>

In [29]:
data.iloc[1]

hotel_id              21897
accommodation_type    Hotel
distance                1.7
stars                   4.0
rating                  3.9
rating_count          189.0
price                    81
Name: 1, dtype: object

<p>Ok, now, to reproduce Table 2.3, we can apply a lambda function as a filter, combined with pandas' <code>.loc</code>. Here, our goal is to get only the observations qualified as "Hotel" in the <code>accommodation_type</code> variable.</p>

In [30]:
data = data.loc[lambda x: x["accommodation_type"] == "Hotel"]

In [31]:
data.shape[0]

264

<p><b>Table 2.3</b> A simple tidy data table</p>
<hr>

In [32]:
data[["hotel_id", "price", "distance"]].head(3)

Unnamed: 0,hotel_id,price,distance
1,21897,81,1.7
2,21901,85,1.4
3,21902,83,1.7


<h2><b>PART B</b> | Repeat part of the cleaning code</h2>
<p>Now, following the authors' notes for this chapter, we'll replicate the code to understand their stance on data preparation.</p> 
<hr>
<h3>2.1 Import and prepare the data</h3>

In [33]:
data = pd.read_csv(f"{data_in_raw}hotelbookingdata-vienna.csv")

In [34]:
data.head()

Unnamed: 0,addresscountryname,city_actual,rating_reviewcount,center1distance,center1label,center2distance,center2label,neighbourhood,price,price_night,...,accommodationtype,guestreviewsrating,scarce_room,hotel_id,offer,offer_cat,year,month,weekend,holiday
0,Austria,Vienna,36.0,2.7 miles,City centre,4.4 miles,Donauturm,17. Hernals,81,price for 1 night,...,_ACCOM_TYPE@Apartment,4.4 /5,1,21894,1,15-50% offer,2017,11,0,0
1,Austria,Vienna,189.0,1.7 miles,City centre,3.8 miles,Donauturm,17. Hernals,81,price for 1 night,...,_ACCOM_TYPE@Hotel,3.9 /5,0,21897,1,1-15% offer,2017,11,0,0
2,Austria,Vienna,53.0,1.4 miles,City centre,2.5 miles,Donauturm,Alsergrund,85,price for 1 night,...,_ACCOM_TYPE@Hotel,3.7 /5,0,21901,1,15-50% offer,2017,11,0,0
3,Austria,Vienna,55.0,1.7 miles,City centre,2.5 miles,Donauturm,Alsergrund,83,price for 1 night,...,_ACCOM_TYPE@Hotel,4 /5,0,21902,1,15-50% offer,2017,11,0,0
4,Austria,Vienna,33.0,1.2 miles,City centre,2.8 miles,Donauturm,Alsergrund,82,price for 1 night,...,_ACCOM_TYPE@Hotel,3.9 /5,1,21903,1,15-50% offer,2017,11,0,0


<p>Here, we can notice the first problem: the variables <code>center1distance</code> and <code>center2distance</code>, that represent the distance to center from each hotel, were stored as <b>string</b> in miles, with one decimal. To our purposes, this is not what we need. We can transform them to numerical, so that they can be properly used.</p>

In [35]:
data["distance"] = data["center1distance"].str.split(" ").apply(lambda x: float(x[0]))
data["distance_alter"] = data["center2distance"].str.split(" ").apply(lambda x: float(x[0]))

<p>Ok, now, we have an issue with <code>accommodation_type</code>: all observations were stored with the prefix <code>_ACCOM_TYPE@</code>. We can remove this part, as it will make our data cleaner and easier to be analysed.</p>

In [36]:
data["accommodation_type"] = (
    data["accommodationtype"].str.split("@").apply(lambda x: x[1]).str.strip()
)

<p>Great! We can also clean the variable <code>price_night</code>. We are interested in getting hold of the number of nights for each observation.</p>

In [37]:
data["nnight"] = data["price_night"].str.split(" ").apply(lambda x: int(x[2]))

<p>Our next challenge concerns the variable <code>guestreviewsrating</code>. The rating was stored as a value between 0 and 5 over its total, that is, for instance, 4.5/5. We just the numerical variable of such rating, as the current format, a string, does not allow for further statistical analysis.</p>

In [38]:
# Split guestreviewsrating on the whitespace and get the value before it
data["rating"] = (
    data["guestreviewsrating"]
    .str.split(" ")
    .apply(lambda x: float(x[0]) if type(x) == list else None)
)

<p>Now, the authors developed a function to tabulate this variable. It returns data on the ratings in a tabulated form. Let's check it out.</p>

In [39]:
def tabulate(series, drop_missing=False):
    table = (
        pd.concat(
            [
                series.value_counts(dropna=drop_missing)
                .sort_index()
                .round(2)
                .rename("Freq."),
                series.value_counts(normalize=True, dropna=drop_missing)
                .sort_index()
                .rename("Perc."),
            ],
            axis=1,
        )
        .assign(Cum=lambda x: x["Perc."].cumsum())
        .round(3)
    )
    return table

In [40]:
tabulate(data["rating"])

Unnamed: 0,Freq.,Perc.,Cum
1.0,3,0.007,0.007
2.0,4,0.009,0.016
2.2,4,0.009,0.026
2.5,1,0.002,0.028
2.7,2,0.005,0.033
3.0,12,0.028,0.06
3.2,14,0.033,0.093
3.4,6,0.014,0.107
3.5,30,0.07,0.177
3.7,43,0.1,0.277


<p>Very nice! We can get the frequency distribution of all ratings, from 1 to 5. Plus, we get to see the number and percentage of missing values as well. This table allows us to get some interesting insights:</p>
<ul>
    <li>There are 35 missing values, which represents 0.8% of the dataset.</li>
    <li>Around 60% of the hotels scored 4.1 points or below that.</li>
    <li>People tend to be nice with hotels but not that much: too low or too high ratings are less frequent, while between 3.5 and 4.5 points, reviews are more frequent.</p>
</ul>
<p>We can now apply the function to check <code>rating_reviewcount</code> and check the distribution of reviews.</p>

In [41]:
tabulate(data["rating_reviewcount"])

Unnamed: 0,Freq.,Perc.,Cum
1.0,15,0.035,0.035
2.0,9,0.021,0.056
3.0,9,0.021,0.077
4.0,1,0.002,0.079
5.0,5,0.012,0.091
6.0,8,0.019,0.109
7.0,2,0.005,0.114
8.0,2,0.005,0.119
9.0,3,0.007,0.126
10.0,2,0.005,0.13


<p>Now, if we'd generate a histogram of this frequency table, we'd get a skewed distribution, with a long tail to the right. Most hotels get very few reviews.</p>

In [42]:
data["rating_count"] = data["rating_reviewcount"].apply(float)
data["rating_count"].describe().to_frame().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
rating_count,395.0,155.293671,191.296684,1.0,26.5,84.0,203.0,1541.0


<p>We can now rename variables, assigning more meaningful names.<p>

In [44]:
data = data.rename(
    columns={
        "rating2_ta": "ratingta",
        "rating2_ta_reviewcount": "ratingta_count",
        "addresscountryname": "country",
        "s_city": "city",
        "starrating": "stars"
    }
)

In [45]:
# Take a look at key variables
tabulate(data["stars"])

Unnamed: 0,Freq.,Perc.,Cum
1.0,1,0.002,0.002
2.0,47,0.109,0.112
2.5,5,0.012,0.123
3.0,141,0.328,0.451
3.5,57,0.133,0.584
4.0,144,0.335,0.919
4.5,8,0.019,0.937
5.0,27,0.063,1.0


<p>With the parameter <code>drop_missing</code> set as <code>True</code>, we can remove <code>NaN</code> observations.</p>

In [46]:
tabulate(data["rating"], drop_missing=True)

Unnamed: 0,Freq.,Perc.,Cum
1.0,3,0.008,0.008
2.0,4,0.01,0.018
2.2,4,0.01,0.028
2.5,1,0.003,0.03
2.7,2,0.005,0.035
3.0,12,0.03,0.066
3.2,14,0.035,0.101
3.4,6,0.015,0.116
3.5,30,0.076,0.192
3.7,43,0.109,0.301


<p>The next step would be to remove unwanted columns.</p>

In [47]:
data = data.drop(
    columns={
    "center2distance",
    "center1distance",
    "price_night",
    "guestreviewsrating",
    "rating_reviewcount",
    }
)

Now, let's look for <b>duplicates</b>. In the following case, we'll be looking for perfect duplicates, that is, entire rows.

<p><b>Table 2.10</b> A simple tidy data table</p>
<hr>

In [48]:
# Look for perfect duplicates
data = data.sort_values(by=["hotel_id"])
data[data["hotel_id"].duplicated(keep=False)][
    [
        "hotel_id",
        "accommodation_type",
        "price",
        "distance",
        "stars",
        "rating",
        "rating_count",
    ]
]

Unnamed: 0,hotel_id,accommodation_type,price,distance,stars,rating,rating_count
128,22050,Hotel,242,0.0,4.0,4.8,404.0
129,22050,Hotel,242,0.0,4.0,4.8,404.0
242,22185,Hotel,84,0.8,3.0,2.2,3.0
241,22185,Hotel,84,0.8,3.0,2.2,3.0


<p>There are two duplicates, in which all variables hold the same variable - a perfect duplicate. We need to drop these observations.</p>

In [49]:
data = data.drop_duplicates()

<p>Great! We can now deal with <b>missing values in text</b>. 

In [51]:
data.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
price,428.0,131.366822,91.580545,27.0,83.0,109.5,146.0,1012.0
stars,428.0,3.434579,0.772278,1.0,3.0,3.5,4.0,5.0
ratingta,325.0,3.990769,0.482638,2.0,3.5,4.0,4.5,5.0
ratingta_count,325.0,556.516923,586.874582,2.0,129.0,335.0,811.0,3171.0
scarce_room,428.0,0.598131,0.49085,0.0,0.0,1.0,1.0,1.0
hotel_id,428.0,22153.502336,146.858477,21894.0,22027.75,22155.5,22279.25,22409.0
offer,428.0,0.679907,0.467058,0.0,0.0,1.0,1.0,1.0
year,428.0,2017.0,0.0,2017.0,2017.0,2017.0,2017.0,2017.0
month,428.0,11.0,0.0,11.0,11.0,11.0,11.0,11.0
weekend,428.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [53]:
print(data["rating"].isnull().sum())
data["misrating"] = data["rating"].isnull()

35


<p>First, we checked using <code>isnull()</code>, which returns a Boolean value, if there were any missing data in the <code>"rating"</code> variable. Then, we apply <code>.sum()</code> to get the sum of missing values in the DataFrame for this variable. The next step is to create another variable, <code>"misrating"</code>, which will return a Boolean that states if there are any missing values for its referenced variable.</p>

In [54]:
tabulate(data["misrating"])

Unnamed: 0,Freq.,Perc.,Cum
False,393,0.918,0.918
True,35,0.082,1.0


<p>As we can see, <code>"misrating"</code>returned true for 35 observations, that is, 8.2% of the dataset. We can now apply the <code>pd.crosstab()</code> function to identify where are these missing values are located. This is an important step, as if these values are inequally distributed among the dataset, our conclusions and analyses can be distorted. We use <code>"accomodation_type"</code> as index, <code>"misrating</code> as the columns, and set <code>margins=True</code> to get subtotals.</p>
<p>Let's check the result.</p> 

In [55]:
pd.crosstab(data["accommodation_type"], data["misrating"], margins=True)

misrating,False,True,All
accommodation_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Apart-hotel,4,0,4
Apartment,92,32,124
Bed and breakfast,4,0,4
Guest House,7,1,8
Hostel,6,0,6
Hotel,263,1,264
Pension,16,0,16
Vacation home Condo,1,1,2
All,393,35,428


<p>As we can see, out of the 35 <code>NaN</code> values, <b>32</b> returned True for <b>apartments</b>. Also, <b>vacation home condos</b> and <b>guest houses</b> also returned a fairly high proportion of missing values.</p>
<p>Now, we can use this same function and instead of using the count of observations as values, we can use the variable <code>"price"</code>, aggregating the result according to its <code>mean</code>. Let's see the result.</p>

In [56]:
pd.crosstab(
    index=data["accommodation_type"],
    columns=data["misrating"],
    values=data["price"],
    aggfunc="mean",
    margins=True
).round(2)

misrating,False,True,All
accommodation_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Apart-hotel,121.25,,121.25
Apartment,136.45,179.06,147.44
Bed and breakfast,118.25,,118.25
Guest House,71.0,103.0,75.0
Hostel,53.67,,53.67
Hotel,130.02,106.0,129.93
Pension,96.06,,96.06
Vacation home Condo,107.0,116.0,111.5
All,127.66,173.0,131.37


<p>Great! We can now make a few observations:</p>
<ul>
    <li>Regarding <b>apartments</b>, the average prices were greater when there were no ratings - around €32 above the overall average for this type of accommodation.</li>
    <li>The other accomodation types with missing values such as guest houses and vacation home condos also followed this behavior. Of course, given the low number of samples, the average may be highly affected by one single outlier.</li>
</ul>
<p>We can now check the sole missing value for all hotels.</p>

In [57]:
data.loc[
    (data["misrating"] == 1) &  (data["accommodation_type"] == "Hotel"),
    [
        "hotel_id",
        "accommodation_type",
        "price",
        "distance",
        "stars",
        "rating",
        "rating_count",
    ],
]

Unnamed: 0,hotel_id,accommodation_type,price,distance,stars,rating,rating_count
14,21916,Hotel,106,0.7,2.5,,


<p>...And it's done! We identified the unrated hotel, and in that account, we end this notebook! Thank you!</p>
<hr>