<h1>Chapter 2 | Data Exercise #1 | <code>hotels-europe</code>: Data Preparation</h1>

<p>1. Use the <code>hotels-europe</code> data for another city (not Vienna).</p>
<p>Assignments</p>
<ul>
    <li>Load and clean the data.</li>
    <li>Document the cleaning.</li>
    <li>Describe the clean dataset.</li>
    <li>Look at the raw data and make it into tidy data.</li>
</ul>

<h2><b>1.</b> Load the data</h2>

In [188]:
import os
import sys
import warnings
import pandas as pd

warnings.filterwarnings("ignore")

In [189]:
# Set pandas options to display 500 rows as max
pd.set_option("display.max_rows", 500)

In [190]:
# Current script folder
current_path = os.getcwd()
dirname = current_path.split("da_data_exercises")[0]

# Get location folders
data_in = f"{dirname}da_data_repo/hotels-europe/"
data_out = f"{dirname}da_data_exercises/ch02-preparing_data_for_analysis/data/"
output = f"{dirname}da_data_exercises/ch02-preparing_data_for_analysis/data/output/"
func = f"{dirname}da_case_studies/ch00-tech_prep/"
sys.path.append(func)


In [191]:
# import the prewrriten helper functions
from py_helper_functions import *

<p>Ok. We can emulate the example on <code>hotels-vienna</code> to see if we can get the same result with any particular city in Europe. First, let's get hold of the raw dataset, which contains all hotel ratings in Europe. We will use <b>London</b> as our target city for this exercise.</p>

In [192]:
# Set raw data working directory
data_in_raw = f"{data_in}raw/"

In [193]:
data = pd.read_csv(f"{data_in_raw}hotelbookingdata.csv")
data.head()

Unnamed: 0,addresscountryname,city_actual,rating_reviewcount,center1distance,center1label,center2distance,center2label,neighbourhood,price,price_night,...,accommodationtype,guestreviewsrating,scarce_room,hotel_id,offer,offer_cat,year,month,weekend,holiday
0,Netherlands,Amsterdam,1030.0,3.1 miles,City centre,3.6 miles,Montelbaanstoren,Amsterdam,172,price for 1 night,...,_ACCOM_TYPE@Hotel,4.3 /5,0,1.0,0,0% no offer,2017,11,1,0
1,Netherlands,Amsterdam,1030.0,3.1 miles,City centre,3.6 miles,Montelbaanstoren,Amsterdam,122,price for 1 night,...,_ACCOM_TYPE@Hotel,4.3 /5,0,1.0,1,15-50% offer,2018,1,1,0
2,Netherlands,Amsterdam,1030.0,3.1 miles,City centre,3.6 miles,Montelbaanstoren,Amsterdam,122,price for 1 night,...,_ACCOM_TYPE@Hotel,4.3 /5,0,1.0,1,15-50% offer,2017,12,0,1
3,Netherlands,Amsterdam,1030.0,3.1 miles,City centre,3.6 miles,Montelbaanstoren,Amsterdam,552,price for 4 nights,...,_ACCOM_TYPE@Hotel,4.3 /5,0,1.0,1,1-15% offer,2017,12,0,1
4,Netherlands,Amsterdam,1030.0,3.1 miles,City centre,3.6 miles,Montelbaanstoren,Amsterdam,122,price for 1 night,...,_ACCOM_TYPE@Hotel,4.3 /5,0,1.0,1,15-50% offer,2018,2,1,0


<p>Now, our first step is to check the cities available in the dataset. We can get a list of them using <code>.unique()</code>:

In [194]:
data["city_actual"].unique()

array(['Amsterdam', 'Badhoevedorp', 'Diemen', 'Halfweg', 'Hoofddorp',
       'Lijnden', 'Duivendrecht', 'Schiphol', 'Schiphol-Rijk',
       'Zwanenburg', 'Acharnes', 'Athens', 'Agia Paraskevi', 'Alimos',
       'Glyfada', 'Vari-Voula-Vouliagmeni', 'Elliniko-Argyroupoli',
       'Palaio Faliro', 'Saronikos', 'Lavreotiki', 'Chaidari',
       'Chalandri', 'Dionysos', 'Elefsina', 'Kallithea', 'Kifisia',
       'Lykovrysi-Pefki', 'Mandra-Eidyllia', 'Marathon',
       'Markopoulo Mesogaias', 'Marousi', 'Megara', 'Metamorfosi',
       'Moschato-Tavros', 'Nea Smirni', 'Oropos', 'Pallini', 'Piraeus',
       'Rafina-Pikermi', 'Salamis', 'Spata-Artemida', 'Barcelona',
       'Castelldefels', 'Cornella de Llobregat', 'El Prat de Llobregat',
       'Esplugues de Llobregat', 'Gava', 'barcelona',
       'L\\u0027Hospitalet de Llobregat', 'Molins de Rei',
       'Sant Adria de Besos', 'Sant Joan Despi', 'Sant Just Desvern',
       'Belgrade', 'Berlin', 'Schoenefeld', 'Birmingham', 'Walsall',
       'B

<p>Ok. Just to make sure, let's use <code>.unique()</code> together with <code>in</code> to check if there are any observations with "London" in the variable <code>"city_actual"</code>

In [195]:
"London" in data["city_actual"].unique()

True

<p>Great. Now, we can filter our dataset and keep only observations for acommodations in London:</p>

In [196]:
df = data.loc[data["city_actual"] == "London"].reset_index(drop=True)
df.head()

Unnamed: 0,addresscountryname,city_actual,rating_reviewcount,center1distance,center1label,center2distance,center2label,neighbourhood,price,price_night,...,accommodationtype,guestreviewsrating,scarce_room,hotel_id,offer,offer_cat,year,month,weekend,holiday
0,United Kingdom,London,161.0,6.5 miles,City centre,9.3 miles,Wimbledon Park Underground Station,Blackheath,515,price for 4 nights,...,_ACCOM_TYPE@Hotel,4 /5,0,7164.0,0,0% no offer,2017,12,0,1
1,United Kingdom,London,161.0,6.5 miles,City centre,9.3 miles,Wimbledon Park Underground Station,Blackheath,147,price for 1 night,...,_ACCOM_TYPE@Hotel,4 /5,0,7164.0,0,0% no offer,2018,6,1,0
2,United Kingdom,London,161.0,6.5 miles,City centre,9.3 miles,Wimbledon Park Underground Station,Blackheath,124,price for 1 night,...,_ACCOM_TYPE@Hotel,4 /5,0,7164.0,0,0% no offer,2018,3,1,0
3,United Kingdom,London,161.0,6.5 miles,City centre,9.3 miles,Wimbledon Park Underground Station,Blackheath,135,price for 1 night,...,_ACCOM_TYPE@Hotel,4 /5,0,7164.0,0,0% no offer,2018,4,1,0
4,United Kingdom,London,161.0,6.5 miles,City centre,9.3 miles,Wimbledon Park Underground Station,Blackheath,128,price for 1 night,...,_ACCOM_TYPE@Hotel,4 /5,0,7164.0,0,0% no offer,2018,2,1,0


<h2><b>2.</b> Clean the data</h2>
<h3><b>2.1.</b> Check the integrity of variables</h3>

<p>Let's start by tackling the variables <code>center1distance</code> and <code>center2distance</code>. Although they represent the distance to center from each hotel, both were stored as strings/objects, with one decimal. We can transform them to numerical so that they can be properly used during our analysis.</p>

In [197]:
df["distance"] = df["center1distance"].str.split(" ").apply(lambda x: float(x[0]))
df["distance_alter"] = df["center2distance"].str.split(" ").apply(lambda x: float(x[0]))

<p>In the above codes, we applied <code>.str.split()</code> to split the observations using a whitespace, as values were stored as "n miles". Then, we applied <code>lambda</code>, which converts each initial value of the split value to a float.</p>

In [198]:
df[["distance", "distance_alter"]]

Unnamed: 0,distance,distance_alter
0,6.5,9.3
1,6.5,9.3
2,6.5,9.3
3,6.5,9.3
4,6.5,9.3
...,...,...
8769,4.2,4.9
8770,4.2,4.9
8771,4.2,4.9
8772,4.2,4.9


In [199]:
df[["distance", "distance_alter"]].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8774 entries, 0 to 8773
Data columns (total 2 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   distance        8774 non-null   float64
 1   distance_alter  8774 non-null   float64
dtypes: float64(2)
memory usage: 137.2 KB


<p>Great! Now, let's check <code>"accommodation_type</code>. All observations were stored with the prefix <code>_ACCOM_TYPE@</code>. Let's apply the procedure we've just used, with the difference that we will split values using <code>@</code> and getting hold of the value after it, and not before.</p>

In [200]:
df["accommodation_type"] = (
    df["accommodationtype"].str.split("@").apply(lambda x: x[1]).str.strip()
)

In [201]:
df["accommodation_type"]

0           Hotel
1           Hotel
2           Hotel
3           Hotel
4           Hotel
          ...    
8769        Hotel
8770        Hotel
8771        Hotel
8772        Hotel
8773    Apartment
Name: accommodation_type, Length: 8774, dtype: object

<p>We can now clean the variable <code>"price_night"</code>. We want to get hold of the number of nights that the price value refers to.</p>

In [202]:
df["nnight"] = df["price_night"].str.split(" ").apply(lambda x: int(x[2]))

<p>Here, we splitted the string using whitespace, which resulted in four substrings (ie "Price per 1 night). As we are intersted in the numeric value, we used <code>x[2]</code> to apply the lambda function.</p>
<p>Now, we can clean the variable <code>"guestreviewsrating</code>. The rating was stored as a value between 0 and 5 over its total, like 4.5/5. We just want the rating itself, hence let's split this variable like the others and convert the first value to a float.</p> 

In [203]:
df["rating"] = (
    df["guestreviewsrating"]
    .str.split(" ")
    .apply(lambda x: float(x[0]) if type(x) == list else None)
)

<p>I'll now use the authors' <code>tabulate</code> function, which returns the parameter values in a tabulated form containing the frequency of observations for each rating <code>Freq.</code>, the <code>Perc.</code> (percentage) of occurrencies for each rating, as well as a cummulative sum <code>Cum</code>.</p>

In [204]:
def tabulate(series, drop_missing=False):
    """Tabulate a pandas Series and return statistical observations."""
    table = (
        pd.concat(
            [
                series.value_counts(dropna=drop_missing)
                .sort_index()
                .round(2)
                .rename("Freq."),
                series.value_counts(normalize=True, dropna=drop_missing)
                .sort_index()
                .rename("Perc."),
            ],
            axis=1
        )
        .assign(Cum=lambda x: x["Perc."].cumsum())
        .round(3)
    )
    return table

In [205]:
tabulate(df["rating"])

Unnamed: 0,Freq.,Perc.,Cum
1.0,20,0.002,0.002
1.2,23,0.003,0.005
1.3,14,0.002,0.006
1.5,10,0.001,0.008
1.6,24,0.003,0.01
1.7,12,0.001,0.012
1.8,10,0.001,0.013
2.0,54,0.006,0.019
2.2,165,0.019,0.038
2.4,82,0.009,0.047


<p>Now that is a <i>great</i> function! Let's take not of a few observations:</p>
<ul>
    <li>The distribution of <code>"rating"</code> seems a littke skewed, with a longer left tail, maybe? We will figure this out when doing a EDA.</li>
    <li>We can affirm that most ratings are concentrated in the 3 - 4 stars interval</li>
    <li>By far, 4 stars is the most frequent observed value</li>
    <li>There are 298 <code>NaN</code> observations, which corresponds to 3.4% of the dataset.</li>
</ul>
<p>Let's now apply this same function to check <code>"rating_reviewcount"</code> and see the distribution of reviews:</p>

In [206]:
tabulate(df["rating_reviewcount"])

Unnamed: 0,Freq.,Perc.,Cum
1.0,97,0.011,0.011
2.0,75,0.009,0.020
3.0,50,0.006,0.025
4.0,55,0.006,0.032
5.0,8,0.001,0.032
...,...,...,...
2730.0,9,0.001,0.963
2895.0,9,0.001,0.964
3464.0,8,0.001,0.965
4300.0,8,0.001,0.966


<p>Now, most hotels seem to have a low number of review counts, and the distribution of this variable would probably have a long tail to the right. We would need to bin values to make a more accurate comment on it.</p>

In [207]:
df["rating_count"] = df["rating_reviewcount"].apply(float)
df["rating_count"].describe().to_frame().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
rating_count,8476.0,314.101227,408.48011,1.0,76.0,186.0,385.0,4300.0


<p>Now, let's rename the variables and assign more meaningful names:</p>

In [208]:
df = df.rename(
    columns={
        "rating2_ta": "ratingta",
        "rating2_ta_reviewcount": "ratingta_count",
        "addresscountryname": "country",
        "s_city": "city",
        "starrating": "stars",
    }
)

In [209]:
tabulate(df["stars"])

Unnamed: 0,Freq.,Perc.,Cum
0.0,18,0.002,0.002
1.0,8,0.001,0.003
1.5,14,0.002,0.005
2.0,746,0.085,0.09
2.5,693,0.079,0.169
3.0,2527,0.288,0.457
3.5,692,0.079,0.535
4.0,2745,0.313,0.848
4.5,179,0.02,0.869
5.0,1152,0.131,1.0


<p>As noticed before, the distribution of stars is concentrated between the 3 - 4 interval with some gaps between stars. For instance, while there were 2745 observations for 4 stars (31%), there were only 179 for 4.5 stars (2%), while for 5 stars there were 1152 (13%).</p>
<p>We can now remove missing values with <code>dropping_missing</code> set as <code>True</code> in our <code>tabulate()</code> function and observe the distribution of values in <code>"rating"</code>.</p> 

In [210]:
tabulate(df["rating"], drop_missing=True)

Unnamed: 0,Freq.,Perc.,Cum
1.0,20,0.002,0.002
1.2,23,0.003,0.005
1.3,14,0.002,0.007
1.5,10,0.001,0.008
1.6,24,0.003,0.011
1.7,12,0.001,0.012
1.8,10,0.001,0.013
2.0,54,0.006,0.02
2.2,165,0.019,0.039
2.4,82,0.01,0.049


<p>We can remove unwanted columns:</p>

In [211]:
df = df.drop(
    columns=[
        "center2distance",
        "center1distance",
        "price_night",
        "guestreviewsrating",
        "rating_reviewcount",
    ]
)

In [212]:
df.head()

Unnamed: 0,country,city_actual,center1label,center2label,neighbourhood,price,city,stars,ratingta,ratingta_count,...,year,month,weekend,holiday,distance,distance_alter,accommodation_type,nnight,rating,rating_count
0,United Kingdom,London,City centre,Wimbledon Park Underground Station,Blackheath,515,London,3.0,4.0,1669.0,...,2017,12,0,1,6.5,9.3,Hotel,4,4.0,161.0
1,United Kingdom,London,City centre,Wimbledon Park Underground Station,Blackheath,147,London,3.0,4.0,1669.0,...,2018,6,1,0,6.5,9.3,Hotel,1,4.0,161.0
2,United Kingdom,London,City centre,Wimbledon Park Underground Station,Blackheath,124,London,3.0,4.0,1669.0,...,2018,3,1,0,6.5,9.3,Hotel,1,4.0,161.0
3,United Kingdom,London,City centre,Wimbledon Park Underground Station,Blackheath,135,London,3.0,4.0,1669.0,...,2018,4,1,0,6.5,9.3,Hotel,1,4.0,161.0
4,United Kingdom,London,City centre,Wimbledon Park Underground Station,Blackheath,128,London,3.0,4.0,1669.0,...,2018,2,1,0,6.5,9.3,Hotel,1,4.0,161.0


In [213]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8774 entries, 0 to 8773
Data columns (total 25 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   country             8774 non-null   object 
 1   city_actual         8774 non-null   object 
 2   center1label        8774 non-null   object 
 3   center2label        8774 non-null   object 
 4   neighbourhood       8774 non-null   object 
 5   price               8774 non-null   int64  
 6   city                8774 non-null   object 
 7   stars               8774 non-null   float64
 8   ratingta            8212 non-null   float64
 9   ratingta_count      8212 non-null   float64
 10  accommodationtype   8774 non-null   object 
 11  scarce_room         8774 non-null   int64  
 12  hotel_id            8774 non-null   float64
 13  offer               8774 non-null   int64  
 14  offer_cat           8774 non-null   object 
 15  year                8774 non-null   int64  
 16  month 

<p>Ok, let's check for missing values before dealing with possible duplicates.</p>
<h3><b>2.2.</b> Deal with Missing Values</h3>
<p>We can tell that a few variables must be considered:</p>
<ul>
    <li><code>"ratingta"</code></li>
    <li><code>"ratingta_count</code></li>
    <li><code>"rating"</code></li>
    <li><code>"rating_count"</code></li>
</ul>

In [214]:
df.isna().sum()

country                 0
city_actual             0
center1label            0
center2label            0
neighbourhood           0
price                   0
city                    0
stars                   0
ratingta              562
ratingta_count        562
accommodationtype       0
scarce_room             0
hotel_id                0
offer                   0
offer_cat               0
year                    0
month                   0
weekend                 0
holiday                 0
distance                0
distance_alter          0
accommodation_type      0
nnight                  0
rating                298
rating_count          298
dtype: int64

In [215]:
df["ratingta"].isnull().sum()

562

In [216]:
df["misratingta"] = df["ratingta"].isnull()
df["misratingta_count"] = df["ratingta_count"].isnull()
df["misrating"] = df["rating"].isnull()
df["misrating_count"] = df["rating_count"].isnull()

In [217]:
tabulate(df["misratingta"])

Unnamed: 0,Freq.,Perc.,Cum
False,8212,0.936,0.936
True,562,0.064,1.0


In [218]:
tabulate(df["misrating"])

Unnamed: 0,Freq.,Perc.,Cum
False,8476,0.966,0.966
True,298,0.034,1.0


In [219]:
pd.crosstab(df["accommodation_type"], df["misratingta"], margins=True)

misratingta,False,True,All
accommodation_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Apart-hotel,179,12,191
Apartment,698,440,1138
Bed and breakfast,304,0,304
Guest House,451,51,502
Hostel,328,0,328
Hotel,6049,54,6103
Inn,202,0,202
Vacation home Condo,1,5,6
All,8212,562,8774


In [220]:
pd.crosstab(df["accommodation_type"], df["misrating"], margins=True)

misrating,False,True,All
accommodation_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Apart-hotel,191,0,191
Apartment,949,189,1138
Bed and breakfast,304,0,304
Guest House,471,31,502
Hostel,328,0,328
Hotel,6031,72,6103
Inn,202,0,202
Vacation home Condo,0,6,6
All,8476,298,8774


<p><b>Problem</b>: we have too many missing values for <code>"ratingta"</code> regarding apartments. Even in <code>"rating"</code>, apartments count for most of the missing values. We should drop all missing values and disregard this accommodation type.</p> 

In [221]:
df = df.dropna()
df = df.loc[df["accommodation_type"] != "Apartment"]

<p>Great! Now, let's move on to duplicates.</p>
<h3><b>2.3.</b> Dealing with duplicates</h3>

In [224]:
df = df.sort_values(by=["hotel_id", "price"])
df[df.duplicated(keep=False)][
    [
        "hotel_id",
        "accommodation_type",
        "price",
        "distance",
        "stars",
        "rating",
        "rating_count",
    ]
]

Unnamed: 0,hotel_id,accommodation_type,price,distance,stars,rating,rating_count
212,7194.0,Hotel,94,1.3,2.0,2.4,68.0
213,7194.0,Hotel,94,1.3,2.0,2.4,68.0
494,7238.0,Hotel,203,4.5,4.0,4.6,130.0
496,7238.0,Hotel,203,4.5,4.0,4.6,130.0
497,7238.0,Hotel,203,4.5,4.0,4.6,130.0
500,7238.0,Hotel,203,4.5,4.0,4.6,130.0
502,7238.0,Hotel,203,4.5,4.0,4.6,130.0
503,7238.0,Hotel,203,4.5,4.0,4.6,130.0
504,7238.0,Hotel,203,4.5,4.0,4.6,130.0
499,7238.0,Hotel,203,4.5,4.0,4.6,130.0


In [186]:
df = df.drop_duplicates()

In [187]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7408 entries, 0 to 8767
Data columns (total 29 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   country             7408 non-null   object 
 1   city_actual         7408 non-null   object 
 2   center1label        7408 non-null   object 
 3   center2label        7408 non-null   object 
 4   neighbourhood       7408 non-null   object 
 5   price               7408 non-null   int64  
 6   city                7408 non-null   object 
 7   stars               7408 non-null   float64
 8   ratingta            7408 non-null   float64
 9   ratingta_count      7408 non-null   float64
 10  accommodationtype   7408 non-null   object 
 11  scarce_room         7408 non-null   int64  
 12  hotel_id            7408 non-null   float64
 13  offer               7408 non-null   int64  
 14  offer_cat           7408 non-null   object 
 15  year                7408 non-null   int64  
 16  month 

In [225]:
df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
price,7469.0,282.866381,330.414462,27.0,131.0,195.0,303.0,9137.0
stars,7469.0,3.526643,0.896916,1.0,3.0,3.5,4.0,5.0
ratingta,7469.0,3.726938,0.79684,1.0,3.0,4.0,4.5,5.0
ratingta_count,7469.0,1153.378096,1373.773711,1.0,256.0,670.0,1546.0,17139.0
scarce_room,7469.0,0.361494,0.480465,0.0,0.0,0.0,1.0,1.0
hotel_id,7469.0,7858.493908,398.574693,7164.0,7532.0,7843.0,8185.0,8562.0
offer,7469.0,0.761012,0.426494,0.0,1.0,1.0,1.0,1.0
year,7469.0,2017.582943,0.493106,2017.0,2017.0,2018.0,2018.0,2018.0
month,7469.0,6.797028,4.211919,1.0,3.0,6.0,11.0,12.0
weekend,7469.0,0.669568,0.4704,0.0,0.0,1.0,1.0,1.0


<p>Ok, now we have a clean dataset. Let's export it and that's it!</p>
<h2><b>3.</b> Export the clean dataset</h2>


In [226]:
df.to_csv(f"{output}hotels_london_clean.csv", index=False)

<h2><b>4.</b> Conclusions and final remarks</h2>
<hr>