<h1>Chapter 4 | Data Exercise #1 | <code>hotels-vienna</code> | Comparison and correlation</h1>
<h2>Introduction:</h2>
<p>In this notebook, you will find my notes and code for Chapter 5's <b>exercise 1</b> of the book <a href="https://gabors-data-analysis.com/">Data Analysis for Business, Economics, and Policy</a>, by Gábor Békés and Gábor Kézdi. The question was: 
<p>1. Are central hotels better? To answer this, using the <code>hotels-vienna</code> dataset (as discussed in Chapter 3, Sesction 3.A1), create <b>two categories</b> by the <b>distance from center</b>: close and far (by picking a cutoff of your choice).</p>
<p>Assignments:</p>
<ul>
    <li>Show summary statistics.</li>
    <li>Compare <b>stars</b> and <b>ratings</b> and <b>prices</b> for close and far hotels.</li>
    <li>Create stacked bar charts, box plots, and violin plots.</li>
    <li>Summarize your findings.</li>
</ul>
<h2><b>1.</b> Load the data</h2>

In [2]:
import os
import sys
import warnings
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from plotnine import *
from mizani.formatters import percent_format

warnings.filterwarnings("ignore")
%matplotlib inline

In [3]:
# Increase number of returned rows in pandas
pd.set_option("display.max_rows", 500)

In [4]:
# Current script folder
current_path = os.getcwd()
dirname = current_path.split("da_data_exercises")[0]

# Get location folders
data_in = f"{dirname}da_data_repo/hotels-vienna/clean/"
data_out = f"{dirname}da_data_exercises/ch04-comparison_correlation/01-hotels_vienna/data/clean/"
output = f"{dirname}da_data_exercises/ch04-comparison_correlation/01-hotels_vienna/"
func = f"{dirname}da_case_studies/ch00-tech_prep/"
sys.path.append(func)

In [5]:
from py_helper_functions import *

In [6]:
df = pd.read_csv(f"{data_in}hotels-vienna.csv")

In [7]:
df.head()

Unnamed: 0,country,city_actual,rating_count,center1label,center2label,neighbourhood,price,city,stars,ratingta,...,offer_cat,year,month,weekend,holiday,distance,distance_alter,accommodation_type,nnights,rating
0,Austria,Vienna,36.0,City centre,Donauturm,17. Hernals,81,Vienna,4.0,4.5,...,15-50% offer,2017,11,0,0,2.7,4.4,Apartment,1,4.4
1,Austria,Vienna,189.0,City centre,Donauturm,17. Hernals,81,Vienna,4.0,3.5,...,1-15% offer,2017,11,0,0,1.7,3.8,Hotel,1,3.9
2,Austria,Vienna,53.0,City centre,Donauturm,Alsergrund,85,Vienna,4.0,3.5,...,15-50% offer,2017,11,0,0,1.4,2.5,Hotel,1,3.7
3,Austria,Vienna,55.0,City centre,Donauturm,Alsergrund,83,Vienna,3.0,4.0,...,15-50% offer,2017,11,0,0,1.7,2.5,Hotel,1,4.0
4,Austria,Vienna,33.0,City centre,Donauturm,Alsergrund,82,Vienna,4.0,3.5,...,15-50% offer,2017,11,0,0,1.2,2.8,Hotel,1,3.9


<p>Let's filter our data and restrict the accomodation type to hotels only.</p>

In [8]:
vienna_cut = df.loc[lambda x: x["accommodation_type"] == "Hotel"]

In [9]:
vienna_cut["accommodation_type"].value_counts()

Hotel    264
Name: accommodation_type, dtype: int64

<h2><b>2.</b> Binning quantitative <code>distance</code></h2>
<p>In chapter 3, we had defined that hotels above a 8-mile distance from the center were too far out and therefore marked as such. Given that a great deal of hotels were within a 2-mile distance, we can bin any hotel below this value as <b>close</b> and those farther than this distance as <b>far</b>. Let's do it.</p>

In [10]:
vienna_cut["distance2bins"] = np.where(vienna_cut["distance"] < 2, 1, pd.np.nan)
vienna_cut["distance2bins"] = np.where(vienna_cut["distance"] >= 2, 2, vienna_cut["distance2bins"])


In [11]:
vienna_cut.head()

Unnamed: 0,country,city_actual,rating_count,center1label,center2label,neighbourhood,price,city,stars,ratingta,...,year,month,weekend,holiday,distance,distance_alter,accommodation_type,nnights,rating,distance2bins
1,Austria,Vienna,189.0,City centre,Donauturm,17. Hernals,81,Vienna,4.0,3.5,...,2017,11,0,0,1.7,3.8,Hotel,1,3.9,1.0
2,Austria,Vienna,53.0,City centre,Donauturm,Alsergrund,85,Vienna,4.0,3.5,...,2017,11,0,0,1.4,2.5,Hotel,1,3.7,1.0
3,Austria,Vienna,55.0,City centre,Donauturm,Alsergrund,83,Vienna,3.0,4.0,...,2017,11,0,0,1.7,2.5,Hotel,1,4.0,1.0
4,Austria,Vienna,33.0,City centre,Donauturm,Alsergrund,82,Vienna,4.0,3.5,...,2017,11,0,0,1.2,2.8,Hotel,1,3.9,1.0
6,Austria,Vienna,57.0,City centre,Donauturm,Alsergrund,103,Vienna,4.0,3.5,...,2017,11,0,0,0.9,2.4,Hotel,1,3.9,1.0


In [12]:
vienna_cut["accommodation_type"].value_counts()

Hotel    264
Name: accommodation_type, dtype: int64

<h2><b>3.</b> Create summary statistics</b></h2>
<p>Let's create one table for each variable and compare the results.</p>

In [13]:
# Summary - stars
vienna_cut.filter(["stars","distance2bins"]).groupby("distance2bins").agg(
    ["min", "max", "mean","median", np.std, "size"]
)

Unnamed: 0_level_0,stars,stars,stars,stars,stars,stars
Unnamed: 0_level_1,min,max,mean,median,std,size
distance2bins,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
1.0,1.0,5.0,3.669192,4.0,0.777482,198
2.0,2.0,4.5,3.431818,3.0,0.600553,66


<p>In terms of stars, we can say that hotels that are close to the city center get more stars - 3.66 vs 3.43 for hotels that were qualified as far. But that is not the entire story. While hotels above the 2-mile threshold distance get a lower average score, there is <b>less uncertainty</b> regarding the spread of their data. Hotels closer to the center have lower min and higher max values, hence, you can expect to find a 1 to a 5-star hotel within this distance. Hotels farther from the center, meanwhile, have a higher min stars and a slightly lower max stars score - 2 and 4.5, respectively. As such, the standard deviation of hotels closer to center is higher, which reflects a higher degree of uncertainty.</p>
<p>Let's take a look at the ratings for each bin.</p>

In [14]:
# Summary - rating
vienna_cut.filter(["rating","distance2bins"]).groupby("distance2bins").agg(
    ["min", "max", "mean","median", np.std, "size"]
)

Unnamed: 0_level_0,rating,rating,rating,rating,rating,rating
Unnamed: 0_level_1,min,max,mean,median,std,size
distance2bins,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
1.0,1.0,4.9,4.109645,4.1,0.441771,198
2.0,2.0,4.5,3.859091,4.0,0.471977,66


<p>Observations:</p>
<ul>
<li>The mean rating value for closer hotels is higher - 4.10 vs. 3.85.</li>
<li>The range of ratings for hotels closer to city center is wider, which means you may get pretty badly rated hotels, as well as really positive ones. If one values user review, hotels farther away tend to concentrate ratings in a shorter range, which means that you may expect not so many extreme values to both ends of the ratings' range.</li>
<li>Overall, the standard deviation for both cases is pretty similar. More distant hotels can expect more spread of the data, as the statistic is a bit higher.</li>
</ul>
<p>At last, let's take a look at the summary statistics for price.</p>

In [15]:
# Summary - price
vienna_cut.filter(["price","distance2bins"]).groupby("distance2bins").agg(
    ["min", "max", "mean","median", np.std, "size"]
)

Unnamed: 0_level_0,price,price,price,price,price,price
Unnamed: 0_level_1,min,max,mean,median,std,size
distance2bins,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
1.0,33,1012,142.419192,111.0,115.539349,198
2.0,50,208,92.469697,83.0,32.245319,66


<p>Now here we can make some interesting observations!</p>
<ul>
<li>There is some significant difference among both categories (let's just call them <code>1</code> and <code>2</code>, which represent, respectivelly, hotels below and above the 2-mile distance from city center).</li>
<li>The mean price for 1 is <b>142</b> dollars, while 2's is <b>92</b> dollars. This is a <b>significant difference</b>, and reflects a higher value being given to hotels closer to city center.</li>
<li>Still, one can get some cheap deals at a closer distance to center. Min price for 1 is <b>33</b> dollars, while for 2, the same statistic is <b>50</b>.</li>
<li>The max price is extremely high for 1 (1012 dollars), while not as much for 2 (208 dollars).</li>
<li>The occurence of such extreme values in 1 may reflect the high variability in the quality of hotels closer to the city center. There may be really cheap hotels that cater to students, for instance, as well as expensive ones that attract tourists with a higher budget. Hotels that are farther away, meanwhile, may seek to establish a more predictable, stable, quality service, as extremely cheap or expensive hotels would probably not attract too many tourists (tourists on a tight budget would consider the expenses with transportation, for instance).</li>
<li>The median is closer than the mean when comparing both bins. This means that the extremely high values found in 1 are affecting the mean price. Bin 2 has a median and a mean value that are closer than 1, which means that the distribution of such values may not be as skewed as 1.</li>
<li>The standar deviation indicates the skewness of 1. While it registered a value of 115 dollars, 2 has a value of only 32. What does that tell us? Well, if you are looking for a good deal, maybe looking for hotels not that close to city center may return results with not as much variability. If the mean is 92 dollars, you can expect to pay between around 122 and 60 dollars for a hotel there.</li>
</ul>
<h2><b>4.</b> Plotting charts</b>
<h3>4.1 Stacked bar charts</h3>

In [18]:
vienna_cut.head()

Unnamed: 0,country,city_actual,rating_count,center1label,center2label,neighbourhood,price,city,stars,ratingta,...,year,month,weekend,holiday,distance,distance_alter,accommodation_type,nnights,rating,emp2bins
1,Austria,Vienna,189.0,City centre,Donauturm,17. Hernals,81,Vienna,4.0,3.5,...,2017,11,0,0,1.7,3.8,Hotel,1,3.9,1.0
2,Austria,Vienna,53.0,City centre,Donauturm,Alsergrund,85,Vienna,4.0,3.5,...,2017,11,0,0,1.4,2.5,Hotel,1,3.7,1.0
3,Austria,Vienna,55.0,City centre,Donauturm,Alsergrund,83,Vienna,3.0,4.0,...,2017,11,0,0,1.7,2.5,Hotel,1,4.0,1.0
4,Austria,Vienna,33.0,City centre,Donauturm,Alsergrund,82,Vienna,4.0,3.5,...,2017,11,0,0,1.2,2.8,Hotel,1,3.9,1.0
6,Austria,Vienna,57.0,City centre,Donauturm,Alsergrund,103,Vienna,4.0,3.5,...,2017,11,0,0,0.9,2.4,Hotel,1,3.9,1.0


In [18]:
df1 = pd.DataFrame(
    vienna_cut.loc[:, ["distance2bins", "stars"]].groupby(["distance2bins", "stars"]).agg("size")
).reset_index()

df1.columns = ["distance2bins", "stars", "Count"]

group_counts = df1.groupby("distance2bins").agg("sum")["Count"]

df1.loc[df1["distance2bins"] == 1, "Percent"] = round(df1.loc[df1["distance2bins"] == 1, "Count"]/group_counts[1],5)
df1.loc[df1["distance2bins"] == 2, "Percent"] = round(df1.loc[df1["distance2bins"] == 2, "Count"]/group_counts[2],5)

df1 = df1.assign(
    stars=pd.Categorical(
        df1["stars"], categories=sorted(set(df["stars"]), reverse=True)
    )
)

In [19]:
df1

Unnamed: 0,distance2bins,stars,Count,Percent
0,1.0,1.0,1,0.00505
1,1.0,2.0,11,0.05556
2,1.0,2.5,2,0.0101
3,1.0,3.0,57,0.28788
4,1.0,3.5,10,0.05051
5,1.0,4.0,91,0.4596
6,1.0,4.5,3,0.01515
7,1.0,5.0,23,0.11616
8,2.0,2.0,3,0.04545
9,2.0,3.0,31,0.4697
