# Hotel reviews and geographical, demographic, economic data
### Data Engineering Capstone Project

#### Project Summary
This is a data engineering (ETL) project to combine data related to hotels from disparate sources, while also adding some additional data to enable some different kinds of analyses. The data sets used (explained in detail in the next section) include hotel review data scraped from Booking.com (with some sentiment analysis features included), data from Google maps, airport data, and tourism, economic, financial and political data from UN Agencies and a few other sources. A number of analyses can be done: based on reviewers' nationalities (and some political/economic/ indicators of their nationalities), on the hotel country, on the number of reviews for a hotel, number of ratings, and so on.  For this project, the tools used are Apache Spark (Pyspark), Pandas, Amazon S3, Amazon Athena, Airflow and Amazon Redshift. 

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [23]:
! pip install jsonlines

Collecting jsonlines
  Downloading https://files.pythonhosted.org/packages/4f/9a/ab96291470e305504aa4b7a2e0ec132e930da89eb3ca7a82fbe03167c131/jsonlines-1.2.0-py2.py3-none-any.whl
Installing collected packages: jsonlines
Successfully installed jsonlines-1.2.0


In [24]:
# Do all imports and installs here
import pandas as pd
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
import jsonlines

In [3]:
spark = SparkSession.builder\
                     .config("spark.jars.packages","org.apache.hadoop:hadoop-aws:2.7.0")\
                     .getOrCreate()

### Step 1: Scope the Project and Gather Data

#### Scope 
Explain what you plan to do in the project in more detail. What data do you use? What is your end solution look like? What tools did you use? etc>

### The main goals of my project are: 
1. to get a data set of hotel reviews in Europe and supplement it with data from Google Maps (Google Local). This will create a much richer data set as it would allow us to utilise things like the working hours of the hotel.
2. to add in information about the nearest airport to our hotels.
3. the original reviews data set is meant for sentiment analysis. I want to enable the end user to do many more kinds of analyses. I use data from the UN and other agencies about the countries (both hotels' countries and the reviewer's countries) to allow additional analyses to see if there are patterns in reviews based on reviewers' nationality and its relation to the country where the hotel is located. For this, I use tourism data, economic, political and demographic data.
4. to build a dimensional model with a central 'reviews' factless fact table and dimensions related to hotels, airports, addresses and time. An additional fact table containing various measures related to countries will also be present.

### Overview of end solution and tools used

I use the following tools:
1. Apache Spark: to clean and join the data sets. This has to be done in Spark (or another similar library which allows Python UDFs) because our data sets cannot be joined directly on any field. We need a fuzzy match for the location data, which is much easier to achieve in Python. Spark makes it easy to prepare all the data sets so that they're ready to be staged in S3 and then Redshift.
2. Pandas: the smaller data sets are wrangled, cleaned and combined using Pandas. These are the data sets from the UN, IMF, UNDP, UNWTO, Freedom House and Our World in Data.
3. Redshift: our data warehouse is stored in Amazon Redshift.
4. Airflow: Airflow is used to orchestrate the data flow after the initial data sets have been reduced and staged in S3.

### Data used

#### Describe and Gather Data 
Describe the data sets you're using. Where did it come from? What type of information is included? 

The data sets used are described in some detail below. I have included a table with the number of rows in each data set and some additional details at the end of this section. 

1. **[515k Hotel Reviews in Europe](https://www.kaggle.com/jiashenliu/515k-hotel-reviews-data-in-europe "Kaggle 515k reviews")**: 
This data set by Jiashen Liu from 2017 contains European hotel review data scraped from Booking.com. The data set was originally created mainly for sentiment analysis, and contains fields such as positive and negative words in the reviews. Other interesting fields in terms of this project are GPS (which I will use to join to data sets 2 and 3 below), address and reviewers' nationality (which I will use to join some additional data sets -- data sets 4-16). 

The complete list of fields (taken mostly from the Kaggle page, but with some additional explanation by me) is given below:
* Hotel_Address: Address of hotel. This is not separated into country, city etc, and is sometimes random
* Review_Date: Date when reviewer posted the corresponding review.
* Average_Score: Average Score of the hotel, calculated based on the latest comment in the last year.
* Hotel_Name: Name of Hotel
* Reviewer_Nationality: Nationality of Reviewer
* Negative_Review: Negative part of the review. If the reviewer does not give a negative review, then the value should be 'No Negative'.
* ReviewTotalNegativeWordCounts: Total number of words in the negative review.
* Positive_Review: Positive part of the review. If the reviewer does not give a negative review, then the value should be: 'No Positive'.
* ReviewTotalPositiveWordCounts: Total number of words in the positive review.
* Reviewer_Score: Score the reviewer has given to the hotel, based on his/her experience
* TotalNumberofReviewsReviewerHasGiven: Number of Reviews the reviewers has given in the past.
* TotalNumberof_Reviews: Total number of valid reviews the hotel has.
* Tags: Tags reviewer gave the hotel.
* dayssincereview: Duration between the review date and scrape date. This is a pointless field for my project.
* AdditionalNumberof_Scoring: The data set's author has added the number of ratings for the hotels here, i.e., the number of users who have rated a hotel but not actually written a review.
* lat: Latitude of the hotel
* lng: longtitude of the hotel

2. **[Google Local Data Set](https://cseweb.ucsd.edu/~jmcauley/datasets.html#google_local "Google Local data set")**: 
This is a data set from Julian Macauley and others which contains data from Google Maps. The authors released the data set as part of 2 papers: ['Translation-based recommendation'](http://cseweb.ucsd.edu/~jmcauley/pdfs/recsys17.pdf) and ['Translation-based Factorization Machines for SequentialRecommendation'](http://cseweb.ucsd.edu/~jmcauley/pdfs/recsys18a.pdf). While the first data set I chose was meant for sentiment analysis, this data set is meant for recommender systems.

The Google Local data set is actually made up of 3 different data sets: one about businesses on Google Maps, one about Google Maps users (local guides), and one containing Google Maps reviews data. For this project, I use only the Google Maps businesses data, and only I have explained only this data set below.
The Google Local Businesses data contains data which has been added by the businesses themselves or the Google Local Guide community on Google Maps. 

The data is in JSON format (it's actually in invalid json format, more about that in the next section). A sample record is given below: 

{"name": "Portofino", "price": null, "address": ["\u0443\u043b. \u0422\u0443\u0442\u0430\u0435\u0432\u0430, 1", "Nazran, Ingushetia, Russia", "366720"], "hours": [["Monday", [["9:30 am--9:00 pm"]]], ["Tuesday", [["9:30 am--9:00 pm"]]], ["Wednesday", [["9:30 am--9:00 pm"]], 1], ["Thursday", [["9:30 am--9:00 pm"]]], ["Friday", [["9:30 am--9:00 pm"]]], ["Saturday", [["9:30 am--9:00 pm"]]], ["Sunday", [["9:30 am--9:00 pm"]]]], "phone": "8 (963) 173-38-38", "closed": false, "gPlusPlaceId": "109810290098030327104", "gps": [43.22776, 44.762726]}

The fields of the data set are:
* name: the name of the hotel
* price: an indicator of how expensive the hotel is. Possible values are '$', '$$', '$$$'. The authors of this data set are American, so this is in dollars rather than euros or any other currency. Null if not present
* address: this is a JSON array with the country, state, city and pincode. However, this data (like in the first data set) seems to be very unstructured. The city might come after the country at times, the pincode may come at the end or the middle, and very annoyingly, the authors have decided to not include a country for American addresses. This project only uses European address.
* hours: a nested JSON array with timings for each day in turn. Null if not present
* phone: phone number, if present
* closed: True or False, I don't use this field in my project.
* gPlusPlaceId: An id for the place on Google Maps/Google Plus
* gps: A JSON array containing the latitude and longitude. Note that the same place has different gps coordinates in different data sets. The 515k reviews and Google Local data set will have different coordinates for the same place, and we therefore need a fuzzy join of some kind.

3. **[Airport codes Data Set](https://datahub.io/core/airport-codes#data "Airport codes data set")**:

This data set contains information about airports. Most of the fields are self-explanatory

* ident: identifier
* type: type of airport
* name: airport name
* elevation_ft: elevation of airport in feet
* continent: continent where airport is situated
* iso_conuntry: iso country code where airport is situated.
* iso_region_code: region where airport is located.
* municipality
* gps code
* iata_code
* local_code
* coordinates: latitude, longitude pair.

Details about the 3 main data which have already been explained are given in a table below.


| Data set        | Num_rows           |   Data Source | Comment | Description | Year |
| ------------- |:-------------:| -----:| ----------:| ----------------------:| ------:|
|515k Hotel Reviews in Europe](https://www.kaggle.com/jiashenliu/515k-hotel-reviews-data-in-europe "Kaggle 515k reviews")|515,738|Kaggle| Data exploration done in Explore_reviews.ipynb | Hotel review data scraped from Booking.com|2017 |
|Google Local Businesses Data Set](https://cseweb.ucsd.edu/~jmcauley/datasets.html#google_local "Google Local data set")|3,114,353 (464,906 for Spain, France, UK, Italy, Austria, Netherlands)|J.McAuley and others| Data exploration done in Explore_google_places.ipynb | Data about businesses in Google Maps |2018 |
|[Airport codes Data Set](https://datahub.io/core/airport-codes#data "Airport codes data set")|55075|Datahub.io| Data exploration done in Explore_airport_codes.ipynb | Simple airport codes data |2018 |

4. **Additional data sets(data sets 4 through 17)**:

Details about these data sets are given in the form of a table below. Most of these data sets have a simple structure. They only have one or two columns, which can be identified from the name of the data set. I have removed any additional columns which aren't required They are pretty small too. I have combined these disparate data sets in a separate Jupyter notebbok: combine_country_data.ipynb. A short description of the data sets is contained in 'Description of supplementary data sets.xlsx. This file contains more information related to the data source and number of rows.


| Data set        | Num_rows           |   Data Source | Comment | Description | Year |
| ------------- |:-------------:| -----:| ----------:| ----------------------:| ------:|
| [Tourist-Visitors Arrival and Expenditure](http://data.un.org/)     | 2246 (whittled down to 220) | UNWTO | Found under 'Tourism and transport' after following the link | Data related to different countries' spending on tourism and the no. of inbound visitors/tourists |2018 |
| [Exchange rates](http://data.un.org/)     | 3408 (whittled down to 234) | IMF | Found under 'Finance' after following the link | Data related to exchange rates at the end of 2018 | 2018 |
| [GNI Per Capita](http://hdr.undp.org/en/data)     | 191 | UNDP | Found under dimension='Income/composition of resources' after following the link | Gives the Gross National Income  in dollars(2011 PPP) | 2018 |
| [GDP Per Capita](http://hdr.undp.org/en/data)     | 192 (whittled down to 220) | UNDP | Found under dimension='Income/composition of resources' after following the link | Gives the Gross Domestic Product in dollars (2011 PPP) | 2018 |
| [Internet Users As Percentage of Population](http://hdr.undp.org/en/data)     | 195 | UNDP | Found under dimension='Mobility and Communiucation' after following the link | Gives the percentage of the total population who are internet users | 2018 |
| [Mobile Phone Subscriptions](http://hdr.undp.org/en/data)     | 195 | UNDP | Found under dimension='Income/composition of resources' after following the link | Gives the mobile phone subscriptions per 100 people (>100: people have >1 mobile connection on average) | 2018 |
| [Net Migration Rate](http://hdr.undp.org/en/data)     | 191 | UNDP | Found under dimension='Income/composition of resources' after following the link | Gives the net migration rate (per 1000 people) | 2020 |
| [Population](http://hdr.undp.org/en/data)     | 195 | UNDP | Found under dimension='Demography' after following the link | Gives the total population (in millions) | 2018 |
| [Urban Population Percentage](http://hdr.undp.org/en/data)     | 195 | UNDP | Found under dimension='Human Development Index' after following the link | Gives the urban population as a percentage of the total population | 2018 |
| [Human Development Index (HDI)](http://hdr.undp.org/en/data)     | 195 | UNDP | Found under dimension='Income/composition of resources' after following the link | Gives the Human Development Index and the corresponding rank in 2018 | 2018 |
| [2020_Country_and_Territory_Ratings_and_Statuses_FIW2020](https://freedomhouse.org/report/freedom-world)     | 205 (whittled  down to 195) | Freedom House | I have included only the latest data, not all the data from 1973-2020| Gives 2 indicators of freedom: Political Rights and Civil Liberties, both of which are scored on a 1-7 scale. A column called Status has values corresponding to 'Free', 'Not Free', 'Partially Free'. | 2020 |
| [2020_List_of_Electoral_Democracies_FIW_2020](https://freedomhouse.org/report/freedom-world)     | 195 | Freedom House | I have included only the latest data| Gives a list of countries and whether or not they are democracies: Yes or No | 2020 |
| [human-rights-score-vs-political-regime-type](https://freedomhouse.org/report/freedom-world)     | 35333 (whittled down to 196) | Our World in Data| -| Gives a list of countries along with their  political regime type (score) and human rights protection score. The political regime score ranges from -10 (autocracy) to +10 (full democracy). The Human Rights Scores (the higher the better) were first developed by Schnakenberg and Farris (2014) and subsequently updated by Farris (2019). |2015 |
| [Country List ISO](https://datahub.io/core/country-list#resource-data)     | 249 | datahub.io| -| Contains a list of countries along with their 2-digit ISO code.. |- |

In [1]:
# Read in the data here
df_hotel_reviews = spark.read.csv('Hotel_Reviews.csv', header=True, inferSchema=True)
df_hotel_reviews.printSchema()


NameError: name 'spark' is not defined

In [8]:
df_hotel_reviews.take(4)

[Row(Hotel_Address=' s Gravesandestraat 55 Oost 1092 AA Amsterdam Netherlands', Additional_Number_of_Scoring=194, Review_Date='8/3/2017', Average_Score=7.7, Hotel_Name='Hotel Arena', Reviewer_Nationality=' Russia ', Negative_Review=' I am so angry that i made this post available via all possible sites i use when planing my trips so no one will make the mistake of booking this place I made my booking via booking com We stayed for 6 nights in this hotel from 11 to 17 July Upon arrival we were placed in a small room on the 2nd floor of the hotel It turned out that this was not the room we booked I had specially reserved the 2 level duplex room so that we would have a big windows and high ceilings The room itself was ok if you don t mind the broken window that can not be closed hello rain and a mini fridge that contained some sort of a bio weapon at least i guessed so by the smell of it I intimately asked to change the room and after explaining 2 times that i booked a duplex btw it costs t

In [8]:
	
'''from pyspark.sql import SparkSession
spark = SparkSession.builder.\
config("spark.jars.packages","saurfang:spark-sas7bdat:2.0.0-s_2.11")\
.enableHiveSupport().getOrCreate()
df_spark =spark.read.format('com.github.saurfang.sas.spark').load('../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat')
'''

In [11]:
#write to parquet
df_spark.write.parquet("sas_data")
df_spark=spark.read.parquet("sas_data")

### Step 2: Explore and Assess the Data
#### Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc.

#### Cleaning Steps
Document steps necessary to clean the data

In [None]:
# Performing cleaning tasks here





### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model

#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

In [None]:
# Write code here

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

In [None]:
# Perform quality checks here

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.