Yelp Review Restaurant Recommendations

Problem Statement

The food and drinks recommendations provided by Google Maps and Yelp are mostly based on specific locations, ratings, and reviews. It is difficult to see if there are any new restaurants to try or holistically assess all of the restaurants in the area. I want to create an ELT pipeline that will enable me to study California's yelp dataset, find places that I have yet tried, and see if I can create a robust restaurant recommendation system competitive as Yelp's and Google's.

Technology Stack

The following technologies are used to build this project:

Yelp Fusion API
Prefect for orchestration workflow
Google Compute Engine VM Instance
Docker
Terraform as Infrastructure-as-Code (IaC) tool to set up Cloud environment
Google Cloud Storage (GCS) as Data Lake
Google BigQuery for Data Warehouse
dbt for transformation and data modeling
Google Looker Studio for visualizations

Data Pipeline Architecture

Data Structure

Yelp API file Data Structure:

Column	Data Type	Description
id	String	Unique ID for every restaurant regardless of chain
alias	String	Unique "ID" that has name and location and number
name	String	Name of restaurant
image_url	String	URL for first image that shows up when searched on Yelp
is_closed	Boolean	True for closed; False for open
url	String	URL for Yelp
review_count	Integer	Total number of reviews
categories	List of Dictionary	List of categories. Example: [{'alias': 'mexican', 'title': 'Mexican'}]
rating	Float	Rating out of 5, with 5 being the best
coordinates	Dictionary	Coordinate of restaruant. latitude and longitude are keys
transactions	List	Example: delivery, pickup, None
price	String	'$', '$$', '$$$', '$$$$' depending on price range in the area
location	Dictionary	Address in dictionary form
phone	String	phone number
display_phone	String	phone number
distance	Float	distance from location used to search

Final DataDrame Data Structure:

Column	Data Type
alias	String (object)
name	String (object)
url	String (object)
review_count	int64
categories	List (object)
ethnic_category	List (object)
rating	float64
price	String (object)
latitude	float64
longitude	float64
city	String (object)
address	String (object)

Data Dashboard

Check out the interactive dashboard here

Future Improvements

Use Google Compute Engine and Docker to host this data pipeline in the cloud (Complete)
Implement dbt testing for data integrity in terms of locations
Implement dbt incremental models for adding new data or updating old data
Optimize Dockerfile or use Docker Compose to streamline the dockerization process
Incorporate Yelp Reviews to build some kind of NLP model and replace null values in Price column with model's output and create a column with restaurant sentiment

Reproduce it yourself

Prerequisites: Ensure you have Google Cloud Platform, dbt, Prefect Cloud accounts. To run the project, use the following steps:

Setup your Google Cloud environment & Generate Yelp API key

Create a Google Cloud Platform project
On GCP create a service account with with GCE, GCS and BiqQuery admin privileges. For a walkthrough, click here
Configure Identity and Access Management (IAM) for the service account, giving it the following privileges:
- Viewer
- Storage Admin
- Storage String (Object) Admin
- BigQuery Admin
Generate Yelp API Key by creating an account with Yelp and generating a free user API Key (limited to 5000 calls per day)

Create a VM (Virtual Machine) with machine type e2-standard-4 in a region that is the closest to you. Choose Operating system: Ubuntu with Version: Ubuntu 20.04 LTS and Size: 40 GB. For this project, region: us-west2 (Los Angeles) was chosen.
Set up the VM. For a walkthrough, click here
- After ssh key is generated and the set up is performed similar to the video, you can easily log into VM with config file in .ssh directory. It will prompt you to input your password after this is run.
```
# EXECUTE IN DIRECTORY WHERE SSH KEY IS
touch config 
code config #open VS code to edit
```
In VS code, edit config file like below:
```
Host YOUR_HOST_NAME_HERE 
    HostName YOUR_VM_EXTERNAL_IP_ADDRESS_HERE
    User YOUR_VM_USERNAME_HERE
    IdentifyFile ~/YOUR_GCP_SSH_KEY_FILE_PATH_HERE
```
Log into your GCP VM by executing ssh YOUR_HOST_NAME_HERE in a directory one above /.ssh/ and typing the passphrase for that ssh.
- After logging into the VM, install Anaconda using the following steps:
  - Download anaconda using wget https://repo.anaconda.com/archive/Anaconda3-2023.07-2-Linux-x86_64.sh or the latest version from this link.
  - Run bash Anaconda3-2023.07-2-Linux-x86_64.sh to execute the download file. Accept all the terms and conditions of download.
  - Then run the code below in your VM CLI.
```
rm Anaconda3-2023.07-2-Linux-x86_64.sh
echo 'export PATH="~/anaconda/bin:$PATH"' >> ~/.bashrc 

# Reload default profile
source ~/.bashrc

conda update conda
```
  - Install Docker and create a user & Install terraform by using the following steps:
```
sudo apt-get update # if sudo not installed, do apt-get -y install sudo and then do sudo apt-get update
sudo apt-get install nano
sudo apt-get install docker.io
sudo groupadd docker
sudo gpasswd -a $USER docker
sudo docker service restart
```
Logout by logout and logging back in with ssh YOUR_HOST_NAME_HERE, assuming config file is set up. This is so that your group membership is re-evaluated. If not, you can run ssh -i ~/YOUR_GCP_SSH_KEY_FILE_PATH_HERE YOUR_VM_USERNAME_HERE@YOUR_VM_EXTERNAL_IP_ADDRESS_HERE.
Fork this repo, and clone it to your local environment.

git clone https://github.com/djeong95/Yelp_review_datapipeline.git

Inside the VM under /home/YOUR_USERNAME/Yelp_review_pipeline/ directory, create both .env and SERVICE_ACCT_KEY.json file.

touch .env
nano .env

touch SERVICE_ACCT_KEY.json
nano SERVICE_ACCT_KEY.json

Inside the .env file, paste below information.

YELP_API_KEY=YOUR_YELP_API_KEY_HERE
GCP_SERVICE_ACCOUNT_FILE=YOUR_SERVICE_ACCOUNT_KEY_JSONFILEPATH_HERE
GOOGLE_CLOUD_PROJECT=YOUR_PROJECT_NAME_HERE

Inside the SERVICE_ACCT_KEY.json file, paste below information from your service account key that you generated.

{
"type": "service_account",
"project_id": "YOUR_PROJECT_ID_HERE",
"private_key_id": "YOUR_PRIVATE_KEY_ID_HERE",
"private_key": "YOUR_PRIVATE_KEY_HERE",
"client_email": "YOUR_CLIENT_EMAIL_HERE",
"client_id": "YOUR_CLIENT_ID_HERE",
"auth_uri": "YOUR_INFORMATION_HERE",
"token_uri": "YOUR_INFORMATION_HERE",
"auth_provider_x509_cert_url": "YOUR_INFORMATION_HERE",
"client_x509_cert_url": "YOUR_INFORMATION_HERE",
"universe_domain": "YOUR_DOMAIN_HERE"
}

In your VM, you are now ready to build a Docker container from an image, which is defined using a Dockerfile.

# build docker image
docker build -t yelp-pipeline-production:latest .

# After docker container is built, add volumes for .env and api_key.json files
docker run -it -v /home/YOUR_USERNAME/Yelp_review_datapipeline/.env:/opt/prefect/.env -v /home/YOUR_USERNAME/Yelp_review_datapipeline/SERVICE_ACCT_KEY.json:/opt/prefect/SERVICE_ACCT_KEY.json yelp-pipeline-production

Install required tools and download & install Terraform.

cd /usr/local/bin/
TERRAFORM_VERSION="1.5.5"
apt-get update
apt-get install -y curl unzip
curl -LO "https://releases.hashicorp.com/terraform/${TERRAFORM_VERSION}/terraform_${TERRAFORM_VERSION}_linux_amd64.zip"
unzip terraform_${TERRAFORM_VERSION}_linux_amd64.zip
rm terraform_${TERRAFORM_VERSION}_linux_amd64.zip

Run Terraform to create infrastructure.

cd /opt/terraform_rev2/

If needed, use nano variables.tf to change the variables.tf file with your corresponding variables (apt-get install nano might be needed). I would recommend to leave the names of the datasets and bucket as they are; otherwise you need to change them in the prefect flows and dbt.
To initiate, plan and apply the infrastructure, adjust and run the following Terraform commands.

terraform init
terraform plan -var="project=YOUR_PROJECT_ID_HERE"
terraform apply -var="project=YOUR_PROJECT_ID_HERE"

Type 'yes' when prompted.

Setup your orchestration

If you do not have a prefect workspace, sign-up for the prefect cloud and create a workspace here.
Generate PREFECT_API_KEY and PREFECT_API_URL for login via Docker container.
Create the prefect blocks via the cloud UI or adjust the variables in /prefect/prefect_create_blocks.py and run.

cd /opt/prefect/
mkdir data/
PREFECT_API_KEY="[API-KEY]"
PREFECT_API_URL="https://api.prefect.cloud/api/accounts/[ACCOUNT-ID]/workspaces/[WORKSPACE-ID]"
prefect cloud login
python prefect/prefect_create_blocks.py

Run prefect deployment build commands for deploying scheduled runs.

cd /opt/
prefect deployment build ./prefect/yelp_api_to_gcs.py:etl_api_to_gcs -n "Parameterized ETL from API to GCS"
prefect deployment build ./prefect/yelp_gcs_to_bq.py:etl_gcs_to_bq -n "Parameterized ETL from GCS datalake to BigQuery"

Edit parameters as you see fit in the etl_api_to_gcs-deployment.yaml and etl_gcs_to_bq-deployment.yaml files generated by using nano etl_api_to_gcs-deployment.yaml.

example: parameters: {"terms": ['Restaurants'], "start_slice": 0, "end_slice": 459}
To execute the flow, run the following commands in two different terminals

prefect deployment apply etl_api_to_gcs-deployment.yaml
prefect deployment apply etl_gcs_to_bq-deployment.yaml
prefect agent start --work-queue "default" # run this to schedule your run

After running the flow, you will find the data at BigQuery in yelp_data_raw.{term}_data_raw, the flow will take around 120 mins to complete, but it will vary depending on the term you run. Note that free Yelp API account is limited to 5000 calls each day.

Data tranformation and modeling using dbt

Follow the steps highlighted here to create an account and create a dbt cloud project. Skip the Create a BigQuery service account, as you already have a service account key.
Below is the lineage graph that describes the data transformation performed by dbt.

Lineage Graph

Execute run the following command:

dbt build --var 'is_test_run: false'

You will get 6 tables in the yelp_data_dbt database:

dim_coordinates
dim_counties
fact_yelp_all
stg_cafes
stg_desserts
stg_restaurants

Data Vizualization and Dashboarding

You can now query the data and connect it to looker to visualize the data, when connecting to data source use fact_yelp_all table to build the data source in looker, don't use partitioned table, as you wont get any data in your report.
Go to Looker Studio → create → BigQuery → choose your project, dataset & transformed table.

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
analyses		analyses
macros		macros
models		models
prefect		prefect
seeds		seeds
terraform		terraform
terraform_rev2		terraform_rev2
tests		tests
.gitignore		.gitignore
.prefectignore		.prefectignore
Dockerfile		Dockerfile
README.md		README.md
california_county_cities.csv		california_county_cities.csv
california_lat_long_cities.csv		california_lat_long_cities.csv
dbt_project.yml		dbt_project.yml
main.py		main.py
packages.yml		packages.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Yelp Review Restaurant Recommendations

Problem Statement

Technology Stack

Data Pipeline Architecture

Data Structure

Data Dashboard

Future Improvements

Reproduce it yourself

Data Vizualization and Dashboarding

About

Releases

Packages

Languages

djeong95/Yelp_review_datapipeline

Folders and files

Latest commit

History

Repository files navigation

Yelp Review Restaurant Recommendations

Problem Statement

Technology Stack

Data Pipeline Architecture

Data Structure

Data Dashboard

Future Improvements

Reproduce it yourself

Data Vizualization and Dashboarding

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages