The food and drinks recommendations provided by Google Maps and Yelp are mostly based on specific locations, ratings, and reviews. It is difficult to see if there are any new restaurants to try or holistically assess all of the restaurants in the area. I want to create an ELT pipeline that will enable me to study California's yelp dataset, find places that I have yet tried, and see if I can create a robust restaurant recommendation system competitive as Yelp's and Google's.
The following technologies are used to build this project:
- Yelp Fusion API
- Prefect for orchestration workflow
- Google Compute Engine VM Instance
- Docker
- Terraform as Infrastructure-as-Code (IaC) tool to set up Cloud environment
- Google Cloud Storage (GCS) as Data Lake
- Google BigQuery for Data Warehouse
- dbt for transformation and data modeling
- Google Looker Studio for visualizations
Yelp API file Data Structure:
Column | Data Type | Description |
---|---|---|
id | String | Unique ID for every restaurant regardless of chain |
alias | String | Unique "ID" that has name and location and number |
name | String | Name of restaurant |
image_url | String | URL for first image that shows up when searched on Yelp |
is_closed | Boolean | True for closed; False for open |
url | String | URL for Yelp |
review_count | Integer | Total number of reviews |
categories | List of Dictionary | List of categories. Example: [{'alias': 'mexican', 'title': 'Mexican'}] |
rating | Float | Rating out of 5, with 5 being the best |
coordinates | Dictionary | Coordinate of restaruant. latitude and longitude are keys |
transactions | List | Example: delivery, pickup, None |
price | String | '$', '$$', '$$$', '$$$$' depending on price range in the area |
location | Dictionary | Address in dictionary form |
phone | String | phone number |
display_phone | String | phone number |
distance | Float | distance from location used to search |
Final DataDrame Data Structure:
Column | Data Type |
---|---|
alias | String (object) |
name | String (object) |
url | String (object) |
review_count | int64 |
categories | List (object) |
ethnic_category | List (object) |
rating | float64 |
price | String (object) |
latitude | float64 |
longitude | float64 |
city | String (object) |
address | String (object) |
Check out the interactive dashboard here
- Use Google Compute Engine and Docker to host this data pipeline in the cloud (Complete)
- Implement dbt testing for data integrity in terms of locations
- Implement dbt incremental models for adding new data or updating old data
- Optimize Dockerfile or use Docker Compose to streamline the dockerization process
- Incorporate Yelp Reviews to build some kind of NLP model and replace
null
values in Price column with model's output and create a column with restaurant sentiment
Prerequisites: Ensure you have Google Cloud Platform, dbt, Prefect Cloud accounts. To run the project, use the following steps:
- Setup your Google Cloud environment & Generate Yelp API key
-
Create a Google Cloud Platform project
-
On GCP create a service account with with GCE, GCS and BiqQuery admin privileges. For a walkthrough, click here
-
Configure Identity and Access Management (IAM) for the service account, giving it the following privileges:
- Viewer
- Storage Admin
- Storage String (Object) Admin
- BigQuery Admin
-
Generate Yelp API Key by creating an account with Yelp and generating a free user API Key (limited to 5000 calls per day)
-
Create a VM (Virtual Machine) with machine type
e2-standard-4
in a region that is the closest to you. ChooseOperating system: Ubuntu
withVersion: Ubuntu 20.04 LTS
andSize: 40 GB
. For this project, region:us-west2
(Los Angeles) was chosen. -
Set up the VM. For a walkthrough, click here
- After ssh key is generated and the set up is performed similar to the video, you can easily log into VM with config file in .ssh directory. It will prompt you to input your password after this is run.
# EXECUTE IN DIRECTORY WHERE SSH KEY IS touch config code config #open VS code to edit
In VS code, edit config file like below:
Host YOUR_HOST_NAME_HERE HostName YOUR_VM_EXTERNAL_IP_ADDRESS_HERE User YOUR_VM_USERNAME_HERE IdentifyFile ~/YOUR_GCP_SSH_KEY_FILE_PATH_HERE
Log into your GCP VM by executing
ssh YOUR_HOST_NAME_HERE
in a directory one above/.ssh/
and typing the passphrase for that ssh.-
After logging into the VM, install Anaconda using the following steps:
- Download anaconda using
wget https://repo.anaconda.com/archive/Anaconda3-2023.07-2-Linux-x86_64.sh
or the latest version from this link. - Run
bash Anaconda3-2023.07-2-Linux-x86_64.sh
to execute the download file. Accept all the terms and conditions of download. - Then run the code below in your VM CLI.
rm Anaconda3-2023.07-2-Linux-x86_64.sh echo 'export PATH="~/anaconda/bin:$PATH"' >> ~/.bashrc # Reload default profile source ~/.bashrc conda update conda
- Install Docker and create a user & Install terraform by using the following steps:
sudo apt-get update # if sudo not installed, do apt-get -y install sudo and then do sudo apt-get update sudo apt-get install nano sudo apt-get install docker.io sudo groupadd docker sudo gpasswd -a $USER docker sudo docker service restart
- Download anaconda using
-
Logout by
logout
and logging back in withssh YOUR_HOST_NAME_HERE
, assumingconfig
file is set up. This is so that your group membership is re-evaluated. If not, you can runssh -i ~/YOUR_GCP_SSH_KEY_FILE_PATH_HERE YOUR_VM_USERNAME_HERE@YOUR_VM_EXTERNAL_IP_ADDRESS_HERE
. -
Fork this repo, and clone it to your local environment.
git clone https://github.com/djeong95/Yelp_review_datapipeline.git
- Inside the VM under
/home/YOUR_USERNAME/Yelp_review_pipeline/
directory, create both.env
andSERVICE_ACCT_KEY.json
file.
touch .env
nano .env
touch SERVICE_ACCT_KEY.json
nano SERVICE_ACCT_KEY.json
- Inside the
.env
file, paste below information.YELP_API_KEY=YOUR_YELP_API_KEY_HERE GCP_SERVICE_ACCOUNT_FILE=YOUR_SERVICE_ACCOUNT_KEY_JSONFILEPATH_HERE GOOGLE_CLOUD_PROJECT=YOUR_PROJECT_NAME_HERE
- Inside the
SERVICE_ACCT_KEY.json
file, paste below information from your service account key that you generated.{ "type": "service_account", "project_id": "YOUR_PROJECT_ID_HERE", "private_key_id": "YOUR_PRIVATE_KEY_ID_HERE", "private_key": "YOUR_PRIVATE_KEY_HERE", "client_email": "YOUR_CLIENT_EMAIL_HERE", "client_id": "YOUR_CLIENT_ID_HERE", "auth_uri": "YOUR_INFORMATION_HERE", "token_uri": "YOUR_INFORMATION_HERE", "auth_provider_x509_cert_url": "YOUR_INFORMATION_HERE", "client_x509_cert_url": "YOUR_INFORMATION_HERE", "universe_domain": "YOUR_DOMAIN_HERE" }
- In your VM, you are now ready to build a Docker container from an image, which is defined using a Dockerfile.
# build docker image
docker build -t yelp-pipeline-production:latest .
# After docker container is built, add volumes for .env and api_key.json files
docker run -it -v /home/YOUR_USERNAME/Yelp_review_datapipeline/.env:/opt/prefect/.env -v /home/YOUR_USERNAME/Yelp_review_datapipeline/SERVICE_ACCT_KEY.json:/opt/prefect/SERVICE_ACCT_KEY.json yelp-pipeline-production
- Install required tools and download & install Terraform.
cd /usr/local/bin/
TERRAFORM_VERSION="1.5.5"
apt-get update
apt-get install -y curl unzip
curl -LO "https://releases.hashicorp.com/terraform/${TERRAFORM_VERSION}/terraform_${TERRAFORM_VERSION}_linux_amd64.zip"
unzip terraform_${TERRAFORM_VERSION}_linux_amd64.zip
rm terraform_${TERRAFORM_VERSION}_linux_amd64.zip
- Run Terraform to create infrastructure.
cd /opt/terraform_rev2/
- If needed, use
nano variables.tf
to change thevariables.tf
file with your corresponding variables (apt-get install nano
might be needed). I would recommend to leave the names of the datasets and bucket as they are; otherwise you need to change them in the prefect flows and dbt. - To initiate, plan and apply the infrastructure, adjust and run the following Terraform commands.
terraform init
terraform plan -var="project=YOUR_PROJECT_ID_HERE"
terraform apply -var="project=YOUR_PROJECT_ID_HERE"
- Type 'yes' when prompted.
- Setup your orchestration
- If you do not have a prefect workspace, sign-up for the prefect cloud and create a workspace here.
- Generate
PREFECT_API_KEY
andPREFECT_API_URL
for login via Docker container. - Create the prefect blocks via the cloud UI or adjust the variables in /prefect/prefect_create_blocks.py and run.
cd /opt/prefect/
mkdir data/
PREFECT_API_KEY="[API-KEY]"
PREFECT_API_URL="https://api.prefect.cloud/api/accounts/[ACCOUNT-ID]/workspaces/[WORKSPACE-ID]"
prefect cloud login
python prefect/prefect_create_blocks.py
- Run
prefect deployment build
commands for deploying scheduled runs.
cd /opt/
prefect deployment build ./prefect/yelp_api_to_gcs.py:etl_api_to_gcs -n "Parameterized ETL from API to GCS"
prefect deployment build ./prefect/yelp_gcs_to_bq.py:etl_gcs_to_bq -n "Parameterized ETL from GCS datalake to BigQuery"
-
Edit parameters as you see fit in the
etl_api_to_gcs-deployment.yaml
andetl_gcs_to_bq-deployment.yaml
files generated by usingnano etl_api_to_gcs-deployment.yaml
.example:
parameters: {"terms": ['Restaurants'], "start_slice": 0, "end_slice": 459}
-
To execute the flow, run the following commands in two different terminals
prefect deployment apply etl_api_to_gcs-deployment.yaml
prefect deployment apply etl_gcs_to_bq-deployment.yaml
prefect agent start --work-queue "default" # run this to schedule your run
After running the flow, you will find the data at BigQuery in yelp_data_raw.{term}_data_raw, the flow will take around 120 mins to complete, but it will vary depending on the term you run. Note that free Yelp API account is limited to 5000 calls each day.
- Data tranformation and modeling using dbt
-
Follow the steps highlighted here to create an account and create a dbt cloud project. Skip the Create a BigQuery service account, as you already have a service account key.
-
Below is the lineage graph that describes the data transformation performed by dbt.
- Execute run the following command:
dbt build --var 'is_test_run: false'
You will get 6 tables in the yelp_data_dbt database:
- dim_coordinates
- dim_counties
- fact_yelp_all
- stg_cafes
- stg_desserts
- stg_restaurants
- You can now query the data and connect it to looker to visualize the data, when connecting to data source use fact_yelp_all table to build the data source in looker, don't use partitioned table, as you wont get any data in your report.
- Go to Looker Studio → create → BigQuery → choose your project, dataset & transformed table.