-
Notifications
You must be signed in to change notification settings - Fork 2
Dataflow Setup Guide
This guide is for installing and configuring Apache Beam and Dataflow and is relevant for the work in Milestones 7 and 8.
1. enable the Dataflow API
- GCP Console -> Navigation Menu -> APIs & Services -> Add APIs and Services -> enter Dataflow in the search bar -> click Enable
2. create new service account (IAM & admin -> Service accounts -> Create Service Account)
- service account name: sa-dataflow
- roles: Dataflow Admin, BigQuery Data Editor, BigQuery Job User, Storage Object Creator, Storage Object Viewer
- create key, key type: JSON
- rename key file to
sa-dataflow.json
- download
sa-dataflow.json
to your local machine - click Done
3. create a Cloud Storage bucket (Storage -> Browser -> Create Bucket)
- bucket name:
<group name>-<some unique suffix>
- storage class: Regional
- location: us-central1
- click Create
- create 3 folders inside your bucket: staging, temp, output
Note:
- bucket names are unique on GCP, that's why you'll need to add a unique suffix to your group
name.
- the 3 folders, staging, temp, output, are used by the WordCount example to store the output from the program.
4. open Cloud Shell from the top-right-menu (menu option called Active Google Cloud Shell)
- upload
sa-dataflow.json
to your home directory - in your home directory, create a new file named .bash_profile with these 4 lines:
export PS1="$ "
exportPROJECT_ID="<project id>"
exportBUCKET="gs://<bucket name>"
exportGOOGLE_APPLICATION_CREDENTIALS="<home directory>/sa-dataflow.json"
- save and close the file
- run:
source .bash_profile
on the shell
Note:
<project id>
= your GCP project ID
<bucket name>
= your bucket name
5. install the apache beam and dataflow libraries
- run:
pip install --user google-cloud-dataflow
6. test your Apache Beam setup by running the Wordcount example in local mode using the Direct Runner
python -m apache_beam.examples.wordcount --output wordcount.out
- if you see any errors in stdout, stop and debug.
- open
wordcount.out-00000-of-00001
and examine the output
7. test your Dataflow setup by running the Wordcount example in distributed mode using the Dataflow Runner
python -m apache_beam.examples.wordcount \
--project $PROJECT_ID \
--runner DataflowRunner \
--staging_location $BUCKET/staging \
--temp_location $BUCKET/temp \
--output $BUCKET/output
- open your Dataflow console, find the running job, and examine the job details.
- open your GCS console, go to your bucket, open the 3 folders and view the contents of the files.
- if the wordcount job completed successfully, your Dataflow setup is complete.
VERY IMPORTANT, PLEASE READ: Cloud Shell uses an ephemeral VM, so you'll need to run pip install --user google-cloud-dataflow
each time you open a new Cloud Shell instance. However, your home directory on Cloud Shell uses a persistent disk and it gets mounted for you on each new Cloud Shell instance. Therefore, whatever you write to your home directory is safe.