Dataflow Setup Guide

This guide is for installing and configuring Apache Beam and Dataflow and is relevant for the work in Milestones 7 and 8.

1. enable the Dataflow API

GCP Console -> Navigation Menu -> APIs & Services -> Add APIs and Services -> enter Dataflow in the search bar -> click Enable

2. create new service account (IAM & admin -> Service accounts -> Create Service Account)

service account name: sa-dataflow
roles: Dataflow Admin, BigQuery Data Editor, BigQuery Job User, Storage Object Creator, Storage Object Viewer
create key, key type: JSON
rename key file to sa-dataflow.json
download sa-dataflow.json to your local machine
click Done

3. create a Cloud Storage bucket (Storage -> Browser -> Create Bucket)

bucket name: <group name>-<some unique suffix>
storage class: Regional
location: us-central1
click Create
create 3 folders inside your bucket: staging, temp, output

Note:

bucket names are unique on GCP, that's why you'll need to add a unique suffix to your group name.
the 3 folders, staging, temp, output, are used by the WordCount example to store the output from the program.

4. open Cloud Shell from the top-right-menu (menu option called Active Google Cloud Shell)

upload sa-dataflow.json to your home directory
in your home directory, create a new file named .bash_profile with these 4 lines:
export PS1="$ "
export PROJECT_ID="<project id>"
export BUCKET="gs://<bucket name>"
export GOOGLE_APPLICATION_CREDENTIALS="<home directory>/sa-dataflow.json"
save and close the file
run: source .bash_profile on the shell

Note:
<project id> = your GCP project ID
<bucket name> = your bucket name

5. install the apache beam and dataflow libraries

run: pip install --user google-cloud-dataflow

6. test your Apache Beam setup by running the Wordcount example in local mode using the Direct Runner

python -m apache_beam.examples.wordcount --output wordcount.out
if you see any errors in stdout, stop and debug.
open wordcount.out-00000-of-00001 and examine the output

7. test your Dataflow setup by running the Wordcount example in distributed mode using the Dataflow Runner python -m apache_beam.examples.wordcount \
--project $PROJECT_ID \
--runner DataflowRunner \
--staging_location $BUCKET/staging \
--temp_location $BUCKET/temp \
--output $BUCKET/output

open your Dataflow console, find the running job, and examine the job details.
open your GCS console, go to your bucket, open the 3 folders and view the contents of the files.
if the wordcount job completed successfully, your Dataflow setup is complete.

VERY IMPORTANT, PLEASE READ: Cloud Shell uses an ephemeral VM, so you'll need to run pip install --user google-cloud-dataflow each time you open a new Cloud Shell instance. However, your home directory on Cloud Shell uses a persistent disk and it gets mounted for you on each new Cloud Shell instance. Therefore, whatever you write to your home directory is safe.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataflow Setup Guide

Clone this wiki locally