Skip to content

Dataflow Setup Guide

Shirley Cohen edited this page Nov 5, 2018 · 21 revisions

This guide is for installing and configuring Apache Beam and Dataflow and is relevant for the work in Milestones 7 and 8.

1. enable the Dataflow API

  • GCP Console -> Navigation Menu -> APIs & Services -> Add APIs and Services -> enter Dataflow in the search bar -> click Enable

2. create new service account (IAM & admin -> Service accounts -> Create Service Account)

  • service account name: sa-dataflow
  • roles: Dataflow Admin, BigQuery Data Editor, BigQuery Job User, Storage Object Creator, Storage Object Viewer
  • create key, key type: JSON
  • rename key file to sa-dataflow.json
  • download sa-dataflow.json to your local machine
  • click Done

3. create a Cloud Storage bucket (Storage -> Browser -> Create Bucket)

  • bucket name: <group name>-<some unique suffix>
  • storage class: Regional
  • location: us-central1
  • click Create
  • create 3 folders inside your bucket: staging, temp, output

Note:

  • bucket names are unique on GCP, that's why you'll need to add a unique suffix to your group name.
  • the 3 folders, staging, temp, output, are used by the WordCount example to store the output from the program.

4. open Cloud Shell from the top-right-menu (menu option called Active Google Cloud Shell)

  • upload sa-dataflow.json to your home directory
  • in your home directory, create a new file named .bash_profile with these 4 lines:
    export PS1="$ "
    export PROJECT_ID="<project id>"
    export BUCKET="gs://<bucket name>"
    export GOOGLE_APPLICATION_CREDENTIALS="<home directory>/sa-dataflow.json"
  • save and close the file
  • run: source .bash_profile on the shell

Note:
<project id> = your GCP project ID
<bucket name> = your bucket name

5. install the apache beam and dataflow libraries

  • run: pip install --user google-cloud-dataflow

6. test your Apache Beam setup by running the Wordcount example in local mode using the Direct Runner

  • python -m apache_beam.examples.wordcount --output wordcount.out
  • if you see any errors in stdout, stop and debug.
  • open wordcount.out-00000-of-00001 and examine the output

7. test your Dataflow setup by running the Wordcount example in distributed mode using the Dataflow Runner python -m apache_beam.examples.wordcount \
--project $PROJECT_ID \
--runner DataflowRunner \
--staging_location $BUCKET/staging \
--temp_location $BUCKET/temp \
--output $BUCKET/output

  • open your Dataflow console, find the running job, and examine the job details.
  • open your GCS console, go to your bucket, open the 3 folders and view the contents of the files.
  • if the wordcount job completed successfully, your Dataflow setup is complete.

VERY IMPORTANT, PLEASE READ: Cloud Shell uses an ephemeral VM, so you'll need to run pip install --user google-cloud-dataflow each time you open a new Cloud Shell instance. However, your home directory on Cloud Shell uses a persistent disk and it gets mounted for you on each new Cloud Shell instance. Therefore, whatever you write to your home directory is safe.