## Google Cloud Composer and Dataflow POC 

##### **Using Python



#### Ref: Gcloud Command - https://cloud.google.com/sdk/gcloud/reference
#### Ref: Gsutil Command: https://cloud.google.com/storage/docs/gsutil/commands/ls
#### Ref: BQ Command: https://cloud.google.com/bigquery/docs/datasets#bq
#### Ref: Apache Beam[GCP]: https://cloud.google.com/dataflow/docs/guides/beam-creating-a-pipeline
#### Ref: Apache Beam[GCP]: https://beam.apache.org/documentation/programming-guide/

#### Youtube Ref:
#### Beam and Dataflow with Python: https://www.youtube.com/watch?v=I1JUtoDHFcg

## POC 
### Ref: https://cloud.google.com/dataflow/docs/quickstarts/quickstart-java-maven
### Ref: https://www.datobra.com/posts/pubsub_to_bigquery_dataflow_pipeline/

### Prerequisites for Python
#### 1. python 3
#### 2. pip
#### 3. virtualenv

### First open Cloud Shell and run below commands one by one

In [None]:
export PROJECT_ID=indranil-24012994-01 \
export BILLING_ACCOUNT_ID=01F748-D68B6C-7BFEF3 \
export SERVICE_ACCOUNT_ID=sa-composer-dataflow \
export REGION=us-central1 \
export GCS_BUCKET_01="gcs-$PROJECT_ID" \
export PUBSUB_TOPIC=pubsub-topic-poc-01 \
export PUBSUB_SUBSCRIPTION_01=pubsub-subscription-poc-01 \
export BIGQUERY_DATASET=poc_dataflow \
export BIGQUERY_TABLE_01=detailed_view \
export BIGQUERY_TABLE_02=search \
export BIGQUERY_TABLE_03=add_to_favorite

### Create new Project for this POC and set to that Project

In [None]:
# Create the Project

# gcloud projects create <project_id> --name <project_name>
gcloud projects create $PROJECT_ID --name 'Composer Dataflow POC'

# List all the project
gcloud projects list

# Set to the newly created Project
gcloud config set project $PROJECT_ID

### Enable billing for that Project

In [None]:
gcloud alpha billing accounts list

gcloud alpha billing projects link $PROJECT_ID --billing-account $BILLING_ACCOUNT_ID


### Enable the APIs

In [None]:
# List available API 
gcloud services list | grep dataflow

# Enable the API
gcloud services enable dataflow.googleapis.com
gcloud services enable pubsub.googleapis.com

### Create Service Account and assign the specific role

In [None]:
# create the service account
gcloud iam service-accounts create $SERVICE_ACCOUNT_ID \
--description='service account for POC' \
--display-name='sa-composer-dataflow'

# to list the service account
gcloud iam service-accounts list

# Search or listig roles
gcloud iam roles list | grep dataflow

# assign the role to that specific service account
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:$SERVICE_ACCOUNT_ID@$PROJECT_ID.iam.gserviceaccount.com" \
--role="roles/editor"

# generate the key for this service account and assign through powershell
$env:GOOGLE_APPLICATION_CREDENTIALS="KEY_PATH"

### Required role for the above service account

In [None]:
roles/dataflow.developer
roles/pubsub.editor
roles/storage.objectCreator
roles/storage.objectViewer

--=============================
Cloud Dataflow Service Agent
Dataflow Admin (for jobs creation)
Dataflow Worker
BigQuery Admin
Pub/Sub Subscriber
Storage Object Admin

### Create Pub/Sub Topic and Subscription

In [None]:
# Create Pub/Sub Topic
gcloud pubsub topics create $PUBSUB_TOPIC --project $PROJECT_ID

# Create Subscription to that Topic
gcloud pubsub subscriptions create $PUBSUB_SUBSCRIPTION_01 --topic $PUBSUB_TOPIC --project $PROJECT_ID

### Create Google Storage Bucket

In [None]:
gsutil mb -c standard -b off -l us-central1 gs://$GCS_BUCKET_01

### Create Bigquery dataset and tables

In [None]:
# Dataset Creation
bq mk --location us --description 'for demo poc' --dataset $PROJECT_ID:$BIGQUERY_DATASET

# Table Creation
bq mk --location us-central1 --table $PROJECT_ID:$BIGQUERY_DATASET.$BIGQUERY_TABLE_01
bq mk --location us-central1 --table $PROJECT_ID:$BIGQUERY_DATASET.$BIGQUERY_TABLE_02
bq mk --location us-central1 --table $PROJECT_ID:$BIGQUERY_DATASET.$BIGQUERY_TABLE_03

### Activate the python virtual environment

In [None]:
python3 -m virtualenv env
source env/bin/activate

### Install the Apache Beam inside Python Virtual Environment

In [None]:
pip install wheel
pip install -m 'apache-beam[gcp]'

### Run the pipeline locally

In [None]:
python -m apache_beam.examples.wordcount --output outputs

more outputs*

python hello-beam.py --project $PROJECT_ID --topic $PUBSUB_TOPIC --output beam.out --runner DirectRunner

### Run the pipeline on the Dataflow service

In [None]:
python -m apache_beam.examples.wordcount \
--region $REGION \
--input gs://dataflow-samples/shakespeare/kinglear.txt \
--output gs://$GCS_BUCKET_01/results/outputs \
--runner DataflowRunner \
--project $PROJECT_ID \
--temp_location gs://$GCS_BUCKET_01/tmp/

### Dataflow Job Output

In [None]:
# List the output files
gsutil ls -lh "gs://$GCS_BUCKET_01/results/outputs*"  

# View the results in the output files:
gsutil cat "gs://$GCS_BUCKET_01/results/outputs*" 

### Clean up

In [None]:
#deactivate the virtual environment
deactivate

# Delete the bucket
gsutil rm -r gs://$GCS_BUCKET_01
gsutil rb gs://$GCS_BUCKET_01
    
# Delete the Pub/Sub Topic
gcloud pubsub topics delete $PUBSUB_TOPIC --project @PROJECT_ID

# Disabling the service account
gcloud iam service-accounts disable SA_NAME@PROJECT_ID.iam.gserviceaccount.com

# Delete the Project
gcloud projects delete $PROJECT_ID

#To undo this delete project scripts within limited time 
#gcloud projects undelete indranil-24012994-01

### Dataflow Custom Template in Python (hello_beam.py)

In [None]:
import logging
import apache_beam as beam

# fetch the data from Pubsub
record = (p | "Read From Pubsub" >> 
             beam.io.ReadStringsFromPubSub(
                 topic=args.topic, 
                 id_label="MESSAGE_ID")
             | "Parse Json to Dict" >> 
             beam.Map(lamda e: json.loads(e)))
# Define Pardo 
# Also pass the output as list of dictionaries as Beam is parallel and distributed processing
class AddKeyToDict(beam.DoFn):
    def process(self, element):
        logging.info("LOG: {}",format(element))
        return [(element("clientid"), element)]

# Call Pardo   
( records | beam.WindowInto(window.SlidingWindows(300, 60, offset=0))
         | beam.ParDo(AddKeyToDict())
         | beam.GroupByKey())

# Load the data into Bigquery
( records | "Write to Bigquery" >> 
            beam.io.WriteToBigQuery(
            args.output,
            schema="",
            craete_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
            write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND))

if __name__ == "main":
    logging.getLogger().setLevel(logging.INFO)
    run()

### ***Missed configuration if anything wrong

#### 1. Made Private Google Access "On" from VPC -> Subnet
#### 2. Can't create blank folder in GCS through gsutil command
#### 3. There is some quota on dataflow job failure. for each job failure its retry 1000 times, and we have 1500 jobs/table/day as quota. (**please confirm again)