Dataproc Customisable HA cluster debian-9 with zookeeper,kafka ,BigQuery and other tools/jobs with Terraform
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
credentials
.gitignore
CODE_OF_CONDUCT.md
LICENSE
README.MD
_config.yml
firewalls.tf
main.tf
networks.tf
provider.tf
schema.json
variables.tf

README.MD

Table of Contents (dataproc cluster debian-9 with zookeeper, kafka ,BigQuery and other tools with Terraform)

  1. Pre-reqs
  2. Creation and destroying
  3. Cluster details
  4. Cloud Dataproc version
  5. URLs and extra components via dataproc-initialization-actions
  6. Terraform graph
  7. Automatic provisioning
  8. Testing Kafka
  9. Reporting bugs
  10. Patches and pull requests
  11. License
  12. Code of conduct

Pre-reqs

  1. Download and Install Terraform

  2. Download and install google cloud sdk

    • One may install gcloud sdk silently for all users as root with access to GCLOUD_HOME for only speficic user:

      export $USERNAME="<<you_user_name>>"

      export SHARE_DATA=/data

      su -c "export SHARE_DATA=/data && export CLOUDSDK_INSTALL_DIR=$SHARE_DATA export CLOUDSDK_CORE_DISABLE_PROMPTS=1 && curl https://sdk.cloud.google.com | bash" $USER_NAME

      echo "source $SHARE_DATA/google-cloud-sdk/path.bash.inc" >> /etc/profile.d/gcloud.sh

      echo "source $SHARE_DATA/google-cloud-sdk/completion.bash.inc" >> /etc/profile.d/gcloud.sh

  3. Clone this repository and cd into dataproc folder. git clone https://github.com/dwaiba/dataproc-terraform && cd dataproc-terraform

  4. Please create Service Credential of type JSON via https://console.cloud.google.com/apis/credentials, download and save as google.json in credentials folder.

Creation and destroying

terraform init

terraform plan -out "run.plan

terraform apply "run.plan"

terraform destroy

Please note - firewall.tf has all ports open and please switch them off before creation if you so want.

Cluster details

Name Role Staging Bucket
poccluster-m
poccluster-w-*
Default 3 Masters and Auto HA with Zookeeper
Number of workers are prompted
dataproc-poc-staging-bucket

Cloud Dataproc version

Version Includes Base OS Released On Last Updated (sub-minor version) Notes
1.3-deb9 Apache Spark 2.3.1
Apache Hadoop 2.9.0
Apache Pig 0.17.0
Apache Hive 2.3.2
Apache Tez 0.9.0*
Cloud Storage connector 1.9.9-hadoop2
Scala 2.11.8
Python 2.7
Debian 9 2018/06/29 2018/11/16
(1.3.16-deb9)
All releases on and after November 2, 2018 will be based on Debian 9.

URLs and extra components via dataproc-initialization-actions

  • YARN ResourceManager: http://<<Master_External_IP>>:8088/cluster

  • HDFS NameNode: http://<<Master_External_IP>>:9870

  • Hadoop Job History Server: http://<<Master_External_IP>>:19888/jobhistory

  • Node Managers: http://<<Individual_Node_External_IP>>:8042

  • Ganglia: http://<<Master_External_IP>>:80/ganglia

  • Livy: http://<<Master_External_IP>>:8998

  • docker latest is installed on all nodes.

Terraform Graph

Please generate dot format (Graphviz) terraform configuration graphs for visual representation of the repo.

terraform graph | dot -Tsvg > graph.svg

Also, one can use Blast Radius on live initialized terraform project to view graph. Please shoot in dockerized format:

docker ps -a|grep blast-radius|awk '{print $1}'|xargs docker kill && rm -rf dataproc-terraform && git clone https://github.com/dwaiba/dataproc-terraform && cd dataproc-terraform/ && terraform init && docker run --cap-add=SYS_ADMIN -dit --rm -p 5005:5000 -v $(pwd):/workdir:ro 28mm/blast-radius && cd ../../

A live example is here for this project.

Automatic Provisioning

https://github.com/dwaiba/dataproc-terraform/

Pre-req:

  1. gcloud should be installed. Silent install is - export $USERNAME="<<you_user_name>>" && export SHARE_DATA=/data && su -c "export SHARE_DATA=/data && export CLOUDSDK_INSTALL_DIR=$SHARE_DATA export CLOUDSDK_CORE_DISABLE_PROMPTS=1 && curl https://sdk.cloud.google.com | bash" $USER_NAME && echo "source $SHARE_DATA/google-cloud-sdk/path.bash.inc" >> /etc/profile.d/gcloud.sh && echo "source $SHARE_DATA/google-cloud-sdk/completion.bash.inc" >> /etc/profile.d/gcloud.sh &&

  2. Please create Service Credential of type JSON via https://console.cloud.google.com/apis/credentials, download and save as google.json in credentials folder of the gke-terraform.

Plan:

terraform init && terraform plan -var cluster_location=us-east1 -var project=<<your-google-cloud-project-name>> -var worker_num_instances=<<number of workers for the default auto ha with zookeeper 3 masters>> -out "run.plan"

Apply:

terraform apply "run.plan"

Destroy:

terraform destroy -var cluster_location=us-east1 -var project=<<your-google-cloud-project-name>> -var worker_num_instances=<<number of workers for the default auto ha with zookeeper 3 masters>>

Testing Kafka

Once the cluster has been created Kafka should be running on all worker nodes in the cluster, and Kafka libraries should be installed on the master node(s). You can test your Kafka setup by creating a simple topic and publishing to it with Kafka's command-line tools, after SSHing into one of your master nodes:

gcloud compute ssh gcloud compute ssh poccluster-m --zone europe-west2-b

Create a test topic, just talking to the local master's zookeeper server.

/usr/lib/kafka/bin/kafka-topics.sh --zookeeper localhost:2181 --create --replication-factor 1 --partitions 1 --topic test

/usr/lib/kafka/bin/kafka-topics.sh --zookeeper localhost:2181 --list

Use worker 0 as broker to publish 10 messages over 10 seconds asynchronously.

export CLUSTER_NAME=$(/usr/share/google/get_metadata_value attributes/dataproc-cluster-name)

for i in {0..10}; do echo "message${i}"; sleep 1; done

nohup /usr/lib/kafka/bin/kafka-console-producer.sh --broker-list ${CLUSTER_NAME}-w-0:9092 --topic test &

User worker 1 as broker to consume those 10 messages as they come.This can also be run in any other master or worker node of the cluster.

/usr/lib/kafka/bin/kafka-console-consumer.sh --bootstrap-server ${CLUSTER_NAME}-w-1:9092 --topic test --from-beginning

Reporting bugs

Please report bugs by opening an issue in the GitHub Issue Tracker. Bugs have auto template defined. Please view it here

Patches and pull requests

Patches can be submitted as GitHub pull requests. If using GitHub please make sure your branch applies to the current master as a 'fast forward' merge (i.e. without creating a merge commit). Use the git rebase command to update your branch to the current master if necessary.

License

Code of Conduct