Skip to content

ggear/cloudera-framework

Repository files navigation

Cloudera Framework

Provide an example organisation wide Cloudera (i.e. Hadoop ecosystem) project framework, defining corporate standards on runtime components, datasets, libraries, testing, deployment and project structure to facilitate operating within a continuous deployment pipeline. This example includes client/runtime/thirdparty bill-of-materials, utility/driver libraries and a unit test harness with examples, providing full coverage against CDH:

  • MR2
  • Kudu
  • HDFS
  • Flume
  • Kafka
  • Impala
  • ZooKeeper
  • Spark & Spark2
  • Hive/MR & Hive/Spark

The framework can target managed services provisioned by Cloudera Altus, automated cluster deployments via Cloudera Director and or manually managed clusters via Cloudera Manager.

Examples are included, codifying the standards, providing end to end data streaming, ingest, modeling, testing pipelines, with synthetic datasets to exercise the codebase.

Finally, a set of archetypes are included to provide bare bones starter client modules.

Requirements

To compile, build and package from source, this project requires:

  • Java 8
  • Maven 3
  • Scala 2.11
  • Python 2.7
  • Anaconda 4
  • Cloudera Altus CLI 2.2
  • Cloudera Director Client 2.6
  • Python Cloudera Manager API 5

The bootstrap.sh script tests for, configures and installs (where possible) the required toolchain and should be sourced as so:

. ./bootstrap.sh environment

To run the unit and integrations tests, binaries and meta-data are provided for all CDH components:

  • CentOS/RHEL 6.x
  • CentOS/RHEL 7.x
  • Ubuntu LTS 14.04.x
  • MacOS 10.13.x (Impala unit tests are no-op'd)

Note that in addition to Maven dependencies, Cloudera parcels are used to manage platform dependent binaries by way of the cloudera-parcel-plugin. Impala parcels are not available for non-Linux containers.

Limitations

As above, this code is known to not work out of the box on Windows hosts, only Linux and MacOS are supported. If developing on Windows it is recommended to run a Linux container and develop from within it.

Install

This project can be compiled, packaged and installed to a local repository, skipping tests, as per:

mvn install -PPKG

To only compile the project:

mvn install -PCMP

To run the tests for both Scala 2.10 (default) and 2.11 (localhost must be resolvable to run the tests):

mvn test
mvn test -pl cloudera-framework-testing -PSCALA_2.11

The bootstrap script provides convenience mechanisms to build and release the project as so:

./bootstrap.sh build release

Usage

The cloudera-framework includes a set of examples:

  • Example 1 (Java, HSQL, Flume, MR, Hive/MR, Impala, HDFS)
  • Example 2 (Java, HSQL, Kafka, Hive/Spark, Spark, Impala, S3)
  • Example 3 (Scala, CDSW, Spark2, MLlib, PMML, HDFS)
  • Example 4 (Envelope, Kafka, Spark2 Streaming, Kudu, HDFS)
  • Example 5 (Python, NLTK, PySpark, Spark2, HDFS)

In addition, archetypes are available in various profiles, allowing one to bootstrap a new cloudera-framework client module:

For example, a project could be created with the Spark2 profile baseline, including a very simple example targeting a Cloudera Altus runtime as below:

mvn org.apache.maven.plugins:maven-archetype-plugin:2.4:generate -B \
  -DarchetypeRepository=http://52.63.86.162/artifactory/cloudera-framework-releases \
  -DarchetypeGroupId=com.cloudera.framework.archetype \
  -DarchetypeArtifactId=cloudera-framework-archetype-spark2 \
  -DarchetypeVersion=2.0.5-cdh5.15.1 \
  -DgroupId=com.myorg.mytest \
  -DartifactId=mytest \
  -Dpackage=com.myorg.mytest \
  -DaltusEnv=my_altus_environment \
  -DaltusCluster=my_cluster \
  -DaltusS3Bucket=my_s3_bucket

Note that in order to run against the Cloudera Altus Amazon AWS runtime as above, both the "AWS_ACCESS_KEY" and "AWS_SECRET_KEY" are required to be set in the environment and each Maven archetype parameter with the "altus" prefix has to be given an appropriate value. The "altusS3Bucket" parameter should specify a valid S3 bucket which the user has read/write access to, within the "altusEnv" region and has data stored under the "/data/workload/input" key with schema like the pristine test data set.