GitHub - dawsongzhao/bigdatalab: Big Data Lab

##Overview Big Data Lab is an open-source project created originally in SoftServe Inc. to serve as an accelerator for big data projects and to be an easy to deploy and use environment for education, prototyping, PoC, performance testing and other purposes. The main idea which lies in the root of Big Data Lab is a Log and Metrics processing system which allows certain set of users to use both raw and aggregated log and performance metrics data in their daily tasks.

###Marketecture Diagram below shows main sources of data (log and performance metrics data being collected from web servers), as well as 3 types of analytics produced based on this data and 3 types of users using this data.

##Architecture ###Architecture Drivers Below is a set of main architecture drivers which influences architecture and implementation of the project. We've used Carnegie Mellon Software Engineering Insitute's ADD (Attribute Driven Design) methodology in building solution architecture.

Lambda Architecture

Overview

As part of the design process, one of the first tasks is to create an overall logical structure for the system. To achieve this, it is generally not necessary to re-invent the wheel, but rather to use one particular type of design concept: a reference architecture. A reference architecture is a blueprint for structuring an application. For Big Data Analytics systems we distinguish five reference architectures:

Traditional Relational
Extended Relational
Non-Relational
Lambda Architecture (Hybrid)
Data Refinery (Hybrid)

Design Rational

From the provided reference architectures Lambda Architecture promises the largest number of benefits, such as access to real-time and historical data at the same time. The parallel layers provide “complexity isolation”, meaning that design decisions and development of each layer can be done independently – which corresponds to “swim lanes” principle that increase fault-tolerance and scalability (which is true for a system and for parallelizing development tasks).

The below diagram represents proposed logical structure of the target system based on Lambda Architecture:

Data Flow

##Roadmap ###Components:

###Client OS Support:

Linux
MacOS

###Guest OS Support:

CentOS
RedHat

###Platform Support:

Amazon Web Services
VMWare vShpere / ESX
OpenStack

###Configuration Size:

Small
Medium
Large

###Data Sources:

Log Files (HTTP, Error)
Performance metrics
IoT

###Service Discovering:

Built-in DNS with hostname self-registering
Consul

##Tested OS Below is a list of client OS which can be used to deploy solution from:

CentOS 6/7 (64 bit)
Ubuntu 14 (64 bit)
OSX

##Product Versions

Vagrant 1.8.1
Virtualbox 5.0.8
Terraform 0.6.14
Ruby 2.2.4 (Bundler 1.10.5)
Puppet 4.2.1
Flume 1.5.2
ElasticSearch 1.7.1 (Lucene Core 4.10.4)
Kibana T.B.D.
Cloudera Director 2.0.0

##Deployment Guide

Use the steps below in order to create Big Data Lab cluster:

Run:

sudo ./core-install.sh
./puppet-install.sh
./terraform-install.sh
source $HOME/.bashrc

Choose a profile that you want to use as a template (small, medium or large).

Small:

1 Log Generator + Flume Agent + Flume Collector
1 ElasticSearch + Kibana
1 Cloudera Director

Cloudera CDH Cluster:
1 Cloudera Manager
1 Name Node
1+ Data Node(s)

Medium:

1+ Log Generator + Flume Agent
1 Flume Collector
1 ElasticSearch + Kibana
1 Cloudera Director

Cloudera CDH Cluster:
1 Cloudera Manager
1 Name Node
3+ Data Nodes

Large:

2+ Log Generator + Flume Agent
1 Flume Collector
2+ ElasticSearch
1 Kibana
1 Cloudera Director

Cloudera CDH Cluster:
1 Cloudera Manager
1 Name Node
5+ Data Nodes

According to the chosen profile, copy puppet/hiera/hieradata/common.yaml-[small|medium|large] to puppet/hiera/hieradata/common.yaml. Review and update its content, if needed.

Values, which may be changed:

profiles::cloudera_director_client::root_volume_size_GB: AWS volume size to be allocated for each node. Root partition on each node will be resized accordingly.
profiles::cloudera_director_client::data_node_quantity: Number of cluster data nodes to be deployed on AWS
profiles::cloudera_director_client::data_node_quantity_min_allowed: Min number of cluster data nodes successfully deployed to AWS, otherwise the process will fail
profiles::cloudera_director_client::data_node_instance_type: AWS instance type for data node
profiles::cloudera_director_client::cloudera_manager_instance_type:  AWS instance type for Cloudera Manager
profiles::cloudera_director_client::master_node_instance_type: AWS instance type for master node
profiles::cloudera_director_client::cluster_deployment_timeout_sec: Cluster deployment timeout in seconds. It should be changed depending on the cluster size.
profiles::cloudera_director_client::hdfs_replication_factor: HDFS replication factor

According to the chosen profile, copy main.tf-[small|medium|large] to main.tf. In medium and large profiles you may change the number of "node_log_generator" nodes. In large profile you may also change the number of "node_elasticsearch" nodes:
```
count = 1
```
Prepare .pem file to be used for SSH connections.

Copy terraform.tfvars.stub to terraform.tfvars. Update its content with actual values. You may also override other values from variables.tf, if needed.

access_key: AWS Access Key Id
secret_key: AWS Secret Access Key
vpc_subnet_id: Existing AWS Subnet Id to be used
key_file: Path to your .pem file
public_key: Content of the public key for your .pem file
tag_owner: "Owner" tag for instances (your name)
tag_app: "App" tag for instances ("bigdatalab" by default)
tag_env: "Env" tag for instances ("development" by default)
security_group: Name for a new AWS Security Group ("bigdatalab-group" by default)
key_name: Name for a new AWS Key Pair ("bigdatalab-key" by default)
cluster_name: Name for the Cloudera cluster ("bigdatalab-cluster" by default)
deploy_cloudera_cluster: Whether to deploy Cloudera cluster or not ("true" by default)

Make sure that security_group and key_name with the specified names don't exist yet.

To run unit tests, go to puppet folder and run:
```
bundle exec rake prep
bundle exec rake rspec:classes
bundle exec rake clean
```
Please note that unit tests are automatically run during each terraform apply.
To run acceptance tests, go to puppet folder and run:
```
bundle exec rake prep
bundle exec rake rspec:acceptance
bundle exec rake clean
```
Please note that you need a physical machine to be able to run acceptance tests.
To create the infrastructure, run:
```
terraform apply
```
To connect to any instance:

ssh -i <your .pem file> <SSH user>@<IP address>

You can find SSH users for all AMIs in variables.tf.

To destroy the infrastructure, you need to perform two steps:

First, connect to the Cloudera Director Client instance and run:

sudo cloudera-director terminate-remote ~/cloudera-director-cluster.conf \
  --lp.remote.username=<Cloudera Director username, "admin" by default> \
  --lp.remote.password=<Cloudera Director password, "admin" by default> \
  --lp.remote.terminate.assumeYes=true

When done, run locally:

terraform destroy

For a quick test you may do the following:

Open ElasticSearch in a browser:

http://<public_ip_node_elasticsearch>:9200/_cat/indices?v

Make sure there is apache-logs- index and its docs.count value is greater than 0 and increases after each refresh.

Open Cloudera Director in a browser:

http://<public_ip_node_cloudera_director>:7189

Check "Yes, I accept the End User License Terms and Conditions" and click "Continue". Enter Username ("admin" by default) and Password ("admin" by default)

Make sure that all services are "green".

Click "bigdatalab-cluster Deployment" link in the table. On the next screen click "View Properties" and copy Public IP value.

Open Cloudera Manager in a browser:

http://<Public IP from the last step>:7180

Please note that Cloudera Manager UI may render incorrectly in FireFox. If there is no data on some screen, just refresh that page.

Make sure that everything is green. Please note that if you use a small profile with less than 3 data nodes, you will probably see some warnings related to HDFS and Zookeeper services.

Click "HDFS-1" in the table. On the next screen click "NameNode Web UI". On the next screen click Utilities -> Browse the file system.

Navigate to /flume/logs directory and make sure that there is some actual data.

Cloudera CDH Cluster Installed Services

Cloudera Manager
Name Node:

HDFS
YARN
Zookeeper
Hive
Hue
Oozie
Impala

Data Node(s):

HDFS
YARN
Impala

Name		Name	Last commit message	Last commit date
Latest commit History 130 Commits
misc/img		misc/img
puppet		puppet
.gitignore		.gitignore
README.md		README.md
bootstrap_cdh.py		bootstrap_cdh.py
core-install.sh		core-install.sh
main.tf-large		main.tf-large
main.tf-medium		main.tf-medium
main.tf-small		main.tf-small
puppet-install.sh		puppet-install.sh
requirements.txt		requirements.txt
terraform-install.sh		terraform-install.sh
terraform.tfvars-stub		terraform.tfvars-stub
variables.tf		variables.tf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

misc/img

misc/img

puppet

puppet

.gitignore

.gitignore

README.md

README.md

bootstrap_cdh.py

bootstrap_cdh.py

core-install.sh

core-install.sh

main.tf-large

main.tf-large

main.tf-medium

main.tf-medium

main.tf-small

main.tf-small

puppet-install.sh

puppet-install.sh

requirements.txt

requirements.txt

terraform-install.sh

terraform-install.sh

terraform.tfvars-stub

terraform.tfvars-stub

variables.tf

variables.tf

Repository files navigation

Table of Contents

Lambda Architecture

Data Flow

Cloudera CDH Cluster Installed Services

About

Releases

Packages

Languages

dawsongzhao/bigdatalab

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

Lambda Architecture

Data Flow

Cloudera CDH Cluster Installed Services

About

Resources

Stars

Watchers

Forks

Languages