Skip to content
This repository has been archived by the owner. It is now read-only.
Permalink
Browse files
FALCON-1106 Documentation for extensions
Author: Sowmya Ramesh <sramesh@hortonworks.com>

Reviewers: "Balu Vellanki <balu@apache.org>", Ying Zheng <yzheng@hortonworks.com>"

Closes #120 from sowmyaramesh/FALCON-1106
  • Loading branch information
Sowmya Ramesh authored and sowmyaramesh committed May 3, 2016
1 parent fc34d42 commit 85345ad7e7421fbd25829381f27eb5b165d2f8d0
Showing 19 changed files with 442 additions and 262 deletions.
@@ -14,16 +14,17 @@
# See the License for the specific language governing permissions and
# limitations under the License.

HDFS Directory Replication Extension
HDFS Mirroring Extension

Overview
This extension implements replicating arbitrary directories on HDFS from one
Hadoop cluster to another Hadoop cluster.
This piggy backs on replication solution in Falcon which uses the DistCp tool.
Falcon supports HDFS mirroring extension to replicate data from source cluster to destination cluster.
This extension implements replicating arbitrary directories on HDFS and piggy backs on replication solution in Falcon which uses the DistCp tool.
It also allows users to replicate data from on-premise to cloud, either Azure WASB or S3.


Use Case
* Copy directories between HDFS clusters with out dated partitions
* Archive directories from HDFS to Cloud. Ex: S3, Azure WASB

Limitations
As the data volume and number of files grow, this can get inefficient.
* As the data volume and number of files grow, this can get inefficient.
@@ -14,45 +14,18 @@
# See the License for the specific language governing permissions and
# limitations under the License.

Hive Metastore Disaster Recovery Recipe
Hive Mirroring Extension

Overview
This extension implements replicating hive metadata and data from one
Hadoop cluster to another Hadoop cluster.
This piggy backs on replication solution in Falcon which uses the DistCp tool.
Falcon provides feature to replicate Hive metadata and data events from source cluster to destination cluster.
This is supported for both secure and unsecure cluster through Falcon extensions. Falcon uses event­based replication capability provided by hive to implement the Hive mirroring feature.
Falcon will act as admin/user­facing tool which will have fine control on what and how to replicate as defined by its users, while leaving the delta, data and metadata management to hive itself.
Hive mirroring extension piggy backs on Distcp tool for replication.

Use Case
*
*
* Replicate data/metadata of Hive DB & table from source to target cluster

Limitations
*
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Hive Metastore Disaster Recovery Extension
* Currently Hive doesn't support create database, roles, views, offline tables, direct HDFS writes without registering with metadata and Database/Table name mapping replication events.
Hence Hive mirroring extension cannot be used to replicate above mentioned events between warehouses.

Overview
This extension implements replicating hive metadata and data from one
Hadoop cluster to another Hadoop cluster.
This piggy backs on replication solution in Falcon which uses the DistCp tool.

Use Case
*
*

Limitations
*
@@ -14,20 +14,19 @@
# See the License for the specific language governing permissions and
# limitations under the License.

Hive Disaster Recovery
Hive Mirroring
=======================

Overview
---------

Falcon provides feature to replicate Hive metadata and data events from one hadoop cluster
to another cluster. This is supported for secure and unsecure cluster through Falcon Recipes.
Falcon provides feature to replicate Hive metadata and data events from source cluster to destination cluster. This is supported for both secure and unsecure cluster through Falcon extensions.


Prerequisites
-------------

Following is the prerequisites to use Hive DR
Following is the prerequisites to use Hive mirrroing

* Hive 1.2.0+
* Oozie 4.2.0+
@@ -69,12 +68,9 @@ a. Perform initial bootstrap of Table and Database from one Hadoop cluster to an
b. Setup cluster definition
$FALCON_HOME/bin/falcon entity -submit -type cluster -file /cluster/definition.xml

c. Submit Hive DR recipe
$FALCON_HOME/bin/falcon recipe -name hive-disaster-recovery -operation HIVE_DISASTER_RECOVERY
c. Submit Hive mirroring extension
$FALCON_HOME/bin/falcon extension -submitAndSchedule -extensionName hive-mirroring -file /process/definition.xml

Please Refer to Falcon CLI and REST API twiki in the Falcon documentation for more details on usage of CLI and REST API's for extension jobs and instances management.

Recipe templates for Hive DR is available in addons/recipe/hive-disaster-recovery and copy it to
recipe path specified in client.properties.

*Note:* If kerberos security is enabled on cluster, use the secure templates for Hive DR from
addons/recipe/hive-disaster-recovery
@@ -922,10 +922,10 @@ The workflow is re-tried after 10 mins, 20 mins and 30 mins. With exponential ba

To enable retries for instances for feeds, user will have to set the following properties in runtime.properties
<verbatim>
falcon.recipe.retry.policy=periodic
falcon.recipe.retry.delay=minutes(30)
falcon.recipe.retry.attempts=3
falcon.recipe.retry.onTimeout=false
falcon.retry.policy=periodic
falcon.retry.delay=minutes(30)
falcon.retry.attempts=3
falcon.retry.onTimeout=false
<verbatim>
---+++ Late data
Late data handling defines how the late data should be handled. Each feed is defined with a late cut-off value which specifies the time till which late data is valid. For example, late cut-off of hours(6) means that data for nth hour can get delayed by upto 6 hours. Late data specification in process defines how this late data is handled.
@@ -0,0 +1,55 @@
---+ Falcon Extensions

---++ Overview

A Falcon extension is a static process template with parameterized workflow to realize a specific use case and enable non-programmers to capture and re-use very complex business logic. Extensions are defined in server space. Objective of the extension is to solve a standard data management function that can be invoked as a tool using the standard Falcon features (REST API, CLI and UI access) supporting standard falcon features.

For example:

* Replicating directories from one HDFS cluster to another (not timed partitions)
* Replicating hive metadata (database, table, views, etc.)
* Replicating between HDFS and Hive - either way
* Data masking etc.

---++ Proposal

Falcon provides a Process abstraction that encapsulates the configuration for a user workflow with scheduling controls. All extensions can be modeled as a Process and its dependent feeds with in Falcon which executes the user
workflow periodically. The process and its associated workflow are parameterized. The user will provide properties which are <name, value> pairs that are substituted by falcon before scheduling it. Falcon translates these extensions
as a process entity by replacing the parameters in the workflow definition.

---++ Falcon extension artifacts to manage extensions

Extension artifacts are published in addons/extensions. Artifacts are expected to be installed on HDFS at "extension.store.uri" path defined in startup properties. Each extension is expected to ahve the below artifacts
* json file under META directory lists all the required and optional parameters/arguments for scheduling extension job
* process entity template to be scheduled under resources directory
* parameterized workflow under resources directory
* required libs under the libs directory
* README describing the functionality achieved by extension

REST API and CLI support has been added for extension artifact management on HDFS. Please Refer to [[falconcli/FalconCLI][Falcon CLI]] and [[restapi/ResourceList][REST API]] for more details.

---++ CLI and REST API support
REST APIs and CLI support has been added to manage extension jobs and instances.

Please Refer to [[falconcli/FalconCLI][Falcon CLI]] and [[restapi/ResourceList][REST API]] for more details on usage of CLI and REST API's for extension jobs and instances management.

---++ Metrics
HDFS mirroring and Hive mirroring extensions will capture the replication metrics like TIMETAKEN, BYTESCOPIED, COPY (number of files copied) for an instance and populate to the GraphDB.

---++ Sample extensions

Sample extensions are published in addons/extensions

---++ Types of extensions
* [[HDFSMirroring][HDFS mirroring extension]]
* [[HiveMirroring][Hive mirroring extension]]

---++ Packaging and installation

Extension artifacts in addons/extensions are packaged in falcon war under extensions directory. For manual installation user is expected to install the extension artifacts under extensions in falcon war to HDFS at "extension.store.uri" path defined in startup properties and then restart Falcon.

---++ Migration
Recipes framework and HDFS mirroring capability was added in Apache Falcon 0.6.0 release and it was client side logic. With 0.10 release its moved to server side and renamed as server side extensions. Client side recipes only had CLI support and expected certain pre steps to get it working. This is no longer required in 0.10 release as new CLI and REST API support has been provided.

If user is migrating to 0.10 release and above then old Recipe setup and CLI's won't work. For manual installation user is expected to copy Extension artifacts to HDFS. Please refer "Packaging and installation" section above for more details.
Please Refer to [[falconcli/FalconCLI][Falcon CLI]] and [[restapi/ResourceList][REST API]] for more details on usage of CLI and REST API's for extension jobs and instances management.
@@ -13,7 +13,7 @@
* <a href="#Falcon_EL_Expressions">Falcon EL Expressions</a>
* <a href="#Lineage">Lineage</a>
* <a href="#Security">Security</a>
* <a href="#Recipes">Recipes</a>
* <a href="#Extensions">Extensions</a>
* <a href="#Monitoring">Monitoring</a>
* <a href="#Email_Notification">Email Notification</a>
* <a href="#Backwards_Compatibility">Backwards Compatibility Instructions</a>
@@ -738,9 +738,9 @@ lifecycle policies such as replication and retention.

Security is detailed in [[Security][Security]].

---++ Recipes
---++ Extensions

Recipes is detailed in [[Recipes][Recipes]].
Extensions is detailed in [[Extensions][Extensions]].

---++ Monitoring

This file was deleted.

@@ -0,0 +1,27 @@
---+ HDFS mirroring Extension
---++ Overview
Falcon supports HDFS mirroring extension to replicate data from source cluster to destination cluster. This extension implements replicating arbitrary directories on HDFS and piggy backs on replication solution in Falcon which uses the DistCp tool. It also allows users to replicate data from on-premise to cloud, either Azure WASB or S3.

---++ Use Case
* Copy directories between HDFS clusters with out dated partitions
* Archive directories from HDFS to Cloud. Ex: S3, Azure WASB

---++ Limitations
As the data volume and number of files grow, this can get inefficient.

---++ Usage
---+++ Setup source and destination clusters
<verbatim>
$FALCON_HOME/bin/falcon entity -submit -type cluster -file /cluster/definition.xml
</verbatim>

---+++ HDFS mirroring extension properties
Extension artifacts are expected to be installed on HDFS at the path specified by "extension.store.uri" in startup properties. hdfs-mirroring-properties.json file located at "<extension.store.uri>/hdfs-mirroring/META/hdfs-mirroring-properties.json" lists all the required and optional parameters/arguments for scheduling HDFS mirroring job.

---+++ Submit and schedule HDFS mirroring extension

<verbatim>
$FALCON_HOME/bin/falcon extension -submitAndSchedule -extensionName hdfs-mirroring -file /process/definition.xml
</verbatim>

Please Refer to [[falconcli/FalconCLI][Falcon CLI]] and [[restapi/ResourceList][REST API]] for more details on usage of CLI and REST API's.

This file was deleted.

0 comments on commit 85345ad

Please sign in to comment.