From 45b7ed1af299ef0b8caf4b3f8c702afe1b2edb26 Mon Sep 17 00:00:00 2001 From: vjagadish1989 Date: Mon, 22 May 2017 16:51:38 -0700 Subject: [PATCH 1/3] Improve documentation for Resource Localization --- .../yarn/yarn-resource-localization.md | 59 ++++++------------- 1 file changed, 19 insertions(+), 40 deletions(-) diff --git a/docs/learn/documentation/versioned/yarn/yarn-resource-localization.md b/docs/learn/documentation/versioned/yarn/yarn-resource-localization.md index a55670bd09..4558d1e26d 100644 --- a/docs/learn/documentation/versioned/yarn/yarn-resource-localization.md +++ b/docs/learn/documentation/versioned/yarn/yarn-resource-localization.md @@ -18,58 +18,49 @@ title: YARN Resource Localization See the License for the specific language governing permissions and limitations under the License. --> - -When Samza jobs run on YARN clusters, sometimes there are needs to preload some files or data (called as resources in this doc) before job starts, such as preparing the job package, fetching job certificate, or etc., Samza supports a general configuration way to localize difference resources. +When running Samza jobs on YARN clusters, you may need to download some resources before startup (For example, downloading the job binaries, fetching certificate files etc.) This step is called as Resource Localization. ### Resource Localization Process -For the Samza jobs running on YARN, the resource localization leverages the YARN node manager localization service. Here is a good [deep dive](https://hortonworks.com/blog/resource-localization-in-yarn-deep-dive/) from Horton Works on how localization works in YARN. - -Depending on where and how the resource comes from, fetching the resource is associated with a scheme in the path, such as `http`, `https`, `hdfs`, `ftp`, `file`, etc., which maps to a certain FileSystem for handling the localization. +For Samza jobs running on YARN, resource localization leverages the YARN node manager's localization service. Here is a [deep dive](https://hortonworks.com/blog/resource-localization-in-yarn-deep-dive/) on how localization works in YARN. -If there is an implementation of [FileSystem](https://hadoop.apache.org/docs/stable/api/index.html?org/apache/hadoop/fs/FileSystem.html) on YARN supporting a scheme, then that scheme can be used for resource localization. +Depending on where and how the resource comes from, fetching the resource is associated with a scheme in the path (such as `http`, `https`, `hdfs`, `ftp`, `file`, etc). The scheme maps to a corresponding `FileSystem` implementation for handling the localization. -There are some predefined file systems in Hadoop or Samza, which are provided if you run Samza jobs on YARN: +There are some predefined `FileSystem` implementations in Hadoop or Samza, which are provided if you run Samza jobs on YARN: -* `org.apache.samza.util.hadoop.HttpFileSystem`: used for fetching resources based on http, or https without client side authentication requirement. +* `org.apache.samza.util.hadoop.HttpFileSystem`: used for fetching resources based on http, or https without client side authentication. * `org.apache.hadoop.hdfs.DistributedFileSystem`: used for fetching resource from DFS system on Hadoop. * `org.apache.hadoop.fs.LocalFileSystem`: used for copying resources from local file system to the job directory. * `org.apache.hadoop.fs.ftp.FTPFileSystem`: used for fetching resources based on ftp. -* ... -If you would like to have your own file system, you need to implement a class which extends from `org.apache.hadoop.fs.FileSystem`. +If you would like to have your own file system, you should implement a class which extends from `org.apache.hadoop.fs.FileSystem`. -### Job Configuration -With the configuration properly defined, the resources a job requiring from external or internal locations may be prepared automatically before it runs. - -For each resource with the name `` in the Samza job, the following set of job configurations are used when running on a YARN cluster. The first one which definiing resource path is required, but the others are optional and they have default values. +### Resource Configuration +You can specify a resource to be localized by the following configuration. +#### Required Configuration 1. `yarn.resources..path` - * Required - * The path for fetching the resource for localization, e.g. http://hostname.com/packages/mySamzaJob + * The path for fetching the resource for localization, e.g. http://hostname.com/packages/myResource + +#### Optional Configuration 2. `yarn.resources..local.name` - * Optional * The local name used for the localized resource. - * If not set, the default one will be `` from the config key. + * If it is not set, the default will be the `` specified in `yarn.resources..path` 3. `yarn.resources..local.type` - * Optional - * Localized resource type with valid values from: `ARCHIVE`, `FILE`, `PATTERN`. + * The type of the resource with valid values from: `ARCHIVE`, `FILE`, `PATTERN`. * ARCHIVE: the localized resource will be an archived directory; * FILE: the localized resource will be a file; * PATTERN: the localized resource will be the entries extracted from the archive with the pattern. - * If not set, the default value is `FILE`. + * If it is not set, the default value is `FILE`. 4. `yarn.resources..local.visibility` - * Optional - * Localized resource visibility for the resource, and it can be a value from `PUBLIC`, `PRIVATE`, `APPLICATION` + * Visibility for the resource with valid values from `PUBLIC`, `PRIVATE`, `APPLICATION` * PUBLIC: visible to everyone * PRIVATE: visible to just the account which runs the job * APPLICATION: visible only to the specific application job which has the resource configuration - * If not set, the default value is `APPLICATION` - -It is up to you how to name the resource, but `` should be the same in the above configurations to apply to the same resource. + * If it is not set, the default value is `APPLICATION` ### YARN Configuration -Make sure the scheme used in the yarn.resources.<resourceName>.path is configured in YARN core-site.xml with a FileSystem implementation. For example, for scheme `http`, you should have the following property in YARN core-site.xml: +Make sure the scheme used in the `yarn.resources..path` is configured with a corresponding FileSystem implementation YARN core-site.xml. {% highlight xml %} @@ -81,19 +72,7 @@ Make sure the scheme used in the yarn.resources.<resourceName>.path is con {% endhighlight %} -You can override a behavior for a scheme by linking it to another file system. For example, you have a special need for localizing a resource for your job through http request, you may implement your own Http File System by extending [FileSystem](https://hadoop.apache.org/docs/stable/api/index.html?org/apache/hadoop/fs/FileSystem.html), and have the following configuration: - -{% highlight xml %} - - - - fs.http.impl - com.myCompany.MyHttpFileSystem - - -{% endhighlight %} - -If you are using other scheme which is not defined in Hadoop or Samza, for example, `yarn.resources.mySampleResource.path=myScheme://host.com/test`, in your job configuration, you may implement your own [FileSystem](https://hadoop.apache.org/docs/stable/api/index.html?org/apache/hadoop/fs/FileSystem.html) such as com.myCompany.MySchemeFileSystem and link it with your own scheme in yarn core-site.xml configuration. +If you are using your own scheme (for example, `yarn.resources.myResource.path=myScheme://host.com/test`), you can implement your own [FileSystem](https://hadoop.apache.org/docs/stable/api/index.html?org/apache/hadoop/fs/FileSystem.html) and link it with your scheme. {% highlight xml %} From d1fdbf76ea85d2c437969d33dd39c2e8901f22f2 Mon Sep 17 00:00:00 2001 From: vjagadish1989 Date: Mon, 22 May 2017 16:57:06 -0700 Subject: [PATCH 2/3] Improve documentation for Resource Localization --- .../documentation/versioned/yarn/yarn-resource-localization.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/learn/documentation/versioned/yarn/yarn-resource-localization.md b/docs/learn/documentation/versioned/yarn/yarn-resource-localization.md index 4558d1e26d..74978b6a0c 100644 --- a/docs/learn/documentation/versioned/yarn/yarn-resource-localization.md +++ b/docs/learn/documentation/versioned/yarn/yarn-resource-localization.md @@ -72,7 +72,7 @@ Make sure the scheme used in the `yarn.resources..path` is configu {% endhighlight %} -If you are using your own scheme (for example, `yarn.resources.myResource.path=myScheme://host.com/test`), you can implement your own [FileSystem](https://hadoop.apache.org/docs/stable/api/index.html?org/apache/hadoop/fs/FileSystem.html) and link it with your scheme. +If you are using your own scheme (for example, `yarn.resources.myResource.path=myScheme://host.com/test`), you can link your [FileSystem](https://hadoop.apache.org/docs/stable/api/index.html?org/apache/hadoop/fs/FileSystem.html) implementation with it as follows. {% highlight xml %} From ba61f70c0535e7541ff4f3199dc1f667ba09dcdc Mon Sep 17 00:00:00 2001 From: vjagadish1989 Date: Mon, 22 May 2017 17:41:51 -0700 Subject: [PATCH 3/3] Improve documentation for Resource Localization --- .../versioned/yarn/yarn-resource-localization.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/learn/documentation/versioned/yarn/yarn-resource-localization.md b/docs/learn/documentation/versioned/yarn/yarn-resource-localization.md index 74978b6a0c..3d1c87afbe 100644 --- a/docs/learn/documentation/versioned/yarn/yarn-resource-localization.md +++ b/docs/learn/documentation/versioned/yarn/yarn-resource-localization.md @@ -26,14 +26,14 @@ For Samza jobs running on YARN, resource localization leverages the YARN node ma Depending on where and how the resource comes from, fetching the resource is associated with a scheme in the path (such as `http`, `https`, `hdfs`, `ftp`, `file`, etc). The scheme maps to a corresponding `FileSystem` implementation for handling the localization. -There are some predefined `FileSystem` implementations in Hadoop or Samza, which are provided if you run Samza jobs on YARN: +There are some predefined `FileSystem` implementations in Hadoop and Samza, which are provided if you run Samza jobs on YARN: -* `org.apache.samza.util.hadoop.HttpFileSystem`: used for fetching resources based on http, or https without client side authentication. +* `org.apache.samza.util.hadoop.HttpFileSystem`: used for fetching resources based on http or https without client side authentication. * `org.apache.hadoop.hdfs.DistributedFileSystem`: used for fetching resource from DFS system on Hadoop. * `org.apache.hadoop.fs.LocalFileSystem`: used for copying resources from local file system to the job directory. * `org.apache.hadoop.fs.ftp.FTPFileSystem`: used for fetching resources based on ftp. -If you would like to have your own file system, you should implement a class which extends from `org.apache.hadoop.fs.FileSystem`. +You can create your own file system implementation by creating a class which extends from `org.apache.hadoop.fs.FileSystem`. ### Resource Configuration You can specify a resource to be localized by the following configuration. @@ -60,7 +60,7 @@ You can specify a resource to be localized by the following configuration. * If it is not set, the default value is `APPLICATION` ### YARN Configuration -Make sure the scheme used in the `yarn.resources..path` is configured with a corresponding FileSystem implementation YARN core-site.xml. +Make sure the scheme used in the `yarn.resources..path` is configured with a corresponding FileSystem implementation in YARN core-site.xml. {% highlight xml %}