From f1e90289a62a27a2e1c5a240a2db7d74cf8103a5 Mon Sep 17 00:00:00 2001 From: Anton Gilgur Date: Thu, 3 Aug 2023 17:24:00 -0400 Subject: [PATCH] docs: full copy-edit of webHDFS page Stylistic changes: - use more direct language, per [k8s style guide](https://kubernetes.io/docs/contribute/style/style-guide/#use-simple-and-direct-language) - "Using webHDFS protocol via HTTP artifacts" -> "webHDFS via HTTP artifacts" - "In order to use [...] we will make use of" -> "You can use" - "HTTP URL" -> "URL" - also consistently use the term "query paramter" instead of "HTTP URL parameter", which is ambiguous (path params, query params, etc) - "need to append" -> "append" - "This results in the following URL" -> "The result is" - "Now, when run, the workflow will" -> "The workflow will" - "There are some additional fields that can be set" -> "Additional fields can be set" - "In order to declare" -> "To delcare" - "need to change [...] to" -> "instead use" - "where we want the [...] to be stored" -> "your desired location" - "want to store the artifact under" -> "artifact will be stored at" - "also supply the optional overwrite [...] to allow [...]" -> "can overwrite [...] with" - "may want to provide some authentication option" -> "may want to use authentication" - "a usage of [...] can be realized by supplying" -> "can be used via" - several other more complex / multi-sentence simplifications - use in-line links, per [k8s style guide](https://kubernetes.io/docs/contribute/style/style-guide/#links) - consistently use "the full webHDFS example" when referring to the main example (instead of differing descriptions for the same example) - remove "all you need", "little change", "only", etc per [k8s style guide](https://kubernetes.io/docs/contribute/style/style-guide/#avoid-words-that-assume-a-specific-level-of-understanding) - address the reader as "you", per [k8s style guide](https://kubernetes.io/docs/contribute/style/style-guide/#address-the-reader-as-you) - replace Latin phrases, per [k8s style guide](https://kubernetes.io/docs/contribute/style/style-guide/#avoid-latin-phrases) - move "Additional fields" sentence to after the in-line example, as the example does not use those and it is a side note - comment out elipses in code blocks so that they are more copyable / valid syntax - use an infobox instead of "**Limitation**:" to be consistent with the rest of the docs - minor grammatical fixes - extra commas, extra articles, etc Semantic changes: - add links to HTTP artifacts and the `overwrite` parameter - remove note about `overwrite` depending on provider. the (newly linked) docs specify that it defaults to `false` - clarify the difference between Hadoop native auth and provider dependent auth - specify that tokens are used via the `delegation` query param, per the linked docs Signed-off-by: Anton Gilgur --- docs/use-cases/webhdfs.md | 25 ++++++++++++++++--------- 1 file changed, 16 insertions(+), 9 deletions(-) diff --git a/docs/use-cases/webhdfs.md b/docs/use-cases/webhdfs.md index 8acbfc80fc4a..21e8880a3bb3 100644 --- a/docs/use-cases/webhdfs.md +++ b/docs/use-cases/webhdfs.md @@ -1,14 +1,16 @@ -# Using webHDFS protocol via HTTP artifacts +# webHDFS via HTTP artifacts -webHDFS is a protocol allowing to access Hadoop or similar a data storage via a unified REST API (). +[webHDFS](https://hadoop.apache.org/docs/r3.3.3/hadoop-project-dist/hadoop-hdfs/WebHDFS.html) is a protocol allowing to access Hadoop or similar data storage via a unified REST API. ## Input Artifacts -In order to use the webHDFS protocol we will make use of HTTP artifacts, where the URL will be set to the webHDFS endpoint including the file path and all its query parameters. Suppose, our webHDFS endpoint is available under `https://mywebhdfsprovider.com/webhdfs/v1/` and we have a file `my-art.txt` located in a `data` folder, which we want to use as an input artifact. To construct the HTTP URL we need to append the file path to the base webHDFS endpoint and set the [OPEN operation](https://hadoop.apache.org/docs/r3.3.3/hadoop-project-dist/hadoop-hdfs/WebHDFS.html#Open_and_Read_a_File) in the HTTP URL parameter. This results in the following URL: `https://mywebhdfsprovider.com/webhdfs/v1/data/my-art.txt?op=OPEN`. This is all you need for webHDFS input artifacts to work! Now, when run, the workflow will download the specified webHDFS artifact into the given `path`. There are some additional fields that can be set for HTTP artifacts (e.g. HTTP headers), which you can find in the [full webHDFS example](https://github.com/argoproj/argo-workflows/blob/master/examples/webhdfs-input-output-artifacts.yaml). +You can use [HTTP artifacts](../walk-through/hardwired-artifacts.md) to connect to webHDFS, where the URL will be the webHDFS endpoint including the file path and any query parameters. +Suppose your webHDFS endpoint is available under `https://mywebhdfsprovider.com/webhdfs/v1/` and you have a file `my-art.txt` located in a `data` folder, which you want to use as an input artifact. To construct the URL, you append the file path to the base webHDFS endpoint and set the [OPEN operation](https://hadoop.apache.org/docs/r3.3.3/hadoop-project-dist/hadoop-hdfs/WebHDFS.html#Open_and_Read_a_File) via query parameter. The result is: `https://mywebhdfsprovider.com/webhdfs/v1/data/my-art.txt?op=OPEN`. +See the below Workflow which will download the specified webHDFS artifact into the specified `path`: ```yaml spec: - [...] + # ... inputs: artifacts: - name: my-art @@ -17,13 +19,16 @@ spec: url: "https://mywebhdfsprovider.com/webhdfs/v1/file.txt?op=OPEN" ``` +Additional fields can be set for HTTP artifacts (for example, headers). See usage in the [full webHDFS example](https://github.com/argoproj/argo-workflows/blob/master/examples/webhdfs-input-output-artifacts.yaml). + ## Output Artifacts -In order to declare a webHDFS output artifact, little change is necessary: We only need to change the webHDFS operation to the [CREATE operation](https://hadoop.apache.org/docs/r3.3.3/hadoop-project-dist/hadoop-hdfs/WebHDFS.html#Create_and_Write_to_a_File) and set the file path to where we want the output artifact to be stored. In this example we want to store the artifact under `outputs/newfile.txt`. We also supply the optional overwrite parameter `overwrite=true` to allow overwriting existing files in the webHDFS provider's data storage. If the `overwrite` flag is unset, the default behavior is used, which depends on the particular webHDFS provider. Below shows the example output artifact: +To declare a webHDFS output artifact, instead use the [CREATE operation](https://hadoop.apache.org/docs/r3.3.3/hadoop-project-dist/hadoop-hdfs/WebHDFS.html#Create_and_Write_to_a_File) and set the file path to your desired location. +In the below example, the artifact will be stored at `outputs/newfile.txt`. You can [overwrite](https://hadoop.apache.org/docs/r3.3.3/hadoop-project-dist/hadoop-hdfs/WebHDFS.html#Overwrite) existing files with `overwrite=true`. ```yaml spec: - [...] + # ... outputs: artifacts: - name: my-art @@ -34,12 +39,14 @@ spec: ## Authentication -Above example showed a minimal use case without any authentication. However, in a real-world scenario, you may want to provide some authentication option. Currently, Argo Workflows' HTTP artifacts support the following authentication mechanisms: +The above examples show minimal use cases without authentication. However, in a real-world scenario, you may want to use authentication. +The authentication mechanism is limited to those supported by HTTP artifacts: - HTTP Basic Auth - OAuth2 - Client Certificates -Hence, the authentication mechanism that can be used for webHDFS artifacts are limited to those supported by HTTP artifacts. Examples for the latter two authentication mechanisms can be found in the [webHDFS example file](https://github.com/argoproj/argo-workflows/blob/master/examples/webhdfs-input-output-artifacts.yaml). +Examples for the latter two mechanisms can be found in the [full webHDFS example](https://github.com/argoproj/argo-workflows/blob/master/examples/webhdfs-input-output-artifacts.yaml). -**Limitation**: Apache Hadoop itself only supports authentication via Kerberos SPNEGO and Hadoop delegation token (see ). While the former one is currently not supported for HTTP artifacts a usage of delegation tokens can be realized by supplying the authentication token in the HTTP URL of the respective input or output artifact. +!!! Warning "Provider dependent" + While your webHDFS provider may support the above mechanisms, Hadoop _itself_ only supports [authentication](https://hadoop.apache.org/docs/r3.3.3/hadoop-project-dist/hadoop-hdfs/WebHDFS.html#Authentication) via Kerberos SPNEGO and Hadoop delegation token. HTTP artifacts do not currently support SPNEGO, but delegation tokens can be used via the `delegation` query parameter.