Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: full copy-edit of webHDFS page #11516

Merged
merged 2 commits into from
Aug 14, 2023
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 16 additions & 9 deletions docs/use-cases/webhdfs.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,16 @@
# Using webHDFS protocol via HTTP artifacts
# webHDFS via HTTP artifacts

webHDFS is a protocol allowing to access Hadoop or similar a data storage via a unified REST API (<https://hadoop.apache.org/docs/r3.3.3/hadoop-project-dist/hadoop-hdfs/WebHDFS.html>).
[webHDFS](https://hadoop.apache.org/docs/r3.3.3/hadoop-project-dist/hadoop-hdfs/WebHDFS.html) is a protocol allowing to access Hadoop or similar data storage via a unified REST API.

## Input Artifacts

In order to use the webHDFS protocol we will make use of HTTP artifacts, where the URL will be set to the webHDFS endpoint including the file path and all its query parameters. Suppose, our webHDFS endpoint is available under `https://mywebhdfsprovider.com/webhdfs/v1/` and we have a file `my-art.txt` located in a `data` folder, which we want to use as an input artifact. To construct the HTTP URL we need to append the file path to the base webHDFS endpoint and set the [OPEN operation](https://hadoop.apache.org/docs/r3.3.3/hadoop-project-dist/hadoop-hdfs/WebHDFS.html#Open_and_Read_a_File) in the HTTP URL parameter. This results in the following URL: `https://mywebhdfsprovider.com/webhdfs/v1/data/my-art.txt?op=OPEN`. This is all you need for webHDFS input artifacts to work! Now, when run, the workflow will download the specified webHDFS artifact into the given `path`. There are some additional fields that can be set for HTTP artifacts (e.g. HTTP headers), which you can find in the [full webHDFS example](https://github.com/argoproj/argo-workflows/blob/master/examples/webhdfs-input-output-artifacts.yaml).
You can use [HTTP artifacts](../walk-through/hardwired-artifacts.md) to connect to webHDFS, where the URL will be the webHDFS endpoint including the file path and any query parameters.
Suppose your webHDFS endpoint is available under `https://mywebhdfsprovider.com/webhdfs/v1/` and you have a file `my-art.txt` located in a `data` folder, which you want to use as an input artifact. To construct the URL, you append the file path to the base webHDFS endpoint and set the [OPEN operation](https://hadoop.apache.org/docs/r3.3.3/hadoop-project-dist/hadoop-hdfs/WebHDFS.html#Open_and_Read_a_File) via query parameter. The result is: `https://mywebhdfsprovider.com/webhdfs/v1/data/my-art.txt?op=OPEN`.
See the below Workflow which will download the specified webHDFS artifact into the specified `path`:

```yaml
spec:
[...]
# ...
inputs:
artifacts:
- name: my-art
Expand All @@ -17,13 +19,16 @@ spec:
url: "https://mywebhdfsprovider.com/webhdfs/v1/file.txt?op=OPEN"
```

Additional fields can be set for HTTP artifacts (for example, headers). See usage in the [full webHDFS example](https://github.com/argoproj/argo-workflows/blob/master/examples/webhdfs-input-output-artifacts.yaml).

## Output Artifacts

In order to declare a webHDFS output artifact, little change is necessary: We only need to change the webHDFS operation to the [CREATE operation](https://hadoop.apache.org/docs/r3.3.3/hadoop-project-dist/hadoop-hdfs/WebHDFS.html#Create_and_Write_to_a_File) and set the file path to where we want the output artifact to be stored. In this example we want to store the artifact under `outputs/newfile.txt`. We also supply the optional overwrite parameter `overwrite=true` to allow overwriting existing files in the webHDFS provider's data storage. If the `overwrite` flag is unset, the default behavior is used, which depends on the particular webHDFS provider. Below shows the example output artifact:
To declare a webHDFS output artifact, instead use the [CREATE operation](https://hadoop.apache.org/docs/r3.3.3/hadoop-project-dist/hadoop-hdfs/WebHDFS.html#Create_and_Write_to_a_File) and set the file path to your desired location.
In the below example, the artifact will be stored at `outputs/newfile.txt`. You can [overwrite](https://hadoop.apache.org/docs/r3.3.3/hadoop-project-dist/hadoop-hdfs/WebHDFS.html#Overwrite) existing files with `overwrite=true`.

```yaml
spec:
[...]
# ...
outputs:
artifacts:
- name: my-art
Expand All @@ -34,12 +39,14 @@ spec:

## Authentication

Above example showed a minimal use case without any authentication. However, in a real-world scenario, you may want to provide some authentication option. Currently, Argo Workflows' HTTP artifacts support the following authentication mechanisms:
The above examples show minimal use cases without authentication. However, in a real-world scenario, you may want to use authentication.
The authentication mechanism is limited to those supported by HTTP artifacts:

- HTTP Basic Auth
- OAuth2
- Client Certificates

Hence, the authentication mechanism that can be used for webHDFS artifacts are limited to those supported by HTTP artifacts. Examples for the latter two authentication mechanisms can be found in the [webHDFS example file](https://github.com/argoproj/argo-workflows/blob/master/examples/webhdfs-input-output-artifacts.yaml).
Examples for the latter two mechanisms can be found in the [full webHDFS example](https://github.com/argoproj/argo-workflows/blob/master/examples/webhdfs-input-output-artifacts.yaml).

**Limitation**: Apache Hadoop itself only supports authentication via Kerberos SPNEGO and Hadoop delegation token (see <https://hadoop.apache.org/docs/r3.3.3/hadoop-project-dist/hadoop-hdfs/WebHDFS.html#Authentication>). While the former one is currently not supported for HTTP artifacts a usage of delegation tokens can be realized by supplying the authentication token in the HTTP URL of the respective input or output artifact.
!!! Warning "Provider dependent"
While your webHDFS provider may support the above mechanisms, Hadoop _itself_ only supports [authentication](https://hadoop.apache.org/docs/r3.3.3/hadoop-project-dist/hadoop-hdfs/WebHDFS.html#Authentication) via Kerberos SPNEGO and Hadoop delegation token. HTTP artifacts do not currently support SPNEGO, but delegation tokens can be used via the `delegation` query parameter.
Loading