Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion website/src/_includes/section-menu/documentation.html
Original file line number Diff line number Diff line change
Expand Up @@ -260,7 +260,8 @@

<ul class="section-nav-list">
<li><a href="{{ site.baseurl }}/documentation/runtime/model/">Execution model</a></li>
<li><a href="{{ site.baseurl }}/documentation/runtime/environments/">Runtime environments</a></li>
<li><a href="{{ site.baseurl }}/documentation/runtime/environments/">Container environments</a></li>
<li><a href="{{ site.baseurl }}/documentation/runtime/sdk-harness-config/">SDK Harness Configuration</a></li>
</ul>
</li>

Expand Down
4 changes: 2 additions & 2 deletions website/src/documentation/runtime/environments.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
layout: section
title: "Runtime environments"
title: "Container environments"
section_menu: section-menu/documentation.html
permalink: /documentation/runtime/environments/
---
Expand All @@ -18,7 +18,7 @@ See the License for the specific language governing permissions and
limitations under the License.
-->

# Runtime environments
# Container environments

The Beam SDK runtime environment is isolated from other runtime systems because the SDK runtime environment is [containerized](https://s.apache.org/beam-fn-api-container-contract) with [Docker](https://www.docker.com/). This means that any execution engine can run the Beam SDK.

Expand Down
57 changes: 57 additions & 0 deletions website/src/documentation/runtime/sdk-harness-config.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
---
layout: section
title: "SDK Harness Configuration"
section_menu: section-menu/documentation.html
permalink: /documentation/runtime/sdk-harness-config/
---
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->

# SDK Harness Configuration

Beam allows configuration of the [SDK harness]({{ site.baseurl }}/roadmap/portability/) to
accommodate varying cluster setups.
(The options below are for Python, but much of this information should apply to the Java and Go SDKs
as well.)

- `environment_type` determines where user code will be executed.
`environment_config` configures the environment depending on the value of `environment_type`.
- `DOCKER` (default): User code is executed within a container started on each worker node.
This requires docker to be installed on worker nodes.
- `environment_config`: URL for the Docker container image. Official Docker images
are available [here](https://hub.docker.com/u/apachebeam) and are used by default.
Alternatively, you can build your own image by following the instructions
[here]({{ site.baseurl }}/documentation/runtime/environments/).
- `PROCESS`: User code is executed by processes that are automatically started by the runner on
each worker node.
- `environment_config`: JSON of the form `{"os": "<OS>", "arch": "<ARCHITECTURE>",
"command": "<process to execute>", "env":{"<Environment variables 1>": "<ENV_VAL>"} }`. All
fields in the JSON are optional except `command`.
- For `command`, it is recommended to use the bootloader executable, which can be built from
source with `./gradlew :sdks:python:container:build` and copied from
`sdks/python/container/build/target/launcher/linux_amd64/boot` to worker machines.
Note that the Python bootloader assumes Python and the `apache_beam` module are installed
on each worker machine.
- `EXTERNAL`: User code will be dispatched to an external service. For example, one can start
an external service for Python workers by running
`docker run -p=50000:50000 apachebeam/python3.6_sdk --worker_pool`.
- `environment_config`: Address for the external service, e.g. `localhost:50000`.
- To access a Dockerized worker pool service from a Mac or Windows client, set the
`BEAM_WORKER_POOL_IN_DOCKER_VM` environment variable on the client:
`export BEAM_WORKER_POOL_IN_DOCKER_VM=1`.
- `LOOPBACK`: User code is executed within the same process that submitted the pipeline. This
option is useful for local testing. However, it is not suitable for a production environment,
as it performs work on the machine the job originated from.
- `environment_config` is not used for the `LOOPBACK` environment.
- `sdk_worker_parallelism` sets the number of SDK workers that will run on each worker node.
22 changes: 3 additions & 19 deletions website/src/roadmap/portability.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,8 @@ vice versa. The framework introduces a new runner, the _Universal
Local Runner (ULR)_, as a practical reference implementation that
complements the direct runners. Finally, it enables cross-language
pipelines (sharing I/O or transformations across SDKs) and
user-customized execution environments ("custom containers").
user-customized [execution environments]({{ site.baseurl }}/documentation/runtime/environments/)
("custom containers").

The portability API consists of a set of smaller contracts that
isolate SDKs and runners for job submission, management and
Expand Down Expand Up @@ -167,21 +168,4 @@ Python streaming mode is not yet supported on Spark.

## SDK Harness Configuration {#sdk-harness-config}

The Beam Python SDK allows configuration of the SDK harness to accommodate varying cluster setups.

- `environment_type` determines where user code will be executed.
- `LOOPBACK`: User code is executed within the same process that submitted the pipeline. This
option is useful for local testing. However, it is not suitable for a production environment,
as it requires a connection between the original Python process and the worker nodes, and
performs work on the machine the job originated from, not the worker nodes.
- `PROCESS`: User code is executed by processes that are automatically started by the runner on
each worker node.
- `DOCKER` (default): User code is executed within a container started on each worker node.
This requires docker to be installed on worker nodes. For more information, see
[here]({{ site.baseurl }}/documentation/runtime/environments/).
- `environment_config` configures the environment depending on the value of `environment_type`.
- When `environment_type=DOCKER`: URL for the Docker container image.
- When `environment_type=PROCESS`: JSON of the form `{"os": "<OS>", "arch": "<ARCHITECTURE>",
"command": "<process to execute>", "env":{"<Environment variables 1>": "<ENV_VAL>"} }`. All
fields in the JSON are optional except `command`.
- `sdk_worker_parallelism` sets the number of SDK workers that will run on each worker node.
See [here]({{ site.baseurl }}/documentation/runtime/sdk-harness-config/) for more information on SDK harness deployment options.