Skip to content

Commit

Permalink
polish docs, move to docs dir
Browse files Browse the repository at this point in the history
  • Loading branch information
pferrel committed Jun 1, 2019
1 parent ef97447 commit aa88533
Show file tree
Hide file tree
Showing 42 changed files with 420 additions and 260 deletions.
38 changes: 0 additions & 38 deletions commands.md

This file was deleted.

File renamed without changes.
57 changes: 57 additions & 0 deletions docs/commands.md
@@ -0,0 +1,57 @@
# Commands

Harness includes an Admin command line interface. It runs using the Harness REST interface and can be run remotely.

## Conventions

Internal to Harness are ***Engines Instances*** that implement some algorithm and contain datasets and configuration parameters. All input data is validated by the engine, and must be readable by the algorithm. The simple form of workflow is:

1. start server
2. add engine
3. input data to the engine
4. train (for Lambda, Kappa will auto train with each new input)
5. query

See the [Workflow](workflow.md) section for more detail.

Harness uses resource-ids to identify all objects in the system. The Engine Instance must have a JSON file, which contains all parameters for Harness engine management including its Engine Instance resource-id as well as algorithm parameters that are specific to the Engine type. All Harness global configuration is stored in `harness-env` see [Harness Config](harness_config.md) for details.

- The file `<some-engine.json>` can be named anything and put anywhere.
- The working copy of all engine parameters and input data is actually in a shared database. Add or update an Engine Instance to change its configuraiton. Changing the file will not update the Engine Instance. See the `add` and `update` commands.

# Harness Start and Stop

Scripts that start and stop Harness are included with the project in the `sbin/`. These are used inside container startup and can be used directly in the OS level installation.

- **`harness-start [-f]`** starts the harness server based on configuration in `harness-env`. The `-f` argument forces a restart if Harness is already running. All other commands require the service to be running, it is always started as a daemon/background process. All previously configured engines are started in the state they were in when harness was last run.

- **`harness-stop`** gracefully stops harness and all engines. If the pid-file has become out of sync, look for the `HarnessServer` process with `jps -lm` or `ps aux | grep HarnessServer` and execute `kill <pid>` to stop it.

# Harness Administration

- **`harnctl status [engines [<engine-id>], users [<user-id>]]`** These print status information about the objects requested. Asking for user status requires the Harness Auth-server, which is optional.
- **`harnctl add <some-engine.json>`** creates and starts an Engine Instance of the type defined by the `engineFactory` parameter.
- **`harnctl update <some-engine.json>`** updates an existing Engine Instance with values defined in `some-engine.json`. The Engine knows what is safe to update and may warn if some value is not updatable but this will be rare.
- **`harnctl delete <some-engine-id>`** The Engine Instance will be stopped and the accumulated dataset and model will be deleted. No artifacts of the Engine Instance will remain except the `some-engine.json` file and any mirrored events.
- **`harnctl import <some-engine-id> [<some-directory> | <some-file>]`** This is typically used to replay previously mirrored events or load bootstrap datasets created from application logs. It is equivalent to sending all imported events to the REST API.
- **`harnctl export <some-engine-id> [<some-directory> | <some-file>]`** If the directory is supplied with the protocol "file:" the export will go to the harness server host's file system. This is for use with vertically scaled Harness. For more general storage use HDFS (the Hadoop File System) flagged by the protocol `hdfs` for example: `hdfs://some-hdfs-server:9000/users/<some-user>/<some-directory>`. [**to me implemented in 0.5.0**]
- **`harnctl train <some-engine-id>`** For Lambda style engines like the UR this will create or update a model. This is required for Lambda Engines before queries will return values.

# Harness Auth-server Administration

There are several extended commands that manage Users and Role. These are only needed when using the Harness Auth-server to create secure multi-tenancy. Open multi-tenancy is the default and requires no Auth-Server

- **`harnctl user-add [client <engine-id> | admin]`** Returns a new user-id and their secret. Grants the role's permissions. Client Users have access to one or more `engine-id`s, `admin` Users have access to all `engine-id`s as well as admin only commands and REST endpoints.
- **`harnctl user-delete <user-id>`** removes all access for the `user-id`
- **`harnctl grant <user-id> [client <engine-id> | admin]`** adds permissions to an existing user
- **`harnctl revoke <user-id> [client <engine-id> | admin]`** removes permissions from an existing user

# Bootstrapping With Import

Import can be used to restore backed up data but also for bootstrapping a new Engine instance with previously logged or collected batches of data. Imagine a recommender that takes in people's purchase history. This might exist in server logs and converting these to files of JSON events is an easy and reproducible way to "bootstrap" your recommender with previous data before you start to send live events. This, in effect, trains your recommender retro-actively, improving the quality of recommendations at its first startup.

# Backup with Export

[**to me implemented in 0.5.0**] Lambda style Engines, which store all Events, usually support `harnctl export ...` This command will create files with a single JSON Event per line in the same format as the [Mirror](mirroring.md) function. To backup an Engine Instance use the export and store somewhere safe. These files can be re-imported for re-calculation of the input DB and, after training, the model.

Engines that follow the Kappa style do not save input but rather update the model with every new input Event. So use [Mirroring](mirroring.md) to log each new Event. In a sense this is an automatic backup that can also be used to re-instantiate a Kappa style model.
File renamed without changes.
Expand Up @@ -27,7 +27,7 @@ An Engine is the API contract. This contract is embodied in the `core/engine` pa
- **Engine**: The Engine class is the "controller" in the MVC use of the term. It takes all input, parses and validates it, then understands where to send it and how to report errors. It also fields queries processing them in much the same way. It creates Datasets and Algorithms and understand when to trigger functionality for them. The basic API is defined but any additional workflow or object needed may be defined ad-hoc, where Harness does not provide sufficient support in some way. For instance a compute engine or storage method can be defined ad-hoc if Harness does not provide default sufficient default support.
- **Algorithm**: The Algorithm understands where data is and implements the Machine Learning part of the system. It converts Datasets into Models and implements the Queries. The Engine controls initialization of the Algorithm with parameters and triggers training and queries. For kappa style learning the Engine triggers training on every input so the Algorithm may spin up Akka based Actors to perform continuous training. For Lambda the training may be periodic and performed less often and may use a compute engine like Spark or TensorFlow to perform this periodic training. There is nothing in the Harness that enforces the methodology or compute platform used but several are given default support to make their use easier.
- **Dataset**: A Dataset may be comprised of event streams and mutable data structures in any combination. The state of these is always stored outside of Harness in a separate scalable sharable store. The Engine usually implements a `dal` or data access layer, which may have any number of backing stores in the basic DAL/DAO/DAOImplementation pattern. The full pattern is implemented in the `core/dal` for one type of object of common use (Users) and for one store (MongoDB). But the idea is easily borrowed and modified. For instance the Contextual Bandit and Navigation Hinting Engines extend this pattern for their own collections. If the Engine uses a store supported by Harness then it can inject the implementation so that Harness and all of its Engines can target different stores using config to define which is used for any installation of Harness. This uses ScalDI for lightweight dependency injection.
- **DAL**: The Data Access Layer is a very lightweight way of accessing some external persistence. The knowledge of how to use the store is contained in anything that uses is. The DAL contains an abstract API in the DAO, and implemented in a DAOImpl. There may be many DAOImpls for many stores, the default is MongoDB but others are under consideration.
- **DAL**: The Data Access Layer is a very lightweight way of accessing some external persistence. The knowledge of how to use the store is contained in anything that uses is. The DAL contains an abstract API in the DAO, and implemented in a DAOImpl. There may be many DAOImpls for many stores, the default is MongoDB but others are possible.
- **Administrator**: Each of the above Classes is extended by a specific Engine and the Administrator handles CRUD operations on the Engine instances, which in turn manage there constituent Algorithms and Datasets. The Administrator manages the state of the system through persistence in a scalable DB (default is MongoDB) as well as attaching the Engine to the input and query REST endpoints corresponding to the assigned resource-id for the Engine instance.
- **Parameters**: are JSON descriptions of Engine instance initialization information. They may contain information for the Administrator, Engine, Algorithm, and Dataset. The first key part of information is the Engine ID, which is used as the unique resource-id in the REST API. The admin CLI may take the JSON from some json file when creating a new engine or re-initializing one but internally the Parameters are stored persistently and only changed when an Engine instance is updated or re-created. After and Engine instance is added to Harness the params are read from the metadata in the shared DB.

Expand Down
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes.
File renamed without changes.
15 changes: 15 additions & 0 deletions docs/mirroring.md
@@ -0,0 +1,15 @@
# Input Mirroring

Harness will mirror (log) all raw events with no validation, when configured to do so for a specific Engine Instance. For online learning like Kappa style Engines this is the only way to backup data since Events are not stored. For Lambda style learning, like the Universal Recommender, you mak choose to mirror or export data periodically.

Mirroring is useful if you wanted to be able to backup/restore all data or are experimenting with changes in engine parameters and wish to recreate the models using past mirrored data.

To accomplish this, you must set up mirroring for the Engine Instance. Once the Engine Instance is launched with a mirrored configuration all events sent to `POST /engines/<engine-id>/events` will be mirrored to a location set in `some-engine.json`. **Note** Events will be mirrored until the config setting is changed and so can grow without limit, like unrotated server logs.

To enable mirroring add the following to the `some-engine.json` for the Engine Instance you want to mirror:

"mirrorType": "localfs" | "hdfs", // optional, turn on a type of mirroring
"mirrorContainer": "path/to/mirror", // optional, where to mirror input

Mirroring is similar to logging. Each new Event is logged to a file before any validation. The format is JSON one event per line. This can be used to backup an Engine Instance or to move data to a new instance.

File renamed without changes.
File renamed without changes.

0 comments on commit aa88533

Please sign in to comment.