diff --git a/commands.md b/commands.md deleted file mode 100644 index 80fd5180..00000000 --- a/commands.md +++ /dev/null @@ -1,38 +0,0 @@ -# Commands - -Harness includes an Admin command line interface. It runs using the Harness REST interface and can be run remotely from the Harness server. - -Internal to Harness are ***Engines Instances*** that implement some algorithm and contain datasets and configuration parameters. All input data is validated by the engine, and must be readable by the algorithm. The simple form of workflow is: - - 1. start server - 2. add engine - 3. input data to the engine - 4. train (for Lambda, Kappa will auto train with each new input) - 5. query - -See the workflow section for more detail. - -## The Command Line Interface - -Harness uses resource-ids to identify all objects in the system. The Engine Instance must have a JSON file, which contains all parameters for Harness engine management including the resource-id used as well as algorithm parameters that are specific to the Engine type and instance. All Harness global configuration is stored in `harness-env` see [Harness Config](harness_config.md) for details. - - - The file `` can be named anything and put anywhere. - - The working copy of all engine parameters and input data is actually in a shared database. Add or update an Engine Instance to change its configuraiton. Changing the file will not update the Engine Instance. See the `add` and `update` commands. - -# Harness Start and Stop: - - - **`harness start [-f]`** starts the harness server based on configuration in `harness-env`. The `-f` argument forces a restart if Harness is already running. All other commands require the service to be running, it is always started as a daemon/background process. All previously configured engines are started in the state they were in when harness was last run. - - - **`harness stop`** gracefully stops harness and all engines. If the pid-file has become out of sync, look for the `HarnessServer` process with `jps -lm` or `ps aux | grep HarnessServer` and execute `kill ` to stop it. - -# Engine Management - - - **`harness add `** creates and starts an Engine Instance of the type defined by the `engineFactory` parameter. - - **`harness update `** updates an existing Engine Instance with values defined in `some-engine.json`. The Engine knows what is safe to update and may warn if some value is not updatable but this will be rare. - - **`harness delete `** The Engine Instance will be stopped and the accumulated dataset and model will be deleted. No artifacts of the Engine Instance will remain except the `some-engine.json` file and any mirrored events. - - **`harness import [ | ]`** This is typically used to replay previously mirrored events or load bootstrap datasets created from application logs. It is equivalent to sending all imported events to the REST API. - - **`harness train `** For Lambda style engines like the UR this will create or update a model. This is required for Lambda Engines before queries will return values. - -# Bootstrapping With Import - -Import can be used to restore backed up data but also for bootstrapping a new Engine instance with previously logged or collected batches of data. Imagine a recommender that takes in people's purchase history. This might exist in server logs and converting these to files of JSON events is an easy and reproducible way to "bootstrap" your recommender with previous data before you start to send live events. This, in effect, trains your recommender retro-actively, improving the quality of recommendations at its first startup. diff --git a/cb_algorithm.md b/docs/cb_algorithm.md similarity index 100% rename from cb_algorithm.md rename to docs/cb_algorithm.md diff --git a/docs/commands.md b/docs/commands.md new file mode 100644 index 00000000..1ee78d74 --- /dev/null +++ b/docs/commands.md @@ -0,0 +1,57 @@ +# Commands + +Harness includes an Admin command line interface. It runs using the Harness REST interface and can be run remotely. + +## Conventions + +Internal to Harness are ***Engines Instances*** that implement some algorithm and contain datasets and configuration parameters. All input data is validated by the engine, and must be readable by the algorithm. The simple form of workflow is: + + 1. start server + 2. add engine + 3. input data to the engine + 4. train (for Lambda, Kappa will auto train with each new input) + 5. query + +See the [Workflow](workflow.md) section for more detail. + +Harness uses resource-ids to identify all objects in the system. The Engine Instance must have a JSON file, which contains all parameters for Harness engine management including its Engine Instance resource-id as well as algorithm parameters that are specific to the Engine type. All Harness global configuration is stored in `harness-env` see [Harness Config](harness_config.md) for details. + + - The file `` can be named anything and put anywhere. + - The working copy of all engine parameters and input data is actually in a shared database. Add or update an Engine Instance to change its configuraiton. Changing the file will not update the Engine Instance. See the `add` and `update` commands. + +# Harness Start and Stop + +Scripts that start and stop Harness are included with the project in the `sbin/`. These are used inside container startup and can be used directly in the OS level installation. + + - **`harness-start [-f]`** starts the harness server based on configuration in `harness-env`. The `-f` argument forces a restart if Harness is already running. All other commands require the service to be running, it is always started as a daemon/background process. All previously configured engines are started in the state they were in when harness was last run. + + - **`harness-stop`** gracefully stops harness and all engines. If the pid-file has become out of sync, look for the `HarnessServer` process with `jps -lm` or `ps aux | grep HarnessServer` and execute `kill ` to stop it. + +# Harness Administration + + - **`harnctl status [engines [], users []]`** These print status information about the objects requested. Asking for user status requires the Harness Auth-server, which is optional. + - **`harnctl add `** creates and starts an Engine Instance of the type defined by the `engineFactory` parameter. + - **`harnctl update `** updates an existing Engine Instance with values defined in `some-engine.json`. The Engine knows what is safe to update and may warn if some value is not updatable but this will be rare. + - **`harnctl delete `** The Engine Instance will be stopped and the accumulated dataset and model will be deleted. No artifacts of the Engine Instance will remain except the `some-engine.json` file and any mirrored events. + - **`harnctl import [ | ]`** This is typically used to replay previously mirrored events or load bootstrap datasets created from application logs. It is equivalent to sending all imported events to the REST API. + - **`harnctl export [ | ]`** If the directory is supplied with the protocol "file:" the export will go to the harness server host's file system. This is for use with vertically scaled Harness. For more general storage use HDFS (the Hadoop File System) flagged by the protocol `hdfs` for example: `hdfs://some-hdfs-server:9000/users//`. [**to me implemented in 0.5.0**] + - **`harnctl train `** For Lambda style engines like the UR this will create or update a model. This is required for Lambda Engines before queries will return values. + +# Harness Auth-server Administration + +There are several extended commands that manage Users and Role. These are only needed when using the Harness Auth-server to create secure multi-tenancy. Open multi-tenancy is the default and requires no Auth-Server + + - **`harnctl user-add [client | admin]`** Returns a new user-id and their secret. Grants the role's permissions. Client Users have access to one or more `engine-id`s, `admin` Users have access to all `engine-id`s as well as admin only commands and REST endpoints. + - **`harnctl user-delete `** removes all access for the `user-id` + - **`harnctl grant [client | admin]`** adds permissions to an existing user + - **`harnctl revoke [client | admin]`** removes permissions from an existing user + +# Bootstrapping With Import + +Import can be used to restore backed up data but also for bootstrapping a new Engine instance with previously logged or collected batches of data. Imagine a recommender that takes in people's purchase history. This might exist in server logs and converting these to files of JSON events is an easy and reproducible way to "bootstrap" your recommender with previous data before you start to send live events. This, in effect, trains your recommender retro-actively, improving the quality of recommendations at its first startup. + +# Backup with Export + +[**to me implemented in 0.5.0**] Lambda style Engines, which store all Events, usually support `harnctl export ...` This command will create files with a single JSON Event per line in the same format as the [Mirror](mirroring.md) function. To backup an Engine Instance use the export and store somewhere safe. These files can be re-imported for re-calculation of the input DB and, after training, the model. + +Engines that follow the Kappa style do not save input but rather update the model with every new input Event. So use [Mirroring](mirroring.md) to log each new Event. In a sense this is an automatic backup that can also be used to re-instantiate a Kappa style model. diff --git a/debugging_with_intellij.md b/docs/debugging_with_intellij.md similarity index 100% rename from debugging_with_intellij.md rename to docs/debugging_with_intellij.md diff --git a/design-philosophy-of-harness.md b/docs/design_philosophy_of_harness.md similarity index 99% rename from design-philosophy-of-harness.md rename to docs/design_philosophy_of_harness.md index 2b8171c5..2cdf5682 100644 --- a/design-philosophy-of-harness.md +++ b/docs/design_philosophy_of_harness.md @@ -27,7 +27,7 @@ An Engine is the API contract. This contract is embodied in the `core/engine` pa - **Engine**: The Engine class is the "controller" in the MVC use of the term. It takes all input, parses and validates it, then understands where to send it and how to report errors. It also fields queries processing them in much the same way. It creates Datasets and Algorithms and understand when to trigger functionality for them. The basic API is defined but any additional workflow or object needed may be defined ad-hoc, where Harness does not provide sufficient support in some way. For instance a compute engine or storage method can be defined ad-hoc if Harness does not provide default sufficient default support. - **Algorithm**: The Algorithm understands where data is and implements the Machine Learning part of the system. It converts Datasets into Models and implements the Queries. The Engine controls initialization of the Algorithm with parameters and triggers training and queries. For kappa style learning the Engine triggers training on every input so the Algorithm may spin up Akka based Actors to perform continuous training. For Lambda the training may be periodic and performed less often and may use a compute engine like Spark or TensorFlow to perform this periodic training. There is nothing in the Harness that enforces the methodology or compute platform used but several are given default support to make their use easier. - **Dataset**: A Dataset may be comprised of event streams and mutable data structures in any combination. The state of these is always stored outside of Harness in a separate scalable sharable store. The Engine usually implements a `dal` or data access layer, which may have any number of backing stores in the basic DAL/DAO/DAOImplementation pattern. The full pattern is implemented in the `core/dal` for one type of object of common use (Users) and for one store (MongoDB). But the idea is easily borrowed and modified. For instance the Contextual Bandit and Navigation Hinting Engines extend this pattern for their own collections. If the Engine uses a store supported by Harness then it can inject the implementation so that Harness and all of its Engines can target different stores using config to define which is used for any installation of Harness. This uses ScalDI for lightweight dependency injection. - - **DAL**: The Data Access Layer is a very lightweight way of accessing some external persistence. The knowledge of how to use the store is contained in anything that uses is. The DAL contains an abstract API in the DAO, and implemented in a DAOImpl. There may be many DAOImpls for many stores, the default is MongoDB but others are under consideration. + - **DAL**: The Data Access Layer is a very lightweight way of accessing some external persistence. The knowledge of how to use the store is contained in anything that uses is. The DAL contains an abstract API in the DAO, and implemented in a DAOImpl. There may be many DAOImpls for many stores, the default is MongoDB but others are possible. - **Administrator**: Each of the above Classes is extended by a specific Engine and the Administrator handles CRUD operations on the Engine instances, which in turn manage there constituent Algorithms and Datasets. The Administrator manages the state of the system through persistence in a scalable DB (default is MongoDB) as well as attaching the Engine to the input and query REST endpoints corresponding to the assigned resource-id for the Engine instance. - **Parameters**: are JSON descriptions of Engine instance initialization information. They may contain information for the Administrator, Engine, Algorithm, and Dataset. The first key part of information is the Engine ID, which is used as the unique resource-id in the REST API. The admin CLI may take the JSON from some json file when creating a new engine or re-initializing one but internally the Parameters are stored persistently and only changed when an Engine instance is updated or re-created. After and Engine instance is added to Harness the params are read from the metadata in the shared DB. diff --git a/dev_process.md b/docs/dev_process.md similarity index 100% rename from dev_process.md rename to docs/dev_process.md diff --git a/engine_configuration.md b/docs/engine_configuration.md similarity index 100% rename from engine_configuration.md rename to docs/engine_configuration.md diff --git a/harness_config.md b/docs/harness_config.md similarity index 100% rename from harness_config.md rename to docs/harness_config.md diff --git a/harness-design-point.md b/docs/harness_design_point.md similarity index 100% rename from harness-design-point.md rename to docs/harness_design_point.md diff --git a/harness_json.md b/docs/harness_json.md similarity index 100% rename from harness_json.md rename to docs/harness_json.md diff --git a/harness-operations.md b/docs/harness_operations.md similarity index 100% rename from harness-operations.md rename to docs/harness_operations.md diff --git a/howto_input_and_query.md b/docs/howto_input_and_query.md similarity index 100% rename from howto_input_and_query.md rename to docs/howto_input_and_query.md diff --git a/images/default-bash-trigger.png b/docs/images/default-bash-trigger.png similarity index 100% rename from images/default-bash-trigger.png rename to docs/images/default-bash-trigger.png diff --git a/images/git-flow.png b/docs/images/git-flow.png similarity index 100% rename from images/git-flow.png rename to docs/images/git-flow.png diff --git a/images/half-life-equation.png b/docs/images/half-life-equation.png similarity index 100% rename from images/half-life-equation.png rename to docs/images/half-life-equation.png diff --git a/images/harness-bash-triggers.png b/docs/images/harness-bash-triggers.png similarity index 100% rename from images/harness-bash-triggers.png rename to docs/images/harness-bash-triggers.png diff --git a/images/harness-start.png b/docs/images/harness-start.png similarity index 100% rename from images/harness-start.png rename to docs/images/harness-start.png diff --git a/images/import-java-sdk-project.png b/docs/images/import-java-sdk-project.png similarity index 100% rename from images/import-java-sdk-project.png rename to docs/images/import-java-sdk-project.png diff --git a/images/send_events.png b/docs/images/send_events.png similarity index 100% rename from images/send_events.png rename to docs/images/send_events.png diff --git a/install.md b/docs/install.md similarity index 100% rename from install.md rename to docs/install.md diff --git a/kappa-learning.md b/docs/kappa_learning.md similarity index 100% rename from kappa-learning.md rename to docs/kappa_learning.md diff --git a/docs/mirroring.md b/docs/mirroring.md new file mode 100644 index 00000000..5237579b --- /dev/null +++ b/docs/mirroring.md @@ -0,0 +1,15 @@ +# Input Mirroring + +Harness will mirror (log) all raw events with no validation, when configured to do so for a specific Engine Instance. For online learning like Kappa style Engines this is the only way to backup data since Events are not stored. For Lambda style learning, like the Universal Recommender, you mak choose to mirror or export data periodically. + +Mirroring is useful if you wanted to be able to backup/restore all data or are experimenting with changes in engine parameters and wish to recreate the models using past mirrored data. + +To accomplish this, you must set up mirroring for the Engine Instance. Once the Engine Instance is launched with a mirrored configuration all events sent to `POST /engines//events` will be mirrored to a location set in `some-engine.json`. **Note** Events will be mirrored until the config setting is changed and so can grow without limit, like unrotated server logs. + +To enable mirroring add the following to the `some-engine.json` for the Engine Instance you want to mirror: + + "mirrorType": "localfs" | "hdfs", // optional, turn on a type of mirroring + "mirrorContainer": "path/to/mirror", // optional, where to mirror input + +Mirroring is similar to logging. Each new Event is logged to a file before any validation. The format is JSON one event per line. This can be used to backup an Engine Instance or to move data to a new instance. + diff --git a/deploying_a_wip.md b/docs/misc/deploying_a_wip.md similarity index 100% rename from deploying_a_wip.md rename to docs/misc/deploying_a_wip.md diff --git a/personalized-nav-hinting.md b/docs/personalized_nav_hinting.md similarity index 100% rename from personalized-nav-hinting.md rename to docs/personalized_nav_hinting.md diff --git a/rest_spec.md b/docs/rest_spec.md similarity index 54% rename from rest_spec.md rename to docs/rest_spec.md index 5ef488f8..36837b1b 100644 --- a/rest_spec.md +++ b/docs/rest_spec.md @@ -1,12 +1,12 @@ # The Harness REST Specification -REST stands for [REpresentational State Transfer](https://en.wikipedia.org/wiki/Representational_state_transfer) and is a method for identifying resources and operations for be preformed on them by combining URIs with HTTP/HTTPS verbs. For instance an HTTP POST corresponds to the C in CRUD (**C**reate, **U**pdate, **R**ead, **D**elete). So by combining the HTTP verb with a resource identifying URI most desired operations can be constructed. +REST stands for [REpresentational State Transfer](https://en.wikipedia.org/wiki/Representational_state_transfer) and is a method for identifying resources and operations for be preformed on them by combining URIs with HTTP/HTTPS verbs. For instance an HTTP POST corresponds to the C and U in CRUD (**C**reate, **R**ead, **U**pdate, **D**elete). So by combining the HTTP verb with a resource identifying URI most desired operations can be constructed. # Harness REST -From the outside Harness looks like a single server that fields all REST APIs, but behind this are serval more heavy-weight services (like databases or compute engines). In cases where Harness needs to define a service we use a ***microservices*** architecture, meaning the service is itself called via HTTP and REST APIs and encapculates some clear function, like the Harness Authentication Server. All of these Services and Microservices are invisible to the outside and only used by Harness as a byproduct of performing some Harness REST API. +From the outside Harness looks like a single server that fields all REST APIs, but behind this are serval more heavy-weight services (like databases or compute engines). In cases where Harness needs to define a service we use a ***microservices*** architecture, meaning the service is itself invoked via HTTP and REST APIs and encapsulates some clear function, like the Harness Auth-server. All of these Services and Microservices are invisible to the outside and only used by Harness to accomplish some task. -The Harness CLI are implemented in Python as calls to the REST API so this separation of client to Harness Server is absolute. See [Commands](commands.md) for more about the CLI. +The Harness CLI (`harnessctl`) is implemented in Python as calls to the REST API so this separation of client to Harness Server is absolute. See [Commands](commands.md) for more about the CLI. # Harness HTTP Response Codes @@ -30,24 +30,56 @@ The Harness CLI are implemented in Python as calls to the REST API so this separ # Harness ML/AI REST -Most of the Harness API deals with Engines, which have sub-resources: Events and Queries. There are also some REST -APIs that modify the REST with params to create long lived "Jobs" for importing or training. +Most of the Harness API deals with Engines, which have sub-resources: Events, Queries, Jobs & others. There are also some REST +APIs that are administrative in nature. + +# JSON + +Harness uses JSON for all POST and response bodies. The format of these bodies are under the control of the Specific Engines with some information layered on by Harness itself. See Harness Responses for admin type API and the specific Engine Docs for Responses made by Engine Instances (like query responses, or status). + +Harness enforces certain conventions, for instance all JSON types are allowed but are validated for their specific use case. Dates are supported as Strings of ISO8601 defined by Java's [`ISO_DATE_FORMAT`](https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html#ISO_DATE_TIME) for input strings. This includes most common formats with time zones, and what is called the "Zulu" format. + +Some JSON follows extended conventions to allow values to be retrieved from the server `env` variables. Specifically Engine Instance configurations may need values from `env`. See [Harness Json](harness_json.md) for the specification. ## Requests -REST defines the nature of the request and for those that have a body, it is in JSON and must be formatted to fit the spec in each Engine for "Events", "Queries", and the Engine Config. +REST defines the nature of the request and for those that have a body, it is in JSON and must be formatted to fit the spec in each Engine type or Harness, depending on the target. See the Engine docs for Request formats. For administrative REST see Harness docs for Request formats. + +## Responses + +These are optional if the Response code enough but may also include information, parsable as JSON, that gives further information. For instance a `DELETE /engines/engine-id` may return an error HTTP code AND a JSON comment with a human readable error description. This is also and example of an administrative REST call. + +# Harness REST Types + +The Resource-id or type in the REST API follows this model, where Harness owns all resources and each level in the ownership defines what collection maintains resource management. + +![](https://docs.google.com/drawings/d/e/2PACX-1vToTQAtggzYIupQMN6emdlKyqmtXSv1DSM-ZMl2hiAxzxLNAXy3vXCSDrnGoWYZD_YXr2DOc6GIQ6Tg/pub?w=915&h=1007) + +Whether the resource definition is sent programmatically or comes from a file (as with an engine's config and params) the actual resource is persisted in a shared store like a DB. Harness itself is stateless. + +## Engines + +Each engine-id is used to reference an Engine Instance defined by configuration JSON. This is defined by each Engine type but must alway contain an id and a factory class to call in creating an Engine Instance. There are other generic params defined in [Harness Config](harness_config.md) but many are Engine specific Algorithm Parameters, which are defined by the Engine (the [Universal Recommender](ur_configuration.md) for instance) + +## Events + +Events encapsulate all input to Engines and are defined by the Engine (the [Universal Recommender](ur_input.md) for example). -## Responces +## Queries -These are optional where the Response code says all that is required but may also include information, parsable as JSON, that give further information. For instance a `DELETE /engines/engine-id` may return an error HTTP code AND a JSON comment with a human readable error description. +Queries encapsulate all queries to Engines and are defined by the Engine (the [Universal Recommender](ur_queries.md) for example). -## User Roles +## Jobs -Users +A Job is created to perform some task that may be long lived. For instance to `train` an Engine Instance may take hours. In this case the request creates the job and returns the job-id but this does not mean it is finished. Further status queries for the Engine Instance will report Job status(es) -For user with a "client" role, POSTing to `/engines//events` and `/engines//queries` will be the primary endpoints of interest. These endpoints are all that is needed to send data and make queries. "client" users are given permission to read and modify one or more engine-ids but the remaining API will block access due to lack of permission. +For example `POST /engines//jobs` will cause the Universal Recommender to queue up a training Job to be executed on Spark. This POST will have a response with a job-id. This can be used to monitor progress by successive `GET /engines/`, which return job-ids and their status. If it is necessary to abort or cancel a job, just execute `DELETE /engines/`. These are more easily performed using the `harnessctl` CLI. -The rest of the REST APIs is more properly thought of being used by the "admin" user role. "admin"s have access to all of the REST API. +## Users + +Users can be created (with the help of the Auth-server) for roles of "client" or "admin". Client Users have CRU access to one or more Engine Instance (only an admin can delete). An admin User has access to all resources and all CRUD. Users are only needed when using the Auth-server's Authentication and Authorization to control access to resources. + +# REST API | HTTP Verb | URL | Request Body | Response Code | Response Body | Function | | --- | --- | :--- | :--- | :--- | :--- | @@ -59,73 +91,32 @@ The rest of the REST APIs is more properly thought of being used by the "admin" | GET | `/engines/` | none | See Item responses | JSON status information about the Engine and sub-resources | Reports Engine status | | POST | `/engines//events` | none | See Collection responses | JSON event formulated as defined in the Engine docs | Creates an event but may not report its ID since the Event may not be persisted, only used in the algorithm. | | POST | `/engines//queries` | none | See Collection responses | JSON query formulated as defined in the Engine docs | Creates a query and returns the result(s) as defined in the Engine docs | -| POST | `/engines//events` | none | See Collection responses | JSON event formulated as defined in the Engine docs | Creates an event but may not report its ID since the Event may not be persisted, only used in the algorithm. | -| POST | `/engines//queries` | none | See Collection responses | JSON query formulated as defined in the Engine docs | Creates a query and returns the result(s) as defined in the Engine docs | | POST | `/engines//imports?import_path=` | none | 202 "Accepted" or Collection error responses | none |The parameter tells Harness where to import from, see the `harness import` command for the file format | | POST | `/engines//configs` | none | See Collection responses | new config to replace the one the Engine is using | Updates different params per Engine type, see Engine docs for details | - -# Harness *Lambda* Admin APIs (Harness-0.3.0) - -Lambda style batch or background learners require not only setup but batch training. So some additional commands are needed and planned for a future release of Harness: - -| HTTP Verb | URL | Request Body | Response Code | Response Body | Function | -| --- | --- | :--- | :--- | :--- | :--- | | POST | `/engines//jobs` | JSON params for batch training if defined by the Engine | See Item responses | 202 or collection error responses | Used to start a batch training operation for an engine. Supplies any needed identifiers for input and training defined by the Engine | -# JSON - -Harness uses JSON for all POST and response bodies. The format of these bodies are under the control of the Specific Engines with some information layered on by Harness itself. - -Harness enforces certain conventions too. For instance all JSON types are allowed but are validated for their specific use case. Dates are supported as Strings of a common subset of ISO8601. Harness is a JVM (Java Virtual Machine) process and so supports the [Java ISO_DATE_FORMAT](https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html#ISO_DATE_TIME) for input strings. This includes most common formats with and without time zones, and what is called the "Zulu" format. - -## Responses - -Several of the APIs return information beyond the Response Code. - - - GET `/engines/`. This corresponds to the CLI `harness engines status ` The response is defined by the Engine to contain configuration and other status information. Performing `harness engines status ` The CLI will display the JSON format for the Engine type. See the Engine docs for specific response bodies. - - ``` - { - "comment": "some general human readable informational message", - ... - "jobId": "id of job", - "jobStatus": "queued" - } - ``` - - - - **Example**: POST `/engines//jobs` and POST `/engines//imports?`. These corresponds to the CLI `harness train ` and `harness import ` With a Response Code of 202 expect a report of the form: - - ``` - { - "comment": "some general human readable informational message", - "jobId": "id of job", - "jobStatus": "executing" | "queued" - } - ``` - # Harness User And Permission APIs -These APIs allow the admin user to create new users granting access to certain resource types. +These APIs allow the admin user to create new users granting access to certain resource types. Read no further if you do not use the Auth-server. -These APIs act as a thin proxy for communication with the Auth-Server. They are provided as endpoints on the main Harness rest-server for simplicity but are actually implemented in the Auth-Server. Consider these as the public APIs for the Auth-Server. They manage "Users" and "Permissions". The Private part of the Auth-Server deals only with authorization requests and is in the next section. +These APIs act as a thin proxy for communication with the Auth-Server. They are provided as endpoints on the main Harness rest-server for simplicity but are actually implemented in the Auth-Server microservice. They manage Users by roles and resource-ids. | HTTP Verb | URL | Request Body | Response Code | Response Body | Function | | --- | --- | :--- | :--- | :--- | :--- | -| POST | `/users` | `{"roleSetId": "client\|admin", "resourceId": "*\|some id"}` | See Collection responses | `{"userId": "user_id", “bearerToken”: "token"}` | Create a new user and assign a bearer token and user-id, setup internal management of the user-id that does not require saving the bearer token and attached the named `roleSet` for the `resource-id` to the new user | +| POST | `/users` | `{"roleSetId": "client"|"admin", "resourceId": "some id"}` | See Collection responses | `{"userId": "user_id", “secret”: "token"}` | Create a new user and generate a secret and user-id, setup internal management of the user-id that does not require saving the secret | | GET | `/users?offset=0&limit=5` | none | see Collection responses | `[{"userId": "user-id", "roleSetId": "client \| admin", "engines": ["engine-id-1", "engine-id-2", ...]}, ...]` | List all users, roles, and resources they have access to | | DELETE | `/users/user-id` | none | see Item responses | `{"userId": "user-id"}` | Delete the User and return their user-id with success. | -| GET | `/users/user-id` | none | see Item responses | `{"userId": "user-id", "roleSetId": "client \| admin", "engines": ["engine-id-1", "engine-id-2", ...]}` | List the user's Engines by ID along with the role set they have and potentially other info about the user. | -| POST | `/users/user-id/permissions` | `{"userId": "user-id", "roleSetId": "client\|admin","resourceId": "*\|"}` | See Collection responses | | Grant named `roleSet` for the `resource-id` to the user with `user-id` | -| DELETE | `/users/user-id/permissions/permission-id` | `{"roleSetId": "client\|admin", "resourceId": "*\|"}` | See Item responses | `{"userId": "user_id", "roleSetId": "client\|admin", "resourceId": "*\|" }` | Removes a specific permission from a user | +| GET | `/users/user-id` | none | see Item responses | `{"userId": "user-id", "roleSetId": "client" | "admin", "engines": ["engine-id-1", "engine-id-2", ...]}` | List the user's Engines by ID along with the role set they have and potentially other info about the user. | +| POST | `/users/user-id/permissions` | `{"userId": "user-id", "roleSetId": "client\|admin","resourceId": "|"}` | See Collection responses | | Grant named `roleSet` for the `resource-id` to the user with `user-id` | +| DELETE | `/users/user-id/permissions/permission-id` | `{"roleSetId": "client\|admin", "resourceId": ""}` | See Item responses | `{"userId": "user_id", "roleSetId": "client\|admin", "resourceId": "" }` | Removes a specific permission from a user | # Auth-Server API (Private) -This API is private and used only by Harness to manages Users and Permissions. It is expected that these resources will be accessed through the Harness API, which will in turn use this API. +This API is private and used only by Harness to manage Users and Permissions. It is expected that these resources will be accessed through the Harness API, which will in turn use this API. The Auth-Server is a microservice that Harness uses to manage `User` and `Permission` resources. Any holder of a "secret" is a `User` and the `User` may have many permissions, which are the routes and resources they are authorized to access. -The Auth-Server is secured with connection level security no TLS or Auth itself is required and no SDK is provided since only the Harness Rest-Server needs to access it directly. +The Auth-Server is secured with connection level security no TLS or Auth itself is used. It is expected that the Auth-server runs on tandem to Harness | HTTP Verb | URL | Request Body | Response Code | Response Body | Function | | --- | --- | :--- | :--- | :--- | :--- | diff --git a/security.md b/docs/security.md similarity index 100% rename from security.md rename to docs/security.md diff --git a/the_contextual_bandit.md b/docs/the_contextual_bandit.md similarity index 100% rename from the_contextual_bandit.md rename to docs/the_contextual_bandit.md diff --git a/docs/ur_comlimentary_items.md b/docs/ur_comlimentary_items.md new file mode 100644 index 00000000..026eed3b --- /dev/null +++ b/docs/ur_comlimentary_items.md @@ -0,0 +1,36 @@ +# Recommendations for Complementary Items + +The Universal Recommender typically takes input that allows personalized recommendations to be made and also supports item-based and item-set based recs. This use can be seen because ALL indicators/events have a user-id that joins the data. It is used to model user behavior. Item and item-set similarity is in terms of the user behavior + +But what if you want to find missing items in a set? This is not a question of user preference so much as what things typically go together. In Machine Learning this is often called ***Complementary Item*** since the items complement each other. + +Let's look at a "Complete" group of items that complement each other: + +## Complete Complementary Item Group + +![](images/complete-item-set.png) + +Imagine that the data shows these are most commonly purchased together. Now take one away: + +## Shopping Cart in Progress + +![](images/incomplete-item-set.png) + +If these were inside a user's shopping cart the question is "what is missing?" The best answer comes from Complementary Items analysis. + +Fortunately the data for this is often easy to obtain if we have personalized recommendation data. All we need to do is identify the groups, which is often done with shopping-cart-ids. The Universal Recommender can identify items that form groups and the items that are most likely missing from those groups. + +The data for personalized recommendations that is input into the UR is basically (user-id, "conversion-indicator", item-id). This input comes into Harness and to the UR encoded in indicator events. If we substitute a group-id, like a shopping-cart-id, in place of a user-id we have all we need to create a ***Complementary Items*** model. Then as the user fills up a shopping cart we can answer the question; "what is missing?". In the incomplete group above the answer should be a "drill bit set" + +To accomplish this we need a new dataset and model. We create a new Engine Instance for the UR and input one indicator for "add-to-cart" which encodes the data (group-id, "add-to-cart", item-id). This is sent as input to the UR for Complementary Items (not the one for Personalized Recommendations). Once we train the Engine Instance we make an item-set query with the item-ids currently in the user's shopping cart. + +The result will be a list of items most likely to be purchased along with these current contents. In the example above the "drill bit set" will be returned. + +***Complementary Items*** has uses beyond ECom shopping carts but should not to be confused with ***Similar Items***. The item-set based queries of the UR's personalized model return items that have the same user behavior attached to them. Put another way, the items are similar in terms of user behavior. When item-set based queries are made on a the UR's Personalized Recommendations model, you will get very different results. + +![](images/similar-item-set.png) + +Both of these results may have a place in your application, but be aware of how an item-set query works depends on the model it is executed against. + + - **Complementary Items**: use a model built from input based on grouped items with group-ids rather than user-ids + - **Similar Items**: use the same model built for personalized recommendations with user-ids in the input data \ No newline at end of file diff --git a/docs/ur_configuration.md b/docs/ur_configuration.md new file mode 100644 index 00000000..2df8ca1b --- /dev/null +++ b/docs/ur_configuration.md @@ -0,0 +1,237 @@ +# Configuration + +Engines in Harness follow a pattern that defines defaults for many parameters then allows you to override them in the Engine's JSON config. If further refinement makes sense it is done in the Query. + +For instance The default number of results returned is 20, this can be overridden in the UR config JSON, which can later be overridden in any query. + +Business Rules can also be specified in the Engine's config or in the query. The use case here might be to only include items where `"available": "true"` and this should be used in every query unless the Query overrides or add new rules. + +## Configuration Sections + +The UR Configuration is written in [Harness JSON](harness_json.md) (JSON extended to allow substitution of values with data from environmental variables) and divided into sections for: + + - **Engine key-value pairs** The settings outside of a named section that are required or may be used in any engine. + - **`dataset`** params that apply to input data encoded as events + - **`algorithm`** params that control the behavior of the UR algorithm, known as Correlated Cross-Occurrence (CCO). The Algorithm section also can hold default Query parameters to be used with all Queries unless overridden in a specific Query. + - **`sparkConf`** params are passed into the Spark Job. These are needed because Spark jobs often require settings to be passed in to Spark Workers via a data structure called `sparkConf`. For instance the Elasticsearch library that writes a Spark RDD to ES needs several settings that it gets from the `soarkConf`. This section is the mostlikely place to put extended JSON that reads from `env`. + +## Simplest UR Configuration + +Imagine an ECom version of the UR that only watches for "buys" and product detail "views". To be sure there are many other ways to use a recommender but this is a good, simple example. + +We will make heavy use of default settings that have been chosen in the Universal Recommender code and only set required config and parameters. + +``` +{ + "engineId": "ecom_ur", + "engineFactory": "com.actionml.engines.ur.UREngine", + "sparkConf": { + "spark.serializer": "org.apache.spark.serializer.KryoSerializer", + "spark.kryo.registrator": "org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator", + "spark.kryo.referenceTracking": "false", + "spark.kryoserializer.buffer": "300m", + "spark.executor.memory": "20g", + "spark.driver.memory": "10g", + "spark.es.index.auto.create": "true", + "spark.es.nodes": "elasticsearch-host", + "spark.es.nodes.wan.only": "true" + }, + "algorithm":{ + "indicators": [ + { + "name": "buy" + },{ + "name": "view" + } + ], + } +} +``` + +Here we are telling Harness how to create a UR instance and telling the UR Instance what types of input to expect. **NOTE**: the first indicator is the ***primary*** one, when it comes in as an input Event it has item-ids that will be recommended. The secondary indicator will also come in as an input Event and will make the UR more predictive since it gives more information about user preferences. Secondary indicators do not have to come in with the same item-ids as the primary so maybe it is easier to send a page-id than a product-id (sent with the "buy" Events). The secondary indicator will be just as helpful. + +Depending on the size of your data this config might work just fine for an ECom application and if the dataset size grows too large we just increase memory given to Spark. + +*It is highly recommended that you start with this type of config before tuning the numerous values that may (or may not) yield better results.* + +## Complete UR Engine Configuration Specification + +How to read config settings: + + - "\" replace with your value + - "this" \| "that" use "this" OR "that" + - if no annotation is present the value must be set exactly as shown + - keys should always be as used as quoted + - most settings can be omitted if default values are sufficient, See Default UR Settings. + + +``` +{ + "engineId": "", + "engineFactory": "com.actionml.engines.ur.UREngine", + "modelContainer": "", + "mirrorType": "localfs" | "hdfs", + "mirrorLocation": "", + "dataset": { + "ttl": "<356 days>", + }, + "sparkConf": { + "spark.serializer": "org.apache.spark.serializer.KryoSerializer", + "spark.kryo.registrator": "org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator", + "spark.kryo.referenceTracking": "false", + "spark.kryoserializer.buffer": "300m", + "spark.executor.memory": "<4g>", + "spark.es.index.auto.create": "true", + "spark.es.nodes": ",", + "spark.es.nodes.wan.only": "true" + }, + "algorithm":{ + "indicators": [ + { + "name": "", + "maxCorrelatorsPerItem": ", + "minLLR": , + "maxIndicatorsPerQuery": + }, + ... + ], + "blacklistEvents": ["", "", "", ""], + "maxEventsPerEventType": , + "maxCorrelatorsPerEventType": "", + "maxQueryEvents": , + "num": , + "seed": , + "recsModel": "all" | "collabFiltering" | "backfill", + "expireDateName": "", + "availableDateName": "", + "dateName": "", + "userbias": <-maxFloat..maxFloat>, + "itembias": <-maxFloat..maxFloat>, + "returnSelf": true | false, + “rules”: [ + { + “name”: ””, + “values”: [“value1”, ...], + “bias”: -maxFloat..maxFloat, + }, + ... + ] + "numESWriteConnections": 100, } + } +} +``` + +## Default UR Settings + + - REQUIRED the value must be set + - NONE the value defaults to no setting, which tells the UR to not use the setting + - RANDOM chosen randomly + +``` +{ + "engineId": REQUIRED, + "engineFactory": "com.actionml.engines.ur.UREngine", + "modelContainer": NONE, + "mirrorType": NONE, + "mirrorLocation": NONE, + "dataset": { + "ttl": "356 days", + }, + "sparkConf": { + "spark.serializer": "org.apache.spark.serializer.KryoSerializer", + "spark.kryo.registrator": "org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator", + "spark.kryo.referenceTracking": "false", + "spark.kryoserializer.buffer": "300m", + "spark.executor.memory": REQUIRED, + "spark.driver.memroy": REQUIRED, + "spark.es.index.auto.create": "true", + "spark.es.nodes": "localhost", + "spark.es.nodes.wan.only": "true" + }, + "algorithm":{ + "indicators": [ + { + "name": ONE OR MORE REQUIRED, + "maxCorrelatorsPerItem": 50, + "minLLR": NONE, + "maxIndicatorsPerQuery": 100 + }, + ... + ], + "blacklistEvents": ["primary-indicator-name"], + "maxEventsPerEventType": 500, + "maxCorrelatorsPerEventType": 50, + "maxQueryEvents": 100, + "num": 20, + "seed": RANDOM, + "recsModel": "all", + "expireDateName": NONE, + "availableDateName": NONE, + "dateName": NONE, + "userbias": NONE, + "itembias": NONE, + "returnSelf": false, + “rules”: [ NONE ] + "numESWriteConnections": NONE, + } +} +``` + +## Dataset Parameters + +The `"dataset"` section controls how long to keep input data. + + - **ttl**: this take a String value that describes the length of time before an indicator Event is dropped from the DB. This only affects indicators, `$set` type events (non-indicator reserved Events) change mutable objects in the DB and so do not accumulate. The `ttl` stands for "time-to-live". Optional, default "365 days". + + +## Algorithm Parameters + +The `"algorithm"` section controls most of the tuning and config of the UR. Possible values are: + + * **indicators**: required. An array of string identifiers describing Events recorded for users, things like “buy”, “watch”, “add-to-cart”, "search-terms", even “location”, or “device” can be considered indicators of user preference. + + The first indicator is considered the ***primary*** indicator, because it **must** exist in the data and is considered the strongest indication of user preference for items, the others enrich the URs understanding of user preferences. Secondary indicators/Events may or may not have item-ids that correspond to the items to be recommended (id that come with the primary indicator), so they are allowed to be things like category-ids, search-terms, device-ids, location-ids... For example: a "category-pref" indicator would have a category-id as the target entity id but a "buy" would have a product-id as the target entity id (see UR Events). Both work fine as long as all indicator events are tied to users. + * **name**: required. Name for the indicator Event. + * **maxCorrelatorsPerItem**: optional, default: 50. Number of correlated items per recommended item. This is set to give best results for the indicator type and is often set to less than 50 (the default value) if the number of different ids for this event type is small. For example if the indicator is "gender" and we only count 2 possible genders there will be only be 2 possible ids "M" and "F" so the UR will preform better if `maxCorrelatorsPerItem ` is set to 1, which would find THE gender that best correlates with a primary event (for instance a "buy"). Without this setting the default of 50 would apply, meaning to take the top 50 gender ids that correlate with the primary indicator/conversion item. With enough data you will get all genders to correlate, meaning none could differentiate recommendation, in turn meaning the indicator is providing no value. Taking 1 correlator would force the UR to choose which is more highly correlated instead of taking up to 50 of the highest. + + A better approach is to use `minLLR` to create a correlation threshold but this is more difficult to tune. + * **maxIndicatorsPerQuery**: optional (use with great care), default: 100. Amount of the most recent user history to use in recommendation model queries. Making this smaller than the default may capture more recent user preferences but may lose longer lived preferences. + * **minLLR**: optional, default: NONE. This is not used by default and is here when an LLR score is desired as the minimum threshold. Since LLR scores will be higher for better correlation this can be set to ensure the highest quality correlators are the only ones used. This will increase precision of recommendations but may decrease recall, meaning you will get better recommendations but less of them. Increasing this may affect results negatively so always A/B test any tweaking of this value. There is no default, we keep `maxCorrelatorsPerItem` of the highest scores by defaultf—no matter the score. A rule of thumb would say to use something like 5 for a typical high quality ecom dataset. +* **maxQueryEvents**: optional (use with great care), default: 100. An integer specifying the number of most recent user history events used to make recommendations for an individual. More implies some will be less recent actions. Theoretically using the right number will capture the user’s current interests. This global value is overridden if specified by the indicator. +* **num**: optional, default: 20. An integer telling the engine the maximum number of recommendations to return per query but less may be returned if the query produces less results or post recommendations filters like blacklists remove some. +* **blacklistIndicators**: optional, default: the primary indicator. An array of strings corresponding to indicator names. If a user has history of any of these indicators and if the indicator has an item-id from the same items as the primary indicator then the item will not be recommended. This is used when trying to avoid recommending items that the user has seen or already converted on. In ECom this might mean; "do not recommend items the user 'buys' or 'views'". The default is to not recommend conversion items. If you want to recommend items the user has interacted with before, things they have bought for example, then set this value to an empty array: `[]` This will signal that no history should cause an item to be blacklisted fro recommendations. +* **rules**: optional, default: NONE. An array of Business Rules as defined for Queries (see [UR Queries](ur_queries.md). These act as defaults for every query and can be added to in any query. This is useful when you want to check something like `"instock": "true"` for every query but may add other rules at query time. +* **userBias**: optional (experimental), default: NONE. Amount to favor user history in creating recommendations that also have an item or item-set in the query. 1 is neutral, fractional is de-boosting, greater than 1 is boosting. +* **itemBias**: optional (experimental), default: NONE. Amount to favor item information in creating recommendations that have user or an item-set in the query. 1 is neutral, fractional is de-boosting, greater than 1 is boosting. +* **itemSetBias**: optional (experimental), default: NONE. Amount to favor item-set information in creating recommendations that have a user or item in the query. 1 is neutral, fractional is de-boosting, greater than 1 is boosting. + + **Note**: ***biases*** are often not the best way to mix recommendations based on user history and item or item-set similarity. There is no way, when using a mix of examples in queries to control how many recommendations are based on each example (user, item, or item-set). Therefore it is suggested that several queries are made and results mixed as desired by the application. However there are special cases where the use of multiple examples might be beneficial. +* **expireDateName** optional, default: NONE. The name of the item property field that contains the date an item expires or is unavailable to recommend. +* **availableDateName** optional, default: NONE. The name of the item property field that contains the date the item is available to recommend. +* **dateName** optional, default: NONE. The name of the item property field that contains a date or timestamp to be used in a `dateRange` query clause. +* **returnSelf**: optional, default: false. Boolean flagging the fact that the item example in the query is a valid result. The default is to never return the example item or one of an item-set in a query result, which is by far the typical case. Where items make be periodically recommended as with consumables (food?) is it usually better to mix these into recommendations based on an application algorithm rather than use the recommender to return them. For instance food items that are popular for a specific user might be added to recommendations or put in some special placement outside of recommender results. +* **recsModel** optional, default: "all", which means collaborative filtering with popular items or other ranking method returned IF no other recommendations can be made. If only "backfill" is specified then only some backfill or ranking type like "popular" will be returned. If only "collabFiltering" then no backfill will be included when there are not enough recommended items. +* **rankings** optional, the default is to use only `"type": "popular"` counting all primary events. This parameter, when specified, is a list of ranking methods used to rank items as fill-in when not enough recommendations can be returned using the CCO algorithm. Popular items usually get the best results and so are the default. It is sometimes useful to be able to return any item, even if it does not have events (popular would not return these) so we allow random ranking as a method to return items. There may also be a user defined way to rank items so this is also supported. + + This parameter is a list of ranking methods that work in the order specified. For instance if popular is first and it cannot return enough items the next method in the list will be used—perhaps random. Random is always able to return all items defined so it should be last in the list. + + When the `"type"` is **"popular", "trending", or "hot"** this set of parameters defines the calculation of the popularity model that ranks all items by their events in one of three different ways corresponding to: event counts (popular), change in event counts over time (trending), and change in trending over time (hot). + + When the `"type"` is **"random"** all items are ranked randomly regardless of any usage events. This is useful if some items have no events but you want to present them to users given no other method to recommend. + + When the `"type"` is **"userDefined"** the property defined in `"name"` is expected to rank any items that you wish to use as backfill. This may be useful, for instance, if you wish to show promoted items when no other method to recommend is possible. + + In all cases the property value defined by `"name"` must be given a unique float value. For `"popular"`, `"trending"`, `"hot"`, and `"random"` the value is calculated by the UR. For `"userDefined"` the value is set using a `$set` event like any other property. See "Property Change Events" [here](ur_input.md). + + - **name** give the field a name in the model and defaults to those mentioned above in the JSON. + - **type** `"popular"`, `"trending"`, `"hot"` can be defined and use event counts per item one of these can be used with `"userDefined"` and/or `"random"`. `"popular"`, `"trending"`, `"hot"` use event counts that are just count, change in event counts, or change in "trending" values. + + **Note**: when using "hot" the algorithm divides the events into three periods and since events tend to be cyclical by day, 3 days will produce results mostly free of daily effects. Making this time period smaller may cause odd effects. Popular is not split and trending splits the events in two. So choose the duration accordingly. + + These each add a rank value to items in the model that is used if collaborative filtering recommendations cannot be made. Since they rank all items they also obey filters, boosts, and business rules as any CF recommendation would. For example setting rankings allows CF to be preferred, then "popular" then "random" falling back to the ranking in the order they are defined. + - **indicatorNames** this is allowed only with one of the popularity types and is an array of indicator/Event names to use in calculating the popularity model, this defaults to the primary/conversion Event—the first in the `algorithm.indicators` list. + - **duration** this is allowed only with one of the popularity types and is a duration like "3 days" (which is the default), which defines the time from now back to the last event to count in the popularity calculation. +* **numESWriteConnections**: optional, default = number of threads in entire Spark Cluster, which may overload Elasticsearch when writing the trained model to it. + + If you see task failures, event if retries cause no Job failure, this will help remove the errors by throttling the write operation to ES. The other option is to add to / scale out your ES cluster because this will slow the Spark cluster down by reducing the number of tasks used to write to ES and so remove the errors. The rule of thumb for this setting is (numberOfNodesHostingPrimaries * bulkRequestQueueLength) * 0.75. In general this is (numberOfESCores * 50) * 0.75, where 50 comes from the Elasticsearch bulk queue default. +* **seed** optional, default: random Set this if you want repeatable downsampling for some offline tests. This can be ignored and shouldn't be set in production. diff --git a/ur_input.md b/docs/ur_input.md similarity index 100% rename from ur_input.md rename to docs/ur_input.md diff --git a/ur_queries.md b/docs/ur_queries.md similarity index 100% rename from ur_queries.md rename to docs/ur_queries.md diff --git a/ur_simple_usage.md b/docs/ur_simple_usage.md similarity index 100% rename from ur_simple_usage.md rename to docs/ur_simple_usage.md diff --git a/users_and_roles.md b/docs/users_and_roles.md similarity index 100% rename from users_and_roles.md rename to docs/users_and_roles.md diff --git a/versions.md b/docs/versions.md similarity index 91% rename from versions.md rename to docs/versions.md index e2037bff..d9dfa036 100644 --- a/versions.md +++ b/docs/versions.md @@ -2,9 +2,18 @@ Harness is a complete end-to-end Machine Learning and Artificial Intelligence server in early maturity. Meaning all minimum viable product features are included and tested in production deployments. It includes several Engines including, most notably, The Universal Recommender. It is built to allow custom Engines employing flexible learning styles. +## 0.5.0-SNAPSHOT (work in progress) + + - UR-0.9.0 + - Elasticsearch 6.x and 7 support + - Export implemented for the UR + - Faster Import using Spark and HDFS + - Extended JSON for Engine Instance Config files, which pulls values from env + - ... + ## 0.4.0 (current stable) -- Add the Universal Recommender ported from PredictionIO +- Add the Universal Recommender (UR-0.8.0) ported from PredictionIO's UR-0.7.3 - Minor Universal Recommender feature enhancements - Business Rules now called `rules` in queries - Event aliases supported to group event types and rearrange via config, requiring no data changes diff --git a/workflow.md b/docs/workflow.md similarity index 100% rename from workflow.md rename to docs/workflow.md diff --git a/harness-3.0-requirements.md b/harness-3.0-requirements.md deleted file mode 100644 index de0eda50..00000000 --- a/harness-3.0-requirements.md +++ /dev/null @@ -1,108 +0,0 @@ -# Harness 0.3.0 Requirements - -Harness 0.3.0 will run big-data Lambda style algorithms based, at first, on Spark. This will enable the use of MLlib and Mahout based algorithms in Engines. These include the Universal Recommender's CCO and MLlib's LDA in separate Engines. - -Basic requirements: - -- HDFS 2.8 or latest -- Spark 2.3 or latest stable -- MongoDB 4.x: read a collection into a Dataframe. Question: does the Spark lib for MongoDB support Spark 2.3? -- Elasticsearch 6.x: write a Spark distributed dataset (maybe Dataframe) to an ES index.Question: does the Spark lib for Elasticsearch support Spark 2.3? -- Scala 2.11 -- Mahout 0.14.0-SNAPSHOT: this runs on Spark 2.3 (or so they say—compiles but untested except for unit tests) -- Either Yarn or Kubernetes for job management. The hard requirement is being able to run more than one job at once on a Spark cluster, something like Yarn's cluster mode, and being able to run the driver on a Spark worker. **Note:** rumor has it that k8 support is not as solid as Yarn so may not be a good choice, a small amount of research is required to answer this. - -# Spark Support - -Each Engine that uses Spark will contain a section of JSON, which defines name: value pairs like the ones spar-submit supports and like this: - -```json -"sparkConf": { - "master": "yarn", // or possible to do any master, "local[8]" etc. - "deploy-mode": "cluster", - "driver-memory": "4g", - "executor-memory": "2g", - "executor-cores": 1, - "spark.serializer": "org.apache.spark.serializer.KryoSerializer", - "spark.kryo.registrator": "org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator", - "spark.kryo.referenceTracking": "false", - "spark.kryoserializer.buffer": "300m", - "spark.es.index.auto.create": "true", - "spark.es.nodes': "es-node-1,es-node-2" - // any arbitrary spark name: value pairs added to the - // spark context without validation since some are params - // for libs, like es.nodes -} -``` - -The `sparkConf` section in the Engine config will validate only part of the config that is specific to the Engine's needs. Some `name: value` pairs are simply put in the context as params for libs. Above some are used by Mahout and Elasticsearch but others are known to the Spark code like `master` and `deploy-mode` - -Note that there are special requirements to support Yarn's Cluster mode deployment and these will need to be enumerated here and supported. - -# UREngine JSON - -The config JSON file for the UREngine will have the above `sparkConf` but also an `algorithm` section like this: - -```json -"algorithm": { - "comment": "simplest setup where all values are default, popularity based backfill, must add eventsNames", - "name": "ur", - "params": { - "appName": "handmade", - "indexName": "urindex", - "typeName": "items", - "comment": "must have data for the first event or the model will not build, other events are optional", - "indicators": [ - { - "name": "purchase" - },{ - "name": "view", - "maxCorrelatorsPerItem": 50 - },{ - "name": "category-pref", - "maxCorrelatorsPerItem": 50, - "minLLR": 5.0, - } - ], - "availableDateName": "available", - "expireDateName": "expires", - "dateName": "date", - "num": 4 - } -} -``` - -This section will be parsed by the algorithm and does not need parsing until the algo implementation is under way. - - -# MongoDB Spark Support - -MongoDB is already supported in the generic `Store` and `DAO` with a custom implementation for Mongo. Since not all Engines will require the reading (or writing) of distributed Spark datasets it might be good to separate this into a trait or some king of extension of the base classes. This support should be something the Engine can pick. We ideally want to allow future versions of Harness to inject the `DAO` and `Store` implementation so whatever mechanism used should probably allow injection. This might be something like a `DAO` abstract interface and a `SparkDAO` interface with implementation of the same structure. In any case injection is not a hard requirement so we can pick a solution without injection if it make things significantly easier. - -# MongoDB Query Result Limit and Ordering - -The UR requires a way to ask for `Events` from a collection of a certain size and ordered by datetime, which is a part of every Event. The ordering and limit is not part of DAO now, which returns all results. - -The UR can work if the ordering is defaulted per collection and does not need to be specified in the query, if that makes the task easier. Mongo allows the ordering and limit to be specified in the query so this is also fine. Mixing a filter with order and limit is ideal so that the UR can ask for members of a collection with `"name": "buy"` and return `"limit": 100`, `"ordered_by": {"date": "descending"}`. - -To make the order performant it may be required to create an index on the key that is used to order—at least with Mongo. Other DBs may not support this but this is not our primary concern now. - -This can be encoded in any way that fits the DAO/Store pattern using an appropriate API. This is a general feature of DB Stores so is not part of the Spark extension and so may be best put in the generic `Store`/`DAO` interface. - -# HDFS Support - -With big-data and Spark come the need for a scalable distributed file system in HDFS. HDFS 2.8 seems to be compatible with tools that need it. - -The mirroring support should be upgraded to store in HDFS. There is a way to specify which store to use for mirroring in the global part of the Engine's JSON so we now need to use and support this otherwise mirroring big data will overwhelm a single machine's file system - -HDFS is also used by Spark for all storage and so the config at very least needs to be known to Spark. - -# New Harness CLI - -New commands are needed to manage batch jobs that create models in Lambda offline learners. They are: - - - **`harness train ** This requests an existing Engine train up a model from existing data. - - **`harness kill-job ** This assumes one job per Engine and the Engine will try to kill it, reporting status afterwards. - - **harness status engine ** This command already exists but will include a section of the response JSON that gives job status if any. - -Using `harness train` and periodic `harness status` a script can tell when training has completed. This in turn can be used for custom workflows. For example, if one Engine instance takes the output of another as input. diff --git a/integration-test.md b/integration-test.md deleted file mode 100644 index 58cc1753..00000000 --- a/integration-test.md +++ /dev/null @@ -1,24 +0,0 @@ -# Running the Harness Integration Test - -## PVR and Non-personlized Nav-Hinting - -The test on a completely clean source install. Get to the point where you can run the Python commands, especially `harness start`. Note that building Harness requires you build the Auth server on the same machine with `sbt publishlocal` to populate the `~/.ivy2` cache with jars used to build Harness. - -If harness is running with no TLS and no Authentication required there is a minimal "smoke test". Follow these steps: - - - Pull the harness-java-sdk repo into a new location with `git clone ...` - - cd into the local repo - - make sure you are on the master branch - - `mvn clean install` This will put needed Harness Java SDK binaries in the local `~/.m2` cache for building the test code. They are not needed to run or build Harness. - - With Harness running you can now start the integration test - - `cd harness-java-sdk` - - `./integration-test.sh` - -## URNavHinting - -This will be merged with the single integration test but is currently run from a different place and uses the Python SDK instead of the Java SDK. - -After running the PVR integration test with Harness running: - - - `cd harness/rest-server/examples` - - `./urnh-integration-test.sh` \ No newline at end of file diff --git a/mirroring.md b/mirroring.md deleted file mode 100644 index 558ee2e8..00000000 --- a/mirroring.md +++ /dev/null @@ -1,15 +0,0 @@ -# Input Mirroring - -Harness will mirror (log) all raw events with no validation, when configured to do so for a specific Engine instance. This is useful if you wanted to be able to backup/restore all data or are experimenting with changes in engine parameters and wish to recreate the models using past mirrored data. - -To accomplish this, you must set up mirroring for the Harness Server. Once the Engine is launched with a mirrored configuration all events sent to `POST /engines//events` will be mirrored to a location set in `some-engine.json`. **Note** Events will be mirrored until the config setting is changed and so can grow without limit, like unrotated server logs. - -To enable mirroring add the following to the `some-engine.json` for the engine you want to mirror events: - - "mirrorType": "localfs", // optional, turn on a type of mirroring - "mirrorContainer": "path/to/mirror", // optional, where to mirror input - -set these in the global engine params, not in algorithm params as in the "Base Parameters" section above. - -Mirroring is similar to logging. Each new Event is logged to a file before any validation. The format is JSON one event per line. This can be used to backup an Engine Instance or to move data to a new instance. - diff --git a/rest-server/build.sbt b/rest-server/build.sbt index e00abda1..41ff7cea 100644 --- a/rest-server/build.sbt +++ b/rest-server/build.sbt @@ -2,7 +2,7 @@ import sbt.Keys.resolvers name := "harness" -version := "0.4.0-RC1" +version := "0.5.0-SNAPSHOT" scalaVersion := "2.11.12" diff --git a/rest-server/engines/src/main/scala/com/actionml/engines/ur/URAlgorithm.scala b/rest-server/engines/src/main/scala/com/actionml/engines/ur/URAlgorithm.scala index 4b3e8453..16788ab4 100644 --- a/rest-server/engines/src/main/scala/com/actionml/engines/ur/URAlgorithm.scala +++ b/rest-server/engines/src/main/scala/com/actionml/engines/ur/URAlgorithm.scala @@ -145,7 +145,7 @@ class URAlgorithm private ( modelEventNames = params.indicators.map(_.name) - blacklistEvents = params.blacklistEvents.getOrElse(Seq(modelEventNames.head)) // empty Seq[String] means no blacklist + blacklistEvents = params.blacklistIndicators.getOrElse(Seq(modelEventNames.head)) // empty Seq[String] means no blacklist returnSelf = params.returnSelf.getOrElse(DefaultURAlgoParams.ReturnSelf) fields = params.rules.getOrElse(Seq.empty[Rule]) @@ -162,7 +162,7 @@ class URAlgorithm private ( rankingsParams = params.rankings.getOrElse(Seq(RankingParams( name = Some(DefaultURAlgoParams.BackfillFieldName), `type` = Some(DefaultURAlgoParams.BackfillType), - eventNames = Some(modelEventNames.take(1)), + indicatorNames = Some(modelEventNames.take(1)), offsetDate = None, endDate = None, duration = Some(DefaultURAlgoParams.BackfillDuration)))).groupBy(_.`type`).map(_._2.head).toSeq @@ -502,7 +502,7 @@ class URAlgorithm private ( import DaoQuery.syntax._ - val queryEventNamesFilter = query.eventNames.getOrElse(modelEventNames) // indicatorParams in query take precedence + val queryEventNamesFilter = query.indicatorNames.getOrElse(modelEventNames) // indicatorParams in query take precedence // these are used in the MAP@k test to limit the indicators used for the query to measure the indicator's predictive // strength. DO NOT document, only for tests @@ -655,7 +655,7 @@ class URAlgorithm private ( val rankingFieldName = rankingParams.name.getOrElse(PopModel.nameByType(rankingType)) val durationAsString = rankingParams.duration.getOrElse(DefaultURAlgoParams.BackfillDuration) val duration = Duration(durationAsString).toSeconds.toInt - val backfillEvents = rankingParams.eventNames.getOrElse(modelEventNames.take(1)) + val backfillEvents = rankingParams.indicatorNames.getOrElse(modelEventNames.take(1)) val offsetDate = rankingParams.offsetDate val rankRdd = popModel.calc(rankingType, eventsRdd, backfillEvents, duration, offsetDate) rankingFieldName -> rankRdd @@ -734,7 +734,7 @@ object URAlgorithm extends JsonSupport { case class RankingParams( name: Option[String] = None, `type`: Option[String] = None, // See [[com.actionml.BackfillType]] - eventNames: Option[Seq[String]] = None, // None means use the algo indicatorParams findMany, otherwise a findMany of events + indicatorNames: Option[Seq[String]] = None, // None means use the algo indicatorParams findMany, otherwise a findMany of events offsetDate: Option[String] = None, // used only for tests, specifies the offset date to start the duration so the most // recent date for events going back by from the more recent offsetDate - duration endDate: Option[String] = None, @@ -743,7 +743,7 @@ object URAlgorithm extends JsonSupport { s""" |_id: $name, |type: ${`type`}, - |indicatorParams: $eventNames, + |indicatorParams: $indicatorNames, |offsetDate: $offsetDate, |endDate: $endDate, |duration: $duration @@ -770,7 +770,7 @@ object URAlgorithm extends JsonSupport { typeName: Option[String], // can optionally be used to specify the elasticsearch type name recsModel: Option[String] = None, // "all", "collabFiltering", "backfill" // indicatorParams: Option[Seq[String]], // names used to ID all user indicatorRDDs - blacklistEvents: Option[Seq[String]] = None, // None means use the primary event, empty array means no filter + blacklistIndicators: Option[Seq[String]] = None, // None means use the primary event, empty array means no filter // number of events in user-based recs query maxQueryEvents: Option[Int] = None, maxEventsPerEventType: Option[Int] = None, diff --git a/rest-server/engines/src/main/scala/com/actionml/engines/ur/UREngine.scala b/rest-server/engines/src/main/scala/com/actionml/engines/ur/UREngine.scala index 6d44af3a..8c8012b1 100644 --- a/rest-server/engines/src/main/scala/com/actionml/engines/ur/UREngine.scala +++ b/rest-server/engines/src/main/scala/com/actionml/engines/ur/UREngine.scala @@ -195,7 +195,7 @@ object UREngine extends JsonSupport { // to what is in the algorithm params or false num: Option[Int] = None, // default: whatever is in algorithm params, which itself has a default--probably 20 from: Option[Int] = None, // paginate from this position return "num" - eventNames: Option[Seq[String]], // names used to ID all user indicatorRDDs + indicatorNames: Option[Seq[String]], // names used to ID all user indicatorRDDs withRanks: Option[Boolean] = None) // Add to ItemScore rank rules values, default false extends Query diff --git a/rest-server/engines/src/main/scala/com/actionml/engines/urnavhinting/URNavHintingAlgorithm.scala b/rest-server/engines/src/main/scala/com/actionml/engines/urnavhinting/URNavHintingAlgorithm.scala index e3bff447..914d6afd 100644 --- a/rest-server/engines/src/main/scala/com/actionml/engines/urnavhinting/URNavHintingAlgorithm.scala +++ b/rest-server/engines/src/main/scala/com/actionml/engines/urnavhinting/URNavHintingAlgorithm.scala @@ -135,9 +135,9 @@ class URNavHintingAlgorithm private ( minLLR = indicatorParams.minLLR) }.toMap } else { - logger.error("Must have either \"eventNames\" or \"indicators\" in algorithm parameters, which are: " + + logger.error("Must have either \"indicatorNames\" or \"indicators\" in algorithm parameters, which are: " + s"$params") - err = Invalid(MissingParams(jsonComment("Must have either eventNames or indicators in algorithm parameters, which are: " + + err = Invalid(MissingParams(jsonComment("Must have either indicatorNames or indicators in algorithm parameters, which are: " + s"$params"))) } @@ -536,7 +536,7 @@ object URNavHintingAlgorithm extends JsonSupport { s""" |_id: $name, |type: ${`type`}, - |eventNames: $eventNames, + |indicatorNames: $eventNames, |offsetDate: $offsetDate, |endDate: $endDate, |duration: $duration