Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job tracking #31

Open
nevali opened this issue Feb 28, 2017 · 2 comments
Open

Job tracking #31

nevali opened this issue Feb 28, 2017 · 2 comments

Comments

@nevali
Copy link
Member

nevali commented Feb 28, 2017

Optionally support a libsql connection URI which will be used to track jobs as they are processed by twine-writerd or twine-cli.

A job consists of:

  • A UUID to identify it
  • Optional a parent UUID
  • A URI to identify it (which may simply be a urn:uuid: representation of the job UUID, if nothing else is suitable, otherwise it'll be the canonical source or target URI, depending upon the processing pipeline; workflow components may update it accordingly during processing)
  • Timestamps for added and updated
  • A status: WAITING, ACTIVE, ABORTED (by the user), COMPLETE, FAILED, ERRORS (partial failure)
  • A status annotation (free-text) which may be set to indicate the failure reason
  • If active, the cluster/instance details of the node processing the job (preserved for diagnosis once set)
  • Processing item x of y progress indicators (particularly for bulk ingests from filesystem sources)

UUIDs should be where possible taken from the source, if it incorporates one into its identification, or generated on-the-fly if this is not possible.

A job stack should be maintained internally to libtwine in order to track parent/child relationships, rather than requiring it to be made explicit.

As an example, an ingest of N-Quads from a file, processing with spindle-correlate might yield the following:

  • A job is created in state WAITING with a newly-generated UUID and a file:/// URI
  • The N-Quads are parsed and the number of graphs determined; the job is updated to state ACTIVE, with progress set to 0 of number-of-graphs
  • For each graph that is correlated by Spindle, progress is updated, and a new child job is created in state WAITING, using the Spindle-generated UUID and URI
  • Once processing of the N-Quads is complete, the job status is updated to COMPLETE

As spindle-generate later processes its queue of items, it performs the following:

  • A job is created in state WAITING using the Spindle-generated UUID and URI; if it already exists, its parentage is preserved (thus, if the job originated from an ingest as described above, the proxy-generation step maintains the parent-child relationship allowing for ready visualisation
  • As the proxy is generated, its status is updated accordingly

With this arrangement, a small number of relatively simple SQL queries can result in progress tracking and volumetrics across a processing cluster.

Open question: how would Twine know when to preserve versus replace the parent of a job?

Perhaps it could be as simple as user action (i.e., twine-cli) taking precedence over an on-going process — thus, a queue-driven twine-writerd will only set the parent of a job if it's newly-created, whereas twine-cli will always override it. Both would create an overarching job for their processing runs, whether that's from a file or a queue.

Tracked as RESDATA-1279

@nevali
Copy link
Member Author

nevali commented Feb 28, 2017

Sketched interface to be implemented as part of libtwine to support this functionality:

typedef /*opaque*/ struct twine_job_struct TWINEJOB;
typedef enum
{
  TJS_WAITING,
  TJS_ACTIVE,
  TJS_ABORTED,
  TJS_COMPLETED,
  TJS_FAILED,
  TJS_ERRORS
} TWINEJOBSTATUS;

typedef enum
{
  TJP_PRESERVE,
  TJP_FORCE
} TWINEJOBPARENTAGE;

/* This is a relatively low-level libtwine API: the only side-effects are limited to
 * twine_job_create() creating or updating rows depending upon the parentage
 * mode of the current parent job and whether a row for that UUIS exists or not.
 */
TWINEJOB *twine_job_create(const uuid_t uuid, const char *restrict uri, CLUSTER *restrict /*optional*/ cluster);
int twine_job_close(TWINEJOB *job);
const char *twine_job_uristr(TWINEJOB *job);
int twine_job_set_uristr(TWINEJOB *restrict job, const char *restrict uri);
/* NB: possibly require URI and librdf_uri variants of the above */
int twine_job_set_parentage(TWINEJOB *job, TWINEJOBPARENTAGE mode);
int twine_job_update(TWINEJOB *restrict job, TWINEJOBSTATUS status, const char *restrict /*optional*/ annotation);
int twine_job_set_progress(TWINEJOB *job, int /*optional*/ current, int /*optional*/ total);
/* NB: twine_job_set_progress() uses -1 as a sentinel to indicate NULL integer values;
 * these will cause the job status to be left unchanged: twine_job_set_progress(job, -1, -1);
 * is therefore a no-op
 */

@nevali nevali added the triaged label Mar 21, 2017
@nevali
Copy link
Member Author

nevali commented Mar 21, 2017

Arguably the core state-tracking mechanism of this should be moved to bbcarchdev/libcluster itself, and Twine simply employs it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant