Skip to content
This repository has been archived by the owner on Jul 20, 2021. It is now read-only.

Missing state model for DataSet #602

Open
manuelbernhardt opened this issue Jul 12, 2012 · 3 comments
Open

Missing state model for DataSet #602

manuelbernhardt opened this issue Jul 12, 2012 · 3 comments
Assignees

Comments

@manuelbernhardt
Copy link
Contributor

There's a DataSet state model inherited from the days when that state was entirely managed by the Sip-Creator, and evolved a little bit to accomodate for changes in the processing model.

The current states are:

  • INCOMPLETE
  • PARSING
  • UPLOADED
  • QUEUED
  • PROCESSING
  • CANCELLED
  • ENABLED
  • DISABLED
  • ERROR

The event model now aims at reflecting every change of state and other event related to a DataSet. The existing events are:

  • Created
  • Updated
  • Removed
  • StateChanged
  • Error (propagates the error message)
  • Locked
  • Unlocked
  • SourceRecordCountChanged (transient event, only used for giving feedback in the UI for the progress)
  • ProcessedRecordCountChanged (transient event, only used for giving feedback in the UI for the progress)

There is however a number of events or change of state (depending on what semantics the state model has) that are not captured:

  • when a new mapping is uploaded to the hub form the Sip-Creator
  • when a new set of invalid record indexes is uploaded to the hub from the Sip-Creator
  • when Sip-Creator hints are uploaded
  • when a set is being downloaded from the Sip-Creator (this could be inferred by the state changing to Locked, but it's not exactly the same)

Some of the above events, although discrete, could perhaps be viewed as one, as e.g. mappings, invalid records (and hints?) are usually connected on an abstract level. These events have impact on other - not yet existing - states, for example a new mapping means that the set is "outdated" in some way, as the new mapping probably influences the way the data and index looks like.

We should think of a better state model for the DataSet, with various states related to the different parts of the life-cycle:

  • the creation life-cycle (created, meta-data updated, deleted)
  • the publishing life-cycle (queued, processing, enabled, disabled)
  • the provisioning life-cycle (mapping uploaded, invalid record indexes uploaded, statistics uploaded, hints uploaded, storing source, source uploaded)
  • the usage life-cycle (locked, unlocked, downloading source?)

It might also make sense to consider splitting the publishing life-cycle so that it is possible to index without re-creating the cache. For this however the state model needs to reflect whether the cache is outdated (in comparison to the mapping).

I think a lot of the above would get clearer if we could somehow bundle the sip-creator "meta-information" (mapping, invalid records, hints, statistics) and keep versions thereof. We dropped the versioning of source data for the time being as it does not effectively bring any added value and technically isn't viable at the moment. A lot of the "versions" created contained identical data and were only versioned because of problems in identifying the same records.

@ghost ghost assigned manuelbernhardt Jul 12, 2012
@geralddejong
Copy link
Contributor

I think that the state machine is the core concept orchestrating the internal and external aspects of the dataset workflow, so it is of paramount importance that we devise and refine this state machine in the context of a unit test which covers all possible situations.

I'm thinking more in terms of a state composed of a number of bits rather than an enumeration of all individual plausible states. Many of the state transitions could pay attention to only one or two of these bits, which avoids the quadratic explosion of state transitions. In other words, many of these bits cast shadows on the others.

I come up with 9 bits:

  • CREATED/DELETED
  • ENABLED/DISABLED
  • UPLOADING/UPLOADED
  • PARSING/PARSED
  • NEEDS_INDEXING/QUEUED
  • INDEXING/INDEXED
  • ERROR(what)/NO_ERROR
  • DOWNLOADING/NOT_DOWNLOADING
  • LOCKED(who)/NOT_LOCKED

@geralddejong
Copy link
Contributor

It should be possible to upload data before any proper mapping has been performed so that one person can upload the data and hand it over to somebody else for building a mapping.

@manuelbernhardt
Copy link
Contributor Author

Actually this is possible to do. If you then try processing the set
with missing mappings, they will be ignored during the run. Of course
it'd be better if the interface wouldn't let you process at all.

On Thu, Aug 2, 2012 at 3:59 PM, Gerald de Jong
reply@reply.github.com
wrote:

It should be possible to upload data before any proper mapping has been performed so that one person can upload the data and hand it over to somebody else for building a mapping.


Reply to this email directly or view it on GitHub:
#602 (comment)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants