Missing state model for DataSet #602

manuelbernhardt · 2012-07-12T07:41:33Z

There's a DataSet state model inherited from the days when that state was entirely managed by the Sip-Creator, and evolved a little bit to accomodate for changes in the processing model.

The current states are:

INCOMPLETE
PARSING
UPLOADED
QUEUED
PROCESSING
CANCELLED
ENABLED
DISABLED
ERROR

The event model now aims at reflecting every change of state and other event related to a DataSet. The existing events are:

Created
Updated
Removed
StateChanged
Error (propagates the error message)
Locked
Unlocked
SourceRecordCountChanged (transient event, only used for giving feedback in the UI for the progress)
ProcessedRecordCountChanged (transient event, only used for giving feedback in the UI for the progress)

There is however a number of events or change of state (depending on what semantics the state model has) that are not captured:

when a new mapping is uploaded to the hub form the Sip-Creator
when a new set of invalid record indexes is uploaded to the hub from the Sip-Creator
when Sip-Creator hints are uploaded
when a set is being downloaded from the Sip-Creator (this could be inferred by the state changing to Locked, but it's not exactly the same)

Some of the above events, although discrete, could perhaps be viewed as one, as e.g. mappings, invalid records (and hints?) are usually connected on an abstract level. These events have impact on other - not yet existing - states, for example a new mapping means that the set is "outdated" in some way, as the new mapping probably influences the way the data and index looks like.

We should think of a better state model for the DataSet, with various states related to the different parts of the life-cycle:

the creation life-cycle (created, meta-data updated, deleted)
the publishing life-cycle (queued, processing, enabled, disabled)
the provisioning life-cycle (mapping uploaded, invalid record indexes uploaded, statistics uploaded, hints uploaded, storing source, source uploaded)
the usage life-cycle (locked, unlocked, downloading source?)

It might also make sense to consider splitting the publishing life-cycle so that it is possible to index without re-creating the cache. For this however the state model needs to reflect whether the cache is outdated (in comparison to the mapping).

I think a lot of the above would get clearer if we could somehow bundle the sip-creator "meta-information" (mapping, invalid records, hints, statistics) and keep versions thereof. We dropped the versioning of source data for the time being as it does not effectively bring any added value and technically isn't viable at the moment. A lot of the "versions" created contained identical data and were only versioned because of problems in identifying the same records.

geralddejong · 2012-07-16T09:54:38Z

I think that the state machine is the core concept orchestrating the internal and external aspects of the dataset workflow, so it is of paramount importance that we devise and refine this state machine in the context of a unit test which covers all possible situations.

I'm thinking more in terms of a state composed of a number of bits rather than an enumeration of all individual plausible states. Many of the state transitions could pay attention to only one or two of these bits, which avoids the quadratic explosion of state transitions. In other words, many of these bits cast shadows on the others.

I come up with 9 bits:

CREATED/DELETED
ENABLED/DISABLED
UPLOADING/UPLOADED
PARSING/PARSED
NEEDS_INDEXING/QUEUED
INDEXING/INDEXED
ERROR(what)/NO_ERROR
DOWNLOADING/NOT_DOWNLOADING
LOCKED(who)/NOT_LOCKED

geralddejong · 2012-08-02T13:59:32Z

It should be possible to upload data before any proper mapping has been performed so that one person can upload the data and hand it over to somebody else for building a mapping.

manuelbernhardt · 2012-08-03T17:21:26Z

Actually this is possible to do. If you then try processing the set
with missing mappings, they will be ignored during the run. Of course
it'd be better if the interface wouldn't let you process at all.

On Thu, Aug 2, 2012 at 3:59 PM, Gerald de Jong
reply@reply.github.com
wrote:

It should be possible to upload data before any proper mapping has been performed so that one person can upload the data and hand it over to somebody else for building a mapping.

Reply to this email directly or view it on GitHub:
#602 (comment)

ghost assigned manuelbernhardt Jul 12, 2012

manuelbernhardt mentioned this issue Aug 3, 2012

Finished processing - "valid" #621

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing state model for DataSet #602

Missing state model for DataSet #602

manuelbernhardt commented Jul 12, 2012

geralddejong commented Jul 16, 2012

geralddejong commented Aug 2, 2012

manuelbernhardt commented Aug 3, 2012

Missing state model for DataSet #602

Missing state model for DataSet #602

Comments

manuelbernhardt commented Jul 12, 2012

geralddejong commented Jul 16, 2012

geralddejong commented Aug 2, 2012

manuelbernhardt commented Aug 3, 2012