Skip to content
This repository has been archived by the owner on Mar 24, 2021. It is now read-only.

Feature/transform improvements #437

Merged
merged 2 commits into from Jul 3, 2015
Merged

Conversation

tombooth
Copy link
Contributor

@tombooth tombooth commented Jul 1, 2015

A couple of improvements to transforms based on some work done with @zilnhoj as the Narwhal yesterday. Details about the individual changes are in the commit messages

'max_age_expected': input_dataset['max_age_expected'],
}

if 'capped_size' in input_dataset and input_dataset['capped_size']:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks good! Comment this up and I shall merge.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's definitely not possible that input_dataset might also not have a key/value pair for bearer_token/realtime/published/max_age_expected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In relation to the second comment, the code here https://github.com/alphagov/stagecraft/blob/master/stagecraft/apps/datasets/models/data_set.py#L214 shows that they will exist in a dict

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But couldn't any of max_age_expected be None (and so null in the json) and cause the same problem with the schema as capped_size? It looks like it could here:

https://github.com/alphagov/stagecraft/blob/master/stagecraft/apps/datasets/models/data_set.py#L168

As it's the only one of these that can be null apart from capped_size. A user would have to override the default though which I'm not sure it possible - if django detects it's None it will just put it back to the default?

I'm going to merge but putting this here as a bit of history.

A lot of the properties that are received when querying for the input
data set are either not wanted or will break the schema validation when
POSTd as they are not meant to be part of the body.

We would not want a derivative data set to share the same auto_ids as it
will have different structure and the transforms should be generating
their own specific `_id` fields.

We only really want the following fields to be the same:
  - token: as someone might expect to be able to fiddle with derivative
    data sets
  - realtime: if the incoming data is realtime, the transformed data
    will be. This also relates to `capped_size` which should be
    configured to tell mongo how much of the realtime data to keep
  - max_age_expected: the transformed data should be as up to date as
    the input data set, so if there are rules of how old it should be
    then that should mirror to the derivative
  - published: if the input is published we would want the derivative to
    be too

`capped_size` is a little different than the others as Stagecraft will
reject a data set with this set to None, as if the key appears in the
POSTd JSON it needs to be a string. We are only interested if it is set
to a value so it gets ignored otherwise.
The collectors allow you to merge static data into every record sent
through to Backdrop, which is very useful for flagging up different
sources of data in a single data set.

We gain similar value from the application of tags in transforms as we
want a single data set tracking the completion rate of two different
sources (old and new forms). We would like to be able to compare them
and leave them side by side in a single data set.

This has been written so that it is applied post the transformation of
the data, regardless of what type of transform was executed. This push
for commonality in utility options is something that we have learnt is
important from the drift in config and impl in the collectors
@tombooth tombooth force-pushed the feature/transform-improvements branch from dec848c to 401dd56 Compare July 2, 2015 16:24
jcbashdown added a commit that referenced this pull request Jul 3, 2015
@jcbashdown jcbashdown merged commit a0153d7 into master Jul 3, 2015
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants