Conversation
'max_age_expected': input_dataset['max_age_expected'], | ||
} | ||
|
||
if 'capped_size' in input_dataset and input_dataset['capped_size']: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All looks good! Comment this up and I shall merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's definitely not possible that input_dataset might also not have a key/value pair for bearer_token/realtime/published/max_age_expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In relation to the second comment, the code here https://github.com/alphagov/stagecraft/blob/master/stagecraft/apps/datasets/models/data_set.py#L214 shows that they will exist in a dict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But couldn't any of max_age_expected be None (and so null in the json) and cause the same problem with the schema as capped_size? It looks like it could here:
https://github.com/alphagov/stagecraft/blob/master/stagecraft/apps/datasets/models/data_set.py#L168
As it's the only one of these that can be null apart from capped_size. A user would have to override the default though which I'm not sure it possible - if django detects it's None it will just put it back to the default?
I'm going to merge but putting this here as a bit of history.
A lot of the properties that are received when querying for the input data set are either not wanted or will break the schema validation when POSTd as they are not meant to be part of the body. We would not want a derivative data set to share the same auto_ids as it will have different structure and the transforms should be generating their own specific `_id` fields. We only really want the following fields to be the same: - token: as someone might expect to be able to fiddle with derivative data sets - realtime: if the incoming data is realtime, the transformed data will be. This also relates to `capped_size` which should be configured to tell mongo how much of the realtime data to keep - max_age_expected: the transformed data should be as up to date as the input data set, so if there are rules of how old it should be then that should mirror to the derivative - published: if the input is published we would want the derivative to be too `capped_size` is a little different than the others as Stagecraft will reject a data set with this set to None, as if the key appears in the POSTd JSON it needs to be a string. We are only interested if it is set to a value so it gets ignored otherwise.
The collectors allow you to merge static data into every record sent through to Backdrop, which is very useful for flagging up different sources of data in a single data set. We gain similar value from the application of tags in transforms as we want a single data set tracking the completion rate of two different sources (old and new forms). We would like to be able to compare them and leave them side by side in a single data set. This has been written so that it is applied post the transformation of the data, regardless of what type of transform was executed. This push for commonality in utility options is something that we have learnt is important from the drift in config and impl in the collectors
dec848c
to
401dd56
Compare
Feature/transform improvements
A couple of improvements to transforms based on some work done with @zilnhoj as the Narwhal yesterday. Details about the individual changes are in the commit messages