Skip to content
This repository has been archived by the owner on Oct 28, 2022. It is now read-only.

Improve consistency of timestamp usage #690

Closed
wants to merge 6 commits into from

Conversation

david4096
Copy link
Member

@david4096 david4096 commented Aug 17, 2016

Timestamps are available across the API and this PR replaces the mixture of timestamp implementations by settling on ISO 8601. Created and updated fields were added to the dataset, feature, and feature set messages. Field names in assay metadata were renamed to be consistent with the remainder of the data model. Closes #682 and closes #685 and closes #690

Found mention of timestamp in intro.rst and replaced with newer definition. Thanks @mbaudis

@mbaudis
Copy link
Member

mbaudis commented Aug 17, 2016

+1

Great & fast response - thanks @kozbo & @david4096!

@ejacox
Copy link
Contributor

ejacox commented Jan 12, 2017

Are timestamps necessary for Features? How would they differ from the FeatureSet timestamps?

Similarly, do individual variants need timestamps?

What is the difference between the timestamps in Datasets and those in FeatureSets?

@david4096
Copy link
Member Author

No, but they are always optional.

The BRCA Exchange has recently put in some effort to exploring how to satisfy the FDA revision requirements for data used for medical purposes. The idea is that the date accessed becomes important when the annotations are updated so often. I think that this timestamp model fits the most basic use case of a client being able to determine if a variant has changed from last access.

This does not completely satisfy the needs of reliable replication, but it is a step in the right direction.

The reference server's implementation would not have variants modified without modifying a variant set, but the BRCA Exchange's implementation might. The same could be true of GENCODE feature sets, wherein I will be able to see if any new featuresets have been added newer than the ones I have already downloaded.

An implementor doesn't need to return timestamps, but it needs to be a possible field. I would assume that if a nested document didn't have a timestamp that it inherited from its ancestor. Perhaps stating this clearly in the documentation would be helpful. I can't construe a situation where the reference server ought to return created timestamps for feature messages, but I can imagine other implementations that might.

@mbaudis
Copy link
Member

mbaudis commented Jan 12, 2017

An implementor doesn't need to return timestamps ... needs to be a possible field

Exactly. We cannot force complex data scenarios, but have to enable their implementation.

@ejacox
Copy link
Contributor

ejacox commented Jan 13, 2017

OK. Thanks. I would be careful with the its always optional argument. Unnecessary or poorly defined fields could become a pain to maintain and also affect usability. Adding fields can be easier than deleting them.

@ejacox
Copy link
Contributor

ejacox commented Jan 13, 2017

The comments for timestamps need to be rewritten to define what the timestamp is for. The current comments are too vague. Does creation time refer to when the message was added to the server? Why would anyone care when the data was loaded? If not, is there existing data with timestamps or are we being wishful? Why do all the comments say record? Should this be message?

The current server uses very few, if any, of the timestamps. Some are just set to the current time. The input methods seem to be an afterthought. Are timestamps required in any cases? Has there been any discussion about enforcing those? If not, why is this a field and not just metadata?

These fields might optional in the message. In the server, they will take up disk space. Do we want to add them to the server if no one is using them?

@ejacox
Copy link
Contributor

ejacox commented Jan 13, 2017

Record seems to apply to biosample or individual. I don't think it applies to other things, like datasets. It should be more semantically clear.

@ejacox
Copy link
Contributor

ejacox commented Jan 25, 2017

We are proposing to put created and updated into a metadata message. That way, any changes are just in one place, rather than in every message. Additionally, the description needs to be better. I propose:
created = "The time the underlying data was produced (i.e. when the experiment or analysis was run)."
updated = "The time the underlying data changed. This time is independent of when the data was loaded."

These fields describe the data. Any dates associated with determining what data was available on a particular server at a particular time would be separate.

@mbaudis
Copy link
Member

mbaudis commented Jan 25, 2017

@ejacox The specification "underlying data" is a good change, to separate from server timestamps. Apart from that, any record should have these as optional; however, the API may be designed as to evaluate records for their overall latest change (but that wouldn't make sense for creation ... anyway confuses things, and I don't really see the problem?).

@mbaudis
Copy link
Member

mbaudis commented Jan 25, 2017

Modification: Created, Updated don't have to be part of Variant (but Variantset, Callset). The comments were not specific enough to make this clear.

@andrewjesaitis
Copy link
Member

+1 to @ejacox that create/update should refer to the data.
+1 to @mbaudis that they should be optional

My only remaining question is if there really a practical differentiation between created and updated at the data level. I mean if the analysis/alignment/etc is rerun if that really an update or is it just all new data? But, (to argue with myself) in the case of phenotype data, or sample attribute collection, I could see the need to differentiate between creation and update (eg we mislabeled all the samples and needed to correct the error). So I'd say to be consistent across the api keep both everywhere and make them optional.

I think it's important to keep in mind that the schema is only defining what should come back in the message and not how the data is stored.

Timestamps are available across the API and this PR replaces the mixture of timestamp implementations by settling on ISO 8601. Created and updated fields were added to the dataset, feature, and feature set messages. Field names in assay metadata were renamed to be consistent with the remainder of the data model. Closes ga4gh#682 and closes ga4gh#685

Found mention of timestamp in intro.rst and replaced with newer definition
rename VariantSet.metadata to VariantSet.variant_set_metadata
@david4096
Copy link
Member Author

I've placed the created and updated fields in their own message in common.proto and moved that throughout the API.

The VariantSet.metadata message was renamed to VariantSet.variant_set_metadata because of the name collision.

@ejacox
Copy link
Contributor

ejacox commented Feb 2, 2017

+1

@david4096
Copy link
Member Author

This PR is on hold while we review how best to approach timestamps in the API. #839

@kozbo kozbo removed this from the 2017-01 v0.6.0 stable release milestone Mar 8, 2017
@reece reece closed this Oct 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inconsistent formats of time stamps inconsistent names of time stamps
6 participants