-
Notifications
You must be signed in to change notification settings - Fork 2
Public data set export #173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ts.py. (#91) These will be used in the script that exports public data.
…SV strings. (#91) By refraining from holding all variants in memory, we improve the speed of all variant CSV generation. This change is particularly necessary for the full database dump because otherwise, for certain large score sets, the SQLAlchemy Variant model objects fill up available memory and cause the application to be killed. Notice that, in variant CSV exports, we now order variants numerically by their URN fragment. The code for this is PostgreSQL-specific and slows the query; we could add a "variant number" column for sorting purposes.
f1946c8 to
652595f
Compare
| logging.basicConfig() | ||
|
|
||
| # Un-comment this line to log all database queries: | ||
| logging.getLogger("__main__").setLevel(logging.INFO) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the # Un-comment this line comment referring to the line it immediately precedes or the next one? In which logging config is this code intended to be submitted?
| .order_by(ExperimentSet.urn) | ||
| ) | ||
|
|
||
| # TODO To support very large data sets, we may want to use custom code for JSON-encoding an iterator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all TODOs should reference a github issue
|
|
||
| zip_file_name = "mavedb-dump.zip" | ||
|
|
||
| logger.info(f"Exporting public data set metadata to {zip_file_name}/main.json") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was not immediately clear to me that this contained experiment set, experiment, and score set metadata. May want to add a comment to that effect for modify the logging statement to be more explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do lines 17-18 cover this?
| target_genes: Sequence[TargetGene] | ||
| private: bool | ||
| processing_state: Optional[ProcessingState] | ||
| processing_errors: Optional[dict] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure these should be included, see above comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't like this much either, but my considerations were:
- Consistency with the API seemed like a useful goal to avoid confusing users.
- And there might be some future use case for a complete export in the same format.
I'll defer to your judgment, though.
| .where(ExperimentSet.published_date.is_not(None)) | ||
| .options( | ||
| lazyload(ExperimentSet.experiments.and_(Experiment.published_date.is_not(None))).options( | ||
| lazyload(Experiment.score_sets.and_(ScoreSet.published_date.is_not(None))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you made processing state part of the export - I think we should also not export unprocessed score sets. it might violate user expectations re: the completeness of score sets in the export (they won't have corresponding scores files, for example).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My assumption is that unprocessed score sets aren't published. Is that valid? Maybe it's possible that a published score set gets updated with new data, but I don't think we plan to allow this except by administrative intervention.
I considered excluding processing state from the JSON metadata, but figured I'd leave it in for consistency with the API.
…se in the archive.
This issue is not new, but mypy now catches the possible runtime error. The issue may already have been handled in some other code branch. Fields should never be missing since variant data should be consistent with a score set's list of columns. But it's a good idea to handle this case anyway.
|
Changes have been incorporated to implement our review discussion, including the decision to include a CC0 license and exclude published data not licensed under CC0. |
|
I've incorporated Alan's last comments:
|
No description provided.