Public data set export #173

jstone-dev · 2024-04-15T05:23:06Z

No description provided.

…ts.py. (#91) These will be used in the script that exports public data.

…SV strings. (#91) By refraining from holding all variants in memory, we improve the speed of all variant CSV generation. This change is particularly necessary for the full database dump because otherwise, for certain large score sets, the SQLAlchemy Variant model objects fill up available memory and cause the application to be killed. Notice that, in variant CSV exports, we now order variants numerically by their URN fragment. The code for this is PostgreSQL-specific and slows the query; we could add a "variant number" column for sorting purposes.

ashsny · 2024-04-22T21:22:18Z

src/mavedb/lib/script_environment.py

+    logging.basicConfig()
+
+    # Un-comment this line to log all database queries:
+    logging.getLogger("__main__").setLevel(logging.INFO)


is the # Un-comment this line comment referring to the line it immediately precedes or the next one? In which logging config is this code intended to be submitted?

ashsny · 2024-04-22T21:24:15Z

src/mavedb/scripts/export_public_data.py

+    .order_by(ExperimentSet.urn)
+)
+
+# TODO To support very large data sets, we may want to use custom code for JSON-encoding an iterator.


all TODOs should reference a github issue

ashsny · 2024-04-22T21:34:29Z

src/mavedb/scripts/export_public_data.py

+
+zip_file_name = "mavedb-dump.zip"
+
+logger.info(f"Exporting public data set metadata to {zip_file_name}/main.json")


was not immediately clear to me that this contained experiment set, experiment, and score set metadata. May want to add a comment to that effect for modify the logging statement to be more explicit.

Do lines 17-18 cover this?

ashsny · 2024-04-22T21:35:35Z

src/mavedb/view_models/score_set.py

+    target_genes: Sequence[TargetGene]
+    private: bool
+    processing_state: Optional[ProcessingState]
+    processing_errors: Optional[dict]


not sure these should be included, see above comment

I didn't like this much either, but my considerations were:

Consistency with the API seemed like a useful goal to avoid confusing users.

And there might be some future use case for a complete export in the same format.

I'll defer to your judgment, though.

ashsny · 2024-04-23T02:47:28Z

src/mavedb/scripts/export_public_data.py

+    .where(ExperimentSet.published_date.is_not(None))
+    .options(
+        lazyload(ExperimentSet.experiments.and_(Experiment.published_date.is_not(None))).options(
+            lazyload(Experiment.score_sets.and_(ScoreSet.published_date.is_not(None)))


I see you made processing state part of the export - I think we should also not export unprocessed score sets. it might violate user expectations re: the completeness of score sets in the export (they won't have corresponding scores files, for example).

My assumption is that unprocessed score sets aren't published. Is that valid? Maybe it's possible that a published score set gets updated with new data, but I don't think we plan to allow this except by administrative intervention.

I considered excluding processing state from the JSON metadata, but figured I'd leave it in for consistency with the API.

…se in the archive.

This issue is not new, but mypy now catches the possible runtime error. The issue may already have been handled in some other code branch. Fields should never be missing since variant data should be consistent with a score set's list of columns. But it's a good idea to handle this case anyway.

jstone-dev · 2024-04-29T20:47:45Z

Changes have been incorporated to implement our review discussion, including the decision to include a CC0 license and exclude published data not licensed under CC0.

…CSV directory. (#91)

jstone-dev · 2024-05-05T18:57:21Z

I've incorporated Alan's last comments:

The ZIP archive's variants directory has been renamed csv.
Count CSV files are only generated for score sets with count columns.

jstone-dev added 6 commits April 18, 2024 12:04

View models for public data export (#91)

fc4e69c

Environment setup for scripts other than the FastAPI server script (#91)

40a8d9b

Move variant CSV generators from the score set router to lib/score_se…

a5ea8a5

…ts.py. (#91) These will be used in the script that exports public data.

Script for dumping all published data sets (#91)

0ff517b

Data dump format changes (#91)

652595f

jstone-dev force-pushed the jstone-uw/public-data-export branch from f1946c8 to 652595f Compare April 18, 2024 19:38

ashsny reviewed Apr 22, 2024

View reviewed changes

ashsny reviewed Apr 23, 2024

View reviewed changes

jstone-dev added 4 commits April 29, 2024 12:06

Added GitHub issue link.

93a5bf2

Filter out score sets not licensed under CC0, and include a CC0 licen…

149cb79

…se in the archive.

Add a timestamp to the data dump.

7345853

jstone-dev changed the base branch from main to release-2024.1.0 April 29, 2024 20:48

jstone-dev marked this pull request as ready for review April 29, 2024 20:48

jstone-dev added 2 commits April 30, 2024 00:33

Convert query result iterables to lists before generating export files.

7a7c0a6

Omit count CSV files when there are no count columns, and rename the …

2233988

…CSV directory. (#91)

Clear mypy error. (#91)

e54309f

ashsny merged commit 9de7f22 into release-2024.1.0 May 7, 2024

jstone-dev deleted the jstone-uw/public-data-export branch May 15, 2024 20:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Public data set export #173

Public data set export #173

Uh oh!

jstone-dev commented Apr 15, 2024

Uh oh!

ashsny Apr 22, 2024

Uh oh!

ashsny Apr 22, 2024

Uh oh!

ashsny Apr 22, 2024

Uh oh!

jstone-dev Apr 29, 2024

Uh oh!

ashsny Apr 22, 2024

Uh oh!

jstone-dev Apr 29, 2024

Uh oh!

ashsny Apr 23, 2024

Uh oh!

jstone-dev Apr 29, 2024

Uh oh!

jstone-dev commented Apr 29, 2024

Uh oh!

jstone-dev commented May 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		zip_file_name = "mavedb-dump.zip"

		logger.info(f"Exporting public data set metadata to {zip_file_name}/main.json")

Public data set export #173

Public data set export #173

Uh oh!

Conversation

jstone-dev commented Apr 15, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jstone-dev commented Apr 29, 2024

Uh oh!

jstone-dev commented May 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants