Skip to content

Conversation

@jstone-dev
Copy link
Collaborator

No description provided.

…ts.py. (#91)

These will be used in the script that exports public data.
…SV strings. (#91)

By refraining from holding all variants in memory, we improve the speed of all variant CSV generation. This change is particularly necessary for the full database dump because otherwise, for certain large score sets, the SQLAlchemy Variant model objects fill up available memory and cause the application to be killed.

Notice that, in variant CSV exports, we now order variants numerically by their URN fragment. The code for this is PostgreSQL-specific and slows the query; we could add a "variant number" column for sorting purposes.
@jstone-dev jstone-dev force-pushed the jstone-uw/public-data-export branch from f1946c8 to 652595f Compare April 18, 2024 19:38
logging.basicConfig()

# Un-comment this line to log all database queries:
logging.getLogger("__main__").setLevel(logging.INFO)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the # Un-comment this line comment referring to the line it immediately precedes or the next one? In which logging config is this code intended to be submitted?

.order_by(ExperimentSet.urn)
)

# TODO To support very large data sets, we may want to use custom code for JSON-encoding an iterator.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all TODOs should reference a github issue


zip_file_name = "mavedb-dump.zip"

logger.info(f"Exporting public data set metadata to {zip_file_name}/main.json")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was not immediately clear to me that this contained experiment set, experiment, and score set metadata. May want to add a comment to that effect for modify the logging statement to be more explicit.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do lines 17-18 cover this?

target_genes: Sequence[TargetGene]
private: bool
processing_state: Optional[ProcessingState]
processing_errors: Optional[dict]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure these should be included, see above comment

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't like this much either, but my considerations were:

  • Consistency with the API seemed like a useful goal to avoid confusing users.
  • And there might be some future use case for a complete export in the same format.

I'll defer to your judgment, though.

.where(ExperimentSet.published_date.is_not(None))
.options(
lazyload(ExperimentSet.experiments.and_(Experiment.published_date.is_not(None))).options(
lazyload(Experiment.score_sets.and_(ScoreSet.published_date.is_not(None)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you made processing state part of the export - I think we should also not export unprocessed score sets. it might violate user expectations re: the completeness of score sets in the export (they won't have corresponding scores files, for example).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My assumption is that unprocessed score sets aren't published. Is that valid? Maybe it's possible that a published score set gets updated with new data, but I don't think we plan to allow this except by administrative intervention.

I considered excluding processing state from the JSON metadata, but figured I'd leave it in for consistency with the API.

This issue is not new, but mypy now catches the possible runtime error. The issue may already have been handled in some other code branch.

Fields should never be missing since variant data should be consistent with a score set's list of columns. But it's a good idea to handle this case anyway.
@jstone-dev
Copy link
Collaborator Author

Changes have been incorporated to implement our review discussion, including the decision to include a CC0 license and exclude published data not licensed under CC0.

@jstone-dev jstone-dev changed the base branch from main to release-2024.1.0 April 29, 2024 20:48
@jstone-dev jstone-dev marked this pull request as ready for review April 29, 2024 20:48
@jstone-dev
Copy link
Collaborator Author

I've incorporated Alan's last comments:

  • The ZIP archive's variants directory has been renamed csv.
  • Count CSV files are only generated for score sets with count columns.

@ashsny ashsny merged commit 9de7f22 into release-2024.1.0 May 7, 2024
@jstone-dev jstone-dev deleted the jstone-uw/public-data-export branch May 15, 2024 20:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants