Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python] Geneformer updates for July 2024 LTS #961

Merged
merged 52 commits into from
Jul 5, 2024
Merged

Conversation

mlin
Copy link
Contributor

@mlin mlin commented Jan 31, 2024

Sorry for the heavyweight PR -- numerous changes to our Geneformer API and workflows accumulated for the new LTS:

  • Run tokenization/fine-tuning/forward-pass WDLs on AWS HealthOmics instead of Batch
  • Update the upstream Geneformer version
    • Add new special_token flag
    • Use a gene ID consolidation mapping, with modifications to the sparse math to implement
    • Add WDL inputs for slight variations on embeddings we want to try
  • Update ontologies for cell subclasses

Noting loose ends for potential future cleanup:

  • Use published version of new Geneformer model, once available
  • Replace legacy cell subclass mapper with cellxgene-ontology-guide

Copy link

codecov bot commented Jan 31, 2024

Codecov Report

Attention: Patch coverage is 72.54902% with 14 lines in your changes missing coverage. Please review.

Project coverage is 91.12%. Comparing base (f775282) to head (e12a102).
Report is 2 commits behind head on main.

Files Patch % Lines
...xperimental/ml/huggingface/geneformer_tokenizer.py 72.34% 13 Missing ⚠️
...sts/experimental/ml/huggingface/test_geneformer.py 66.66% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #961      +/-   ##
==========================================
- Coverage   91.19%   91.12%   -0.07%     
==========================================
  Files          77       79       +2     
  Lines        5971     6173     +202     
==========================================
+ Hits         5445     5625     +180     
- Misses        526      548      +22     
Flag Coverage Δ
unittests 91.12% <72.54%> (-0.07%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mlin mlin changed the title [python] run Geneformer WDLs on HealthOmics managed service instead of AWS Batch [python] Geneformer updates for July 2024 LTS Jul 3, 2024
@mlin mlin marked this pull request as ready for review July 3, 2024 08:06
@mlin mlin merged commit 1b24d78 into main Jul 5, 2024
17 checks passed
@mlin mlin deleted the mlin/geneformer-healthomics branch July 5, 2024 22:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants