Skip to content

Conversation

@bistline
Copy link
Contributor

BACKGROUND & CHANGES

This update adds a new raw_counts extraction phase for AnnData files where a list of cell names is inserted into MongoDB as the "raw" cells for this file. Usually we extract flat files and then use existing ingest classes to process the data. However, since we only require the cell names for raw counts data, and extracting a full MTX bundle would as such be pointless, this update reads the cell names directly from the AnnData file. This assumes that there is data in the adata.raw slot, and that the cells represented there match those in adata.obs_names. There may be future work required if users are not using that slot, and we may wish to let them specify which slot the raw data is in, or even which index the cell names are. But for now, the slots are hard-coded. This is part of work to enable downstream portal actions, such as automated differential expression calculation for AnnData files.

MANUAL TESTING

  1. Initialize your dev environment as normal
  2. Run the example command for AnnData raw counts extraction:
python ingest_pipeline.py  --study-id 5d276a50421aa9117c982845 --study-file-id 5dd5ae25421aa910a723a337 ingest_anndata --ingest-anndata --anndata-file ../tests/data/anndata/trimmed_compliant_pbmc3K.h5ad  --extract "['raw_counts']"
  1. Ensure no error was thrown:
% echo $?
0
  1. In log.txt, look for the following message:
2024-08-21T11:42:14-0400 ingest_files INFO:Extracted 1 DataArray for 5dd5ae25421aa910a723a337:h5ad_frag.matrix.raw.mtx.gz Cells

@bistline bistline requested review from eweitz and jlchang August 21, 2024 15:57
@codecov
Copy link

codecov bot commented Aug 21, 2024

Codecov Report

Attention: Patch coverage is 78.26087% with 5 lines in your changes missing coverage. Please review.

Project coverage is 75.33%. Comparing base (17d9b8f) to head (081e4dc).
Report is 7 commits behind head on development.

Files Patch % Lines
ingest/anndata_.py 78.94% 4 Missing ⚠️
ingest/ingest_pipeline.py 50.00% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@               Coverage Diff               @@
##           development     #359      +/-   ##
===============================================
+ Coverage        75.29%   75.33%   +0.03%     
===============================================
  Files               29       29              
  Lines             4279     4297      +18     
===============================================
+ Hits              3222     3237      +15     
- Misses            1057     1060       +3     
Files Coverage Δ
ingest/expression_files/expression_files.py 88.53% <100.00%> (+0.14%) ⬆️
ingest/ingest_pipeline.py 56.43% <50.00%> (-0.04%) ⬇️
ingest/anndata_.py 87.12% <78.94%> (-0.17%) ⬇️

Copy link
Contributor

@jlchang jlchang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good and manual tests behave as described.

Copy link
Member

@eweitz eweitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good! Nice follow-on for broadinstitute/single_cell_portal_core#2113.

I suggest some trivial readability refinements, no blockers.

Co-authored-by: Eric Weitz <eweitz@broadinstitute.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants