[python] dataloader optimization (picking up 1169) #1224

ivirshup · 2024-07-03T23:10:47Z

Building on top of #1169

This PR removes the new argument method (which was undocumented and had no tests atm) and instead using return_sparse_X to control conversion. Previously return_sparse_X controlled whether densification happened in a wrapping iterator, while now a dense array is created without having to make a CSR intermediate.

`nd.array`, `scipy.coo`, `scipy.csr`

codecov · 2024-07-04T02:35:23Z

Codecov Report

Attention: Patch coverage is 84.84848% with 10 lines in your changes missing coverage. Please review.

Project coverage is 91.15%. Comparing base (f775282) to head (629af82).
Report is 2 commits behind head on main.

Files	Patch %	Lines
...us/src/cellxgene_census/experimental/ml/pytorch.py	84.84%	10 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1224      +/-   ##
==========================================
- Coverage   91.19%   91.15%   -0.04%     
==========================================
  Files          77       79       +2     
  Lines        5971     6183     +212     
==========================================
+ Hits         5445     5636     +191     
- Misses        526      547      +21

Flag	Coverage Δ
unittests	`91.15% <84.84%> (-0.04%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

bkmartinjr · 2024-07-06T16:13:35Z

api/python/cellxgene_census/src/cellxgene_census/experimental/ml/pytorch.py

@@ -36,6 +37,10 @@
 The Tensors are rank 1 if ``batch_size`` is 1, otherwise the Tensors are rank 2."""


+# "Chunk" of X data, returned by each `Method` above
+ChunkX = Union[npt.NDArray[Any], sparse.csr_matrix]


nit (optional): why npt.NDArray[Any] and not npt.NDArray[np.number] (or something even more specific?)

Also, is the API signature really saying that it could return any sparse type? If it is actually constrained to a more specific set of sub-types (e.g, csr_matrix), it would be useful to further specify this type alias (even if just for the static type checking...)

Used np.number, but mypy also want's explicit generics, so I went with np.number[Any].

Not totally sure I'm understanding the second part, since the alias here is already csr_matrix?

Hmm. This is opening up a big can of typing worms. There wouldn't be as much of an issue once we drop support for python 3.8.

I think I will go back to Any until we move to SPEC-0

api/python/cellxgene_census/src/cellxgene_census/experimental/ml/pytorch.py

bkmartinjr · 2024-07-06T16:36:39Z

api/python/cellxgene_census/src/cellxgene_census/experimental/ml/pytorch.py

-        X_batch, _ = next(scipy_iter)
+
+        if not self.return_sparse_X:
+            batch_iter = _tables_to_np(blockwise_iter.tables(), shape=(obs_batch.shape[0], len(self.var_joinids)))


IMHO, this motivates a dense option on the underlying blockwise iterator (i.e, in addition to the Table and scipy options). I don't think we should hold up this PR, but ideally just file a feature request GH issue in the TileDB-SOMA repo.

@ryan-williams - thoughts?

Makes sense to me; we'd basically just move _tables_to_np into TileDB-SOMA?

api/python/cellxgene_census/src/cellxgene_census/experimental/ml/pytorch.py

bkmartinjr

Overall looks good - no major concerns. A few minor points of clarity and refinment noted inline.

In at least one case, there is a redundant data copy where zero-copy would suffice. Recommend tracing through the code to ensure no others (perhaps after this PR lands?), as reducing memory pressure is important in this use case.

This reverts commit 357582e.

ryan-williams and others added 5 commits June 7, 2024 08:58

ExperimentDataPipe: configurable array-conversion method

b8182d8

`nd.array`, `scipy.coo`, `scipy.csr`

group chunks iff shuffle_chunk_count > 1

f7f6121

rm scipy.coo method

9655f8b

add METHODS constant

54ffe73

Use return_sparse_X instead of adding a method kwarg

2fd1b14

ivirshup mentioned this pull request Jul 3, 2024

[python] ExperimentDataPipe: configurable method (nd.array, scipy.coo, scipy.csr) #1169

Closed

ivirshup added 4 commits July 3, 2024 23:32

Oops, one more thing

b74bf67

Merge branch 'main' into ivirshup/optimize-dataloader

f29b4c0

revert change to shuffle_chunk_count branch

664812a

mypy + linting fixes

629af82

ivirshup changed the title ~~Ivirshup/optimize dataloader~~ [python] dataloader optimization (picking up 1169) Jul 4, 2024

ivirshup added 4 commits July 4, 2024 03:10

Add test for bug I found

8bab49a

Fix bug, possibly speed up dataloaders

4c042ee

Remove unused (I think) branch

9d5cb73

Remove more unused code

295b308

bkmartinjr reviewed Jul 6, 2024

View reviewed changes

api/python/cellxgene_census/src/cellxgene_census/experimental/ml/pytorch.py Outdated Show resolved Hide resolved

bkmartinjr reviewed Jul 6, 2024

View reviewed changes

api/python/cellxgene_census/src/cellxgene_census/experimental/ml/pytorch.py Outdated Show resolved Hide resolved

bkmartinjr reviewed Jul 6, 2024

View reviewed changes

ivirshup added 8 commits July 8, 2024 17:29

Slightly more precise typing

137f77d

Reduce additional copies in _tables_to_np

4ac8511

Simplify branching in _ObsAndXSOMAIterator

f943404

Merge branch 'main' into ivirshup/optimize-dataloader

c1204bf

typing fix

357582e

Revert "typing fix"

15cb1a8

This reverts commit 357582e.

Use more simplistic typing to avoid mypy + old python wrath

4b1a993

Merge branch 'main' into ivirshup/optimize-dataloader

9cd6df7

ebezzi approved these changes Jul 8, 2024

View reviewed changes

ivirshup merged commit 473ba97 into main Jul 8, 2024
14 checks passed

ivirshup deleted the ivirshup/optimize-dataloader branch July 8, 2024 23:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python] dataloader optimization (picking up 1169) #1224

[python] dataloader optimization (picking up 1169) #1224

ivirshup commented Jul 3, 2024

codecov bot commented Jul 4, 2024

bkmartinjr Jul 6, 2024

bkmartinjr Jul 6, 2024

ivirshup Jul 8, 2024

ivirshup Jul 8, 2024

bkmartinjr Jul 6, 2024

ryan-williams Jul 11, 2024

bkmartinjr left a comment

[python] dataloader optimization (picking up 1169) #1224

[python] dataloader optimization (picking up 1169) #1224

Conversation

ivirshup commented Jul 3, 2024

codecov bot commented Jul 4, 2024

Codecov Report

bkmartinjr Jul 6, 2024

Choose a reason for hiding this comment

bkmartinjr Jul 6, 2024

Choose a reason for hiding this comment

ivirshup Jul 8, 2024

Choose a reason for hiding this comment

ivirshup Jul 8, 2024

Choose a reason for hiding this comment

bkmartinjr Jul 6, 2024

Choose a reason for hiding this comment

ryan-williams Jul 11, 2024

Choose a reason for hiding this comment

bkmartinjr left a comment

Choose a reason for hiding this comment