Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python] dataloader optimization (picking up 1169) #1224

Merged
merged 21 commits into from
Jul 8, 2024

Conversation

ivirshup
Copy link
Collaborator

@ivirshup ivirshup commented Jul 3, 2024

Building on top of #1169

This PR removes the new argument method (which was undocumented and had no tests atm) and instead using return_sparse_X to control conversion. Previously return_sparse_X controlled whether densification happened in a wrapping iterator, while now a dense array is created without having to make a CSR intermediate.

@ivirshup ivirshup changed the title Ivirshup/optimize dataloader [python] dataloader optimization (picking up 1169) Jul 4, 2024
Copy link

codecov bot commented Jul 4, 2024

Codecov Report

Attention: Patch coverage is 84.84848% with 10 lines in your changes missing coverage. Please review.

Project coverage is 91.15%. Comparing base (f775282) to head (629af82).
Report is 2 commits behind head on main.

Files Patch % Lines
...us/src/cellxgene_census/experimental/ml/pytorch.py 84.84% 10 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1224      +/-   ##
==========================================
- Coverage   91.19%   91.15%   -0.04%     
==========================================
  Files          77       79       +2     
  Lines        5971     6183     +212     
==========================================
+ Hits         5445     5636     +191     
- Misses        526      547      +21     
Flag Coverage Δ
unittests 91.15% <84.84%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -36,6 +37,10 @@
The Tensors are rank 1 if ``batch_size`` is 1, otherwise the Tensors are rank 2."""


# "Chunk" of X data, returned by each `Method` above
ChunkX = Union[npt.NDArray[Any], sparse.csr_matrix]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (optional): why npt.NDArray[Any] and not npt.NDArray[np.number] (or something even more specific?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, is the API signature really saying that it could return any sparse type? If it is actually constrained to a more specific set of sub-types (e.g, csr_matrix), it would be useful to further specify this type alias (even if just for the static type checking...)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used np.number, but mypy also want's explicit generics, so I went with np.number[Any].

Not totally sure I'm understanding the second part, since the alias here is already csr_matrix?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. This is opening up a big can of typing worms. There wouldn't be as much of an issue once we drop support for python 3.8.

I think I will go back to Any until we move to SPEC-0

X_batch, _ = next(scipy_iter)

if not self.return_sparse_X:
batch_iter = _tables_to_np(blockwise_iter.tables(), shape=(obs_batch.shape[0], len(self.var_joinids)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, this motivates a dense option on the underlying blockwise iterator (i.e, in addition to the Table and scipy options). I don't think we should hold up this PR, but ideally just file a feature request GH issue in the TileDB-SOMA repo.

@ryan-williams - thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me; we'd basically just move _tables_to_np into TileDB-SOMA?

Copy link
Contributor

@bkmartinjr bkmartinjr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good - no major concerns. A few minor points of clarity and refinment noted inline.

In at least one case, there is a redundant data copy where zero-copy would suffice. Recommend tracing through the code to ensure no others (perhaps after this PR lands?), as reducing memory pressure is important in this use case.

@ivirshup ivirshup merged commit 473ba97 into main Jul 8, 2024
14 checks passed
@ivirshup ivirshup deleted the ivirshup/optimize-dataloader branch July 8, 2024 23:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants