-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python] dataloader optimization (picking up 1169) #1224
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1224 +/- ##
==========================================
- Coverage 91.19% 91.15% -0.04%
==========================================
Files 77 79 +2
Lines 5971 6183 +212
==========================================
+ Hits 5445 5636 +191
- Misses 526 547 +21
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
@@ -36,6 +37,10 @@ | |||
The Tensors are rank 1 if ``batch_size`` is 1, otherwise the Tensors are rank 2.""" | |||
|
|||
|
|||
# "Chunk" of X data, returned by each `Method` above | |||
ChunkX = Union[npt.NDArray[Any], sparse.csr_matrix] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit (optional): why npt.NDArray[Any]
and not npt.NDArray[np.number]
(or something even more specific?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, is the API signature really saying that it could return any sparse type? If it is actually constrained to a more specific set of sub-types (e.g, csr_matrix), it would be useful to further specify this type alias (even if just for the static type checking...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used np.number
, but mypy also want's explicit generics, so I went with np.number[Any]
.
Not totally sure I'm understanding the second part, since the alias here is already csr_matrix
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. This is opening up a big can of typing
worms. There wouldn't be as much of an issue once we drop support for python 3.8.
I think I will go back to Any
until we move to SPEC-0
api/python/cellxgene_census/src/cellxgene_census/experimental/ml/pytorch.py
Outdated
Show resolved
Hide resolved
X_batch, _ = next(scipy_iter) | ||
|
||
if not self.return_sparse_X: | ||
batch_iter = _tables_to_np(blockwise_iter.tables(), shape=(obs_batch.shape[0], len(self.var_joinids))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO, this motivates a dense option on the underlying blockwise iterator (i.e, in addition to the Table and scipy options). I don't think we should hold up this PR, but ideally just file a feature request GH issue in the TileDB-SOMA repo.
@ryan-williams - thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me; we'd basically just move _tables_to_np
into TileDB-SOMA?
api/python/cellxgene_census/src/cellxgene_census/experimental/ml/pytorch.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good - no major concerns. A few minor points of clarity and refinment noted inline.
In at least one case, there is a redundant data copy where zero-copy would suffice. Recommend tracing through the code to ensure no others (perhaps after this PR lands?), as reducing memory pressure is important in this use case.
This reverts commit 357582e.
Building on top of #1169
This PR removes the new argument
method
(which was undocumented and had no tests atm) and instead usingreturn_sparse_X
to control conversion. Previouslyreturn_sparse_X
controlled whether densification happened in a wrapping iterator, while now a dense array is created without having to make a CSR intermediate.