`cells.aggregate_compartment()` queries the database in chunks, reducing memory #156

sjfleming · 2021-07-14T22:16:38Z

Description

These changes were made directly in response to #152 , and may impact #142 .

The intent is to limit memory usage during pycytominer.cyto_utils.cells.aggregate_compartment() calls, and other functions which call this method, including pycytominer.cyto_utils.cells.aggregate_profiles(). One explicit stipulation was to touch pycytominer.aggregate.aggregate() as little as possible. (I did have to make one tiny change.)

What is the nature of your change?

Bug fix (fixes an issue).
Enhancement (adds functionality).
Breaking change (fix or feature that would cause existing functionality to not work as expected).
This change requires a documentation update.

Checklist

I have read the CONTRIBUTING.md guidelines.
My code follows the style guidelines of this project.
I have performed a self-review of my own code.
I have commented my code, particularly in hard-to-understand areas.
I have made corresponding changes to the documentation.
My changes generate no new warnings.
New and existing unit tests pass locally with my changes.
I have added tests that prove my fix is effective or that my feature works.
I have deleted all non-relevant text in this pull request template.

sjfleming · 2021-07-16T15:00:48Z

I performed some tests using a 37GB SQLite database file. I realized that, when it comes to memory usage, more is not better. I had expected that using more of the available memory would lead to faster aggregation, but this is in fact the opposite of what happens. I suppose the whole task is I/O bound, and the actual aggregation in pandas is very very fast. So putting more in memory at once does not help.

Test cases were carried out by specifying n_strata, a parameter which I have added to SingleCells.aggregate_profiles(). n_strata controls how many strata (of the unique specified aggregation strata) are pulled from the database into memory at once. Tests were performed on a 2cpu 8GB memory machine with a solid-state hard drive, running Ubuntu 16.04.

Results of tests --- runtime in (minutes:seconds)

n_strata = 1 ------------- 18:21
n_strata = 2 ------------- 19:43
n_strata = 4 ------------- 25:31
n_strata max ------------- 26:45

Based on these results, I am going to remove the auto-scaling of n_strata to be the max it can be to fit in memory, and I will also remove the related dependency I introduced (psutil). I will set the default n_strata to be 1.

sjfleming · 2021-07-16T15:12:55Z

I have checked the final outputs of aggregation on a 37GB test database:

run with the previous code
run with this branch

and they seem to be identical.

gwaybio

Really an important contribution @sjfleming - thank you! I have made several suggestions and added many points of discussion that we'll need to talk through before merging.

@niranjchandrasekaran is also planning on talking a look at the whole PR, and I've tagged him at two specific places where I think an extra careful eye is needed.

My main comment is that I'd really like to understand more about _sqlite_strata_conditions() and test it slightly more thoroughly. I'd also like to modify some variable names and docstrings, and I have a couple of one-off suggestions/comments.

I think we are close to merging!

gwaybio · 2021-07-18T11:41:15Z

pycytominer/aggregate.py

@@ -62,7 +62,7 @@ def aggregate(
    # Subset the data to specified samples
    if isinstance(subset_data_df, pd.DataFrame):
        population_df = subset_data_df.merge(
-            population_df, how="left", on=subset_data_df.columns.tolist()
+            population_df, how="inner", on=subset_data_df.columns.tolist()


I don't believe this alters anything, but I am curious why you made this change?

Great question! Without this change, my chunked aggregation strategy fails for the following reason: for each chunk, I call pycytominer.aggregate(), and I create one aggregated data frame per chunk, which I then concatenate at the end into the final output. If the aggregation per chunk occurs with a "left" merge, then all the rows of subset_data_df that are not part of this chunk will get NaN entries. For example:

# using "left" merge for a chunk containing only well A01 well image value A01 1 1.543 A01 2 2.442 A02 1 NaN # using "inner" merge for a chunk containing only well A01 well image value A01 1 1.543 A01 2 2.442

and then after the final concatenation step

# using "left" merge well image value A01 1 1.543 A01 2 2.442 A02 1 NaN A02 1 0.234 # using "inner" merge well image value A01 1 1.543 A01 2 2.442 A01 2 2.442

gotcha! This makes total sense. Thanks for the explanation.

gwaybio · 2021-07-18T11:43:12Z

pycytominer/cyto_utils/cells.py

@@ -401,6 +401,7 @@ def aggregate_compartment(
        compartment,
        compute_subsample=False,
        compute_counts=False,
+        n_strata=1,


can you name this something more descriptive? How about n_aggregation_memory_strata or something similar? n_strata does not make it obvious what the argument actually does.

Yes, I'm happy to re-think this name. I also wasn't very happy with it.

gwaybio · 2021-07-18T11:46:16Z

pycytominer/cyto_utils/cells.py

+            Whether or not to compute the number of objects in each compartment
+            and the number of fields of view per well.
+        n_strata : int, default 1
+            Number of unique strata to pull from the database into working memory


is strata the right word? Maybe chunk? I don't know.

What it really is, is the number of unique aggregation strata being pulled into one chunk. For example, if we are aggregating by "well", then n_strata=1 means that one "well" will be pulled from the SQLite database into memory at a time.

this is perfect. Can you add your example to the docstring? I think this makes the description super clear. I also love the new variable name!

Added the example

gwaybio · 2021-07-18T11:47:09Z

pycytominer/cyto_utils/cells.py

+            and the number of fields of view per well.
+        n_strata : int, default 1
+            Number of unique strata to pull from the database into working memory
+            at once.  A larger number does not lead to faster compute times


very helpful hint! Do you know of any specific guidance one might use? if so, can you add?

Also, should it be "a larger number does not necessarily lead to faster compute times"?

According to my (limited) experiments, a larger number actually leads to slower compute times (see my comment in the PR above). It seems the best thing to do is probably always to have n_strata=1

pycytominer/cyto_utils/cells.py

gwaybio · 2021-07-18T12:13:57Z

pycytominer/cyto_utils/cells.py

+
+            Parameters
+            ----------
+            df : pandas.core.frame.DataFrame


Is this a compartment data frame? Is this merged with image_df? Can you be more specific in the docstring about what columns to expect in this dataframe?

This is a very strange dataframe that is neither of the above. It is a new dataframe constructed with the sole purpose of figuring out how to divide the compartment table into chunks.

The rows represent unique self.strata in the data. The columns are self.merge_cols. The values tell what are the values of merge_cols that are contained in each of the strata... i.e., how to chunk the data.

gwaybio · 2021-07-18T12:17:50Z

pycytominer/cyto_utils/cells.py

+
+        # Translate the first strata combination into a SQLite condition string
+        first_stratum_condition = _sqlite_strata_conditions(
+            df=df_unique_mergecols.head(1),


it took me a while to figure out that df_unique_mergecols is defined out of scope of this function. Should you include as a function argument in _sqlite_strata_conditions?

df_unique_mergecols is in scope of _compartment_df_generator(), but out of scope of _sqlite_strata_conditions(), and not used in _sqlite_strata_conditions().

Maybe I made a strange choice by nesting _sqlite_strata_conditions() inside _compartment_df_generator()? The only reason I did that was because I was pretty sure no other function would ever call it.

gwaybio · 2021-07-18T12:19:50Z

pycytominer/cyto_utils/cells.py

+            return grouped_conditions
+
+        # Translate the first strata combination into a SQLite condition string
+        first_stratum_condition = _sqlite_strata_conditions(


where are you using first_stratum_condition?

I am not. Good catch! This is an old thing I didn't entirely delete.

gwaybio · 2021-07-18T12:22:25Z

pycytominer/cyto_utils/cells.py

+            .reset_index(drop=True)
+        )
+
+        def _sqlite_strata_conditions(df, dtypes, n=1):


Based on my read, this function is the most critical contribution in this PR (cc @niranjchandrasekaran). It's probably because I'm not an expert in generator functions and how they're used, but I am having some trouble following the logic.

I make a couple comments in this function to try to help my understanding, but if you can see where my knowledge gap is, please let me know. Right now, I think this function will be difficult to maintain unless we provide a couple more tests.

I agree that there should be some dedicated tests to demonstrate the purpose of this function

gwaybio · 2021-07-18T12:24:33Z

pycytominer/tests/test_cyto_utils/test_cells.py

+        output_file=compress_file,
+        compute_subsample=True,
+        compression_options={"method": "gzip"},
+        n_strata=1,  # this will force multiple queries from each compartment


can you run this test multiple times looping through different n_strata:

E.g.

for n_strata in [1, 2, 3]: ... aggregate_profiles(n_strata=n_strata)

and confirm that the expected result doesn't change

can you also add a separate test where you set n_strata to an invalid argument?

Yes this is a good idea in terms of testing multiple values for n_strata.

We can test invalid n_strata values... but the only truly invalid values should be < 1, and there is an explicit assertion for that. But we can still include a test.

sjfleming · 2021-07-19T18:17:38Z

I have made an attempt to address all your comments and make relevant changes @gwaygenomics . I added a couple of tests to address a few comments as well.

gwaybio

really great to see this happening. I made a couple comments.

@niranjchandrasekaran - are you able to take a look? I think it's worth having your eyes on the full PR, but I did note particular focus areas.

gwaybio · 2021-07-19T18:51:24Z

pycytominer/aggregate.py

@@ -62,7 +62,7 @@ def aggregate(
    # Subset the data to specified samples
    if isinstance(subset_data_df, pd.DataFrame):
        population_df = subset_data_df.merge(
-            population_df, how="left", on=subset_data_df.columns.tolist()
+            population_df, how="inner", on=subset_data_df.columns.tolist()


gotcha! This makes total sense. Thanks for the explanation.

gwaybio · 2021-07-19T18:52:27Z

pycytominer/cyto_utils/cells.py

+            Whether or not to compute the number of objects in each compartment
+            and the number of fields of view per well.
+        n_strata : int, default 1
+            Number of unique strata to pull from the database into working memory


this is perfect. Can you add your example to the docstring? I think this makes the description super clear. I also love the new variable name!

gwaybio · 2021-07-19T18:58:23Z

pycytominer/cyto_utils/cells.py

+        )
+        dtype_dict = dict(
+            zip(
+                [s[7:-1] for s in compartment_dtypes.columns],  # strip typeof( )


is it possible to create a different request without typeof() to get the actual column names in the proper order? I'd be in favor of two queries to improve readability/maintainability rather than hardcoding something that is also difficult to read (even if it is robust).

gwaybio · 2021-07-19T20:09:38Z

pycytominer/cyto_utils/cells.py

@@ -846,3 +775,67 @@ def aggregate_profiles(
            )
        else:
            return aggregated
+
+
+def _sqlite_strata_conditions(df, dtypes, n=1):


should this be a separately defined function? When one imports SingleCells will it automatically know this function definition? I thought that you had it as a private method of the class before? Any reason you decided to make it separate?

When one imports SingleCells, yes, a SingleCells object will automatically be able to see that function. I did have it as a private method of the class before. I can move it back if we want. The only reason I moved it out is that I realized it doesn't actually rely on anything from SingleCells and it was just a tiny bit easier to import it for testing as a separate function. Though it could still be imported for testing either way

gwaybio · 2021-07-19T20:17:39Z

pycytominer/cyto_utils/cells.py

+    ----------
+    df : pandas.core.frame.DataFrame
+        A dataframe where columns are merge_cols and rows represent
+        unique aggregation strata of the compartment table


Suggested change

unique aggregation strata of the compartment table

the specific image IDs grouped by unique aggregation strata (e.g. Metadata_well) of the image table

Is this more accurate?

Yes that's right. Only I think it turns out to be whatever is in merge_cols, so "ImageNumber" and "TableNumber"

gwaybio · 2021-07-19T20:22:58Z

pycytominer/cyto_utils/cells.py

+    dtypes : Dict[str, str]
+        Dictionary to look up SQLite datatype based on column name
+    n : int
+        Number of rows of the input df to combine in each output


this is really great documentation - very appreciated

gwaybio · 2021-07-19T20:26:33Z

pycytominer/cyto_utils/cells.py

+    Returns
+    -------
+    grouped_conditions : List[str]
+        A list of strings, each string being a valid SQLite conditional


can you add a couple words describing how this is meant to be used?

"being a valid SQLite conditional to facilitate chunked aggregation?" 🤷

Haha yeah, I can add a bit. Really what I mean is that it will return a list of strings (called conditions), where each of the strings can be used in an SQL statement like this:

f"select {cols} from {compartment} where {condition}"

Here condition is one element of the list of strings grouped_conditions, and the above SQL statement will grab a single chunk for aggregation. So by iterating over the conditions in grouped_conditions and making SQL queries, we get the whole table back in chunks:

for condition in grouped_conditions: do_sql_query(f"select {cols} from {compartment} where {condition}")

gwaybio · 2021-07-19T20:30:05Z

pycytominer/tests/test_cyto_utils/test_cells.py

@@ -6,7 +6,7 @@
 from sqlalchemy import create_engine

 from pycytominer import aggregate, normalize
-from pycytominer.cyto_utils.cells import SingleCells
+from pycytominer.cyto_utils.cells import SingleCells, _sqlite_strata_conditions


ah, I see, you're needing to input this as a new function. I typically think of _single_leading_underscore as being private methods, and they're defined as "internal use" characters in pep8. I'm wondering if I'm missing any critical insight that led you to separate out this method (This is kind of a duplicate comment of one previous)

No real insight, no. Yeah I just did it that way so I could easily import it to test it here. And the leading underscore should denote to a user that it's private and never to be used directly for anything.

gwaybio · 2021-07-19T20:31:05Z

pycytominer/tests/test_cyto_utils/test_cells.py

+    ]
+    out1 = _sqlite_strata_conditions(
+        df=df,
+        dtypes={"TableNumber": "integer", "ImageNumber": "integer"},


will the dtype ever be a string?

oh, "text" - I see below, gotcha!

Yeah, it seems like the dtypes for SQL are null, real, text, integer, and blob. But the TableNumber and ImageNumber values I think in practice are always of integer type, though I am not 100% sure

gwaybio · 2021-07-21T20:03:18Z

a quick note @sjfleming - can you also add your name to the software citation section in this PR? Thanks!

…tment Introduce a generator function to obtain data from a compartment table in chunks of whole strata. These chunks can be aggregated separately, and the results combined. Memory efficient.

…inations

… class

sjfleming · 2021-07-22T15:48:56Z

I rebased on top of the current master branch to get the latest changes just now

sjfleming · 2021-07-22T15:51:39Z

Thanks @gwaygenomics , I took a guess about where to put myself in that author list, please do correct me if it's the wrong place :)

README.md

Co-authored-by: Greg Way <gregory.way@gmail.com>

niranjchandrasekaran

LGTM! I don't any additional comments

niranjchandrasekaran · 2021-07-22T21:24:44Z

pycytominer/cyto_utils/cells.py

+
+        # Obtain all valid strata combinations, and their merge_cols values
+        df_unique_mergecols = (
+            self.image_df[self.strata + self.merge_cols]


I think self.strata and self.merge_cols will always be in self.image_df and they will be unique. I trying to imagine when the two lists will not be unique. We could aggregate at the image-level but for that we will use Metadata_Site in self.strata and not ImageNumber. Other than that, I can't think of an overlap between the two.

pycytominer/cyto_utils/cells.py

gwaybio

Looks great @sjfleming - I think we can merge this in.

One last comment (sorry I should have mentioned this sooner!) but can you make sure to run black on the files you've touched?

sjfleming · 2021-07-23T15:53:41Z

@gwaygenomics Great! Yes, I have already run black and everything is currently formatted that way.

gwaybio · 2021-07-23T16:10:00Z

Running final tests now!

codecov-commenter · 2021-07-23T16:12:21Z

Codecov Report

Merging #156 (eb8ce17) into master (3edd0c7) will increase coverage by 0.04%.
The diff coverage is 98.43%.

@@            Coverage Diff             @@
##           master     #156      +/-   ##
==========================================
+ Coverage   98.04%   98.09%   +0.04%     
==========================================
  Files          49       49              
  Lines        2199     2253      +54     
==========================================
+ Hits         2156     2210      +54     
  Misses         43       43

Flag	Coverage Δ
unittests	`98.09% <98.43%> (+0.04%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
pycytominer/aggregate.py	`96.77% <ø> (ø)`
pycytominer/cyto_utils/cells.py	`97.02% <97.22%> (+0.45%)`	⬆️
pycytominer/tests/test_cyto_utils/test_cells.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3edd0c7...eb8ce17. Read the comment docs.

gwaybio · 2021-07-23T17:14:56Z

thanks again Stephen!

sjfleming changed the title ~~Sf chunk aggregation~~ cells.aggregate_compartment() queries the database in chunks, reducing memory Jul 14, 2021

sjfleming marked this pull request as ready for review July 16, 2021 15:12

sjfleming requested a review from niranjchandrasekaran as a code owner July 16, 2021 15:12

gwaybio self-requested a review July 18, 2021 11:40

gwaybio requested changes Jul 18, 2021

View reviewed changes

gwaybio reviewed Jul 19, 2021

View reviewed changes

sjfleming added 9 commits July 22, 2021 11:43

Add psutil requirement in order to check available memory

f368dce

aggregate_compartment: iteratively load and aggregate parts of compar…

29c09d0

…tment Introduce a generator function to obtain data from a compartment table in chunks of whole strata. These chunks can be aggregated separately, and the results combined. Memory efficient.

Test aggregation when loading a compartment via multiple queries

c5d0672

Fix failure on newly added test

aca958c

Modify strategy for groupby strata and finding unique merge_cols comb…

25751f6

…inations

Remove psutil dependency, set default n_strata to 1

ee85d95

Tiny modification of a comment

8d72e84

Tiny modification of a comment

8aaae18

Address comments, add tests, move _sqlite_strata_conditions() outside…

003812a

… class

sjfleming force-pushed the sf_chunk_aggregation branch from bc388a8 to 003812a Compare July 22, 2021 15:44

Update README citation author list

2441fde

gwaybio reviewed Jul 22, 2021

View reviewed changes

README.md Outdated Show resolved Hide resolved

Update order of contributors

c2a787a

Co-authored-by: Greg Way <gregory.way@gmail.com>

niranjchandrasekaran reviewed Jul 22, 2021

View reviewed changes

Improve readability and add comments

eb8ce17

gwaybio approved these changes Jul 23, 2021

View reviewed changes

gwaybio merged commit 372f108 into cytomining:master Jul 23, 2021

sjfleming deleted the sf_chunk_aggregation branch July 23, 2021 17:31

sjfleming mentioned this pull request Aug 6, 2021

Code refactor to limit memory usage? #152

Closed

	unique aggregation strata of the compartment table
	the specific image IDs grouped by unique aggregation strata (e.g. Metadata_well) of the image table

cells.aggregate_compartment() queries the database in chunks, reducing memory #156

cells.aggregate_compartment() queries the database in chunks, reducing memory #156

Conversation

sjfleming commented Jul 14, 2021 • edited Loading

Description

What is the nature of your change?

Checklist

sjfleming commented Jul 16, 2021

sjfleming commented Jul 16, 2021

gwaybio left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sjfleming Jul 19, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sjfleming Jul 19, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sjfleming Jul 19, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sjfleming Jul 19, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sjfleming commented Jul 19, 2021

gwaybio left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sjfleming Jul 19, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gwaybio commented Jul 21, 2021

sjfleming commented Jul 22, 2021

sjfleming commented Jul 22, 2021

niranjchandrasekaran left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gwaybio left a comment

Choose a reason for hiding this comment

sjfleming commented Jul 23, 2021 • edited Loading

gwaybio commented Jul 23, 2021

codecov-commenter commented Jul 23, 2021

Codecov Report

gwaybio commented Jul 23, 2021

`cells.aggregate_compartment()` queries the database in chunks, reducing memory #156

`cells.aggregate_compartment()` queries the database in chunks, reducing memory #156

sjfleming commented Jul 14, 2021 •

edited

Loading

sjfleming Jul 19, 2021 •

edited

Loading

sjfleming Jul 19, 2021 •

edited

Loading

sjfleming Jul 19, 2021 •

edited

Loading

sjfleming Jul 19, 2021 •

edited

Loading

sjfleming Jul 19, 2021 •

edited

Loading

sjfleming commented Jul 23, 2021 •

edited

Loading