Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cells.aggregate_compartment() queries the database in chunks, reducing memory #156

Merged
merged 12 commits into from
Jul 23, 2021

Conversation

sjfleming
Copy link
Contributor

@sjfleming sjfleming commented Jul 14, 2021

Description

These changes were made directly in response to #152 , and may impact #142 .

The intent is to limit memory usage during pycytominer.cyto_utils.cells.aggregate_compartment() calls, and other functions which call this method, including pycytominer.cyto_utils.cells.aggregate_profiles(). One explicit stipulation was to touch pycytominer.aggregate.aggregate() as little as possible. (I did have to make one tiny change.)

What is the nature of your change?

  • Bug fix (fixes an issue).
  • Enhancement (adds functionality).
  • Breaking change (fix or feature that would cause existing functionality to not work as expected).
  • This change requires a documentation update.

Checklist

  • I have read the CONTRIBUTING.md guidelines.
  • My code follows the style guidelines of this project.
  • I have performed a self-review of my own code.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have made corresponding changes to the documentation.
  • My changes generate no new warnings.
  • New and existing unit tests pass locally with my changes.
  • I have added tests that prove my fix is effective or that my feature works.
  • I have deleted all non-relevant text in this pull request template.

@sjfleming sjfleming changed the title Sf chunk aggregation cells.aggregate_compartment() queries the database in chunks, reducing memory Jul 14, 2021
@sjfleming
Copy link
Contributor Author

I performed some tests using a 37GB SQLite database file. I realized that, when it comes to memory usage, more is not better. I had expected that using more of the available memory would lead to faster aggregation, but this is in fact the opposite of what happens. I suppose the whole task is I/O bound, and the actual aggregation in pandas is very very fast. So putting more in memory at once does not help.

Test cases were carried out by specifying n_strata, a parameter which I have added to SingleCells.aggregate_profiles(). n_strata controls how many strata (of the unique specified aggregation strata) are pulled from the database into memory at once. Tests were performed on a 2cpu 8GB memory machine with a solid-state hard drive, running Ubuntu 16.04.

Results of tests --- runtime in (minutes:seconds)

n_strata = 1 ------------- 18:21
n_strata = 2 ------------- 19:43
n_strata = 4 ------------- 25:31
n_strata max ------------- 26:45

Based on these results, I am going to remove the auto-scaling of n_strata to be the max it can be to fit in memory, and I will also remove the related dependency I introduced (psutil). I will set the default n_strata to be 1.

@sjfleming
Copy link
Contributor Author

I have checked the final outputs of aggregation on a 37GB test database:

  • run with the previous code
  • run with this branch

and they seem to be identical.

@sjfleming sjfleming marked this pull request as ready for review July 16, 2021 15:12
@gwaybio gwaybio self-requested a review July 18, 2021 11:40
Copy link
Member

@gwaybio gwaybio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really an important contribution @sjfleming - thank you! I have made several suggestions and added many points of discussion that we'll need to talk through before merging.

@niranjchandrasekaran is also planning on talking a look at the whole PR, and I've tagged him at two specific places where I think an extra careful eye is needed.

My main comment is that I'd really like to understand more about _sqlite_strata_conditions() and test it slightly more thoroughly. I'd also like to modify some variable names and docstrings, and I have a couple of one-off suggestions/comments.

I think we are close to merging!

@@ -62,7 +62,7 @@ def aggregate(
# Subset the data to specified samples
if isinstance(subset_data_df, pd.DataFrame):
population_df = subset_data_df.merge(
population_df, how="left", on=subset_data_df.columns.tolist()
population_df, how="inner", on=subset_data_df.columns.tolist()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe this alters anything, but I am curious why you made this change?

Copy link
Contributor Author

@sjfleming sjfleming Jul 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question! Without this change, my chunked aggregation strategy fails for the following reason: for each chunk, I call pycytominer.aggregate(), and I create one aggregated data frame per chunk, which I then concatenate at the end into the final output. If the aggregation per chunk occurs with a "left" merge, then all the rows of subset_data_df that are not part of this chunk will get NaN entries. For example:

# using "left" merge for a chunk containing only well A01
well            image                value
A01             1                      1.543
A01             2                      2.442
A02             1                      NaN

# using "inner" merge for a chunk containing only well A01
well            image                value
A01             1                      1.543
A01             2                      2.442

and then after the final concatenation step

# using "left" merge
well            image                value
A01             1                      1.543
A01             2                      2.442
A02             1                      NaN
A02             1                      0.234

# using "inner" merge
well            image                value
A01             1                      1.543
A01             2                      2.442
A01             2                      2.442

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha! This makes total sense. Thanks for the explanation.

@@ -401,6 +401,7 @@ def aggregate_compartment(
compartment,
compute_subsample=False,
compute_counts=False,
n_strata=1,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you name this something more descriptive? How about n_aggregation_memory_strata or something similar? n_strata does not make it obvious what the argument actually does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm happy to re-think this name. I also wasn't very happy with it.

Whether or not to compute the number of objects in each compartment
and the number of fields of view per well.
n_strata : int, default 1
Number of unique strata to pull from the database into working memory
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is strata the right word? Maybe chunk? I don't know.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What it really is, is the number of unique aggregation strata being pulled into one chunk. For example, if we are aggregating by "well", then n_strata=1 means that one "well" will be pulled from the SQLite database into memory at a time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is perfect. Can you add your example to the docstring? I think this makes the description super clear. I also love the new variable name!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the example

and the number of fields of view per well.
n_strata : int, default 1
Number of unique strata to pull from the database into working memory
at once. A larger number does not lead to faster compute times
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very helpful hint! Do you know of any specific guidance one might use? if so, can you add?

Also, should it be "a larger number does not necessarily lead to faster compute times"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to my (limited) experiments, a larger number actually leads to slower compute times (see my comment in the PR above). It seems the best thing to do is probably always to have n_strata=1

pycytominer/cyto_utils/cells.py Show resolved Hide resolved

Parameters
----------
df : pandas.core.frame.DataFrame
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a compartment data frame? Is this merged with image_df? Can you be more specific in the docstring about what columns to expect in this dataframe?

Copy link
Contributor Author

@sjfleming sjfleming Jul 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very strange dataframe that is neither of the above. It is a new dataframe constructed with the sole purpose of figuring out how to divide the compartment table into chunks.

The rows represent unique self.strata in the data. The columns are self.merge_cols. The values tell what are the values of merge_cols that are contained in each of the strata... i.e., how to chunk the data.


# Translate the first strata combination into a SQLite condition string
first_stratum_condition = _sqlite_strata_conditions(
df=df_unique_mergecols.head(1),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it took me a while to figure out that df_unique_mergecols is defined out of scope of this function. Should you include as a function argument in _sqlite_strata_conditions?

Copy link
Contributor Author

@sjfleming sjfleming Jul 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

df_unique_mergecols is in scope of _compartment_df_generator(), but out of scope of _sqlite_strata_conditions(), and not used in _sqlite_strata_conditions().

Maybe I made a strange choice by nesting _sqlite_strata_conditions() inside _compartment_df_generator()? The only reason I did that was because I was pretty sure no other function would ever call it.

return grouped_conditions

# Translate the first strata combination into a SQLite condition string
first_stratum_condition = _sqlite_strata_conditions(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where are you using first_stratum_condition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not. Good catch! This is an old thing I didn't entirely delete.

.reset_index(drop=True)
)

def _sqlite_strata_conditions(df, dtypes, n=1):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my read, this function is the most critical contribution in this PR (cc @niranjchandrasekaran). It's probably because I'm not an expert in generator functions and how they're used, but I am having some trouble following the logic.

I make a couple comments in this function to try to help my understanding, but if you can see where my knowledge gap is, please let me know. Right now, I think this function will be difficult to maintain unless we provide a couple more tests.

Copy link
Contributor Author

@sjfleming sjfleming Jul 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that there should be some dedicated tests to demonstrate the purpose of this function

output_file=compress_file,
compute_subsample=True,
compression_options={"method": "gzip"},
n_strata=1, # this will force multiple queries from each compartment
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you run this test multiple times looping through different n_strata:

E.g.

for n_strata in [1, 2, 3]:
    ...
   aggregate_profiles(n_strata=n_strata)

and confirm that the expected result doesn't change

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also add a separate test where you set n_strata to an invalid argument?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is a good idea in terms of testing multiple values for n_strata.

We can test invalid n_strata values... but the only truly invalid values should be < 1, and there is an explicit assertion for that. But we can still include a test.

@sjfleming
Copy link
Contributor Author

I have made an attempt to address all your comments and make relevant changes @gwaygenomics . I added a couple of tests to address a few comments as well.

Copy link
Member

@gwaybio gwaybio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really great to see this happening. I made a couple comments.

@niranjchandrasekaran - are you able to take a look? I think it's worth having your eyes on the full PR, but I did note particular focus areas.

@@ -62,7 +62,7 @@ def aggregate(
# Subset the data to specified samples
if isinstance(subset_data_df, pd.DataFrame):
population_df = subset_data_df.merge(
population_df, how="left", on=subset_data_df.columns.tolist()
population_df, how="inner", on=subset_data_df.columns.tolist()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha! This makes total sense. Thanks for the explanation.

Whether or not to compute the number of objects in each compartment
and the number of fields of view per well.
n_strata : int, default 1
Number of unique strata to pull from the database into working memory
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is perfect. Can you add your example to the docstring? I think this makes the description super clear. I also love the new variable name!

)
dtype_dict = dict(
zip(
[s[7:-1] for s in compartment_dtypes.columns], # strip typeof( )
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to create a different request without typeof() to get the actual column names in the proper order? I'd be in favor of two queries to improve readability/maintainability rather than hardcoding something that is also difficult to read (even if it is robust).

@@ -846,3 +775,67 @@ def aggregate_profiles(
)
else:
return aggregated


def _sqlite_strata_conditions(df, dtypes, n=1):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be a separately defined function? When one imports SingleCells will it automatically know this function definition? I thought that you had it as a private method of the class before? Any reason you decided to make it separate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When one imports SingleCells, yes, a SingleCells object will automatically be able to see that function. I did have it as a private method of the class before. I can move it back if we want. The only reason I moved it out is that I realized it doesn't actually rely on anything from SingleCells and it was just a tiny bit easier to import it for testing as a separate function. Though it could still be imported for testing either way

----------
df : pandas.core.frame.DataFrame
A dataframe where columns are merge_cols and rows represent
unique aggregation strata of the compartment table
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
unique aggregation strata of the compartment table
the specific image IDs grouped by unique aggregation strata (e.g. Metadata_well) of the image table

Is this more accurate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's right. Only I think it turns out to be whatever is in merge_cols, so "ImageNumber" and "TableNumber"

dtypes : Dict[str, str]
Dictionary to look up SQLite datatype based on column name
n : int
Number of rows of the input df to combine in each output
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is really great documentation - very appreciated

Returns
-------
grouped_conditions : List[str]
A list of strings, each string being a valid SQLite conditional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a couple words describing how this is meant to be used?

"being a valid SQLite conditional to facilitate chunked aggregation?" 🤷

Copy link
Contributor Author

@sjfleming sjfleming Jul 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha yeah, I can add a bit. Really what I mean is that it will return a list of strings (called conditions), where each of the strings can be used in an SQL statement like this:

f"select {cols} from {compartment} where {condition}"

Here condition is one element of the list of strings grouped_conditions, and the above SQL statement will grab a single chunk for aggregation. So by iterating over the conditions in grouped_conditions and making SQL queries, we get the whole table back in chunks:

for condition in grouped_conditions:
    do_sql_query(f"select {cols} from {compartment} where {condition}")

@@ -6,7 +6,7 @@
from sqlalchemy import create_engine

from pycytominer import aggregate, normalize
from pycytominer.cyto_utils.cells import SingleCells
from pycytominer.cyto_utils.cells import SingleCells, _sqlite_strata_conditions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I see, you're needing to input this as a new function. I typically think of _single_leading_underscore as being private methods, and they're defined as "internal use" characters in pep8. I'm wondering if I'm missing any critical insight that led you to separate out this method (This is kind of a duplicate comment of one previous)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No real insight, no. Yeah I just did it that way so I could easily import it to test it here. And the leading underscore should denote to a user that it's private and never to be used directly for anything.

]
out1 = _sqlite_strata_conditions(
df=df,
dtypes={"TableNumber": "integer", "ImageNumber": "integer"},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will the dtype ever be a string?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, "text" - I see below, gotcha!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it seems like the dtypes for SQL are null, real, text, integer, and blob. But the TableNumber and ImageNumber values I think in practice are always of integer type, though I am not 100% sure

@gwaybio
Copy link
Member

gwaybio commented Jul 21, 2021

a quick note @sjfleming - can you also add your name to the software citation section in this PR? Thanks!

@sjfleming
Copy link
Contributor Author

I rebased on top of the current master branch to get the latest changes just now

@sjfleming
Copy link
Contributor Author

Thanks @gwaygenomics , I took a guess about where to put myself in that author list, please do correct me if it's the wrong place :)

README.md Outdated Show resolved Hide resolved
Co-authored-by: Greg Way <gregory.way@gmail.com>
Copy link
Member

@niranjchandrasekaran niranjchandrasekaran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I don't any additional comments


# Obtain all valid strata combinations, and their merge_cols values
df_unique_mergecols = (
self.image_df[self.strata + self.merge_cols]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think self.strata and self.merge_cols will always be in self.image_df and they will be unique. I trying to imagine when the two lists will not be unique. We could aggregate at the image-level but for that we will use Metadata_Site in self.strata and not ImageNumber. Other than that, I can't think of an overlap between the two.

pycytominer/cyto_utils/cells.py Show resolved Hide resolved
Copy link
Member

@gwaybio gwaybio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great @sjfleming - I think we can merge this in.

One last comment (sorry I should have mentioned this sooner!) but can you make sure to run black on the files you've touched?

@sjfleming
Copy link
Contributor Author

sjfleming commented Jul 23, 2021

@gwaygenomics Great! Yes, I have already run black and everything is currently formatted that way.

@gwaybio
Copy link
Member

gwaybio commented Jul 23, 2021

Running final tests now!

@codecov-commenter
Copy link

Codecov Report

Merging #156 (eb8ce17) into master (3edd0c7) will increase coverage by 0.04%.
The diff coverage is 98.43%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #156      +/-   ##
==========================================
+ Coverage   98.04%   98.09%   +0.04%     
==========================================
  Files          49       49              
  Lines        2199     2253      +54     
==========================================
+ Hits         2156     2210      +54     
  Misses         43       43              
Flag Coverage Δ
unittests 98.09% <98.43%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pycytominer/aggregate.py 96.77% <ø> (ø)
pycytominer/cyto_utils/cells.py 97.02% <97.22%> (+0.45%) ⬆️
pycytominer/tests/test_cyto_utils/test_cells.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3edd0c7...eb8ce17. Read the comment docs.

@gwaybio gwaybio merged commit 372f108 into cytomining:master Jul 23, 2021
@gwaybio
Copy link
Member

gwaybio commented Jul 23, 2021

thanks again Stephen!

@sjfleming sjfleming deleted the sf_chunk_aggregation branch July 23, 2021 17:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants