Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code refactor to limit memory usage? #152

Closed
sjfleming opened this issue Jul 1, 2021 · 8 comments
Closed

Code refactor to limit memory usage? #152

sjfleming opened this issue Jul 1, 2021 · 8 comments

Comments

@sjfleming
Copy link
Contributor

Hi all,

I've been working with Carmen @carmendv from the Precision Cardiology Lab on using pycytominer to aggregate and normalize data which has been saved in an SQLite database format by running

cytominer-database ingest /data/directory sqlite:///backend.sqlite -c ingest_config.ini

Our strategy once the database has been created is then to run the following code snippet using pycytominer

from pycytominer.cyto_utils.cells import SingleCells

sc = SingleCells('sqlite:///backend.sqlite',
                 aggregation_operation='median')
sc.aggregate_profiles().to_csv(output_filename)

However, as we've discussed with @niranjchandrasekaran and @gwaygenomics , the first step of this operation seems to be to load the entire SQLite database into memory. It seems like this happens either here


or here (if load_image_data=False upon object instantiation)

These database files can be quite large, many tens of GB or even hundreds of GB. At some point it becomes untenable to have a machine with enough memory to handle this, as well as prohibitively expensive. This seems to be related to this issue #142

As I understand it, the current strategy is:

  1. create the SQLite database
  2. load the entire database into memory as a pandas dataframe
  3. use pandas to manipulate the dataframe and perform aggregation

An alternative strategy I am proposing would be:

  1. create the SQLite database
  2. perform slightly more complex queries on the database on disk in order to perform the aggregation directly via SQL

This would have the advantage of truly making use of the database, and could really minimize the memory footprint.

I was thinking of something like this (https://www.sqlitetutorial.net/sqlite-group-by/) though I am no expert at SQL.

Is there interest in this? Does this seem like a good direction for pycytominer?

I do think it would probably require a decent amount of refactoring, since the aggregate() function in aggregate.py takes a pandas dataframe as an input, and a lot of the code currently involves fetching and formatting that dataframe. My current thinking is that aggregate() would take the database connection as an input argument, rather than an in-memory dataframe.

@gwaybio
Copy link
Member

gwaybio commented Jul 2, 2021

This is wonderful - I am fully supportive of this direction and would welcome your contribution. A couple of thoughts:

  • I'd recommend adding the functionality to aggregate directly from SQL to the SingleCells class. Let's try keep pycytominer.aggregate() functional for pandas dataframes (we use aggregate elsewhere too). If that's not possible, then I'd recommend writing two separate functions:
    • (1) sniff file input to pycytominer.aggregate()
    • (2) perform SQL aggregation if sniffed = "SQL"
  • We're slowly moving away from SQL in favor of parquet. This is happening via a transition from cytominer-database to cytominer-transport.
    • Practically, this means any functionality to aggregate depending on file type should try to be as separated as possible from the core pycytominer functions. We would still find SQL based aggregation very helpful, but it is a short-to-mid term solution. Really, this point is more evidence for your contribution containing the two functions as detailed above.
  • Minor point, but we're not loading the full database in SingleCells().load_image(). This is just loading the image.csv file. We are loading the full database when we run SingleCells().load_compartment() and merge single cells.

How does this sound? To me, I imagine this would be a big improvement. Although a SQL novice, I do have a sense that this would improve speed and reduce required resources quite a bit. What do you think @niranjchandrasekaran ? Any additional guidance for Stephen?

@niranjchandrasekaran
Copy link
Member

I'd recommend adding the functionality to aggregate directly from SQL to the SingleCells class. Let's try keep pycytominer.aggregate() functional for pandas dataframes (we use aggregate elsewhere too).

I second this.

Minor point, but we're not loading the full database in SingleCells().load_image(). This is just loading the image.csv file. We are loading the full database when we run SingleCells().load_compartment() and merge single cells.

I feel like this distinction is important while @sjfleming is thinking about refactoring the code. Currently some of queries are made at the image table level while others are made at the whole database level and it will be important to get that right.

@sjfleming
Copy link
Contributor Author

Thanks for your comments! I will try to work on this in my fork of the repo. I'll plan to leave pycytominer.aggregate() alone, and change only the way that aggregation is being done in the context of the SingleCells class.

And yes, I was not completely sure where the different parts of the database actually get loaded so this is a helpful explanation.

I won't be super speedy, but with luck I might have an update in a few weeks :)

@gwaybio
Copy link
Member

gwaybio commented Jul 8, 2021

Sounds great @sjfleming - I look forward to seeing your contribution, potentially in a couple of weeks

@sjfleming
Copy link
Contributor Author

sjfleming commented Jul 14, 2021

Okay, so digging into the code a bit more, I see that a real refactor to use aggregation via SQL queries would require a very substantial refactor. Since I wanted to touch as little code as possible, I introduced what I hope is a compromise... one which makes the minimal changes needed in order to reduce memory usage substantially.

I am still in the process of testing to make sure that these changes do in fact limit memory usage. I'm just going to try it out on a big dataset and monitor memory usage. I am not sure how I would include a pytest test case for this... it was not obvious to me.

@gwaybio
Copy link
Member

gwaybio commented Jul 15, 2021

Awesome, thanks @sjfleming - let me know when I should dig deeper into the code. Looks like you were able to add one test, which is great.

@sjfleming
Copy link
Contributor Author

Alright @gwaygenomics , I have made a few more changes, and now I think the code is ready for review. Thanks!

@sjfleming
Copy link
Contributor Author

Closed by #156

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants