Pycytominer fails when running aggregate from large SQLite tables #142

bethac07 · 2021-05-06T22:35:46Z

Reading in one table from an ~25-30GB SQLite file (so that table should have been 8-10GB total), it ran completely through all the 64 GB of memory on my machine, leading to the error below.

Happily, adding a chunksize parameter to cyto_utils/cells.py/load_compartment as below seems to avoid the worst of the issues (I did still see spikes into the 30s of GB on htop, not sure if it was during pandas.concat or during the actual aggregation itself). Reporting here as an issue rather than just making a PR because
a) not sure if you think there is anything worth investigating, I'd say probably not but your call and
b) not sure if you want pycytominer to ALWAYS use chunk sizes (and if so, if you want to use this particular chunk size); I can also imagine also doing chunksize as a user-passed option or pycytominer trying to assess database size in some way and then choosing whether and how to chunk.

Working in the profiling conda env (python 3.7.1) on an ubuntu 14 machine; pip freeze at the bottom.

        dflist=[]
        for chunk in pd.read_sql(sql=compartment_query, con=self.conn,chunksize=10000):
            dflist.append(chunk)
        df = pd.concat(dflist)

Traceback (most recent call last):
  File "profiling-recipe/profiles/profiling_pipeline.py", line 28, in <module>
    process_profile(batch=batch, plate=plate, pipeline=pipeline)
  File "/home/ubuntu/ebs_tmp/work/projects/workflow_demo/workspace/software/workflow_demo_data/profiling-recipe/profiles/profile.py", line 124, in process_profile
    aggregate_args=aggregate_args,
  File "/home/ubuntu/miniconda3/envs/profiling/lib/python3.7/site-packages/pycytominer/cyto_utils/cells.py", line 661, in aggregate_profiles
    compute_counts=True,
  File "/home/ubuntu/miniconda3/envs/profiling/lib/python3.7/site-packages/pycytominer/cyto_utils/cells.py", line 433, in aggregate_compartment
    self.load_compartment(compartment=compartment),
  File "/home/ubuntu/miniconda3/envs/profiling/lib/python3.7/site-packages/pycytominer/cyto_utils/cells.py", line 392, in load_compartment
    df = pd.read_sql(sql=compartment_query, con=self.conn)
  File "/home/ubuntu/miniconda3/envs/profiling/lib/python3.7/site-packages/pandas/io/sql.py", line 516, in read_sql
    chunksize=chunksize,
  File "/home/ubuntu/miniconda3/envs/profiling/lib/python3.7/site-packages/pandas/io/sql.py", line 1313, in read_query
    parse_dates=parse_dates,
  File "/home/ubuntu/miniconda3/envs/profiling/lib/python3.7/site-packages/pandas/io/sql.py", line 124, in _wrap_result
    frame = DataFrame.from_records(data, columns=columns, coerce_float=coerce_float)
  File "/home/ubuntu/miniconda3/envs/profiling/lib/python3.7/site-packages/pandas/core/frame.py", line 1895, in from_records
    mgr = arrays_to_mgr(arrays, arr_columns, result_index, columns)
  File "/home/ubuntu/miniconda3/envs/profiling/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 95, in arrays_to_mgr
    return create_block_manager_from_arrays(arrays, arr_names, axes)
  File "/home/ubuntu/miniconda3/envs/profiling/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 1701, in create_block_manager_from_arrays
    blocks = _form_blocks(arrays, names, axes)
  File "/home/ubuntu/miniconda3/envs/profiling/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 1759, in _form_blocks
    float_blocks = _multi_blockify(items_dict["FloatBlock"])
  File "/home/ubuntu/miniconda3/envs/profiling/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 1852, in _multi_blockify
    values, placement = _stack_arrays(list(tup_block), dtype)
  File "/home/ubuntu/miniconda3/envs/profiling/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 1880, in _stack_arrays
    stacked = np.empty(shape, dtype=dtype)
numpy.core._exceptions.MemoryError: Unable to allocate 7.62 GiB for an array with shape (3463, 295388) and data type float64

certifi==2020.12.5
greenlet==1.1.0
importlib-metadata==4.0.1
joblib==1.0.1
numpy==1.20.2
pandas==1.2.0
pycytominer==0.1
python-dateutil==2.8.1
pytz==2021.1
PyYAML==5.3.1
scikit-learn==0.24.2
scipy==1.6.3
six==1.16.0
SQLAlchemy==1.4.13
threadpoolctl==2.1.0
typing-extensions==3.10.0.0
zipp==3.4.1

The text was updated successfully, but these errors were encountered:

gwaybio · 2021-05-07T13:43:40Z

ouch, that's not great. Thanks for tracking this down and reporting!

It looks like this error comes from process_profile() in the profiling recipe. Do you have a sense if the leak originates from the recipe or the load_compartment() pycytominer call?

bethac07 · 2021-05-07T13:52:03Z

This is the total amount of tracing that I did, so I can't say for sure; my naive assumption is that, since load_compartment just returns out its dataframe, that it must be during that step or otherwise adding the chunksize wouldn't have actually helped, but I don't have data to back that up.

bethac07 · 2022-12-07T14:34:49Z

I think this can be closed

gwaybio · 2022-12-09T15:31:47Z

@d33bs - any objections to closing this?

d33bs · 2022-12-09T16:04:32Z

@d33bs - any objections to closing this?

Thanks @gwaybio - really appreciated getting to read through these notes (thanks as well to @bethac07 for the great details here) in the context of some recent inspirations. I'm wondering if chunked DuckDB SQL queries to join metadata and compartment data would outperform pandas dataframe merges from select *. This may not cover all scenarios but could provide scalability for certain bottlenecks.

I imagine recent updates have resolved some but maybe not all of the resource based failures. Even if/when we don't experience failures/exceptions, I feel reducing the amount of time taken to perform these actions could be helpful for developers. That said, maybe we could close this issue and explore these aspects further within #198 ?

gwaybio · 2022-12-09T16:12:34Z

maybe we could close this issue and explore these aspects further within #198 ?

I was thinking this too (and possibly Beth was too) - I just wanted to check that everything is captured in that issue (which I think is!)

gwaybio · 2022-12-09T16:12:48Z

Thanks @d33bs !

bethac07 added the bug Something isn't working label May 6, 2021

gwaybio mentioned this issue May 31, 2021

Update aggregate to use Dask dataframe #146

Open

sjfleming mentioned this issue Jul 1, 2021

Code refactor to limit memory usage? #152

Closed

sjfleming mentioned this issue Jul 14, 2021

cells.aggregate_compartment() queries the database in chunks, reducing memory #156

Merged

13 tasks

d33bs mentioned this issue May 21, 2022

Address SQLite Read Performance Issues #198

Closed

gwaybio closed this as completed Dec 9, 2022

d33bs mentioned this issue Dec 15, 2022

Reduce memory consumption of SingleCells.merge_single_cells #234

Closed

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pycytominer fails when running aggregate from large SQLite tables #142

Pycytominer fails when running aggregate from large SQLite tables #142

bethac07 commented May 6, 2021

gwaybio commented May 7, 2021

bethac07 commented May 7, 2021

bethac07 commented Dec 7, 2022

gwaybio commented Dec 9, 2022

d33bs commented Dec 9, 2022

gwaybio commented Dec 9, 2022

gwaybio commented Dec 9, 2022

Pycytominer fails when running aggregate from large SQLite tables #142

Pycytominer fails when running aggregate from large SQLite tables #142

Comments

bethac07 commented May 6, 2021

gwaybio commented May 7, 2021

bethac07 commented May 7, 2021

bethac07 commented Dec 7, 2022

gwaybio commented Dec 9, 2022

d33bs commented Dec 9, 2022

gwaybio commented Dec 9, 2022

gwaybio commented Dec 9, 2022