Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pycytominer fails when running aggregate from large SQLite tables #142

Closed
bethac07 opened this issue May 6, 2021 · 7 comments
Closed

Pycytominer fails when running aggregate from large SQLite tables #142

bethac07 opened this issue May 6, 2021 · 7 comments
Labels
bug Something isn't working

Comments

@bethac07
Copy link
Member

bethac07 commented May 6, 2021

Reading in one table from an ~25-30GB SQLite file (so that table should have been 8-10GB total), it ran completely through all the 64 GB of memory on my machine, leading to the error below.

Happily, adding a chunksize parameter to cyto_utils/cells.py/load_compartment as below seems to avoid the worst of the issues (I did still see spikes into the 30s of GB on htop, not sure if it was during pandas.concat or during the actual aggregation itself). Reporting here as an issue rather than just making a PR because
a) not sure if you think there is anything worth investigating, I'd say probably not but your call and
b) not sure if you want pycytominer to ALWAYS use chunk sizes (and if so, if you want to use this particular chunk size); I can also imagine also doing chunksize as a user-passed option or pycytominer trying to assess database size in some way and then choosing whether and how to chunk.

Working in the profiling conda env (python 3.7.1) on an ubuntu 14 machine; pip freeze at the bottom.

        dflist=[]
        for chunk in pd.read_sql(sql=compartment_query, con=self.conn,chunksize=10000):
            dflist.append(chunk)
        df = pd.concat(dflist)
Traceback (most recent call last):
  File "profiling-recipe/profiles/profiling_pipeline.py", line 28, in <module>
    process_profile(batch=batch, plate=plate, pipeline=pipeline)
  File "/home/ubuntu/ebs_tmp/work/projects/workflow_demo/workspace/software/workflow_demo_data/profiling-recipe/profiles/profile.py", line 124, in process_profile
    aggregate_args=aggregate_args,
  File "/home/ubuntu/miniconda3/envs/profiling/lib/python3.7/site-packages/pycytominer/cyto_utils/cells.py", line 661, in aggregate_profiles
    compute_counts=True,
  File "/home/ubuntu/miniconda3/envs/profiling/lib/python3.7/site-packages/pycytominer/cyto_utils/cells.py", line 433, in aggregate_compartment
    self.load_compartment(compartment=compartment),
  File "/home/ubuntu/miniconda3/envs/profiling/lib/python3.7/site-packages/pycytominer/cyto_utils/cells.py", line 392, in load_compartment
    df = pd.read_sql(sql=compartment_query, con=self.conn)
  File "/home/ubuntu/miniconda3/envs/profiling/lib/python3.7/site-packages/pandas/io/sql.py", line 516, in read_sql
    chunksize=chunksize,
  File "/home/ubuntu/miniconda3/envs/profiling/lib/python3.7/site-packages/pandas/io/sql.py", line 1313, in read_query
    parse_dates=parse_dates,
  File "/home/ubuntu/miniconda3/envs/profiling/lib/python3.7/site-packages/pandas/io/sql.py", line 124, in _wrap_result
    frame = DataFrame.from_records(data, columns=columns, coerce_float=coerce_float)
  File "/home/ubuntu/miniconda3/envs/profiling/lib/python3.7/site-packages/pandas/core/frame.py", line 1895, in from_records
    mgr = arrays_to_mgr(arrays, arr_columns, result_index, columns)
  File "/home/ubuntu/miniconda3/envs/profiling/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 95, in arrays_to_mgr
    return create_block_manager_from_arrays(arrays, arr_names, axes)
  File "/home/ubuntu/miniconda3/envs/profiling/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 1701, in create_block_manager_from_arrays
    blocks = _form_blocks(arrays, names, axes)
  File "/home/ubuntu/miniconda3/envs/profiling/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 1759, in _form_blocks
    float_blocks = _multi_blockify(items_dict["FloatBlock"])
  File "/home/ubuntu/miniconda3/envs/profiling/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 1852, in _multi_blockify
    values, placement = _stack_arrays(list(tup_block), dtype)
  File "/home/ubuntu/miniconda3/envs/profiling/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 1880, in _stack_arrays
    stacked = np.empty(shape, dtype=dtype)
numpy.core._exceptions.MemoryError: Unable to allocate 7.62 GiB for an array with shape (3463, 295388) and data type float64
certifi==2020.12.5
greenlet==1.1.0
importlib-metadata==4.0.1
joblib==1.0.1
numpy==1.20.2
pandas==1.2.0
pycytominer==0.1
python-dateutil==2.8.1
pytz==2021.1
PyYAML==5.3.1
scikit-learn==0.24.2
scipy==1.6.3
six==1.16.0
SQLAlchemy==1.4.13
threadpoolctl==2.1.0
typing-extensions==3.10.0.0
zipp==3.4.1
@bethac07 bethac07 added the bug Something isn't working label May 6, 2021
@gwaybio
Copy link
Member

gwaybio commented May 7, 2021

ouch, that's not great. Thanks for tracking this down and reporting!

It looks like this error comes from process_profile() in the profiling recipe. Do you have a sense if the leak originates from the recipe or the load_compartment() pycytominer call?

@bethac07
Copy link
Member Author

bethac07 commented May 7, 2021

This is the total amount of tracing that I did, so I can't say for sure; my naive assumption is that, since load_compartment just returns out its dataframe, that it must be during that step or otherwise adding the chunksize wouldn't have actually helped, but I don't have data to back that up.

@bethac07
Copy link
Member Author

bethac07 commented Dec 7, 2022

I think this can be closed

@gwaybio
Copy link
Member

gwaybio commented Dec 9, 2022

@d33bs - any objections to closing this?

@d33bs
Copy link
Member

d33bs commented Dec 9, 2022

@d33bs - any objections to closing this?

Thanks @gwaybio - really appreciated getting to read through these notes (thanks as well to @bethac07 for the great details here) in the context of some recent inspirations. I'm wondering if chunked DuckDB SQL queries to join metadata and compartment data would outperform pandas dataframe merges from select *. This may not cover all scenarios but could provide scalability for certain bottlenecks.

I imagine recent updates have resolved some but maybe not all of the resource based failures. Even if/when we don't experience failures/exceptions, I feel reducing the amount of time taken to perform these actions could be helpful for developers. That said, maybe we could close this issue and explore these aspects further within #198 ?

@gwaybio
Copy link
Member

gwaybio commented Dec 9, 2022

maybe we could close this issue and explore these aspects further within #198 ?

I was thinking this too (and possibly Beth was too) - I just wanted to check that everything is captured in that issue (which I think is!)

@gwaybio
Copy link
Member

gwaybio commented Dec 9, 2022

Thanks @d33bs !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants