Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure reasonable performance with large CSV datasets #8

Closed
d33bs opened this issue Sep 24, 2022 · 9 comments
Closed

Ensure reasonable performance with large CSV datasets #8

d33bs opened this issue Sep 24, 2022 · 9 comments
Assignees
Labels
enhancement New feature or request

Comments

@d33bs
Copy link
Member

d33bs commented Sep 24, 2022

In order to provide the best utility to Pycytominer users it's crucial we ensure reasonable performance when ingesting large amounts of CSV data (whether by number of files, size of files, or both). This issue is intended to help to guide this repo towards reasonable performance expectations (computing resources and time).

@d33bs d33bs added the enhancement New feature or request label Sep 24, 2022
@d33bs d33bs self-assigned this Sep 24, 2022
@d33bs
Copy link
Member Author

d33bs commented Sep 24, 2022

Hi @gwaybio - would you have any suggestions in terms of large CSV dataset(s) to benchmark CSV data handling for pyctyominer-transform? I've been using the "Human HT29" example CSV output from CellProfiler for early work here but these are relatively small in number and size.

@bethac07
Copy link
Member

Hi Dave,

All the data you could ever want can be found here! Data sets range in size from a couple of plates into the several hundred.

https://registry.opendata.aws/cellpainting-gallery/

@d33bs
Copy link
Member Author

d33bs commented Sep 26, 2022

Fantastic, thank you so much @bethac07!

@d33bs
Copy link
Member Author

d33bs commented Sep 29, 2022

Hi @bethac07 - thanks again for the cellpainting-gallery data references, these have been very helpful! Using those data as a reference, what is typically used as an input for pycytominer work? I notice there are analysis files broken down into compartments ("Cells", "Cytoplasm", etc) and also backend files which appear to be aggregations. If both are used, is there a preference towards one or other other for certain purposes? I'm wondering if I should target one format over the other for work in this repo. (cc @gwaybio)

@bethac07
Copy link
Member

Great question!

  • The analysis files are what come directly from CellProfiler, and are what we hope will eventually be the direct target for pycytominer-transform in new experiments going forward. They are much more important to be performant on.
  • The backend aggregated sqlite files we would like to eventually convert to whatever format we end up with here (parquet or whatever). We have a lot of them from 15 years of experiments, so it would be nice to not be terrible to convert these all over, but since they're already available in aggregated form, are much less important.

@shntnu
Copy link
Member

shntnu commented Jan 31, 2023

I was very excited to test this out! I've included my notes below – hope this helps!

Python setup

sudo apt remove pipenv
install pipenv
pipenv install --python /usr/bin/python3.10 
pipenv shell
curl -sS https://bootstrap.pypa.io/get-pip.py | python3.10

Test dataset

aws s3 ls --recursive s3://cellpainting-gallery/test-cpg0016-jump/source_4/workspace/analysis/2021_04_26_Batch1/BR00117035/analysis/|grep csv
2023-01-30 22:05:41    2497224 test-cpg0016-jump/source_4/workspace/analysis/2021_04_26_Batch1/BR00117035/analysis/BR00117035-A01-1/Cells.csv
2023-01-30 22:05:42    2431302 test-cpg0016-jump/source_4/workspace/analysis/2021_04_26_Batch1/BR00117035/analysis/BR00117035-A01-1/Cytoplasm.csv
2023-01-30 22:05:42     361721 test-cpg0016-jump/source_4/workspace/analysis/2021_04_26_Batch1/BR00117035/analysis/BR00117035-A01-1/Experiment.csv
2023-01-30 22:05:42     105210 test-cpg0016-jump/source_4/workspace/analysis/2021_04_26_Batch1/BR00117035/analysis/BR00117035-A01-1/Image.csv
2023-01-30 22:05:43    2378879 test-cpg0016-jump/source_4/workspace/analysis/2021_04_26_Batch1/BR00117035/analysis/BR00117035-A01-1/Nuclei.csv
2023-01-30 22:05:44    8301088 test-cpg0016-jump/source_4/workspace/analysis/2021_04_26_Batch1/BR00117035/analysis/BR00117035-A01-2/Cells.csv
2023-01-30 22:05:44    8026097 test-cpg0016-jump/source_4/workspace/analysis/2021_04_26_Batch1/BR00117035/analysis/BR00117035-A01-2/Cytoplasm.csv
2023-01-30 22:05:45     361721 test-cpg0016-jump/source_4/workspace/analysis/2021_04_26_Batch1/BR00117035/analysis/BR00117035-A01-2/Experiment.csv
2023-01-30 22:05:45     105251 test-cpg0016-jump/source_4/workspace/analysis/2021_04_26_Batch1/BR00117035/analysis/BR00117035-A01-2/Image.csv
2023-01-30 22:05:45    7920455 test-cpg0016-jump/source_4/workspace/analysis/2021_04_26_Batch1/BR00117035/analysis/BR00117035-A01-2/Nuclei.csv
2023-01-30 22:06:04    1980355 test-cpg0016-jump/source_4/workspace/analysis/2021_04_26_Batch1/BR00117035/analysis/BR00117035-A02-1/Cells.csv
2023-01-30 22:06:04    1932478 test-cpg0016-jump/source_4/workspace/analysis/2021_04_26_Batch1/BR00117035/analysis/BR00117035-A02-1/Cytoplasm.csv
2023-01-30 22:06:05     361721 test-cpg0016-jump/source_4/workspace/analysis/2021_04_26_Batch1/BR00117035/analysis/BR00117035-A02-1/Experiment.csv
2023-01-30 22:06:05     105204 test-cpg0016-jump/source_4/workspace/analysis/2021_04_26_Batch1/BR00117035/analysis/BR00117035-A02-1/Image.csv
2023-01-30 22:06:05    1886212 test-cpg0016-jump/source_4/workspace/analysis/2021_04_26_Batch1/BR00117035/analysis/BR00117035-A02-1/Nuclei.csv
2023-01-30 22:06:06    8551391 test-cpg0016-jump/source_4/workspace/analysis/2021_04_26_Batch1/BR00117035/analysis/BR00117035-A02-2/Cells.csv
2023-01-30 22:06:07    8274429 test-cpg0016-jump/source_4/workspace/analysis/2021_04_26_Batch1/BR00117035/analysis/BR00117035-A02-2/Cytoplasm.csv
2023-01-30 22:06:07     361723 test-cpg0016-jump/source_4/workspace/analysis/2021_04_26_Batch1/BR00117035/analysis/BR00117035-A02-2/Experiment.csv
2023-01-30 22:06:07     105383 test-cpg0016-jump/source_4/workspace/analysis/2021_04_26_Batch1/BR00117035/analysis/BR00117035-A02-2/Image.csv
2023-01-30 22:06:08    8154002 test-cpg0016-jump/source_4/workspace/analysis/2021_04_26_Batch1/BR00117035/analysis/BR00117035-A02-2/Nuclei.csv

Install package

git clone git@github.com:cytomining/pycytominer-transform.git
cd pycytominer-transform
pip install -e .

Run

from pycytominer_transform import convert

convert(
    source_path="s3://cellpainting-gallery/test-cpg0016-jump/source_4/workspace/analysis/2021_04_26_Batch1/BR00117035/analysis/",
    source_datatype="csv",
    dest_path=".",
    dest_datatype="parquet",
    concat=True,
    no_sign_request=True,
)

Error log

22:07:21.433 | INFO    | prefect.engine - Created flow run 'hopping-snake' for flow 'to-parquet'
22:07:21.579 | INFO    | Flow run 'hopping-snake' - Created subflow run 'military-dingo' for flow 'gather-records'
22:07:21.631 | INFO    | Flow run 'military-dingo' - Created task run 'build_path-505e517e-0' for task 'build_path'
22:07:21.631 | INFO    | Flow run 'military-dingo' - Executing 'build_path-505e517e-0' immediately...
22:07:21.736 | INFO    | Task run 'build_path-505e517e-0' - Finished in state Completed()
22:07:21.762 | INFO    | Flow run 'military-dingo' - Created task run 'get_source_filepaths-b02baff7-0' for task 'get_source_filepaths'
22:07:21.763 | INFO    | Flow run 'military-dingo' - Executing 'get_source_filepaths-b02baff7-0' immediately...
22:07:22.887 | INFO    | Task run 'get_source_filepaths-b02baff7-0' - Finished in state Completed()
22:07:22.908 | INFO    | Flow run 'military-dingo' - Created task run 'infer_source_datatype-8e1a10f3-0' for task 'infer_source_datatype'
22:07:22.908 | INFO    | Flow run 'military-dingo' - Executing 'infer_source_datatype-8e1a10f3-0' immediately...
22:07:22.953 | INFO    | Task run 'infer_source_datatype-8e1a10f3-0' - Finished in state Completed()
22:07:22.976 | INFO    | Flow run 'military-dingo' - Created task run 'filter_source_filepaths-7fbae088-0' for task 'filter_source_filepaths'
22:07:22.977 | INFO    | Flow run 'military-dingo' - Executing 'filter_source_filepaths-7fbae088-0' immediately...
22:07:24.263 | INFO    | Task run 'filter_source_filepaths-7fbae088-0' - Finished in state Completed()
22:07:24.289 | INFO    | Flow run 'military-dingo' - Finished in state Completed()
22:07:24.383 | INFO    | Flow run 'hopping-snake' - Created task run 'read_file-518a7dde-2' for task 'read_file'
22:07:24.384 | INFO    | Flow run 'hopping-snake' - Submitted task run 'read_file-518a7dde-2' for execution.
22:07:24.403 | INFO    | Flow run 'hopping-snake' - Created task run 'read_file-518a7dde-6' for task 'read_file'
22:07:24.404 | INFO    | Flow run 'hopping-snake' - Submitted task run 'read_file-518a7dde-6' for execution.
22:07:24.431 | INFO    | Flow run 'hopping-snake' - Created task run 'write_parquet-9d9b144f-2' for task 'write_parquet'
22:07:24.432 | INFO    | Flow run 'hopping-snake' - Submitted task run 'write_parquet-9d9b144f-2' for execution.
22:07:25.220 | INFO    | Flow run 'hopping-snake' - Created task run 'read_file-518a7dde-4' for task 'read_file'
22:07:25.221 | INFO    | Flow run 'hopping-snake' - Submitted task run 'read_file-518a7dde-4' for execution.
22:07:28.333 | INFO    | Flow run 'hopping-snake' - Created task run 'read_file-518a7dde-10' for task 'read_file'
22:07:28.334 | INFO    | Flow run 'hopping-snake' - Submitted task run 'read_file-518a7dde-10' for execution.
22:07:31.568 | INFO    | Flow run 'hopping-snake' - Created task run 'read_file-518a7dde-1' for task 'read_file'
22:07:31.569 | INFO    | Flow run 'hopping-snake' - Submitted task run 'read_file-518a7dde-1' for execution.
22:07:34.815 | INFO    | Flow run 'hopping-snake' - Created task run 'write_parquet-9d9b144f-10' for task 'write_parquet'
22:07:34.816 | INFO    | Flow run 'hopping-snake' - Submitted task run 'write_parquet-9d9b144f-10' for execution.
22:07:41.747 | INFO    | Flow run 'hopping-snake' - Created task run 'write_parquet-9d9b144f-6' for task 'write_parquet'
22:07:41.748 | INFO    | Flow run 'hopping-snake' - Submitted task run 'write_parquet-9d9b144f-6' for execution.
22:07:53.852 | INFO    | Flow run 'hopping-snake' - Created task run 'read_file-518a7dde-0' for task 'read_file'
22:07:53.853 | INFO    | Flow run 'hopping-snake' - Submitted task run 'read_file-518a7dde-0' for execution.
22:07:55.301 | INFO    | Flow run 'hopping-snake' - Created task run 'write_parquet-9d9b144f-1' for task 'write_parquet'
22:07:55.302 | INFO    | Flow run 'hopping-snake' - Submitted task run 'write_parquet-9d9b144f-1' for execution.
22:08:06.304 | INFO    | Flow run 'hopping-snake' - Created task run 'read_file-518a7dde-15' for task 'read_file'
22:08:06.305 | INFO    | Flow run 'hopping-snake' - Submitted task run 'read_file-518a7dde-15' for execution.
22:08:12.840 | INFO    | Task run 'read_file-518a7dde-10' - Finished in state Completed()
22:08:16.472 | INFO    | Flow run 'hopping-snake' - Created task run 'read_file-518a7dde-8' for task 'read_file'
22:08:16.473 | INFO    | Flow run 'hopping-snake' - Submitted task run 'read_file-518a7dde-8' for execution.
22:08:21.693 | INFO    | Task run 'read_file-518a7dde-2' - Finished in state Completed()
22:08:24.641 | INFO    | Flow run 'hopping-snake' - Created task run 'write_parquet-9d9b144f-0' for task 'write_parquet'
22:08:24.642 | INFO    | Flow run 'hopping-snake' - Submitted task run 'write_parquet-9d9b144f-0' for execution.
22:08:24.880 | ERROR   | prefect.orion - Encountered exception in request:
Traceback (most recent call last):
  File "/home/ubuntu/.local/share/virtualenvs/pycytominer-transform-ckiYrrK9/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1900, in _execute_context
    self.dialect.do_execute(
  File "/home/ubuntu/.local/share/virtualenvs/pycytominer-transform-ckiYrrK9/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 736, in do_execute
    cursor.execute(statement, parameters)
  File "/home/ubuntu/.local/share/virtualenvs/pycytominer-transform-ckiYrrK9/lib/python3.10/site-packages/sqlalchemy/dialects/sqlite/aiosqlite.py", line 108, in execute
    self._adapt_connection._handle_exception(error)
  File "/home/ubuntu/.local/share/virtualenvs/pycytominer-transform-ckiYrrK9/lib/python3.10/site-packages/sqlalchemy/dialects/sqlite/aiosqlite.py", line 236, in _handle_exception
    raise error
  File "/home/ubuntu/.local/share/virtualenvs/pycytominer-transform-ckiYrrK9/lib/python3.10/site-packages/sqlalchemy/dialects/sqlite/aiosqlite.py", line 90, in execute
    self.await_(_cursor.execute(operation, parameters))
  File "/home/ubuntu/.local/share/virtualenvs/pycytominer-transform-ckiYrrK9/lib/python3.10/site-packages/sqlalchemy/util/_concurrency_py3k.py", line 68, in await_only
    return current.driver.switch(awaitable)
  File "/home/ubuntu/.local/share/virtualenvs/pycytominer-transform-ckiYrrK9/lib/python3.10/site-packages/sqlalchemy/util/_concurrency_py3k.py", line 121, in greenlet_spawn
    value = await result
  File "/home/ubuntu/.local/share/virtualenvs/pycytominer-transform-ckiYrrK9/lib/python3.10/site-packages/aiosqlite/cursor.py", line 37, in execute
    await self._execute(self._cursor.execute, sql, parameters)
  File "/home/ubuntu/.local/share/virtualenvs/pycytominer-transform-ckiYrrK9/lib/python3.10/site-packages/aiosqlite/cursor.py", line 31, in _execute
    return await self._conn._execute(fn, *args, **kwargs)
  File "/home/ubuntu/.local/share/virtualenvs/pycytominer-transform-ckiYrrK9/lib/python3.10/site-packages/aiosqlite/core.py", line 137, in _execute
    return await future
  File "/home/ubuntu/.local/share/virtualenvs/pycytominer-transform-ckiYrrK9/lib/python3.10/site-packages/aiosqlite/core.py", line 110, in run
    result = function()
sqlite3.OperationalError: database is locked

Related issue: PrefectHQ/prefect#7277

Config

python --version
# Python 3.10.9

git rev-parse --short HEAD
# e55424d

git branch -v
# * main e55424d [behind 1] Add Initial CSV to Parquet Conversion, Documentation, and Workflows

sqlite3 --version
# 3.31.1 2020-01-27 19:55:54 3bfa9cc97da10598521b342961df8f5f68c7388fa117345eeb516eaa837balt1

uname -r
# 5.15.0-1028-aws

@d33bs
Copy link
Member Author

d33bs commented Jan 31, 2023

Thank you @shntnu for trying this out and the great details! Very sorry to see the test failed. I'll look into this and follow up.

@shntnu
Copy link
Member

shntnu commented Jan 31, 2023

Very sorry to see the test failed. I'll look into this and follow up.

I am so excited to see what you've built here!

@gwaybio
Copy link
Member

gwaybio commented Jul 26, 2023

Closing this issue as we've seen reasonable performance so far, and because our primary discussions about performance will continue in https://github.com/cytomining/CytoTable-benchmarks

@gwaybio gwaybio closed this as completed Jul 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants