Ensure reasonable performance with large CSV datasets #8

d33bs · 2022-09-24T19:30:58Z

In order to provide the best utility to Pycytominer users it's crucial we ensure reasonable performance when ingesting large amounts of CSV data (whether by number of files, size of files, or both). This issue is intended to help to guide this repo towards reasonable performance expectations (computing resources and time).

d33bs · 2022-09-24T19:36:53Z

Hi @gwaybio - would you have any suggestions in terms of large CSV dataset(s) to benchmark CSV data handling for pyctyominer-transform? I've been using the "Human HT29" example CSV output from CellProfiler for early work here but these are relatively small in number and size.

bethac07 · 2022-09-26T01:26:45Z

Hi Dave,

All the data you could ever want can be found here! Data sets range in size from a couple of plates into the several hundred.

https://registry.opendata.aws/cellpainting-gallery/

d33bs · 2022-09-26T15:26:52Z

Fantastic, thank you so much @bethac07!

d33bs · 2022-09-29T16:01:58Z

Hi @bethac07 - thanks again for the cellpainting-gallery data references, these have been very helpful! Using those data as a reference, what is typically used as an input for pycytominer work? I notice there are analysis files broken down into compartments ("Cells", "Cytoplasm", etc) and also backend files which appear to be aggregations. If both are used, is there a preference towards one or other other for certain purposes? I'm wondering if I should target one format over the other for work in this repo. (cc @gwaybio)

bethac07 · 2022-09-29T16:20:13Z

Great question!

The analysis files are what come directly from CellProfiler, and are what we hope will eventually be the direct target for pycytominer-transform in new experiments going forward. They are much more important to be performant on.
The backend aggregated sqlite files we would like to eventually convert to whatever format we end up with here (parquet or whatever). We have a lot of them from 15 years of experiments, so it would be nice to not be terrible to convert these all over, but since they're already available in aggregated form, are much less important.

shntnu · 2023-01-31T15:19:40Z

I was very excited to test this out! I've included my notes below – hope this helps!

Python setup

sudo apt remove pipenv
install pipenv
pipenv install --python /usr/bin/python3.10 
pipenv shell
curl -sS https://bootstrap.pypa.io/get-pip.py | python3.10

Test dataset

aws s3 ls --recursive s3://cellpainting-gallery/test-cpg0016-jump/source_4/workspace/analysis/2021_04_26_Batch1/BR00117035/analysis/|grep csv
2023-01-30 22:05:41    2497224 test-cpg0016-jump/source_4/workspace/analysis/2021_04_26_Batch1/BR00117035/analysis/BR00117035-A01-1/Cells.csv
2023-01-30 22:05:42    2431302 test-cpg0016-jump/source_4/workspace/analysis/2021_04_26_Batch1/BR00117035/analysis/BR00117035-A01-1/Cytoplasm.csv
2023-01-30 22:05:42     361721 test-cpg0016-jump/source_4/workspace/analysis/2021_04_26_Batch1/BR00117035/analysis/BR00117035-A01-1/Experiment.csv
2023-01-30 22:05:42     105210 test-cpg0016-jump/source_4/workspace/analysis/2021_04_26_Batch1/BR00117035/analysis/BR00117035-A01-1/Image.csv
2023-01-30 22:05:43    2378879 test-cpg0016-jump/source_4/workspace/analysis/2021_04_26_Batch1/BR00117035/analysis/BR00117035-A01-1/Nuclei.csv
2023-01-30 22:05:44    8301088 test-cpg0016-jump/source_4/workspace/analysis/2021_04_26_Batch1/BR00117035/analysis/BR00117035-A01-2/Cells.csv
2023-01-30 22:05:44    8026097 test-cpg0016-jump/source_4/workspace/analysis/2021_04_26_Batch1/BR00117035/analysis/BR00117035-A01-2/Cytoplasm.csv
2023-01-30 22:05:45     361721 test-cpg0016-jump/source_4/workspace/analysis/2021_04_26_Batch1/BR00117035/analysis/BR00117035-A01-2/Experiment.csv
2023-01-30 22:05:45     105251 test-cpg0016-jump/source_4/workspace/analysis/2021_04_26_Batch1/BR00117035/analysis/BR00117035-A01-2/Image.csv
2023-01-30 22:05:45    7920455 test-cpg0016-jump/source_4/workspace/analysis/2021_04_26_Batch1/BR00117035/analysis/BR00117035-A01-2/Nuclei.csv
2023-01-30 22:06:04    1980355 test-cpg0016-jump/source_4/workspace/analysis/2021_04_26_Batch1/BR00117035/analysis/BR00117035-A02-1/Cells.csv
2023-01-30 22:06:04    1932478 test-cpg0016-jump/source_4/workspace/analysis/2021_04_26_Batch1/BR00117035/analysis/BR00117035-A02-1/Cytoplasm.csv
2023-01-30 22:06:05     361721 test-cpg0016-jump/source_4/workspace/analysis/2021_04_26_Batch1/BR00117035/analysis/BR00117035-A02-1/Experiment.csv
2023-01-30 22:06:05     105204 test-cpg0016-jump/source_4/workspace/analysis/2021_04_26_Batch1/BR00117035/analysis/BR00117035-A02-1/Image.csv
2023-01-30 22:06:05    1886212 test-cpg0016-jump/source_4/workspace/analysis/2021_04_26_Batch1/BR00117035/analysis/BR00117035-A02-1/Nuclei.csv
2023-01-30 22:06:06    8551391 test-cpg0016-jump/source_4/workspace/analysis/2021_04_26_Batch1/BR00117035/analysis/BR00117035-A02-2/Cells.csv
2023-01-30 22:06:07    8274429 test-cpg0016-jump/source_4/workspace/analysis/2021_04_26_Batch1/BR00117035/analysis/BR00117035-A02-2/Cytoplasm.csv
2023-01-30 22:06:07     361723 test-cpg0016-jump/source_4/workspace/analysis/2021_04_26_Batch1/BR00117035/analysis/BR00117035-A02-2/Experiment.csv
2023-01-30 22:06:07     105383 test-cpg0016-jump/source_4/workspace/analysis/2021_04_26_Batch1/BR00117035/analysis/BR00117035-A02-2/Image.csv
2023-01-30 22:06:08    8154002 test-cpg0016-jump/source_4/workspace/analysis/2021_04_26_Batch1/BR00117035/analysis/BR00117035-A02-2/Nuclei.csv

Install package

git clone git@github.com:cytomining/pycytominer-transform.git
cd pycytominer-transform
pip install -e .

Run

from pycytominer_transform import convert

convert(
    source_path="s3://cellpainting-gallery/test-cpg0016-jump/source_4/workspace/analysis/2021_04_26_Batch1/BR00117035/analysis/",
    source_datatype="csv",
    dest_path=".",
    dest_datatype="parquet",
    concat=True,
    no_sign_request=True,
)

Error log

22:07:21.433 | INFO    | prefect.engine - Created flow run 'hopping-snake' for flow 'to-parquet'
22:07:21.579 | INFO    | Flow run 'hopping-snake' - Created subflow run 'military-dingo' for flow 'gather-records'
22:07:21.631 | INFO    | Flow run 'military-dingo' - Created task run 'build_path-505e517e-0' for task 'build_path'
22:07:21.631 | INFO    | Flow run 'military-dingo' - Executing 'build_path-505e517e-0' immediately...
22:07:21.736 | INFO    | Task run 'build_path-505e517e-0' - Finished in state Completed()
22:07:21.762 | INFO    | Flow run 'military-dingo' - Created task run 'get_source_filepaths-b02baff7-0' for task 'get_source_filepaths'
22:07:21.763 | INFO    | Flow run 'military-dingo' - Executing 'get_source_filepaths-b02baff7-0' immediately...
22:07:22.887 | INFO    | Task run 'get_source_filepaths-b02baff7-0' - Finished in state Completed()
22:07:22.908 | INFO    | Flow run 'military-dingo' - Created task run 'infer_source_datatype-8e1a10f3-0' for task 'infer_source_datatype'
22:07:22.908 | INFO    | Flow run 'military-dingo' - Executing 'infer_source_datatype-8e1a10f3-0' immediately...
22:07:22.953 | INFO    | Task run 'infer_source_datatype-8e1a10f3-0' - Finished in state Completed()
22:07:22.976 | INFO    | Flow run 'military-dingo' - Created task run 'filter_source_filepaths-7fbae088-0' for task 'filter_source_filepaths'
22:07:22.977 | INFO    | Flow run 'military-dingo' - Executing 'filter_source_filepaths-7fbae088-0' immediately...
22:07:24.263 | INFO    | Task run 'filter_source_filepaths-7fbae088-0' - Finished in state Completed()
22:07:24.289 | INFO    | Flow run 'military-dingo' - Finished in state Completed()
22:07:24.383 | INFO    | Flow run 'hopping-snake' - Created task run 'read_file-518a7dde-2' for task 'read_file'
22:07:24.384 | INFO    | Flow run 'hopping-snake' - Submitted task run 'read_file-518a7dde-2' for execution.
22:07:24.403 | INFO    | Flow run 'hopping-snake' - Created task run 'read_file-518a7dde-6' for task 'read_file'
22:07:24.404 | INFO    | Flow run 'hopping-snake' - Submitted task run 'read_file-518a7dde-6' for execution.
22:07:24.431 | INFO    | Flow run 'hopping-snake' - Created task run 'write_parquet-9d9b144f-2' for task 'write_parquet'
22:07:24.432 | INFO    | Flow run 'hopping-snake' - Submitted task run 'write_parquet-9d9b144f-2' for execution.
22:07:25.220 | INFO    | Flow run 'hopping-snake' - Created task run 'read_file-518a7dde-4' for task 'read_file'
22:07:25.221 | INFO    | Flow run 'hopping-snake' - Submitted task run 'read_file-518a7dde-4' for execution.
22:07:28.333 | INFO    | Flow run 'hopping-snake' - Created task run 'read_file-518a7dde-10' for task 'read_file'
22:07:28.334 | INFO    | Flow run 'hopping-snake' - Submitted task run 'read_file-518a7dde-10' for execution.
22:07:31.568 | INFO    | Flow run 'hopping-snake' - Created task run 'read_file-518a7dde-1' for task 'read_file'
22:07:31.569 | INFO    | Flow run 'hopping-snake' - Submitted task run 'read_file-518a7dde-1' for execution.
22:07:34.815 | INFO    | Flow run 'hopping-snake' - Created task run 'write_parquet-9d9b144f-10' for task 'write_parquet'
22:07:34.816 | INFO    | Flow run 'hopping-snake' - Submitted task run 'write_parquet-9d9b144f-10' for execution.
22:07:41.747 | INFO    | Flow run 'hopping-snake' - Created task run 'write_parquet-9d9b144f-6' for task 'write_parquet'
22:07:41.748 | INFO    | Flow run 'hopping-snake' - Submitted task run 'write_parquet-9d9b144f-6' for execution.
22:07:53.852 | INFO    | Flow run 'hopping-snake' - Created task run 'read_file-518a7dde-0' for task 'read_file'
22:07:53.853 | INFO    | Flow run 'hopping-snake' - Submitted task run 'read_file-518a7dde-0' for execution.
22:07:55.301 | INFO    | Flow run 'hopping-snake' - Created task run 'write_parquet-9d9b144f-1' for task 'write_parquet'
22:07:55.302 | INFO    | Flow run 'hopping-snake' - Submitted task run 'write_parquet-9d9b144f-1' for execution.
22:08:06.304 | INFO    | Flow run 'hopping-snake' - Created task run 'read_file-518a7dde-15' for task 'read_file'
22:08:06.305 | INFO    | Flow run 'hopping-snake' - Submitted task run 'read_file-518a7dde-15' for execution.
22:08:12.840 | INFO    | Task run 'read_file-518a7dde-10' - Finished in state Completed()
22:08:16.472 | INFO    | Flow run 'hopping-snake' - Created task run 'read_file-518a7dde-8' for task 'read_file'
22:08:16.473 | INFO    | Flow run 'hopping-snake' - Submitted task run 'read_file-518a7dde-8' for execution.
22:08:21.693 | INFO    | Task run 'read_file-518a7dde-2' - Finished in state Completed()
22:08:24.641 | INFO    | Flow run 'hopping-snake' - Created task run 'write_parquet-9d9b144f-0' for task 'write_parquet'
22:08:24.642 | INFO    | Flow run 'hopping-snake' - Submitted task run 'write_parquet-9d9b144f-0' for execution.
22:08:24.880 | ERROR   | prefect.orion - Encountered exception in request:
Traceback (most recent call last):
  File "/home/ubuntu/.local/share/virtualenvs/pycytominer-transform-ckiYrrK9/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1900, in _execute_context
    self.dialect.do_execute(
  File "/home/ubuntu/.local/share/virtualenvs/pycytominer-transform-ckiYrrK9/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 736, in do_execute
    cursor.execute(statement, parameters)
  File "/home/ubuntu/.local/share/virtualenvs/pycytominer-transform-ckiYrrK9/lib/python3.10/site-packages/sqlalchemy/dialects/sqlite/aiosqlite.py", line 108, in execute
    self._adapt_connection._handle_exception(error)
  File "/home/ubuntu/.local/share/virtualenvs/pycytominer-transform-ckiYrrK9/lib/python3.10/site-packages/sqlalchemy/dialects/sqlite/aiosqlite.py", line 236, in _handle_exception
    raise error
  File "/home/ubuntu/.local/share/virtualenvs/pycytominer-transform-ckiYrrK9/lib/python3.10/site-packages/sqlalchemy/dialects/sqlite/aiosqlite.py", line 90, in execute
    self.await_(_cursor.execute(operation, parameters))
  File "/home/ubuntu/.local/share/virtualenvs/pycytominer-transform-ckiYrrK9/lib/python3.10/site-packages/sqlalchemy/util/_concurrency_py3k.py", line 68, in await_only
    return current.driver.switch(awaitable)
  File "/home/ubuntu/.local/share/virtualenvs/pycytominer-transform-ckiYrrK9/lib/python3.10/site-packages/sqlalchemy/util/_concurrency_py3k.py", line 121, in greenlet_spawn
    value = await result
  File "/home/ubuntu/.local/share/virtualenvs/pycytominer-transform-ckiYrrK9/lib/python3.10/site-packages/aiosqlite/cursor.py", line 37, in execute
    await self._execute(self._cursor.execute, sql, parameters)
  File "/home/ubuntu/.local/share/virtualenvs/pycytominer-transform-ckiYrrK9/lib/python3.10/site-packages/aiosqlite/cursor.py", line 31, in _execute
    return await self._conn._execute(fn, *args, **kwargs)
  File "/home/ubuntu/.local/share/virtualenvs/pycytominer-transform-ckiYrrK9/lib/python3.10/site-packages/aiosqlite/core.py", line 137, in _execute
    return await future
  File "/home/ubuntu/.local/share/virtualenvs/pycytominer-transform-ckiYrrK9/lib/python3.10/site-packages/aiosqlite/core.py", line 110, in run
    result = function()
sqlite3.OperationalError: database is locked

Related issue: PrefectHQ/prefect#7277

Config

python --version
# Python 3.10.9

git rev-parse --short HEAD
# e55424d

git branch -v
# * main e55424d [behind 1] Add Initial CSV to Parquet Conversion, Documentation, and Workflows

sqlite3 --version
# 3.31.1 2020-01-27 19:55:54 3bfa9cc97da10598521b342961df8f5f68c7388fa117345eeb516eaa837balt1

uname -r
# 5.15.0-1028-aws

d33bs · 2023-01-31T18:43:13Z

Thank you @shntnu for trying this out and the great details! Very sorry to see the test failed. I'll look into this and follow up.

shntnu · 2023-01-31T18:54:20Z

Very sorry to see the test failed. I'll look into this and follow up.

I am so excited to see what you've built here!

gwaybio · 2023-07-26T22:08:52Z

Closing this issue as we've seen reasonable performance so far, and because our primary discussions about performance will continue in https://github.com/cytomining/CytoTable-benchmarks

d33bs added the enhancement New feature or request label Sep 24, 2022

d33bs self-assigned this Sep 24, 2022

d33bs mentioned this issue Oct 3, 2022

Add Initial CSV to Parquet Conversion, Documentation, and Workflows #10

Merged

This was referenced Jan 31, 2023

Add Python 3.10 to tests #28

Merged

Move to SequentialTaskRunner default to avoid Prefect API SQLite write conflicts #29

Closed

gwaybio closed this as completed Jul 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure reasonable performance with large CSV datasets #8

Ensure reasonable performance with large CSV datasets #8

d33bs commented Sep 24, 2022

d33bs commented Sep 24, 2022

bethac07 commented Sep 26, 2022

d33bs commented Sep 26, 2022

d33bs commented Sep 29, 2022

bethac07 commented Sep 29, 2022

shntnu commented Jan 31, 2023

Python setup

Test dataset

Install package

Run

Error log

Config

d33bs commented Jan 31, 2023

shntnu commented Jan 31, 2023

gwaybio commented Jul 26, 2023

Ensure reasonable performance with large CSV datasets #8

Ensure reasonable performance with large CSV datasets #8

Comments

d33bs commented Sep 24, 2022

d33bs commented Sep 24, 2022

bethac07 commented Sep 26, 2022

d33bs commented Sep 26, 2022

d33bs commented Sep 29, 2022

bethac07 commented Sep 29, 2022

shntnu commented Jan 31, 2023

Python setup

Test dataset

Install package

Run

Error log

Config

d33bs commented Jan 31, 2023

shntnu commented Jan 31, 2023

gwaybio commented Jul 26, 2023