-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure reasonable performance with large CSV datasets #8
Comments
Hi @gwaybio - would you have any suggestions in terms of large CSV dataset(s) to benchmark CSV data handling for pyctyominer-transform? I've been using the "Human HT29" example CSV output from CellProfiler for early work here but these are relatively small in number and size. |
Hi Dave, All the data you could ever want can be found here! Data sets range in size from a couple of plates into the several hundred. |
Fantastic, thank you so much @bethac07! |
Hi @bethac07 - thanks again for the |
Great question!
|
I was very excited to test this out! I've included my notes below – hope this helps! Python setupsudo apt remove pipenv
install pipenv
pipenv install --python /usr/bin/python3.10
pipenv shell
curl -sS https://bootstrap.pypa.io/get-pip.py | python3.10 Test dataset
Install packagegit clone git@github.com:cytomining/pycytominer-transform.git
cd pycytominer-transform
pip install -e . Runfrom pycytominer_transform import convert
convert(
source_path="s3://cellpainting-gallery/test-cpg0016-jump/source_4/workspace/analysis/2021_04_26_Batch1/BR00117035/analysis/",
source_datatype="csv",
dest_path=".",
dest_datatype="parquet",
concat=True,
no_sign_request=True,
) Error log
Related issue: PrefectHQ/prefect#7277 Configpython --version
# Python 3.10.9
git rev-parse --short HEAD
# e55424d
git branch -v
# * main e55424d [behind 1] Add Initial CSV to Parquet Conversion, Documentation, and Workflows
sqlite3 --version
# 3.31.1 2020-01-27 19:55:54 3bfa9cc97da10598521b342961df8f5f68c7388fa117345eeb516eaa837balt1
uname -r
# 5.15.0-1028-aws |
Thank you @shntnu for trying this out and the great details! Very sorry to see the test failed. I'll look into this and follow up. |
I am so excited to see what you've built here! |
Closing this issue as we've seen reasonable performance so far, and because our primary discussions about performance will continue in https://github.com/cytomining/CytoTable-benchmarks |
In order to provide the best utility to Pycytominer users it's crucial we ensure reasonable performance when ingesting large amounts of CSV data (whether by number of files, size of files, or both). This issue is intended to help to guide this repo towards reasonable performance expectations (computing resources and time).
The text was updated successfully, but these errors were encountered: