Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to more stable test data path for CellLocations tests #375

Open
kenibrewer opened this issue Mar 11, 2024 · 5 comments
Open

Switch to more stable test data path for CellLocations tests #375

kenibrewer opened this issue Mar 11, 2024 · 5 comments

Comments

@kenibrewer
Copy link
Member

          Erin's suggestion in https://github.com/broadinstitute/cellpainting-gallery/issues/80#issuecomment-1986375614 was to move away from JUMP since it is the most likely to change in the future - just checking if you've seen this advice.

If you're looking for an alternative file, I think this one sounds reasonable (and likely more stable)

s3://cellpainting-gallery/cpg0017-rohban-pathways/broad/workspace/backend/2013_10_11_SIGMA2_Pilot/41744/41744.sqlite

Originally posted by @gwaybio in #374 (comment)

@d33bs d33bs mentioned this issue Mar 12, 2024
13 tasks
@kenibrewer
Copy link
Member Author

@d33bs Was this issue addressed as part of #374 or is it still outstanding?

@d33bs
Copy link
Member

d33bs commented Apr 2, 2024

@kenibrewer thanks for checking, this isn't yet resolved. We're using the following file:

$ aws s3 ls s3://cellpainting-gallery/cpg0016-jump/source_4/workspace/backend/2021_08_23_Batch12/BR00126114/BR00126114.sqlite --no-sign-request --human-readable
2022-10-02 07:47:03    3.6 GiB BR00126114.sqlite

I also just noticed there's a reference to the previous file in the readme and some other spots (my apologies) (search link here).

@d33bs
Copy link
Member

d33bs commented Apr 8, 2024

I looked into this briefly today and noticed the file s3://cellpainting-gallery/cpg0017-rohban-pathways/broad/workspace/backend/2013_10_11_SIGMA2_Pilot/41744/41744.sqlite is bigger than that of s3://cellpainting-gallery/cpg0016-jump/source_4/workspace/backend/2021_08_23_Batch12/BR00126114/BR00126114.sqlite (see below):

$ aws s3 ls s3://cellpainting-gallery/cpg0016-jump/source_4/workspace/backend/2021_08_23_Batch12/BR00126114/BR00126114.sqlite --no-sign-request --human-readable
2022-10-02 07:47:03    3.6 GiB BR00126114.sqlite
$ aws s3 ls s3://cellpainting-gallery/cpg0017-rohban-pathways/broad/workspace/backend/2013_10_11_SIGMA2_Pilot/41744/41744.sqlite --no-sign-request --human-readable
2023-02-16 17:47:43    7.7 GiB 41744.sqlite

This would likely cause the GitHub Actions-based tests to hang longer during downloads of the data (testing this locally was noticeably slower). Looking through the other data within s3://cellpainting-gallery/cpg0017-rohban-pathways/broad/workspace/backend/2013_10_11_SIGMA2_Pilot it looks like all .sqlite files are ~9-10GB in storage size.

Would it make sense to instead produce a "shrunken" copy of 41744.sqlite for use in testing? It'd be possible to simulate the S3 object storage through moto as discussed #374, but this may come with added complexity to tests. Alternatively, I wonder if there's a smaller file we could reference instead as part of this work.

@d33bs
Copy link
Member

d33bs commented Apr 24, 2024

Adding more thoughts here: we could intermix the capabilities here by performing a SQLite in-memory database reduction and combine something like s3sqlite to make the data transfer quicker (by using limited reads of the data). I don't know for sure if this would work based on the SQLite's constraints. I double checked BR00126114.sqlite and found it has a JOURNAL_MODE of delete which means we'd theoretically have compatibility here (s3sqlite cannot process WAL mode files).

Roughly, I think this would look something like the following. While this would maybe be less upfront complexity from moto we'd still face the complexity of s3fs here. The nice part about either option is we'd avoid having to store the whole file and could port this capability to other remote Sqlite datasets as needed.

import s3fs
import s3sqlite
import apsw

# Create an S3 filesystem. Check the s3fs docs for more examples:
# https://s3fs.readthedocs.io/en/latest/
s3 = s3fs.S3FileSystem(
    ...
)

s3vfs = s3sqlite.S3VFS(name="s3-vfs", fs=s3)

# Define the S3 location
key_prefix = "bucketname/BR00126114.sqlite"

# Create a database and query it
with apsw.Connection(
    key_prefix, vfs=s3vfs.name, flags=apsw.SQLITE_OPEN_READONLY
) as conn:

    # extract the schema from s3 database
    cursor = conn.execute("""
    ATTACH DATABASE 'BR00126114.sqlite' as 'BR00126114';
    SELECT sql FROM BR00126114 WHERE type='table';
    """)
    create_stmts = cursor.fetchall()

    # form an in-memory database with a limited subset of data from the s3 database
    cursor = conn.execute(f"""
    ATTACH DATABASE ':memory:' as 'test_BR00126114';
    
    /* we need to create identical table schema within the new database
    in order for the inserts below to process correctly */
    {create_stmts}

    /* we use limited selections from the AWS database to populate the in-mem database */
    INSERT INTO test_BR00126114.Image SELECT * FROM BR00126114.Image WHERE ImageNumber IN (1,2,3) LIMIT 1000;
    INSERT INTO test_BR00126114.Cells SELECT * FROM BR00126114.Cells WHERE ImageNumber IN (1,2,3) LIMIT 1000;
    ...

    /* leverage vacuum to export in-mem data to file */      
    VACUUM test_BR00126114 INTO 'test_BR00126114.sqlite';
    """)

@d33bs
Copy link
Member

d33bs commented Apr 25, 2024

Hi @gwaybio and @kenibrewer , could I ask for your thoughts and input on what might be best moving forward for this issue? Some of the options I feel might be possible are (open to others too):

  • Avoid S3-based testing, leaving a gap with test coverage for only that part of the capabilities
  • Find a possible smaller dataset from an S3 location for testing (< 3 GB)
  • Use a "shrink" script to create a small SQLite file and store it within the repo for testing (with possibly less realistic results) and leverage moto for simulated S3 API interactions.
  • Use s3sqlite to perform an on-the-fly dataset creation (possibly larger than 2GB but smaller than 3 GB?) and moto to balance data size and dependencies (providing a more realistic dataset through size and avoiding large data storage for the repo)

@d33bs d33bs changed the title Switch to more stable test data path Switch to more stable test data path for CellLocations tests Apr 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants