Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

syncing an upstream gzip file with an expanded local version #45

Closed
cmungall opened this issue Jul 11, 2022 · 4 comments · Fixed by #47
Closed

syncing an upstream gzip file with an expanded local version #45

cmungall opened this issue Jul 11, 2022 · 4 comments · Fixed by #47

Comments

@cmungall
Copy link

pystow has methods for syncing with a gzipped file from a URL and dynamically opening it

but if my upstream file is a gzipped sqlite (e.g. https://s3.amazonaws.com/bbop-sqlite/hp.db.gz), then I need it to be uncompressed in my ~/.data folder, before I make a connection to it (the same may hold for things like OWL)

I can obviously do this trivially, but this would require introspecting paths and would seem to defeat the point of having an abstraction layer.

For now I am putting duplicative .db and .db.gz files on s3, and only using the former with pystow, but I would like to migrate away from distributing the uncompressed versions

What I am imagining is:

url = 'https://s3.amazonaws.com/bbop-sqlite/hp.db.gz'
path = pystow.ensure('oaklib', 'sqlite', url=url, decompress=True)
conn = connect("file:///{path}")

Does that make sense?

As an aside, it may also be useful to have specific ensure methods for sqlite and/or sqlalchemy the same way you have for pandas.

@cthoyt
Copy link
Owner

cthoyt commented Jul 12, 2022

There are several different requests here, it would be helpful to have separate discussions for each of:

  1. Auto-decompression
  2. Ensure + sqlite.connect (done in Add functions for ensuring SQLite #46)
  3. Ensure + some SQLAlchemy functionality (not clear what you're asking for)

I'd suggest taking a look at a place where I already implemented something like this for ChEMBL's tar.gz'd SQLite dump: https://github.com/cthoyt/chembl-downloader/blob/d7100ba316f496ee4c36a2a684a2d9434391eb9c/src/chembl_downloader/api.py#L164-L183. Unfortunately connecting to a gzipped file from sqlite is a paid-only feature, otherwise it would be great to read them directly.

There's also already https://pystow.readthedocs.io/en/latest/api/pystow.ensure_untar.html, so I guess I could make an analog for gzip

@cmungall
Copy link
Author

cmungall commented Jul 12, 2022 via email

@cthoyt
Copy link
Owner

cthoyt commented Jul 12, 2022

yeah okay I think the solution is to have an ensure_gunzip function and then double wrap the ensure sqlite and ensure_gunzip functions to get what you wanted

@cthoyt
Copy link
Owner

cthoyt commented Jul 25, 2022

@cmungall solution is now available like:

import pandas as pd

import pystow

if __name__ == "__main__":
    sql = "SELECT * FROM entailed_edge LIMIT 10"
    url = "https://s3.amazonaws.com/bbop-sqlite/hp.db.gz"
    with pystow.ensure_open_sqlite_gz("test", url=url) as conn:
        df = pd.read_sql(sql, conn)
    print(df)

cmungall added a commit to INCATools/ontology-access-kit that referenced this issue Oct 17, 2022
This will ensure faster downloads for large databases;
it also means we can start phasing out the direct .db deposits
and save s3 space.

Utilizes cthoyt/pystow#45

Note that the gzipped file is not cached in ~/.data - but
the gunzipped file is. It is necessary for the user
to periodically clean this cache.
See cthoyt/pystow#54
cmungall added a commit to INCATools/ontology-access-kit that referenced this issue Oct 17, 2022
This will ensure faster downloads for large databases;
it also means we can start phasing out the direct .db deposits
and save s3 space.

Utilizes cthoyt/pystow#45

Note that the gzipped file is not cached in ~/.data - but
the gunzipped file is. It is necessary for the user
to periodically clean this cache.
See cthoyt/pystow#54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants