syncing an upstream gzip file with an expanded local version #45

cmungall · 2022-07-11T22:52:26Z

pystow has methods for syncing with a gzipped file from a URL and dynamically opening it

but if my upstream file is a gzipped sqlite (e.g. https://s3.amazonaws.com/bbop-sqlite/hp.db.gz), then I need it to be uncompressed in my ~/.data folder, before I make a connection to it (the same may hold for things like OWL)

I can obviously do this trivially, but this would require introspecting paths and would seem to defeat the point of having an abstraction layer.

For now I am putting duplicative .db and .db.gz files on s3, and only using the former with pystow, but I would like to migrate away from distributing the uncompressed versions

What I am imagining is:

url = 'https://s3.amazonaws.com/bbop-sqlite/hp.db.gz'
path = pystow.ensure('oaklib', 'sqlite', url=url, decompress=True)
conn = connect("file:///{path}")

Does that make sense?

As an aside, it may also be useful to have specific ensure methods for sqlite and/or sqlalchemy the same way you have for pandas.

cthoyt · 2022-07-12T03:15:53Z

There are several different requests here, it would be helpful to have separate discussions for each of:

Auto-decompression
Ensure + sqlite.connect (done in Add functions for ensuring SQLite #46)
Ensure + some SQLAlchemy functionality (not clear what you're asking for)

I'd suggest taking a look at a place where I already implemented something like this for ChEMBL's tar.gz'd SQLite dump: https://github.com/cthoyt/chembl-downloader/blob/d7100ba316f496ee4c36a2a684a2d9434391eb9c/src/chembl_downloader/api.py#L164-L183. Unfortunately connecting to a gzipped file from sqlite is a paid-only feature, otherwise it would be great to read them directly.

There's also already https://pystow.readthedocs.io/en/latest/api/pystow.ensure_untar.html, so I guess I could make an analog for gzip

cmungall · 2022-07-12T03:21:45Z

Let’s do anything SQLite specific in another issue Something analogous to ensure untar would be great I’ll take a look at the chembl downloader later

…

On Mon, Jul 11, 2022 at 8:16 PM Charles Tapley Hoyt < ***@***.***> wrote: There are several different requests here, it would be helpful to have separate discussions for each of: 1. Auto-decompression 2. Ensure + sqlite.connect 3. Ensure + some SQLAlchemy functionality (not clear what you're asking for) I'd suggest taking a look at a place where I already implemented something like this for ChEMBL's tar.gz'd SQLite dump: https://github.com/cthoyt/chembl-downloader/blob/d7100ba316f496ee4c36a2a684a2d9434391eb9c/src/chembl_downloader/api.py#L164-L183. Unfortunately connecting to a gzipped file from sqlite is a paid-only feature, otherwise it would be great to read them directly. There's also already https://pystow.readthedocs.io/en/latest/api/pystow.ensure_untar.html, so I guess I could make an analog for gzip — Reply to this email directly, view it on GitHub <#45 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAMMONBFCJTWP7VX7OELN3VTTPPJANCNFSM53JCBJXA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

cthoyt · 2022-07-12T20:21:24Z

yeah okay I think the solution is to have an ensure_gunzip function and then double wrap the ensure sqlite and ensure_gunzip functions to get what you wanted

Closes #45

cthoyt · 2022-07-25T19:42:59Z

@cmungall solution is now available like:

import pandas as pd

import pystow

if __name__ == "__main__":
    sql = "SELECT * FROM entailed_edge LIMIT 10"
    url = "https://s3.amazonaws.com/bbop-sqlite/hp.db.gz"
    with pystow.ensure_open_sqlite_gz("test", url=url) as conn:
        df = pd.read_sql(sql, conn)
    print(df)

This will ensure faster downloads for large databases; it also means we can start phasing out the direct .db deposits and save s3 space. Utilizes cthoyt/pystow#45 Note that the gzipped file is not cached in ~/.data - but the gunzipped file is. It is necessary for the user to periodically clean this cache. See cthoyt/pystow#54

cthoyt mentioned this issue Jul 12, 2022

Add functions for ensuring SQLite #46

Merged

cmungall mentioned this issue Jul 18, 2022

Create w3ids for sqlite downloads INCATools/semantic-sql#43

Open

cthoyt added a commit that referenced this issue Jul 22, 2022

Adds ensure gunzip command for single files

383e6d6

Closes #45

cthoyt mentioned this issue Jul 22, 2022

Adds ensure gunzip command for single files #47

Merged

cthoyt closed this as completed in #47 Jul 25, 2022

cthoyt closed this as completed in 91da88f Jul 25, 2022

cmungall mentioned this issue Oct 17, 2022

Use the gzipped file from the semsql s3 cache. INCATools/ontology-access-kit#316

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

syncing an upstream gzip file with an expanded local version #45

syncing an upstream gzip file with an expanded local version #45

cmungall commented Jul 11, 2022

cthoyt commented Jul 12, 2022 •

edited

Loading

cmungall commented Jul 12, 2022 via email

cthoyt commented Jul 12, 2022

cthoyt commented Jul 25, 2022

syncing an upstream gzip file with an expanded local version #45

syncing an upstream gzip file with an expanded local version #45

Comments

cmungall commented Jul 11, 2022

cthoyt commented Jul 12, 2022 • edited Loading

cmungall commented Jul 12, 2022 via email

cthoyt commented Jul 12, 2022

cthoyt commented Jul 25, 2022

cthoyt commented Jul 12, 2022 •

edited

Loading