-
-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
syncing an upstream gzip file with an expanded local version #45
Comments
There are several different requests here, it would be helpful to have separate discussions for each of:
I'd suggest taking a look at a place where I already implemented something like this for ChEMBL's tar.gz'd SQLite dump: https://github.com/cthoyt/chembl-downloader/blob/d7100ba316f496ee4c36a2a684a2d9434391eb9c/src/chembl_downloader/api.py#L164-L183. Unfortunately connecting to a gzipped file from sqlite is a paid-only feature, otherwise it would be great to read them directly. There's also already https://pystow.readthedocs.io/en/latest/api/pystow.ensure_untar.html, so I guess I could make an analog for gzip |
Let’s do anything SQLite specific in another issue
Something analogous to ensure untar would be great I’ll take a look at the
chembl downloader later
…On Mon, Jul 11, 2022 at 8:16 PM Charles Tapley Hoyt < ***@***.***> wrote:
There are several different requests here, it would be helpful to have
separate discussions for each of:
1. Auto-decompression
2. Ensure + sqlite.connect
3. Ensure + some SQLAlchemy functionality (not clear what you're
asking for)
I'd suggest taking a look at a place where I already implemented something
like this for ChEMBL's tar.gz'd SQLite dump:
https://github.com/cthoyt/chembl-downloader/blob/d7100ba316f496ee4c36a2a684a2d9434391eb9c/src/chembl_downloader/api.py#L164-L183.
Unfortunately connecting to a gzipped file from sqlite is a paid-only
feature, otherwise it would be great to read them directly.
There's also already
https://pystow.readthedocs.io/en/latest/api/pystow.ensure_untar.html, so
I guess I could make an analog for gzip
—
Reply to this email directly, view it on GitHub
<#45 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAMMONBFCJTWP7VX7OELN3VTTPPJANCNFSM53JCBJXA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
yeah okay I think the solution is to have an |
@cmungall solution is now available like: import pandas as pd
import pystow
if __name__ == "__main__":
sql = "SELECT * FROM entailed_edge LIMIT 10"
url = "https://s3.amazonaws.com/bbop-sqlite/hp.db.gz"
with pystow.ensure_open_sqlite_gz("test", url=url) as conn:
df = pd.read_sql(sql, conn)
print(df) |
This will ensure faster downloads for large databases; it also means we can start phasing out the direct .db deposits and save s3 space. Utilizes cthoyt/pystow#45 Note that the gzipped file is not cached in ~/.data - but the gunzipped file is. It is necessary for the user to periodically clean this cache. See cthoyt/pystow#54
This will ensure faster downloads for large databases; it also means we can start phasing out the direct .db deposits and save s3 space. Utilizes cthoyt/pystow#45 Note that the gzipped file is not cached in ~/.data - but the gunzipped file is. It is necessary for the user to periodically clean this cache. See cthoyt/pystow#54
pystow has methods for syncing with a gzipped file from a URL and dynamically opening it
but if my upstream file is a gzipped sqlite (e.g. https://s3.amazonaws.com/bbop-sqlite/hp.db.gz), then I need it to be uncompressed in my ~/.data folder, before I make a connection to it (the same may hold for things like OWL)
I can obviously do this trivially, but this would require introspecting paths and would seem to defeat the point of having an abstraction layer.
For now I am putting duplicative .db and .db.gz files on s3, and only using the former with pystow, but I would like to migrate away from distributing the uncompressed versions
What I am imagining is:
Does that make sense?
As an aside, it may also be useful to have specific ensure methods for sqlite and/or sqlalchemy the same way you have for pandas.
The text was updated successfully, but these errors were encountered: