Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch query to ebrain-kg-core #33

Closed
mih opened this issue Nov 8, 2022 · 4 comments · Fixed by #35
Closed

Switch query to ebrain-kg-core #33

mih opened this issue Nov 8, 2022 · 4 comments · Fixed by #35

Comments

@mih
Copy link
Member

mih commented Nov 8, 2022

So far we have used hard-crafted queries. With ebrains-kg-core available, it makes sense to remove this complication and switch to standard queries provided by this API wrapper. Here are examples:

Establish query setup:

from kg_core.kg import kg
k=kg()
k.with_token()
kg_client= k.build()

Get info on a dataset (by ID):

>>> pprint(list(kg_client.instances.get_by_id('e472a8c7-d9f9-4e75-9d0b-b137cecbc6a2').data.keys()))
['https://openminds.ebrains.eu/vocab/repository',
 '@type',
 'https://openminds.ebrains.eu/vocab/keyword',
 'https://openminds.ebrains.eu/vocab/description',
 'https://openminds.ebrains.eu/vocab/preparationDesign',
 'https://openminds.ebrains.eu/vocab/studyTarget',
 'https://core.kg.ebrains.eu/vocab/meta/revision',
 'https://openminds.ebrains.eu/vocab/studiedSpecimen',
 'http://schema.org/identifier',
 'https://openminds.ebrains.eu/vocab/behavioralProtocol',
 'https://openminds.ebrains.eu/vocab/versionIdentifier',
 'https://openminds.ebrains.eu/vocab/fullName',
 'https://openminds.ebrains.eu/vocab/shortName',
 'https://core.kg.ebrains.eu/vocab/meta/space',
 'https://openminds.ebrains.eu/vocab/releaseDate',
 '@id',
 'https://openminds.ebrains.eu/vocab/relatedPublication',
 'https://openminds.ebrains.eu/vocab/ethicsAssessment',
 'https://openminds.ebrains.eu/vocab/type',
 'https://openminds.ebrains.eu/vocab/funding',
 'https://openminds.ebrains.eu/vocab/author',
 'https://openminds.ebrains.eu/vocab/homepage',
 'https://openminds.ebrains.eu/vocab/license',
 'https://openminds.ebrains.eu/vocab/custodian',
 'https://openminds.ebrains.eu/vocab/accessibility',
 'https://openminds.ebrains.eu/vocab/isAlternativeVersionOf',
 'https://openminds.ebrains.eu/vocab/digitalIdentifier',
 'https://openminds.ebrains.eu/vocab/versionInnovation',
 'https://openminds.ebrains.eu/vocab/experimentalApproach',
 'https://openminds.ebrains.eu/vocab/dataType',
 'https://openminds.ebrains.eu/vocab/howToCite',
 'https://openminds.ebrains.eu/vocab/fullDocumentation',
 'https://openminds.ebrains.eu/vocab/technique',
 'https://core.kg.ebrains.eu/vocab/meta/lastReleasedAt',
 'https://core.kg.ebrains.eu/vocab/meta/firstReleasedAt']

Get info on a file repository (by ID from dataset version record)

>>> pprint(list(kg_client.instances.get_by_id('00932cbe-f90f-4968-a91f-da717c554320').data.keys()))
['https://core.kg.ebrains.eu/vocab/lastSync',
 '@type',
 'https://core.kg.ebrains.eu/vocab/bytes',
 'https://core.kg.ebrains.eu/vocab/meta/revision',
 'https://openminds.ebrains.eu/vocab/hostedBy',
 'https://openminds.ebrains.eu/vocab/IRI',
 'http://schema.org/identifier',
 'https://core.kg.ebrains.eu/vocab/lastModified',
 'https://core.kg.ebrains.eu/vocab/objectsCount',
 'https://core.kg.ebrains.eu/vocab/lastSyncIRI',
 'https://openminds.ebrains.eu/vocab/repositoryType',
 'https://core.kg.ebrains.eu/vocab/lastSeenCount',
 'https://core.kg.ebrains.eu/vocab/meta/space',
 'https://core.kg.ebrains.eu/vocab/public',
 'https://openminds.ebrains.eu/vocab/name',
 'https://openminds.ebrains.eu/vocab/contentTypePattern',
 '@id',
 'https://core.kg.ebrains.eu/vocab/meta/lastReleasedAt',
 'https://core.kg.ebrains.eu/vocab/meta/firstReleasedAt']

At present it is unclear to me how to get the actual file repository content listing. The only way I see it to visit the https://core.kg.ebrains.eu/vocab/lastSyncIRI-type URL, which yields an XML-formatted response that has the container content list. There is likely a better way that does not require a different query/parsing paradigm.

@apdavison
Copy link

Hi Michael,

You might find fairgraph helpful. This is a Python library which builds on top of ebrains-kg-core but also knows about openMINDS schemas. Documentation here: https://fairgraph.readthedocs.io/en/latest

Example for listing repository contents:

In [1]: from fairgraph import KGClient

In [2]: import fairgraph.openminds.core as omcore

In [3]: client = KGClient(host="core.kg.ebrains.eu")

In [4]: dv = omcore.DatasetVersion.from_id("e472a8c7-d9f9-4e75-9d0b-b137cecbc6a2", client)

In [5]: files = omcore.File.list(client, file_repository=dv.repository)

In [6]: for file in files:
   ...:     print(file.iri)
   ...:
https://object.cscs.ch/v1/AUTH_4791e0a3b3de43e2840fe46d9dc2b334/ext-d000017_DiFuMoAtlases_pub/64/DataDescriptor-DiFuMo(64-dimensions).pdf
https://object.cscs.ch/v1/AUTH_4791e0a3b3de43e2840fe46d9dc2b334/ext-d000017_DiFuMoAtlases_pub/64/labels_64_dictionary.csv
https://object.cscs.ch/v1/AUTH_4791e0a3b3de43e2840fe46d9dc2b334/ext-d000017_DiFuMoAtlases_pub/64/Licence-CC-BY.pdf
https://object.cscs.ch/v1/AUTH_4791e0a3b3de43e2840fe46d9dc2b334/ext-d000017_DiFuMoAtlases_pub/64/maps.nii.gz
https://object.cscs.ch/v1/AUTH_4791e0a3b3de43e2840fe46d9dc2b334/ext-d000017_DiFuMoAtlases_pub/64/3mm/resampled_maps.nii.gz

In [7]: files[0].download(".", client, accept_terms_of_use=True)
Out[7]: PosixPath('DataDescriptor-DiFuMo(64-dimensions).pdf')

@mih
Copy link
Member Author

mih commented Nov 9, 2022

@apdavison that looks fantastic! I will take a closer look shortly! Thanks much!

@mih
Copy link
Member Author

mih commented Nov 21, 2022

Hey @apdavison! I now had a chance to try this out, and it works just as advertized -- really cool! I particularly like the readily accessible content of properties, e.g. the type of a file hash

>>> f.hash
Hash(algorithm='MD5', digest='1dc869c088d4ebd615287fb79b5853b2')

I was hoping that you had also come up with a convention to derive local file paths from the IRIs of files, but I could not find something related to that. To be more specific:

I can know the repo a file is in:

>>> f.file_repository
KGProxy([<class 'fairgraph.openminds.core.data.file_repository.FileRepository'>], 'https://kg.ebrains.eu/api/instances/00932cbe-f90f-4968-a91f-da717c554320')

I can also know the IRI of the file, pointing into that repo

>>> f.iri
IRI(https://object.cscs.ch/v1/AUTH_4791e0a3b3de43e2840fe46d9dc2b334/ext-d000017_DiFuMoAtlases_pub/64/DataDescriptor-DiFuMo(64-dimensions).pdf)

But there seems to be no standard implementation to derive from this information that a suitable local path could be 64/DataDescriptor-DiFuMo(64-dimensions).pdf.

It seems to require something like "longest-common-prefix". However, this would quickly become messy when a dataset incorporates files from different file repositories (which I understand is not done (yet?), but possible.

Are you aware of an implementation for that?

Thanks in advance!

Unrelated, but worth noting: fairgraph also exhibits the really slow file repository queries (e.g. 20-30s for a 5-file repo, such as the one demo'ed above). I had suspected that I was somehow doing it suboptimally with my custom queries, but now it looks like a more general issue.

@apdavison
Copy link

We have some code that does this for CSCS Swift containers, which is the most commonly used file repository in EBRAINS, but nothing for the general case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants