Unrecognized file repository pointer for private dataset in ebrains #58

alexisthual · 2023-02-27T15:18:05Z

Hi!

First thank you for the nice extension 😊
We (@bthirion @ferponcem @man-shu @ymzayek) are interested in downloading this dataset from ebrains: https://search.kg.ebrains.eu/instances/07ab1665-73b0-40c5-800e-557bc319109d

Although we authenticated with export KG_AUTH_TOKEN=`datalad ebrains-authenticate`, we still could not get the following command to work: datalad ebrains-clone 07ab1665-73b0-40c5-800e-557bc319109d ibc-test

The traceback is the following:

[INFO   ] scanning for unlocked files (this may take some time) 
ebrains-clone(impossible): [Unrecognized file repository pointer https://data-proxy.ebrains.eu/api/v1/buckets/d-07ab1665-73b0-40c5-800e-557bc319109d]                                                               
save(ok): . (dataset)                                                                                                                                                                                               
action summary:                                                                                                                                                                                                     
  ebrains-clone (impossible: 1)                                                                                                                                                                                     
  save (ok: 1)

Maybe we're missing something here. Happy to contribute to the docs if someone can help us find a solution to this!
Thanks

The text was updated successfully, but these errors were encountered:

alexisthual · 2023-03-06T14:31:52Z

We tried the same command today and got a code 500 error: [ERROR ] Error: code=500 message='Internal Server Error' uuid=None

mih · 2023-03-06T16:26:53Z

Hey, thanks for giving it a go!

re #58 (comment): it looks a bit as if this was attempt with code prior 736f542 -- if that is true, then updating to the most recent dev-snapshot should fix this particular issues. Please let me know.

We tried the same command today and got a code 500 error:

Sadly, this looks like #36 -- there is no fix that I am aware of other than time. This situation typically lasts for a few days, and then the query endpoint (I assume) comes back to life.

I can replicate the behavior you are seeing. There error is happening here:

 94                 dv = omcore.DatasetVersion.from_id(id, self.client)
 95                 target_version = dv.uuid
 96                 # determine the Dataset from the DatasetVersion we got
 97  >>           ds = omcore.Dataset.list(self.client, versions=dv)[0]

where dv is as DatasetVersion() instance with dv.uui='07ab1665-73b0-40c5-800e-557bc319109d'.

As you can see, both queries in the snippet run through fairgraph, the first one succeeds, the seconds one causes HTTP500.

However, as this is only happening occasionally, albeit still annoyingly frequent, neither datalad-ebrains, not fairgraph seem to be at fault here (at least given my superficial understanding).

Maybe you could consider bringing this up in https://github.com/HumanBrainProject/fairgraph/issues, or some ebrains support channel?

mih · 2023-03-06T16:31:54Z

Oh, looking at the test runs of #59 from Mar 3, it seems that the outage is already a few days long. That is the longest observed so far.

alexisthual · 2023-03-06T16:39:52Z

Thank you for the pointers, I'll try using the latest version!

Just to let you know, we also tried accessing the aforementioned dataset today using siibra and it worked well. My shallow understanding is that it also uses fairgraph under the hood, so it was hard for us to fully understand what was the real problem here.

apdavison · 2023-03-07T13:39:50Z

I've looked into this a bit with Oliver Schmid, the KG product owner. It seems likely that this problem originates because datalad is talking to the pre-production KG server (kg-ppd). This is the default for fairgraph (the motivation being that people should test their scripts against PPD before running against the production server), but this is not well documented, for which I apologise.

The fix would be here: https://github.com/datalad/datalad-ebrains/blob/main/datalad_ebrains/fairgraph_query.py#L34

self.client = KGClient(host="core.kg.ebrains.eu")

alexisthual · 2023-03-07T13:50:29Z

Oh great, thanks a lot for the investigation!
Should someone open a PR with this fix?

mih · 2023-03-07T15:19:06Z

Thx @apdavison for determining the cause!

@alexisthual If you want to prep a quick PR that would be much appreciated. I have put it on my TODO otherwise. TIA!

alexisthual · 2023-03-07T20:43:08Z

Unfortunately, even when I use the latest commits pushed on main, the problem described in this issue is still there.

I tried different commands:

using the dataset id directly: datalad ebrains-clone 07ab1665-73b0-40c5-800e-557bc319109d ibc-test
using the dataset url: datalad ebrains-clone https://search.kg.ebrains.eu/instances/07ab1665-73b0-40c5-800e-557bc319109d ibc-test
using the url which appears in the error message: datalad ebrains-clone https://data-proxy.ebrains.eu/api/v1/buckets/d-07ab1665-73b0-40c5-800e-557bc319109d ibc-test

All 3 commands yield the same error as the one I reported in the first message of this issue:

[INFO   ] scanning for unlocked files (this may take some time) 
ebrains-clone(impossible): [Unrecognized file repository pointer https://data-proxy.ebrains.eu/api/v1/buckets/d-07ab1665-73b0-40c5-800e-557bc319109d]
save(ok): . (dataset)
action summary:
  ebrains-clone (impossible: 1)
  save (ok: 1)

Moreover, trying to access the link present in the error from my browser yields

{"status_code":401,"detail":"You are not authenticated. This resource might require authenticated access - please retry with providing an authentication header."}

which is probably normal since I didn't explicitly provide a token.

mih · 2023-03-08T07:17:58Z

Thanks for looking into it. I had a closer look, and the dataset's files are hosted "behind" the human data gateway. To my knowledge, the is no programmatic way to access such data directly. It involves requesting access by clicking a button on the web UI, receiving an email, clicking a link in that email.

Because of these complications, I had not attempted to check if a programmatic access is possible afterwards (also because the access permissions only last for 24h, so testing such functionality on a CI is not easily possible).

I have now requested and received access to this dataset, and will have a look.

This change is merely adding the ability to recognize and process non-public dataset data-proxy URLs. However, it is not enough to support such datasets, because the underlying `fairgraph` query to get a dataset's file listing returns no results. The query is essentially this ```py batch = omcore.File.list( self.client, file_repository=dvr, size=chunk_size, from_index=cur_index) ``` and for the dataset referenced in #58 it returns an empty list with - a properly authenticated `client` - `dvr`: `FileRepository(name='buckets/d-07ab1665-73b0-40c5-800e-557bc319109d', iri=IRI(https://data-proxy.ebrains.eu/api/v1/buckets/d-07ab1665-73b0-40c5-800e-557bc319109d)...` - `chunk_size`: 10000 - `cur_index`: 0 With the same requesting account, I can browser-visit https://data-proxy.ebrains.eu/datasets/07ab1665-73b0-40c5-800e-557bc319109d and see a file listing.

mih · 2023-03-08T07:42:03Z

I have posted #61 with a code sketch and my findings. After rectifying superficial issues in datalad-ebrains code, the next blocker is an empty fairgraph report on the files contained in this dataset.

If you happened to have any insight in this, please let me know. Thx!

alexisthual · 2023-03-08T08:08:03Z

Thanks for looking into this Michael!
I was actually able to fetch the dataset through siibra so I thought there should be a way for us to do this with datalad-ebrains.
In particular, I could get the list of available files and fetch one file.
Should we try to get some inspiration from what they do? I'm happy to schedule a call if you think that'd be of any help!

mih · 2023-03-08T08:51:35Z

AFAIK siibra also uses fairgraph. So it should be possible. I am currently not able to commit to a call, but if you can point people here, we should be able to figure it out asynchronously. thx!

ymzayek · 2023-03-08T09:06:29Z

Hello, I'm not sure how siibra depends on fairgraph but they use the method siibra.fetch_ebrains_token() to produce a link where you can authenticate and then you can pass the dataset id to siibra.retrieval.repositories.EbrainsHdgConnector() and use another method to search the files. When I did this a few days ago I remember after the second step I got an email where I had to click a link to get access. I just tried again now and didn't have to do this step so I'm not sure how long the access is provided after this step but it seems more than 24 hours. I am looping @dickscheid in here because maybe he can clarify better how siibra does this and its relationship with fairgraph and datalad extension.

Full code to reproduce the data fetching described above:

import siibra

siibra.fetch_ebrains_token()

from siibra.retrieval.repositories import EbrainsHdgConnector
dataset_id = "07ab1665-73b0-40c5-800e-557bc319109d" # The ID is the last part of the url
conn = EbrainsHdgConnector(dataset_id)
conn.search_files()

data_file = "resulting_smooth_maps/sub-01/ses-14/sub-01_ses-14_task-MTTWE_dir-ap_space-MNI152NLin2009cAsym_desc-preproc_ZMap-we_all_event_response.nii.gz"
img = conn.get(data_file)

mih · 2023-03-10T08:30:57Z

@ymzayek Thanks for the code snippet. That is very helpful. We should be able to reuse the auth-setup in datalad-ebrains also for calling out to siibra. I will check what it is doing and if that does not inform a code change in the fairgraph call, we could simply employ siibra for this case.

I am not sure whether a non-public data-proxy bucket link always implies the human data gateway, but until we discover counter-evidence, this may be good enough.

mih · 2023-03-10T08:43:46Z

So looking at https://github.com/FZJ-INM1-BDA/siibra-python/blob/908f118f87ec83def2970d9a526f29f49482e2bc/siibra/retrieval/repositories.py#L354-L449 I see that siibra queries the data-proxy directly, and does not go through the knowledge graph! It does go through the KG for public datasets.

Now I am wondering: We could do the same thing. Moreoever, doing it not only for non-public datasets, like the example here, but for any data-proxy accessible dataset may actually solve #52. If that is True, it would boost overall performance by quite a bit!

alexisthual · 2023-03-14T16:15:52Z

@mih, @ymzayek and I are interested to look more into this but it feels a bit hard to dive into this codebase on our own.
However, we'd be down to schedule a peer-coding session with you some day soon if you are interested in that too!
Otherwise, we can try to deal with this asynchronously, but we'll need some of your guidance haha

mih · 2023-03-14T16:23:42Z

@alexisthual @ymzayek That would be wonderful. We have a regular zoom call for such things Tue's 8:30 CET. If this would work for you, that would be the easiest, and @dickscheid would also be in that call.

alexisthual · 2023-03-14T16:37:47Z

Nice! 8:30 am might be a bit early (the office is rather far haha) but I think I can try and make it next Tuesday 🙂

ymzayek · 2023-03-14T16:46:54Z

I think I should be able to make it for next Tuesday as well.

mih · 2023-03-14T16:55:00Z

Awesome! Apologies for the timing. This is pretty much 11am if-there-would-be-nothing-stupid-to-do o'clock. Please shoot me an email at michael.hanke@gmail.com, and I will send you a zoom link. Thx for your interest!

mih · 2023-03-21T08:51:07Z

#61 has progressed a bit with today's meeting, but is not yet in a functional state.

@dickscheid pointed out the HDG documentation should have all missing information
https://wiki.ebrains.eu/bin/view/Collabs/data-proxy/Human%20Data%20Gateway/

It might require a dedicated implementation of a downloader. This should be fairly straightforward with the UrlOperations framework from datalad-next.

alexisthual · 2023-04-03T13:13:39Z

Hi @mih and @dickscheid !
Do you have some time in the coming days or weeks to work some more on this? We'd happily participate in a new peer-coding session if it can help.

alexisthual · 2023-04-24T14:02:14Z

Hi!

We've (@man-shu @ferponce @bthirion) tried using datalad-next directly with urls from EBRAINS data-proxy (which, from our understanding, allows users to directly use URLs) but did not succeed in getting the data.
We have also tried to access a bucket directly from the data proxy, but it seems that our dataset does not have a bucket yet (and we don't know if it'd be useful).

We did not try to integrate these changes in datalad-ebrains but would happily participate in a peer-coding session if that sounds useful!

alexisthual · 2023-07-09T13:42:11Z

Hi @mih!
I see there has not been much movement on this repo for the past few weeks.
We are still interested in this feature but could not get it to work on our end 😊
The HBP is coming to an end, and I don't know if you'll spend more time on this repo ; in any case, let us know if we can help with anything!

This change is merely adding the ability to recognize and process non-public dataset data-proxy URLs. However, it is not enough to support such datasets, because the underlying `fairgraph` query to get a dataset's file listing returns no results. The query is essentially this ```py batch = omcore.File.list( self.client, file_repository=dvr, size=chunk_size, from_index=cur_index) ``` and for the dataset referenced in #58 it returns an empty list with - a properly authenticated `client` - `dvr`: `FileRepository(name='buckets/d-07ab1665-73b0-40c5-800e-557bc319109d', iri=IRI(https://data-proxy.ebrains.eu/api/v1/buckets/d-07ab1665-73b0-40c5-800e-557bc319109d)...` - `chunk_size`: 10000 - `cur_index`: 0 With the same requesting account, I can browser-visit https://data-proxy.ebrains.eu/datasets/07ab1665-73b0-40c5-800e-557bc319109d and see a file listing.

mih · 2023-07-14T11:17:03Z

I had the chance to work on this again. #61 refactors the code to allow for interacting with the data proxy API directly. Moreover, it switches access for publicly hosted datasets that are accessible via the DP to use that API too.

I could not get the authentication flow for private dataset access via the HDG to work -- neither in code, nor with https://data-proxy.ebrains.eu/api/docs

I use my EBRAINS token to authenticate. When I POST to /datasets/{dataset_id} to initiate the HDG flow, as instructed at https://wiki.ebrains.eu/bin/view/Collabs/data-proxy/Human%20Data%20Gateway/ I get

{
  "status_code": 401,
  "detail": "User not authenticated properly. The token needs to access the 'roles', 'email', 'team' and 'profile' scope."
}

The correpsonding GET request fails (as expected) with

{
  "status_code": 401,
  "detail": "Access has expired, please request access again",
  "can_request_access": true
}

This makes me think that either the EBRAINS session token is the wrong credential here, or that my particular account is insufficient, or I am missing a crucial step in the authorization flow.

@alexisthual if you can get a file listing of a HDG dataset via https://data-proxy.ebrains.eu/api/docs please let me know how, and I am confident that I can achieve the rest.

ymzayek · 2023-07-17T09:33:55Z

Not sure this is helpful but I also tried this. From the browser I can access this private dataset https://data-proxy.ebrains.eu/datasets/07ab1665-73b0-40c5-800e-557bc319109d (authorized through login and email link). Then using that same token:

curl -X 'POST' \
  'https://data-proxy.ebrains.eu/api/v1/datasets/07ab1665-73b0-40c5-800e-557bc319109d' \
  -H 'accept: application/json' \
  -H 'Authorization: Bearer $TOKEN' \
  -d ''

I get

{
  "error": "Error accessing userinfo"
}

And same response with GET request

mih mentioned this issue Mar 7, 2023

fairgraph File query fails with code=500 'Internal Server Error' #36

Closed

alexisthual mentioned this issue Mar 7, 2023

Use KG production server instead of pre-production #60

Merged

mih mentioned this issue Mar 8, 2023

Sketch to support (private) data-proxy dataset #61

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unrecognized file repository pointer for private dataset in ebrains #58

Unrecognized file repository pointer for private dataset in ebrains #58

alexisthual commented Feb 27, 2023

alexisthual commented Mar 6, 2023

mih commented Mar 6, 2023

mih commented Mar 6, 2023

alexisthual commented Mar 6, 2023 •

edited

Loading

apdavison commented Mar 7, 2023 •

edited

Loading

alexisthual commented Mar 7, 2023

mih commented Mar 7, 2023

alexisthual commented Mar 7, 2023 •

edited

Loading

mih commented Mar 8, 2023

mih commented Mar 8, 2023

alexisthual commented Mar 8, 2023

mih commented Mar 8, 2023

ymzayek commented Mar 8, 2023 •

edited

Loading

mih commented Mar 10, 2023

mih commented Mar 10, 2023

alexisthual commented Mar 14, 2023

mih commented Mar 14, 2023

alexisthual commented Mar 14, 2023

ymzayek commented Mar 14, 2023

mih commented Mar 14, 2023 •

edited

Loading

mih commented Mar 21, 2023

alexisthual commented Apr 3, 2023

alexisthual commented Apr 24, 2023

alexisthual commented Jul 9, 2023

mih commented Jul 14, 2023 •

edited

Loading

ymzayek commented Jul 17, 2023

Unrecognized file repository pointer for private dataset in ebrains #58

Unrecognized file repository pointer for private dataset in ebrains #58

Comments

alexisthual commented Feb 27, 2023

alexisthual commented Mar 6, 2023

mih commented Mar 6, 2023

mih commented Mar 6, 2023

alexisthual commented Mar 6, 2023 • edited Loading

apdavison commented Mar 7, 2023 • edited Loading

alexisthual commented Mar 7, 2023

mih commented Mar 7, 2023

alexisthual commented Mar 7, 2023 • edited Loading

mih commented Mar 8, 2023

mih commented Mar 8, 2023

alexisthual commented Mar 8, 2023

mih commented Mar 8, 2023

ymzayek commented Mar 8, 2023 • edited Loading

mih commented Mar 10, 2023

mih commented Mar 10, 2023

alexisthual commented Mar 14, 2023

mih commented Mar 14, 2023

alexisthual commented Mar 14, 2023

ymzayek commented Mar 14, 2023

mih commented Mar 14, 2023 • edited Loading

mih commented Mar 21, 2023

alexisthual commented Apr 3, 2023

alexisthual commented Apr 24, 2023

alexisthual commented Jul 9, 2023

mih commented Jul 14, 2023 • edited Loading

ymzayek commented Jul 17, 2023

alexisthual commented Mar 6, 2023 •

edited

Loading

apdavison commented Mar 7, 2023 •

edited

Loading

alexisthual commented Mar 7, 2023 •

edited

Loading

ymzayek commented Mar 8, 2023 •

edited

Loading

mih commented Mar 14, 2023 •

edited

Loading

mih commented Jul 14, 2023 •

edited

Loading