Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unrecognized file repository pointer for private dataset in ebrains #58

Open
alexisthual opened this issue Feb 27, 2023 · 26 comments
Open

Comments

@alexisthual
Copy link
Contributor

Hi!

First thank you for the nice extension 馃槉
We (@bthirion @ferponcem @man-shu @ymzayek) are interested in downloading this dataset from ebrains: https://search.kg.ebrains.eu/instances/07ab1665-73b0-40c5-800e-557bc319109d

Although we authenticated with export KG_AUTH_TOKEN=`datalad ebrains-authenticate`, we still could not get the following command to work: datalad ebrains-clone 07ab1665-73b0-40c5-800e-557bc319109d ibc-test

The traceback is the following:

[INFO   ] scanning for unlocked files (this may take some time) 
ebrains-clone(impossible): [Unrecognized file repository pointer https://data-proxy.ebrains.eu/api/v1/buckets/d-07ab1665-73b0-40c5-800e-557bc319109d]                                                               
save(ok): . (dataset)                                                                                                                                                                                               
action summary:                                                                                                                                                                                                     
  ebrains-clone (impossible: 1)                                                                                                                                                                                     
  save (ok: 1)

Maybe we're missing something here. Happy to contribute to the docs if someone can help us find a solution to this!
Thanks

@alexisthual
Copy link
Contributor Author

We tried the same command today and got a code 500 error: [ERROR ] Error: code=500 message='Internal Server Error' uuid=None

@mih
Copy link
Member

mih commented Mar 6, 2023

Hey, thanks for giving it a go!

re #58 (comment): it looks a bit as if this was attempt with code prior 736f542 -- if that is true, then updating to the most recent dev-snapshot should fix this particular issues. Please let me know.

We tried the same command today and got a code 500 error:

Sadly, this looks like #36 -- there is no fix that I am aware of other than time. This situation typically lasts for a few days, and then the query endpoint (I assume) comes back to life.

I can replicate the behavior you are seeing. There error is happening here:

 94                 dv = omcore.DatasetVersion.from_id(id, self.client)
 95                 target_version = dv.uuid
 96                 # determine the Dataset from the DatasetVersion we got
 97  >>           ds = omcore.Dataset.list(self.client, versions=dv)[0]

where dv is as DatasetVersion() instance with dv.uui='07ab1665-73b0-40c5-800e-557bc319109d'.

As you can see, both queries in the snippet run through fairgraph, the first one succeeds, the seconds one causes HTTP500.

However, as this is only happening occasionally, albeit still annoyingly frequent, neither datalad-ebrains, not fairgraph seem to be at fault here (at least given my superficial understanding).

Maybe you could consider bringing this up in https://github.com/HumanBrainProject/fairgraph/issues, or some ebrains support channel?

@mih
Copy link
Member

mih commented Mar 6, 2023

Oh, looking at the test runs of #59 from Mar 3, it seems that the outage is already a few days long. That is the longest observed so far.

@alexisthual
Copy link
Contributor Author

alexisthual commented Mar 6, 2023

Thank you for the pointers, I'll try using the latest version!

Just to let you know, we also tried accessing the aforementioned dataset today using siibra and it worked well. My shallow understanding is that it also uses fairgraph under the hood, so it was hard for us to fully understand what was the real problem here.

@apdavison
Copy link

apdavison commented Mar 7, 2023

I've looked into this a bit with Oliver Schmid, the KG product owner. It seems likely that this problem originates because datalad is talking to the pre-production KG server (kg-ppd). This is the default for fairgraph (the motivation being that people should test their scripts against PPD before running against the production server), but this is not well documented, for which I apologise.

The fix would be here: https://github.com/datalad/datalad-ebrains/blob/main/datalad_ebrains/fairgraph_query.py#L34

self.client = KGClient(host="core.kg.ebrains.eu")

@alexisthual
Copy link
Contributor Author

Oh great, thanks a lot for the investigation!
Should someone open a PR with this fix?

@mih
Copy link
Member

mih commented Mar 7, 2023

Thx @apdavison for determining the cause!

@alexisthual If you want to prep a quick PR that would be much appreciated. I have put it on my TODO otherwise. TIA!

@alexisthual
Copy link
Contributor Author

alexisthual commented Mar 7, 2023

Unfortunately, even when I use the latest commits pushed on main, the problem described in this issue is still there.

I tried different commands:

  • using the dataset id directly: datalad ebrains-clone 07ab1665-73b0-40c5-800e-557bc319109d ibc-test
  • using the dataset url: datalad ebrains-clone https://search.kg.ebrains.eu/instances/07ab1665-73b0-40c5-800e-557bc319109d ibc-test
  • using the url which appears in the error message: datalad ebrains-clone https://data-proxy.ebrains.eu/api/v1/buckets/d-07ab1665-73b0-40c5-800e-557bc319109d ibc-test

All 3 commands yield the same error as the one I reported in the first message of this issue:

[INFO   ] scanning for unlocked files (this may take some time) 
ebrains-clone(impossible): [Unrecognized file repository pointer https://data-proxy.ebrains.eu/api/v1/buckets/d-07ab1665-73b0-40c5-800e-557bc319109d]
save(ok): . (dataset)
action summary:
  ebrains-clone (impossible: 1)
  save (ok: 1)

Moreover, trying to access the link present in the error from my browser yields

{"status_code":401,"detail":"You are not authenticated. This resource might require authenticated access - please retry with providing an authentication header."}

which is probably normal since I didn't explicitly provide a token.

@mih
Copy link
Member

mih commented Mar 8, 2023

Thanks for looking into it. I had a closer look, and the dataset's files are hosted "behind" the human data gateway. To my knowledge, the is no programmatic way to access such data directly. It involves requesting access by clicking a button on the web UI, receiving an email, clicking a link in that email.

Because of these complications, I had not attempted to check if a programmatic access is possible afterwards (also because the access permissions only last for 24h, so testing such functionality on a CI is not easily possible).

I have now requested and received access to this dataset, and will have a look.

mih added a commit that referenced this issue Mar 8, 2023
This change is merely adding the ability to recognize and
process non-public dataset data-proxy URLs.

However, it is not enough to support such datasets, because
the underlying `fairgraph` query to get a dataset's file listing
returns no results.

The query is essentially this

```py
batch = omcore.File.list(
    self.client,
    file_repository=dvr,
    size=chunk_size,
    from_index=cur_index)
```

and for the dataset referenced in
#58 it returns an empty
list with

- a properly authenticated `client`
- `dvr`: `FileRepository(name='buckets/d-07ab1665-73b0-40c5-800e-557bc319109d', iri=IRI(https://data-proxy.ebrains.eu/api/v1/buckets/d-07ab1665-73b0-40c5-800e-557bc319109d)...`
- `chunk_size`: 10000
- `cur_index`: 0

With the same requesting account, I can browser-visit
https://data-proxy.ebrains.eu/datasets/07ab1665-73b0-40c5-800e-557bc319109d
and see a file listing.
@mih
Copy link
Member

mih commented Mar 8, 2023

I have posted #61 with a code sketch and my findings. After rectifying superficial issues in datalad-ebrains code, the next blocker is an empty fairgraph report on the files contained in this dataset.

If you happened to have any insight in this, please let me know. Thx!

@alexisthual
Copy link
Contributor Author

Thanks for looking into this Michael!
I was actually able to fetch the dataset through siibra so I thought there should be a way for us to do this with datalad-ebrains.
In particular, I could get the list of available files and fetch one file.
Should we try to get some inspiration from what they do? I'm happy to schedule a call if you think that'd be of any help!

@mih
Copy link
Member

mih commented Mar 8, 2023

AFAIK siibra also uses fairgraph. So it should be possible. I am currently not able to commit to a call, but if you can point people here, we should be able to figure it out asynchronously. thx!

@ymzayek
Copy link

ymzayek commented Mar 8, 2023

Hello, I'm not sure how siibra depends on fairgraph but they use the method siibra.fetch_ebrains_token() to produce a link where you can authenticate and then you can pass the dataset id to siibra.retrieval.repositories.EbrainsHdgConnector() and use another method to search the files. When I did this a few days ago I remember after the second step I got an email where I had to click a link to get access. I just tried again now and didn't have to do this step so I'm not sure how long the access is provided after this step but it seems more than 24 hours. I am looping @dickscheid in here because maybe he can clarify better how siibra does this and its relationship with fairgraph and datalad extension.

Full code to reproduce the data fetching described above:

import siibra

siibra.fetch_ebrains_token()
from siibra.retrieval.repositories import EbrainsHdgConnector
dataset_id = "07ab1665-73b0-40c5-800e-557bc319109d" # The ID is the last part of the url
conn = EbrainsHdgConnector(dataset_id)
conn.search_files()
data_file = "resulting_smooth_maps/sub-01/ses-14/sub-01_ses-14_task-MTTWE_dir-ap_space-MNI152NLin2009cAsym_desc-preproc_ZMap-we_all_event_response.nii.gz"
img = conn.get(data_file)

@mih
Copy link
Member

mih commented Mar 10, 2023

@ymzayek Thanks for the code snippet. That is very helpful. We should be able to reuse the auth-setup in datalad-ebrains also for calling out to siibra. I will check what it is doing and if that does not inform a code change in the fairgraph call, we could simply employ siibra for this case.

I am not sure whether a non-public data-proxy bucket link always implies the human data gateway, but until we discover counter-evidence, this may be good enough.

@mih
Copy link
Member

mih commented Mar 10, 2023

So looking at https://github.com/FZJ-INM1-BDA/siibra-python/blob/908f118f87ec83def2970d9a526f29f49482e2bc/siibra/retrieval/repositories.py#L354-L449 I see that siibra queries the data-proxy directly, and does not go through the knowledge graph! It does go through the KG for public datasets.

Now I am wondering: We could do the same thing. Moreoever, doing it not only for non-public datasets, like the example here, but for any data-proxy accessible dataset may actually solve #52. If that is True, it would boost overall performance by quite a bit!

@alexisthual
Copy link
Contributor Author

@mih, @ymzayek and I are interested to look more into this but it feels a bit hard to dive into this codebase on our own.
However, we'd be down to schedule a peer-coding session with you some day soon if you are interested in that too!
Otherwise, we can try to deal with this asynchronously, but we'll need some of your guidance haha

@mih
Copy link
Member

mih commented Mar 14, 2023

@alexisthual @ymzayek That would be wonderful. We have a regular zoom call for such things Tue's 8:30 CET. If this would work for you, that would be the easiest, and @dickscheid would also be in that call.

@alexisthual
Copy link
Contributor Author

Nice! 8:30 am might be a bit early (the office is rather far haha) but I think I can try and make it next Tuesday 馃檪

@ymzayek
Copy link

ymzayek commented Mar 14, 2023

I think I should be able to make it for next Tuesday as well.

@mih
Copy link
Member

mih commented Mar 14, 2023

Awesome! Apologies for the timing. This is pretty much 11am if-there-would-be-nothing-stupid-to-do o'clock. Please shoot me an email at michael.hanke@gmail.com, and I will send you a zoom link. Thx for your interest!

@mih
Copy link
Member

mih commented Mar 21, 2023

#61 has progressed a bit with today's meeting, but is not yet in a functional state.

@dickscheid pointed out the HDG documentation should have all missing information
https://wiki.ebrains.eu/bin/view/Collabs/data-proxy/Human%20Data%20Gateway/

It might require a dedicated implementation of a downloader. This should be fairly straightforward with the UrlOperations framework from datalad-next.

@alexisthual
Copy link
Contributor Author

Hi @mih and @dickscheid !
Do you have some time in the coming days or weeks to work some more on this? We'd happily participate in a new peer-coding session if it can help.

@alexisthual
Copy link
Contributor Author

Hi!

We've (@man-shu @ferponce @bthirion) tried using datalad-next directly with urls from EBRAINS data-proxy (which, from our understanding, allows users to directly use URLs) but did not succeed in getting the data.
We have also tried to access a bucket directly from the data proxy, but it seems that our dataset does not have a bucket yet (and we don't know if it'd be useful).

We did not try to integrate these changes in datalad-ebrains but would happily participate in a peer-coding session if that sounds useful!

@alexisthual
Copy link
Contributor Author

Hi @mih!
I see there has not been much movement on this repo for the past few weeks.
We are still interested in this feature but could not get it to work on our end 馃槉
The HBP is coming to an end, and I don't know if you'll spend more time on this repo ; in any case, let us know if we can help with anything!

mih added a commit that referenced this issue Jul 14, 2023
This change is merely adding the ability to recognize and
process non-public dataset data-proxy URLs.

However, it is not enough to support such datasets, because
the underlying `fairgraph` query to get a dataset's file listing
returns no results.

The query is essentially this

```py
batch = omcore.File.list(
    self.client,
    file_repository=dvr,
    size=chunk_size,
    from_index=cur_index)
```

and for the dataset referenced in
#58 it returns an empty
list with

- a properly authenticated `client`
- `dvr`: `FileRepository(name='buckets/d-07ab1665-73b0-40c5-800e-557bc319109d', iri=IRI(https://data-proxy.ebrains.eu/api/v1/buckets/d-07ab1665-73b0-40c5-800e-557bc319109d)...`
- `chunk_size`: 10000
- `cur_index`: 0

With the same requesting account, I can browser-visit
https://data-proxy.ebrains.eu/datasets/07ab1665-73b0-40c5-800e-557bc319109d
and see a file listing.
@mih
Copy link
Member

mih commented Jul 14, 2023

I had the chance to work on this again. #61 refactors the code to allow for interacting with the data proxy API directly. Moreover, it switches access for publicly hosted datasets that are accessible via the DP to use that API too.

I could not get the authentication flow for private dataset access via the HDG to work -- neither in code, nor with https://data-proxy.ebrains.eu/api/docs

I use my EBRAINS token to authenticate. When I POST to /datasets/{dataset_id} to initiate the HDG flow, as instructed at https://wiki.ebrains.eu/bin/view/Collabs/data-proxy/Human%20Data%20Gateway/ I get

{
  "status_code": 401,
  "detail": "User not authenticated properly. The token needs to access the 'roles', 'email', 'team' and 'profile' scope."
}

The correpsonding GET request fails (as expected) with

{
  "status_code": 401,
  "detail": "Access has expired, please request access again",
  "can_request_access": true
}

This makes me think that either the EBRAINS session token is the wrong credential here, or that my particular account is insufficient, or I am missing a crucial step in the authorization flow.

@alexisthual if you can get a file listing of a HDG dataset via https://data-proxy.ebrains.eu/api/docs please let me know how, and I am confident that I can achieve the rest.

@ymzayek
Copy link

ymzayek commented Jul 17, 2023

Not sure this is helpful but I also tried this. From the browser I can access this private dataset https://data-proxy.ebrains.eu/datasets/07ab1665-73b0-40c5-800e-557bc319109d (authorized through login and email link). Then using that same token:

curl -X 'POST' \
  'https://data-proxy.ebrains.eu/api/v1/datasets/07ab1665-73b0-40c5-800e-557bc319109d' \
  -H 'accept: application/json' \
  -H 'Authorization: Bearer $TOKEN' \
  -d ''

I get

{
  "error": "Error accessing userinfo"
}

And same response with GET request

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants