Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Towards more user-friendly navigation using aliases and ID concepts #430

Merged
merged 19 commits into from
Apr 27, 2024

Conversation

jsheunis
Copy link
Member

@jsheunis jsheunis commented Mar 4, 2024

This PR is in response to #423, the issue contains a detailed walk-through of the concepts being introduced here.

UPDATED 17 April:

1. alias and dataset-id-based navigation

This is the core contribution of this PR. The goal was to be able to have pretty and short URLs for a dataset (i.e. using dataset alias or dataset ID, as opposed to the previous dataset ID+VERSION), and to be able to specify tabs and keywords as query parameters, meaning that such a URL can easily be copied and used as a link to e.g. "all the <insert-category-here> datasets in the catalog". This change also addressed a long-standing open question of whether to specify tabs as route parameters (previous) or query parameters (new).

For navigation using aliases and dataset IDs, the alias metadata file and concept id metadata file were introduced. These files have the format:

{
   "type": "redirect",
   "dataset_id": "...",
   "dataset_version": "..."
}

and they are located at predictable locations based on the ID or ALIAS of the dataset being accessed. This allows the web app to easily navigate using a friendly URL (e..g. https://mycatalog.de/dataset/studyforrest_alias or https://mycatalog.de/dataset/abcd_id), which is then routed internally to the correct ID and VERSION metadata file. Note that the dataset_version field is optional for an alias metadata file, but required for a concept ID metadata file. This means that an alias metadata file can point to a dataset ID in general or to a specific version (i.e. versioned aliases are possible), but a concept ID metadata file should always point to the dataset version that is currently considered canonical. This necessitates that the concept ID metadata file should be kept updated whenever a new version of a dataset, that is considered canonical, is added to a catalog.

2. Extended query functionality

The core change that allows the URL query string to follow changes caused by the user, e.g. when selecting search tags (i.e. adding more keywords to filter subdatasets) or when selecting a tab to view, is the use of history.replaceState (keeping the path constant and only updating the query string) as opposed to route.push or route.replace. The reason is that the route actions result in a page reload, triggering unnecessary functions and resulting in bad UX, whereas the history.replaceState action (as implemented here) does not trigger a reload.

3. More changes

Apart from these javascript changes above , the PR also needed to consider how aliases will be assigned in bulk, which lead to the need to iterate through the dataset-versions of a catalog. Hence, it includes changes to the following core modules:

  • webcatalog: new methods are added for extracting all IDs in a catalog, and all VERSIONs for respective IDs, including properties like name, alias, and more. This allows users to iterate through a list comprising all catalog datasets (which is necessary for assigning aliases to datasets, and creating alias and ID metadata files for all datasets in a catalog) and allows the 'report' functionality which summarizes some stats of a catalog.
  • node: adds a method for sorting metadata_sources by time, as a proxy for determining a last_updated value for a dataset version, which forms part of the output of the new webcatalog methods for reporting.
  • get: adds the property 'report', allowing a summary report of the catalog to be printed out.
  • utils: changes supporting above methods.
  • tests: minor changes to allow test success

Then, the dataset schema was updated to include the alias field, and name and alias fields were added to the existing metadata of all datasets in the catalog, in preparation for the next step. A script was added (and executed) that uses these new python methods to run through all datasets in the demo catalog to create associated concept ID and alias metadata files.

There are several aspects that would make this functionality compatible with the full pipeline of catalog creation and entry generation and maintenance, that have not been addressed and will not be addressed in this PR:

  • adding alias-related functionality to the catalog-add command, i.e. code that would add an alias metadata file and concept ID file when a dataset record is added to a catalog
  • adding a separate command that would create an alias metadata file and/or concept ID file for a specific dataset(-version) in a catalog
  • supporting a list of versions in the concept id metadata file, which would allow UI functionality like a dropdown list of available versions for the currently displayed dataset (the alternative would be to not maintain any list and let the client application determine this at runtime, and e.g. using a last_updated_at field as a proxy for latest if no canonical version specification is done via the concept id metadata file.

For the time being, the script at tools/create_alias_concept_metadata.py is provided for use when alias- and concept-id metadata files need to be created for dataset-versions in an existing catalog.

…n a catalog

This includes changes to the following core modules:
- webcatalog: new methods are added for extracting all IDs
in a catalog, and all VERSIONs for respective IDs, including
properties like name, alias, and more. This allows users to
iterate through a list comprising all catalog datasets (which
is necessary for the upcoming 'tree' command) and allows the
'report' functionality which summarizes some stats of a catalog.
- node: adds a method for sorting metadata_sources by time, as
a proxy for determining a 'last_updated' value for a dataset
version, which forms part of the output of the new webcatalog
methods.
- get: adds the property 'report', allowing a summary report
of the catalog to be printed out.
- utils: changes supporting above methods.
- tests: minor changes to allow test success, new tests are TODO.

These commits all contribute to the upcoming changes related
to navigating to a dataset-version via concept ID or via alias.
These are the web-side changes that accompany the previous commits
related to dataset aliases and dataset id concepts. This now allows
users to navigate to e.g. http://mycatalog.de/dataset/\<dataset-id\>
or to e.g. http://mycatalog.de/dataset/\<dataset-alias\>. The javascript
will first fetch the concept id file or the alias file, which contains a
redirect pointer to the correct dataset id AND version. Internal navigation
between datasets are still done based on the dataset-id AND dataset-version
URL parameters. The latter now becomes an optional parameter. One aspect
that this commit does not account for yet is the second optional URL parameter
'tab' and its use with the sinle required URL parameter.
Copy link

netlify bot commented Mar 4, 2024

Deploy Preview for datalad-catalog ready!

Name Link
🔨 Latest commit 3f3963d
🔍 Latest deploy log https://app.netlify.com/sites/datalad-catalog/deploys/662d2bffdbd3a2000869b8ac
😎 Deploy Preview https://deploy-preview-430--datalad-catalog.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@codecov-commenter
Copy link

codecov-commenter commented Mar 4, 2024

Codecov Report

Attention: Patch coverage is 23.40426% with 36 lines in your changes are missing coverage. Please review.

Project coverage is 83.53%. Comparing base (d8cc3ec) to head (3f3963d).
Report is 3 commits behind head on main.

Files Patch % Lines
datalad_catalog/webcatalog.py 8.00% 23 Missing ⚠️
datalad_catalog/get.py 25.00% 9 Missing ⚠️
datalad_catalog/node.py 33.33% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #430      +/-   ##
==========================================
- Coverage   84.55%   83.53%   -1.03%     
==========================================
  Files          43       43              
  Lines        2862     2897      +35     
==========================================
  Hits         2420     2420              
- Misses        442      477      +35     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jsheunis
Copy link
Member Author

jsheunis commented Mar 4, 2024

With the current version of the script at https://github.com/datalad/datalad-catalog/blob/ab2fbeabb51696a534daeb0f43c2e969d8e1a1b7/tools/create_alias_concept_metadata.py, you can apply all necessary updates to a catalog's metadata to allow the new functionality provided by this PR.

For example:

python tools/create_alias_concept_metadata.py --catalog ../abcdj/data-catalog/catalog --aliases abcdj-aliases.tsv

where abcdj-aliases.tsv is a tsv file with the following content:

alias	dataset_id	dataset_version
ocr-PIRA-cohort	db7592d0-6206-5684-a29c-8059a6033241	0.1.0
jumax	3f8a45c0-08fc-479c-b561-cb6f744d2b5c	32f3303308c89b037fe1700547348470539a30cb
movies	0036b2c6-f131-4660-9ef9-945087ad02d3	1139b4c82d7ecd0d9ab25bf4c00ec2c06461d6da
ema_pilot	372e66b3-e654-4a0f-ba4c-6394bf314f2f	d468e9a9455b38b959db34bc5335fd66e42b2884
abcdj	1015ed7c-0a3d-4dfc-9c4f-11fe71673a41	c58f0f563011222c618a58951fa21b61c8eb189b

After (or before) these changes have been applied to a catalog, the only other changes necessary are those made to the web assets and schema at:

  • datalad_catalog/catalog/assets/app_component_dataset.js
  • datalad_catalog/catalog/assets/app_globals.js
  • datalad_catalog/catalog/assets/app_router.js
  • datalad_catalog/catalog/schema/jsonschema_dataset.json

@mslw: ☝️ for reference if we want to apply this to SFB1451 catalog

@jsheunis
Copy link
Member Author

Some more thoughts after revisiting this PR:

  • It feels to me that it makes more sense to assign an alias to an ID concept (i.e. dataset) only, and not to the ID+VERSION. The idea is that the dataset retains an alias, irrespective of continued version bumps. If this is regarded as the way to go, then:
    • the alias metadata file should not be populated with a specific version as it is currently.
    • the ID concept metadata file seems like the appropriate place to include a list of the multiple versions of a dataset that are contained in a catalog
    • the alternative (actually ideal) is not to track available versions explicitly in a file, but rather to determine it at runtime by looking at the metadata directories and files in a dataset ID directory, but we only have a client side application and we're not working with a local filesystem
    • tracking versions in a file would mean that we would also need to update the file both when new versions are added or when existing versions are removed.
  • Still TODO:
    • add functionality to always create/update concept and ID metadata files when metadata is added/updated to the catalog

This falls into the same context as the work on aliasing and concept id
navigation. The point is that any number of combinations between using
an alias or concept id or dataset_id+version, and query tabs and keywords
as query parameters should be possible. An important change is the use of
history.replaceState to update the query string in the URL without updating
the route or reloading it (i.e. without using pushState). This seems to work
fine, but the process of keeping the query string and the selected keywords
and tabs in sync is still not ironed out completely, and hence some navigations
are still a bit buggy (e.g. after navigation to a subdataset, a keyword query
param might still remain, unexpectedly). Note that vue-router's router.push
is not aware of the history.replaceState call. This is not a problem per se,

but needs to be kept track of.
@mih
Copy link
Member

mih commented Apr 12, 2024

I spot check the live demo and it worked nicely. Thanks!

I agree that aliasing for a dataset (unversioned) makes most sense. Conceptually, however, aliased dataset versions also nake sense (they are tagged versions).

I have no good concept on the structural consequences of this PR. Would two uncoordinated, parallel updates of a dataset still work, or is there the change to overwrite information?

If I understood correctly, there is now a need (or at least the ability) to enumerate catalog items. What would you think the maximum number of catalag record could be, in a real world scenario?

@mslw
Copy link
Collaborator

mslw commented Apr 12, 2024

Some more thoughts after revisiting this PR:

It feels to me that it makes more sense to assign an alias to an ID concept (i.e. dataset) only, and not to the ID+VERSION. The idea is that the dataset retains an alias, irrespective of continued version bumps. If this is regarded as the way to go, then:

  • the alias metadata file should not be populated with a specific version as it is currently.

(...)

I think the "concept" URL would be the most valuable addition (but tagged versions could also be useful, as mih points out). Without looking into the meat of the PR, the file abcdj-aliases.tsv seems to be a reasonable way to get this done. Updates to a dataset (adding metadata for a new version) would require updating the file to keep it pointing at "latest" (unless the version is, kept literally at "latest"...). However, this would not be a complicated operation, and we already follow the same procedure whenever the superdataset version progresses (catalog set-super, which changes super.json). I would be willing to pay the price.

I think figuring out at runtime is not necessary (consider this as potential complication: I don't typically update "older" dataset versions, but in general the metadata update time/order does not have to match version time/order).

@jsheunis
Copy link
Member Author

@mih

I have no good concept on the structural consequences of this PR. Would two uncoordinated, parallel updates of a dataset still work, or is there the change to overwrite information?

Two uncoordinated, parallel updates of a dataset could still work (I'm assuming "coordinated" would mean that the contributors have to speak/chat to each other in order to make both updates work successfully?) in the way that it currently works, by depending on a standard git-based collaborative workflow.

Also, there is as much chance of overwriting information as there was before this PR. E.g. if two agents have write permissions to the same catalog repo, and they both provide updates to the same property of a dataset, and if both metadata sources have a higher level of priority than the source that populated the existing property value, these two changes could overwrite each other. But this is expected behaviour and source priority is a configuration-level setting per catalog or per dataset.

If I understood correctly, there is now a need (or at least the ability) to enumerate catalog items. What would you think the maximum number of catalag record could be, in a real world scenario?

The ability: yes. The need stems from me wanting to run through all datasets in a catalog to assign aliases. I though this functionality could be useful in future as well for any type of catalog-dataset-level update.

The maximum number of catalog records for a real world scenario is difficult to answer. It would depend on how the user interface (and code underneath it) deals with delays and whether that translates to user frustration. The bottle neck at the moment is the fetching of subdataset-level metadata when navigating to a dataset. So the issue won't be maximum number of datasets for the whole catalog, but if a single dataset has many subdatasets. I think we need to test this to start having a better understanding of the practicalities, and the SFB1451 catalog revamping would be a good first test. But we should stress test it more.

This single-dataset-with-many-subdatastes scenario is kind of built-in in our current usage of a catalog home page, but isn't strictly required. The alternative is to have a more traditional "data portal" where the front page isn't also a (conceptual) dataset, and is rather used to query across all datasets in the catalog. This is something I think we should discuss when taling about the design of the new catalog toolset.

@jsheunis
Copy link
Member Author

@mslw

Without looking into the meat of the PR, the file abcdj-aliases.tsv seems to be a reasonable way to get this done. Updates to a dataset (adding metadata for a new version) would require updating the file to keep it pointing at "latest" (unless the version is, kept literally at "latest"...). However, this would not be a complicated operation, and we already follow the same procedure whenever the superdataset version progresses (catalog set-super, which changes super.json). I would be willing to pay the price.

Just to be clear, the file abcdj-aliases.tsv is not intended to serve as a continuously maintained and referenced source of aliases for dataset-id-versions of a catalog instance. I suggested its use only when running a script to assign aliases, i.e. once-off usage. Afterwards, normal catalog operation depends on two main aspects:

  1. The alias, which is not referenced from a catalog-central source (such as the 'super.json' file for catalog homepage) but which is found within the so-called "alias metadata file" (which follows the same structure as the concept-id metadata file) and which contains a pointer to the correct dataset ID (which is in turn the concept-id metadata file)
  2. The dataset version that needs to be navigated to. In the current state of the PR, this is specified in the alias metadata file, but I am thinking of making that at least optional (not required as it is currently), and maintaining the correct version in the concept-id metadata file.

I think figuring out at runtime is not necessary (consider this as potential complication: I don't typically update "older" dataset versions, but in general the metadata update time/order does not have to match version time/order).

This is a good point. At the moment the proxy for "latest" is "update time", but like you say these might not be equal. Thinking about the use case for multi-versioned datasets in a single catalog, there are two scenarios:

  1. A dataset is accessed as a specific version, either via a versioned subdataset relation or via direct navigation. Here we do not have to care about what version is considered latest, because the user is interested in this specific version.
  2. A dataset is accessed via the concept-ID or via an (unversioned) alias, meaning no version is specified and should be derived somehow. Currently, from my POV the most sensible way for this to be achieved is to have a kind of "canonical version" in the concept-id file, which could be updated every time a new dataset version is added to the catalog (but does not necessarily have to be). Alternatively, by proxy of update time (or whatever other time-based field we might prefer), the "latest" version could also be implicitly used as the canonical version. In addition storing the canonical version, the concept-id metadata file could also store a list of other versions (that is, as opposed to determining all versions of a dataset at runtime), for example for easy navigation between versions.

That is, an alias file can redirect to a dataset ID without specifying
a dataset version. Importantly, the concept ID metadata file should
always specify a dataset version. This lays the groundwork for allowing
an alias to identify either a dataset OR a dataset-version.
@jsheunis
Copy link
Member Author

In order to spend the least amount of time necessary on this PR, I propose:

  • fixing the remaining issues with navigation using url query strings (main priority)
  • NOT (at least, not yet) adding alias-related functionality to the catalog-add command, i.e. not adding code that would add an alias metadata file and concept ID file when a dataset record is added to a catalog
  • NOT (at least, not yet) adding a separate command that would create an alias metadata file and/or concept ID file for a specific dataset(-version) in a catalog
  • proposing the already committed script for use when alias- and concept-id metadata files need to be created for dataset-versions in an existing catalog

These items are pending a discussion with @mslw about how the SFB1451 catalog restructuring will take place.

@jsheunis
Copy link
Member Author

jsheunis commented Apr 15, 2024

fixing the remaining issues with navigation using url query strings (main priority)

I believe this is fixed now, it was a simple bug where a line of code tested for the existence of a variable without considering "falsy" value possibility.

From my POV, this is ready to merge, and any functionality that would extend the command line or python api options for the purpose of dealing with aliases and concept ids more easily could form part of a new PR.

UPDATE:
Still discovering some bugs with inconsistent behaviour of the query string upon navigation to a subdataset.

@jsheunis
Copy link
Member Author

Bringing a note from @mih into this PR:

Is it true that some kind of alias listing has to be done, and this needs a consolidation in case a new (lastest) version of an existing resource is added to the catalog?

This is somewhat undecided. With the current state of the new functionality in this PR, a catalog instance (and its maintainers) can choose to complete ignore aliases and concept IDs if they want to, and carry on using the id+version-based navigation as supported in the version of SFB1451 that is currently in production. If they choose to support navigation based on aliases and concept ids (i.e. where the url does not contain a specified version), the necessary additional metadata files can be added once-off (using the script added in this PR), i.e. no listing has to be maintained. What has to be maintained though, is a pointer to the "current canonical" version of a dataset from (at least) the concept id metadata file (this can also be pointed to from the alias metadata file, thus supporting versioned tags). The alternative is to not maintain any pointer to the current canonical version and let the client application determine this at runtime from some field in the metadata of all available versions of the dataset being navigated to. E.g. using a last_updated_at field as a proxy for latest, but as @mslw pointed out there isn't necessarily a 1-to-1 relationship between the last_updated_at and canonical version.

Somewhat related, if we want to support a dropdown list of versions available for a given dataset, this would necessitate running through all versions of a current dataset, which would mean that finding the latest one is a small additional step.

@jsheunis
Copy link
Member Author

UPDATE: Since #440 has been merged, this PR needs some resolving first. And then the last bugs related to the query strings on navigation to subdatasets need to be fixed.

…e behaviour

This commit fixes bugs in the recently added functionality to update the URL query
string based on user selections. The problems included that a test for '>= 0'
incorrectly validated null or NaN, resulting in the incorrect part of an if statement
being executed and creating the '?tab=undefined' issue in the query string; and also
that the code for adding/removing search tags based on the 'keyword' parameter in the
query string had to move to the 'dataset_ready' watcher rather than the 'subdatasets_ready'
watcher to allow more time for asynchronous code to complete. To support debugging
during this and future steps, several 'console.debug' statements were added all
over the code base. Lastly, references in the html rendering to 'Subdatasets' were
replaced with 'Datasets' to conform to recent similar changes in other catalog instances.
@jsheunis
Copy link
Member Author

Given that:

  • dataset_version is optional for a alias metadata/redirect file (it will then redirect to the dataset-id metadata/redirect file)
  • the canonical dataset_version is required for the dataset-id metadata/redirect file
  • new versions of a dataset, to serve as the canonical versions of that dataset, will likely be added to a catalog

I think the script that creates the redirect files (and takes a tsv with dataset_id | dataset_version | alias as input) should create the alias files without the dataset_version field. This is not currently the case.

@mslw thoughts?

@jsheunis
Copy link
Member Author

I don't understand the linting errors:

...
would reformat /home/runner/work/datalad-catalog/datalad-catalog/tools/create_alias_concept_metadata.py

Oh no! 💥 💔 💥
1 file would be reformatted, 48 files would be left unchanged.

When I run black locally on that file, no changes are suggested. So I'll ignore these failures, meaning there are no blockers to merge anymore.

@mslw
Copy link
Collaborator

mslw commented Apr 18, 2024

Given that:

* `dataset_version` is optional for a alias metadata/redirect file (it will then redirect to the dataset-id metadata/redirect file)

* the canonical `dataset_version` is required for the dataset-id metadata/redirect file

* new versions of a dataset, to serve as the canonical versions of that dataset, will likely be added to a catalog

I think the script that creates the redirect files (and takes a tsv with dataset_id | dataset_version | alias as input) should create the alias files without the dataset_version field. This is not currently the case.

@mslw thoughts?

I don't think there is a strict should - but I agree with your reasoning, and I'd personally be more inclined to create an alias for a dataset ID, and then use the dataset id metadata/redirect file to keep it pointing to "latest".

</b-row>
</div>
</span>
<h6>ID URL</h5>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if "Versioned URL" might be a better name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't that make more sense for "full URL"?

Copy link
Collaborator

@mslw mslw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diff is too big for me to review today, but having played with the netlify preview (studyforrest) and also the catalog applied on top of the SFB1451 catalog, I think it gets the job done!

No unexpected behaviours with the SFB1451 catalog as far as I could see.

One thing I wanted to see (and you might want to look into) is what the share button would do when no alias-related files are in place (i.e. old metadata, no redirect files of any sort in place). The share pop-up does not show alias URL (great!) but it does show ID URL which 404s.

@jsheunis
Copy link
Member Author

Ah good catch, i didn't consider that. I will build in a check for that and will merge afterwards.

@jsheunis jsheunis merged commit 53ac02a into main Apr 27, 2024
11 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment