-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Towards more user-friendly navigation using aliases and ID concepts #430
Conversation
…ted metadata_source
…n a catalog This includes changes to the following core modules: - webcatalog: new methods are added for extracting all IDs in a catalog, and all VERSIONs for respective IDs, including properties like name, alias, and more. This allows users to iterate through a list comprising all catalog datasets (which is necessary for the upcoming 'tree' command) and allows the 'report' functionality which summarizes some stats of a catalog. - node: adds a method for sorting metadata_sources by time, as a proxy for determining a 'last_updated' value for a dataset version, which forms part of the output of the new webcatalog methods. - get: adds the property 'report', allowing a summary report of the catalog to be printed out. - utils: changes supporting above methods. - tests: minor changes to allow test success, new tests are TODO. These commits all contribute to the upcoming changes related to navigating to a dataset-version via concept ID or via alias.
…tasets in a catalog
…ult of running the script added in 13355d0
These are the web-side changes that accompany the previous commits related to dataset aliases and dataset id concepts. This now allows users to navigate to e.g. http://mycatalog.de/dataset/\<dataset-id\> or to e.g. http://mycatalog.de/dataset/\<dataset-alias\>. The javascript will first fetch the concept id file or the alias file, which contains a redirect pointer to the correct dataset id AND version. Internal navigation between datasets are still done based on the dataset-id AND dataset-version URL parameters. The latter now becomes an optional parameter. One aspect that this commit does not account for yet is the second optional URL parameter 'tab' and its use with the sinle required URL parameter.
✅ Deploy Preview for datalad-catalog ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #430 +/- ##
==========================================
- Coverage 84.55% 83.53% -1.03%
==========================================
Files 43 43
Lines 2862 2897 +35
==========================================
Hits 2420 2420
- Misses 442 477 +35 ☔ View full report in Codecov by Sentry. |
With the current version of the script at https://github.com/datalad/datalad-catalog/blob/ab2fbeabb51696a534daeb0f43c2e969d8e1a1b7/tools/create_alias_concept_metadata.py, you can apply all necessary updates to a catalog's metadata to allow the new functionality provided by this PR. For example: python tools/create_alias_concept_metadata.py --catalog ../abcdj/data-catalog/catalog --aliases abcdj-aliases.tsv where
After (or before) these changes have been applied to a catalog, the only other changes necessary are those made to the web assets and schema at:
@mslw: ☝️ for reference if we want to apply this to SFB1451 catalog |
Some more thoughts after revisiting this PR:
|
This falls into the same context as the work on aliasing and concept id navigation. The point is that any number of combinations between using an alias or concept id or dataset_id+version, and query tabs and keywords as query parameters should be possible. An important change is the use of history.replaceState to update the query string in the URL without updating the route or reloading it (i.e. without using pushState). This seems to work fine, but the process of keeping the query string and the selected keywords and tabs in sync is still not ironed out completely, and hence some navigations are still a bit buggy (e.g. after navigation to a subdataset, a keyword query param might still remain, unexpectedly). Note that vue-router's router.push is not aware of the history.replaceState call. This is not a problem per se, but needs to be kept track of.
I spot check the live demo and it worked nicely. Thanks! I agree that aliasing for a dataset (unversioned) makes most sense. Conceptually, however, aliased dataset versions also nake sense (they are tagged versions). I have no good concept on the structural consequences of this PR. Would two uncoordinated, parallel updates of a dataset still work, or is there the change to overwrite information? If I understood correctly, there is now a need (or at least the ability) to enumerate catalog items. What would you think the maximum number of catalag record could be, in a real world scenario? |
I think the "concept" URL would be the most valuable addition (but tagged versions could also be useful, as mih points out). Without looking into the meat of the PR, the file I think figuring out at runtime is not necessary (consider this as potential complication: I don't typically update "older" dataset versions, but in general the metadata update time/order does not have to match version time/order). |
Two uncoordinated, parallel updates of a dataset could still work (I'm assuming "coordinated" would mean that the contributors have to speak/chat to each other in order to make both updates work successfully?) in the way that it currently works, by depending on a standard git-based collaborative workflow. Also, there is as much chance of overwriting information as there was before this PR. E.g. if two agents have write permissions to the same catalog repo, and they both provide updates to the same property of a dataset, and if both metadata sources have a higher level of priority than the source that populated the existing property value, these two changes could overwrite each other. But this is expected behaviour and source priority is a configuration-level setting per catalog or per dataset.
The ability: yes. The need stems from me wanting to run through all datasets in a catalog to assign aliases. I though this functionality could be useful in future as well for any type of catalog-dataset-level update. The maximum number of catalog records for a real world scenario is difficult to answer. It would depend on how the user interface (and code underneath it) deals with delays and whether that translates to user frustration. The bottle neck at the moment is the fetching of subdataset-level metadata when navigating to a dataset. So the issue won't be maximum number of datasets for the whole catalog, but if a single dataset has many subdatasets. I think we need to test this to start having a better understanding of the practicalities, and the SFB1451 catalog revamping would be a good first test. But we should stress test it more. This single-dataset-with-many-subdatastes scenario is kind of built-in in our current usage of a catalog home page, but isn't strictly required. The alternative is to have a more traditional "data portal" where the front page isn't also a (conceptual) dataset, and is rather used to query across all datasets in the catalog. This is something I think we should discuss when taling about the design of the new catalog toolset. |
Just to be clear, the file
This is a good point. At the moment the proxy for "latest" is "update time", but like you say these might not be equal. Thinking about the use case for multi-versioned datasets in a single catalog, there are two scenarios:
|
That is, an alias file can redirect to a dataset ID without specifying a dataset version. Importantly, the concept ID metadata file should always specify a dataset version. This lays the groundwork for allowing an alias to identify either a dataset OR a dataset-version.
In order to spend the least amount of time necessary on this PR, I propose:
These items are pending a discussion with @mslw about how the SFB1451 catalog restructuring will take place. |
…, and outdated command syntax in readme
I believe this is fixed now, it was a simple bug where a line of code tested for the existence of a variable without considering "falsy" value possibility. From my POV, this is ready to merge, and any functionality that would extend the command line or python api options for the purpose of dealing with aliases and concept ids more easily could form part of a new PR. UPDATE: |
Bringing a note from @mih into this PR:
This is somewhat undecided. With the current state of the new functionality in this PR, a catalog instance (and its maintainers) can choose to complete ignore aliases and concept IDs if they want to, and carry on using the Somewhat related, if we want to support a dropdown list of versions available for a given dataset, this would necessitate running through all versions of a current dataset, which would mean that finding the latest one is a small additional step. |
UPDATE: Since #440 has been merged, this PR needs some resolving first. And then the last bugs related to the query strings on navigation to subdatasets need to be fixed. |
…e behaviour This commit fixes bugs in the recently added functionality to update the URL query string based on user selections. The problems included that a test for '>= 0' incorrectly validated null or NaN, resulting in the incorrect part of an if statement being executed and creating the '?tab=undefined' issue in the query string; and also that the code for adding/removing search tags based on the 'keyword' parameter in the query string had to move to the 'dataset_ready' watcher rather than the 'subdatasets_ready' watcher to allow more time for asynchronous code to complete. To support debugging during this and future steps, several 'console.debug' statements were added all over the code base. Lastly, references in the html rendering to 'Subdatasets' were replaced with 'Datasets' to conform to recent similar changes in other catalog instances.
Given that:
I think the script that creates the redirect files (and takes a tsv with @mslw thoughts? |
I don't understand the linting errors:
When I run |
I don't think there is a strict should - but I agree with your reasoning, and I'd personally be more inclined to create an alias for a dataset ID, and then use the dataset id metadata/redirect file to keep it pointing to "latest". |
</b-row> | ||
</div> | ||
</span> | ||
<h6>ID URL</h5> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if "Versioned URL" might be a better name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't that make more sense for "full URL"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The diff is too big for me to review today, but having played with the netlify preview (studyforrest) and also the catalog applied on top of the SFB1451 catalog, I think it gets the job done!
No unexpected behaviours with the SFB1451 catalog as far as I could see.
One thing I wanted to see (and you might want to look into) is what the share button would do when no alias-related files are in place (i.e. old metadata, no redirect files of any sort in place). The share pop-up does not show alias URL (great!) but it does show ID URL which 404s.
Ah good catch, i didn't consider that. I will build in a check for that and will merge afterwards. |
This PR is in response to #423, the issue contains a detailed walk-through of the concepts being introduced here.
UPDATED 17 April:
1.
alias
anddataset-id
-based navigationThis is the core contribution of this PR. The goal was to be able to have pretty and short URLs for a dataset (i.e. using dataset alias or dataset ID, as opposed to the previous dataset ID+VERSION), and to be able to specify tabs and keywords as query parameters, meaning that such a URL can easily be copied and used as a link to e.g. "all the
<insert-category-here>
datasets in the catalog". This change also addressed a long-standing open question of whether to specify tabs as route parameters (previous) or query parameters (new).For navigation using aliases and dataset IDs, the alias metadata file and concept id metadata file were introduced. These files have the format:
and they are located at predictable locations based on the ID or ALIAS of the dataset being accessed. This allows the web app to easily navigate using a friendly URL (e..g.
https://mycatalog.de/dataset/studyforrest_alias
orhttps://mycatalog.de/dataset/abcd_id
), which is then routed internally to the correct ID and VERSION metadata file. Note that thedataset_version
field is optional for an alias metadata file, but required for a concept ID metadata file. This means that an alias metadata file can point to a dataset ID in general or to a specific version (i.e. versioned aliases are possible), but a concept ID metadata file should always point to the dataset version that is currently considered canonical. This necessitates that the concept ID metadata file should be kept updated whenever a new version of a dataset, that is considered canonical, is added to a catalog.2. Extended query functionality
The core change that allows the URL query string to follow changes caused by the user, e.g. when selecting search tags (i.e. adding more keywords to filter subdatasets) or when selecting a tab to view, is the use of
history.replaceState
(keeping the path constant and only updating the query string) as opposed toroute.push
orroute.replace
. The reason is that the route actions result in a page reload, triggering unnecessary functions and resulting in bad UX, whereas thehistory.replaceState
action (as implemented here) does not trigger a reload.3. More changes
Apart from these javascript changes above , the PR also needed to consider how aliases will be assigned in bulk, which lead to the need to iterate through the dataset-versions of a catalog. Hence, it includes changes to the following core modules:
webcatalog
: new methods are added for extracting all IDs in a catalog, and all VERSIONs for respective IDs, including properties like name, alias, and more. This allows users to iterate through a list comprising all catalog datasets (which is necessary for assigning aliases to datasets, and creating alias and ID metadata files for all datasets in a catalog) and allows the 'report' functionality which summarizes some stats of a catalog.node
: adds a method for sortingmetadata_sources
by time, as a proxy for determining alast_updated
value for a dataset version, which forms part of the output of the newwebcatalog
methods for reporting.get
: adds the property 'report', allowing a summary report of the catalog to be printed out.utils
: changes supporting above methods.tests
: minor changes to allow test successThen, the dataset schema was updated to include the
alias
field, andname
andalias
fields were added to the existing metadata of all datasets in the catalog, in preparation for the next step. A script was added (and executed) that uses these new python methods to run through all datasets in the demo catalog to create associated concept ID and alias metadata files.There are several aspects that would make this functionality compatible with the full pipeline of catalog creation and entry generation and maintenance, that have not been addressed and will not be addressed in this PR:
catalog-add
command, i.e. code that would add an alias metadata file and concept ID file when a dataset record is added to a cataloglast_updated_at field
as a proxy for latest if no canonical version specification is done via the concept id metadata file.For the time being, the script at
tools/create_alias_concept_metadata.py
is provided for use when alias- and concept-id metadata files need to be created for dataset-versions in an existing catalog.