Implement server-side bookmarks #2843

tillprochaska · 2023-01-30T17:11:37Z

The current experimental bookmarks feature in Aleph stores bookmark in local storage in the browser. This PR extends the feature to store bookmarks on the server, preventing a few common issues (for example users losing their bookmarks after clearing the browsing data). Closes #2831.

Specifically, there are three primary changes:

API: A new /bookmarks API endpoint to retrieve, create, and delete bookmarks.
Frontend: Changes to frontend to use the API instead of storing bookmarks locally.
Migration: New logic in the frontend to migrate bookmarks from client-side storage to the server.

API

Data is stored in a new bookmark table in Postgres. This table stores the role ID, entity ID and (to support efficient querying based on users permissions) the collection ID.
There are API method to list bookmarks makes use of existing abstractions to fetch data from the database (incl. flexible offsets and limits) using the DatabaseQuery class, and merging it with entity data from the ElasticSearch index using the Serializer class. So it’s actually quite simple.

UI

Uses our existing data fetching logic based on Redux to handle data fetching and mutations. I had to extend this to support partial data invalidation, for details see f5ac9d5.

Migration

There’s a separate endpoint to batch create bookmarks specifically for the purpose of migrating existing bookmarks stored client-side. The idea is to remove any code related to this migration after a couple of months.
When a user opens the bookmarks drawer and has local bookmarks that haven’t yet been migrated, a request to the migration endpoint is sent. If there have been errors while migrating bookmarks, users can download a list of bookmarks that couldn’t be migrated.

How to test the migration

Testing the happy path:

Check out the develop branch.
Start Aleph.
Create a few entities.
Bookmark a few entities.
Stop Aleph.
Check out this branch.
Start Aleph.
Open the bookmarks drawer.
You should see a message notifying you that bookmarks now sync across devices. Confirm the message and you should see your bookmarks.
Clear your local storage.
Reload the page.
Open the bookmarks drawer.
Your bookmarks should still be there.

Testing migration errors:

Check out the develop branch.
Start Aleph.
Create a few entities.
Bookmark a few entities.
Now delete one or two of the bookmarked entities, but not all of them.
Stop Aleph.
Check out this branch.
Start Aleph.
Open the bookmarks drawer.
You should see a message about recent changes to bookmarks and that there was an issue with some of your bookmarks prompting you to download a list of affected bookmarks.
Download and confirm.
You should now be presented with a list of bookmarks that could be migrated successfully.
Clear your local storage.
Reload the page.
Open the bookmarks drawer.
Your bookmarks should still be there.

Backend:

Decide whether we want to have single POST /bookmarks endpoint that handles both creating a single bookmark as well as bulk creation (see comment below).
- Decision: We want to have a separate endpoint for the purpose of migrating client-side bookmarks only. This endpoint can include specifics only relevant to this particular use case and we expect to remove it after some time.
GET /bookmarks: Endpoint should return bookmarks sorted by creation date, in descending order.
POST /bookmarks: Endpoint should validate entity_id when creating a new bookmark.
POST /bookmarks: Creating a bookmark for an entity that is already bookmarked shouldn’t raise an exception, but silently ignore the request or return a semantic status code (see review comment below).
Adjust GET /entities/:entity_id to return whether the entity is bookmarked if the user is authenticated
Optional: We should consider using bulk operations when creating database records for new bookmarks (see review comment below).
~~Optional: Delete bookmarks if entity is deleted. (Check entity set implementation)~~
- Update: Entity sets are very similar to bookmarks. I checked the current implementation and right now, if you delete an entity, corresponding entity set items aren’t deleted. They are obviously not returned from API responses and not displayed in the UI, but we keep the records in the database. I think it’s sensible to keep the bookmarks implementation similar for now and don’t think it’s a huge issue.
- Update 2: Actually we do clean up related records after deleting an entity. This is now also implemented for bookmarks.
Optional: Add proper response schemata in docstrings for API routes

Frontend:

Fetch bookmarks from backend API
Implement infinite scrolling for bookmarks list to support users with 100+ bookmarks. (We should expect users to accumulate many bookmarks over time.)
Send a one-time bulk create requests for bookmarks stored in local storage.

Other:

Check whether we use a foreign key constraint on the entity_id column? This will also help us get a better understanding of Aleph’s architecture.
- The answer is no -- and it makes sense as there can be situations where entity data does exist in the index only and there is no corresponding record in the Postgres database.

aleph/tests/test_bookmarks_api.py

aleph/views/bookmarks_api.py

aleph/model/bookmark.py

tillprochaska · 2023-01-31T18:12:53Z

@catileptic I’ve thought about the bulk endpoint: Right now, the default POST /bookmarks endpoint supports bulk creation because, this way, we can support both creating a single and multiple bookmarks using a single endpoint.

Bulk creation is only required for a one-time migration of client-side bookmarks. Apart from that, the bulk endpoint will need to silently ignore entities that are already bookmarked or that do not exist/are not accessible to the user. It will need to support a user-provided created_at timestamp, even though that doesn’t make in cases except for the one-time migration. Maybe it will need to have a few more quirks because we haven’t considered other edge cases yet.

Maybe it’s better to have the default POST /bookmarks endpoint create a single bookmark, handling requests in a simple and semantic way, without trying to support the use case of the one-time migration.

Then, we could have a separate POST /bookmarks/bulk (or similar) endpoint that handles bulk creation for the purpose of the one-time migration. This endpoint can be a bit weird and handle edge cases specific to the migration in a way that’s convenient for us (even though it may not be the most semantic or clean way). After some time, we can simply remove the endpoint and the complexity it introduces.

aleph/migrate/versions/c52a1f469ac7_create_bookmark_table.py

tillprochaska · 2023-04-06T07:39:50Z

aleph/views/bookmarks_api.py

+
+    for bookmark in data:
+        try:
+            entity = get_index_entity(bookmark.get("entity_id"))


Could instead use entities_by_ids from aleph.index.entities to get all entities in a single ES request, but would need to manually check permissions.

Is this following similar conventions in other parts of the codebase? What I'm saying is that if other areas are manually checking the permissions it might be worth following the convention, if not, then that's a good indicator that perhaps this is the way to go

Yes and no. For singular resources, we do rely on these helpers to check permissions and throw appropriate errors in case of missing permissions etc.

For lists of many resources, we usually ensure that only resources the current user can access are returned at the query level, and we have another last-resort check when serializing resources in API responses.

This case is a little different though -- we’re not processing data from the database, but from a user. And we can’t rely on the serializer-level check because we actually need to know about the issues in order to show a message to the user.

The current solution is the least complex, but I’m not sure if it might lead to issues when migrating huge lists of bookmarks due to the O(n) ES requests. So far it hasn’t been a problem locally, and I’ll make sure to test it on staging once it’s out there.

I would try this approach, which seems easier to follow and only change it if we see issues with it.

tillprochaska · 2023-04-06T07:49:57Z

aleph/logic/entities.py

@@ -165,6 +165,7 @@ def prune_entity(collection, entity_id=None, job_id=None):
    if doc is not None:
        doc.delete()
    EntitySetItem.delete_by_entity(entity_id)
+    Bookmark.delete_by_entity(entity_id)


Currently, this isn’t covered by tests. I was trying to find existing tests for this method without success so far, and I guess it doesn’t make a lot of sense to test prune_entity in isolation? Anyone knows whether this is already covered as part of a bigger integration-style test somewhere?

Indeed, I'm not seeing any test coverage for this. And I agree it should go into an integration test. Shall we leave a TODO here?

tillprochaska · 2023-04-06T07:53:43Z

aleph/tests/util.py

-            db.create_all()
+            flask_migrate.upgrade()


Previously, we created the DB schema based on the model classes. I know that some people use Alembic in a way where the model classes are the only source of truth and contain all information required to generate the schema, and then auto-generate alembic revisions based on the model classes. We don’t seem to do that though (at least not consistently), as many indexes etc. are only defined in revision files, so we ended up with inconsistent DB schemata in test and dev/prod environemnts.

Not sure if there is any reason not to use the actual migrations (except for maybe test performance and historic reason)?

I tested if this has an effect on how long it takes to execute tests but couldn’t find a significant difference.

tillprochaska · 2023-04-19T12:01:33Z

aleph/views/bookmarks_api.py

@@ -176,5 +176,6 @@ def migrate():
        index_elements=[Bookmark.role_id, Bookmark.entity_id],
    )
    db.session.execute(stmt)
+    db.session.commit()


I accidentally removed this line in a previous commit or rebase I think. I was a little surprised the existing test didn’t catch this. When running this in development, if the session is not committed, no bookmarks are created (expected). However, in a test environment, asserting that bookmarks exist does not fail.

Pretty sure the cause is related to SQLAlchemy’s session handling, autocommit/transaction isolation config or something like that, but wasn’t able to isolate it yet.

Hm great point and I see no immediate answer to this. Anything wrapped in a session should be running in a transaction so I would expect this to require a commit. But perhaps the the default isolation level is autocommit? 🤔

aleph/migrate/versions/c52a1f469ac7_create_bookmark_table.py

Rosencrantz · 2023-04-24T15:05:55Z

aleph/views/bookmarks_api.py

+
+    for bookmark in data:
+        try:
+            entity = get_index_entity(bookmark.get("entity_id"))


Is this following similar conventions in other parts of the codebase? What I'm saying is that if other areas are manually checking the permissions it might be worth following the convention, if not, then that's a good indicator that perhaps this is the way to go

Rosencrantz · 2023-04-24T15:09:46Z

aleph/views/bookmarks_api.py

+        entity = get_index_entity(entity_id, request.authz.READ)
+    except (NotFound, Forbidden):
+        raise BadRequest(
+            "Could not bookmark the given entity as the entity does not exist or you do not have access."


Perhaps: "We're sorry but we are unable to bookmark this entity for you" - But if this happens are we logging/tracking it?

ui/src/components/common/BookmarksMigration.tsx

stchris

Looks good to me, Till! Outside of the mysterious commit question, which I can't quite answer, everything seems good to go! 👏

Before, we created the database schema based on model classes which meant we'd end up with a slightly different schema in test and dev/prod environments.

As a reminder to my future self: In Aleph’s frontend, we use our own mini data fetching framework built on top of Redux. One thing it does is caching data from API responses. For example, when a user views their bookmarks, does something else and then views the bookmarks again, the bookmarks are only fetched once. When viewing the bookmarks the second time, we render them based on a runtime cache. This can lead to outdated data being displayed. For example, when the user creates a new bookmark *after* the bookmarks have been loaded, the list of bookmarks would be outdated. Our mini framework does handle data invalidation, but only globally, for all cached data. That works ok in most cases, but for bookmarks, it leads to a bad UX. When you view an entity, then click on the bookmarks button, it would cause the entire page (all the data about the entity) to reload, even though none of that data has changed. The only thing that has changed is the list of bookmarks. We handle data invalidation by storing the timestamp when a data object was loaded and the timestamp of the last mutation. Whenever we render cached data, we check whether the cached data might be outdated (i.e. when it has been loaded before the latest mutation). Until now, we only stored one global mutation timestamp. Whenever that timestamp was updated, all cached data became outdated. Now, in addition to the global mutation timestamp, we have an option to store mutation timestamp for specific subsets of the cached data. So when creating or deleting a bookmark, instead of updating the global mutation timestamp (which would invalidate all cached data), we can update the timestamp for the `bookmarks` mutation key. This would invalidate only cached bookmarks, but no other data.

* Implement basic CRUD operations for bookmarks * Delete bookmarks when entity is deleted * Run Alembic migrations in test environment Before, we created the database schema based on model classes which meant we'd end up with a slightly different schema in test and dev/prod environments. * Add endpoint to migrate bookmarks from client-side storage * Return whether entity is bookmarked in entity API response * Remove warning popover when using bookmarks feature for the first time * Load bookmarks from API * Automatically migrate local bookmarks * Extend data fetching logic to support partial invalidations As a reminder to my future self: In Aleph’s frontend, we use our own mini data fetching framework built on top of Redux. One thing it does is caching data from API responses. For example, when a user views their bookmarks, does something else and then views the bookmarks again, the bookmarks are only fetched once. When viewing the bookmarks the second time, we render them based on a runtime cache. This can lead to outdated data being displayed. For example, when the user creates a new bookmark *after* the bookmarks have been loaded, the list of bookmarks would be outdated. Our mini framework does handle data invalidation, but only globally, for all cached data. That works ok in most cases, but for bookmarks, it leads to a bad UX. When you view an entity, then click on the bookmarks button, it would cause the entire page (all the data about the entity) to reload, even though none of that data has changed. The only thing that has changed is the list of bookmarks. We handle data invalidation by storing the timestamp when a data object was loaded and the timestamp of the last mutation. Whenever we render cached data, we check whether the cached data might be outdated (i.e. when it has been loaded before the latest mutation). Until now, we only stored one global mutation timestamp. Whenever that timestamp was updated, all cached data became outdated. Now, in addition to the global mutation timestamp, we have an option to store mutation timestamp for specific subsets of the cached data. So when creating or deleting a bookmark, instead of updating the global mutation timestamp (which would invalidate all cached data), we can update the timestamp for the `bookmarks` mutation key. This would invalidate only cached bookmarks, but no other data. * Actually commit ORM session to execute queries * Update wording

tillprochaska force-pushed the feature/2831-server-side-bookmarks branch 5 times, most recently from 716bb0e to 474da36 Compare January 31, 2023 17:08