Split post table into notifiable posts and context #85

rossjrw · 2023-10-04T18:28:43Z

Resolves #65.

Add migration that splits posts into notifiable posts and context, discarding other posts
Rewrite notification retrieval for new data structure
Verify no regressions in retrieved notifications
Verify that it's faster
Rewrite post lookup and download for new data structure
Remove non-notifiable posts - either as part of regular post download, or notification retrieval, or as a dedicated cleanup step
Rewrite everything that touches the post, thread, category, wiki, or thread_first_post tables to use the new context tables
Rewrite stored post cleanup to delete deleted posts and remove posts older than the threshold
Remove the bit in the digest that says how many things a user is autosubscribed to, as we no longer keep all of them (the 'still learning' part of the FAQ will need to be changed or removed too).

Requirements deemed out of scope for now:

Add text in notification digest to indicate the context of each notification (Split post table into notifiable posts and context #85)

rossjrw · 2023-11-10T23:47:57Z

For testing purposes using the local dataset, here's the user IDs that would be notified

8hourly
1854139
5033208
5657010
8051787
8497838

daily
2277479
3628082
3695916
4187885
4754680
6361323
6477976
7267457
7566519
7666338
7785805
7823422
7857622
8116833
8140507
8164274
8166351
8481833
8546184

daily
2893766
3535580
4317225
7282391
7869175
8051387
8216454
8254640
8462856
8534171

monthly

weekly
3995762

I need to write the new data structure and scripts to query it. I need to answer the following questions:

How long does the old approach take?
How long does the new approach take, and is it faster?
The old approach uses a temporary cache table. Is the new approach faster with a cache table? (based on f067ac2, I assume the answer is definitely yes)
Do EXPLAIN and EXPLAIN ANALYZE give any useful insights about the new approach?
What indexes can I add to speed things up?

To run these tests I'll need to have both approaches available at the same time. That'll mean constructing the new tables without disrupting the old tables, for now, and I'll need to have both queries available too.

Measuring the old approach:

for freq in monthly hourly 8hourly weekly daily; do mysql -h127.0.0.1 -uroot -proot wikidot_notifier < <(echo "$(cat notifier/database/queries/cache_post_context.sql | sed 's/%(post_lower_timestamp_limit)s/1627277777/');$(cat notifier/database/queries/get_user_ids_for_frequency_with_notifications.sql | sed "s/%(frequency)s/\"$freq\"/")") > $freq.txt && echo "done $freq"; done

2m 12s
2m 25s
2m 31s

First test, with amendments to the cache table but no amendments to the selection query: completed in less than one second in all trials. However, there are some missing users, which is not acceptable.

rossjrw · 2024-01-23T14:24:54Z

Updating the old queries

Going to record any notes about updating queries here.

What I effectively need to do is, one by one for each query:

Edit the query to use new tables, if it uses any of the old ones (thread, post, wiki, category, thread_first_post).
Edit database methods that use this query to fit the new data format.
Edit uses of those methods in the codebase to fit the new data format.

Once I'm done with that, maybe, this whole thing will be complete.

In principle, the old context tables (wiki and category) can be totally removed because the data is now going to be duplicated across rows of posts.
- This is actually not true - see Purge old data #65 (comment)
Hit check_post_exists.sql and already am thinking that I'm going about this the wrong way. There are probably lots of assumptions in the codebase, particularly around post acquisition, that are incorrect now. The changes I'm making can no longer be contained solely to the database layer. Maybe I need to get in there, get the functionality up to date, and change the DB queries afterwards. Possibly I need to reevaluate what DB queries I even need; possibly, I should delete most of them and rewrite from scratch (most of them are really short so that's not as bad as it sounds).
- Example of wrong assumption: check_post_exists.sql is called for each newly acquired post to check if it's in the database and presumably skip it if it is (idk I didn't check the code just now). However I actually now only want to store a post if it is notifiable. So, at bare minimum, this script should actually be something like check_post_does_not_exist_and_is_notifiable.sql, or should effect some change in the codebase.

rossjrw · 2024-01-29T18:18:17Z

When should posts be checked for notifiability?

The migration from the old data structure to the new will bin all posts that aren't notifiable. But, right now, all posts are downloaded and stored exactly as naively as they were before.

The choice of when exactly to test for and purge posts that aren't notifiable is an important decision. There are some things to bear in mind:

Checking as early as possible, i.e. during the download stage, means we can skip unnecessary requests to Wikidot to get data for posts that will never be used.
Users will change their settings between runs, altering the set of posts that are considered notifiable for them.
Checking later in the run rather than earlier means notifications are dispatched earlier.
Checking later in the run rather than earlier risks the checks being cancelled if the lambda overruns.

The ideal scenario is to always be checking at all stages, as this will ensure that the database is always trimmed and no unnecessary requests are made. But is this always possible?

It might be expensive to check which posts are notifiable. (I don't know, I haven't done it yet.) It might be that I can only afford to do it in one batch once per run.
- As part of the guiding axioms of this project, I consider requests to Wikidot to be expensive and requests to my database to be cheap. That being said, a long running query on my database is still pretty expensive.
It might only be possible to ascertain if a post is notifiable or not once all of its context has been acquired from the server; if this is the case, then it's impossible to cut down on the number of requests by skipping posts like this.
- E.g. currently the only context that effects an additional request is the thread context. The additional request yields data about the first post in the thread, including the username of the OP, which is necessary for determining if a user is autosubscribed to it. Therefore, this request is required.

I like the idea of having one single method of establishing notifiability, and I like the idea of it being an idempotent operation which, if aborted, can be resumed by a later run with only a temporary loss in efficiency.

To this effect, the best way to go about this is to naively download and store all new posts. Then at the end of the run as part of the cleanup phase, any non-notifiable posts are removed. This will be performed in SQL using the algorithm outlined in the data structure migration script.

rossjrw · 2024-02-01T16:28:27Z

Updating the digester

The digester fundamentally assumes that it's getting a list of thread replies and a list of post replies. Now it's just getting a list of posts alongs with flags to mark why it's considered a notification.

The digester is going to need to be made more generic, but also I want to change its overall structure as little as possible because I don't want to deal with changing the translations.

I think what I'm going to try to do is implement this by only removing text from the English lexicon, where possible. Then I'll copy those changes to other languages once I'm done.

rossjrw · 2024-02-05T19:21:04Z

Speed tests

Running a speed comparison again, AMD 7840U, 2023-06-07 dataset.

Reference: drop database wikidot_notifier, create database wikidot_notifier, mysql < test-dataset.sql, then

for freq in monthly hourly 8hourly weekly daily; do mysql -h127.0.0.1 -uroot -proot wikidot_notifier < <(echo "$(cat notifier/database/queries/cache_post_context.sql | sed 's/%(post_lower_timestamp_limit)s/1627277777/');$(cat notifier/database/queries/get_user_ids_for_frequency_with_notifications.sql | sed "s/%(frequency)s/\"$freq\"/")") > bchmrk-ref-$freq.txt && echo "done $freq"; done

Test: mysql < migrations/008.up.sql, then

for freq in monthly hourly 8hourly weekly daily; do mysql -h127.0.0.1 -uroot -proot wikidot_notifier < <(echo "$(cat notifier/database/queries/cache_notifiable_post_context.sql);$(cat notifier/database/queries/get_user_ids_for_frequency_with_notifications.sql | sed "s/%(frequency)s/\"$freq\"/")") > bchmrk-test-$freq.txt && echo "done $freq"; done

Compare results:

for freq in monthly hourly 8hourly weekly daily; do echo $freq; diff bchmrk-ref-$freq.txt bchmrk-test-$freq.txt; done

Test: baseline

Reference @ main: 1m 17s
Test @ 7fef20c: ~0s

Results: Not accurate; test was missing users. Figured that's probably because the cutoff dates were different - service start timestamp vs last post minus 1y.

Test: added AND post.posted_timestamp > ((SELECT MAX(posted_timestamp) FROM post) - (60 * 60 * 24 * 365)) to main branch to match the filter on this branch

Reference @ main+1: 1m 1s
Test @ 7fef20c: ~0s.

Results: Fewer missing users than before. There are still 3 in the hourly channel and 7 in the daily channel. This feels very familiar - I'm sure I've gone through these exact tests. I should see if I can find those in order to work out where those missing users have gone. Failing that (which appears to be the case as of now), I should work it out from scratch.

I'm comparing the notifiable_post migration with the queries that pull notifiable posts/notifiable users from the database, which were more recently written, and there are a few where conditions in the migration that differ. In particular, users are filtered to only those with frequencies in a known set (which excludes Test and Never, for which there do exist some users). Additionally, the old query attempts to avoid notifying users about posts they already responded to - but because the test didn't elicit any additional notificiations I don't believe this is a concern.

I need to:

Edit the migration to not do things that the cleanup task should be doing. Removing posts based on their timestamp and removing posts for users with invalid frequencies is not the migration's job.
Write the cleanup task.

Then amend the test procedure:

Run the query on the reference data.
Migrate the reference data to the test data.
Run the query on the test data and compare accuracy.
Run the cleanup on the test data.
Run the query on the test data and compare speed.

New test @ 6f11aad: 1m 50s. Very surprised that it actually took longer! Added an index to speed it up.
New test @ 7327e8e: 1m 15s. About equivalent with prior performance now. Obviously will be a magnitude smaller post-cleanup.

Accuracy @ 7327e8e: Pretty good! Just 2 extra results in the daily channel. I need to investigate why that is, but before I do, here's my guess: the new query resolves conflicting manual subscriptions differently. The old one gave weight to unsubscriptions, the new one just sort of picks one in whatever order they come and its behaviour re conflicts is not defined. I think that's fine, but let's see if it's the case here.

They're both error users.
Neither have any manual subscriptions.
One has 1 notification, one has 3.
All the notifications are older than a year...

Oops! It's because I didn't re-run the reference without the year filter. There are actually no regressions from the migration alone. Yay!

Looks like the only 'regressions' caused by the delete script are the exact same two users as well.

rossjrw · 2024-03-14T00:04:22Z

Deleting posts

Checking for post deletion state is still something I need to do, to remove from my database posts that have been deleted in the remote.

But it's no longer enough to check a subset of posts for deletion. Previously I was iterating all posts that a weekly or monthly user would be notified about, which minimises the amount of work to do. The new priority is to minimise the absolute size of the database, and that means that I need to check all posts for deletion.

There are far, far fewer stored posts in the new structure, but still far too many to check on every run - even limited to infrequent runs.

So, I need some way to limit the number of posts that I check for deletion.

I figure I probably want to limit the check to some number of posts per run. The current deletion check takes a decent amount of time on average, I should check how many posts that is.

But I also would like to stagger when to check a post for deletion. I'd like to check it frequently initially, as that's when a post is mostly likely to be deleted - probably trailing off after about 6 hours. I want to keep checking it after that, with the frequency of checks trailing off according to the post's age.

Initially I thought I'd need to keep track of how many times a post has been checked, store it along with the rest of the post, increment it over time and increase the timing of checks based on that. But I don't need to do that. If I know the post's age, I can define a formula to select ages for which a post should be checked.

E.g. I can define a list: check posts aged 0-6 hours old, 12 hours old, 24 hours, 48 hours etc. Provided that I round each age to an integer number of hours, and assuming the deletion runs successfully every hour, this should catch all posts with no stragglers.

There's still a risk that too many posts could be captured in one of these bands. It's not really a problem because I can just slap a limit on the query, but I don't much like the idea of some posts being masked and slipping through permanently. So it might actually help to track the number of times a post has been checked, and order by that. Still, I'd really like to avoid needing to do that if possible. It could be possible to define some ordering system that changes each time the deletion check happens. Honestly though, a random order would probably be fine.

It's also worth noting that I now need to check for context deletion too. The only contexts right now to which this applies are thread and parent post, which can both be acquired by a single request to the post.

Context deletion is actually a little more complex than it appears at first glance. If I want to delete a context, I also need to delete all posts associated with it. I then need to follow that up by deleting all contexts that now have zero posts associated with them.

Simplifying that, I can (unintuitively) skip the first step. I can just delete all posts with that context's ID registered, and then when I delete contexts that no longer have any posts associated with them, the context I wanted to delete in the first place will be wiped out too. I like this approach because it reduces duplication and effects an idempotent action (deleting unused context).

I don't know what the best way to delete unused context is, though. I figure I'm going to need a query for each context, even if they're real similar. But is there anything I can do in the setup stage that would help? E.g. I could create a trigger that deletes contexts when a post is deleted. But I'd need to edit that trigger whenever I add a new context and that seems like something I don't need to learn how to do.

I'll brute force it using existing solutions in whatever is the easiest way.

I've decided that this is a pointless behaviour - personally I'd much rather want the guarantee that I'm going to be notified of a post. Otherwise a vigilant user might fear that the notifications system doesn't actually work. There may well be users who depend on this behaviour and I'm sure they'll pipe up. If in the future I choose to reimplement this behaviour, I should run a migration to remove posts that are no longer relevant because of it.

Now also soft-deletes wikis instead of deleting their data on every run, preserving their contextual information for posts that request it. Unsure exactly where wiki_service_configured will come in handy.

The post existence check is only used when acquiring new posts to check which ones are already known in order to skip them. This isn't necessary because it's 60n times cheaper, where n is the number of configured wikis, to just get the latest timestamp and ignore anything that came earlier than it.

This means that there is no longer a page-by-page thread iterator in the codebase. That's truly a shame because honestly this was one of my favourite functions I've ever written. But it is time for it to go.

rossjrw · 2024-04-21T00:00:12Z

I've come up against a wall about the interaction between post replies and thread unsubscriptions. The current behaviour is that a post reply will bypass a thread unsubscription, and will go directly to the user. I don't know if this is the correct approach or not.

Instinctively I'd like the new behaviour to be that a thread unsubscription blocks any notifications from the thread, post replies included.

I think I should probably check the FAQ to see how I describe the current behaviour, and that might give me some indication as to whether it's intentional...

Automatic thread subscriptions are naïve, and err on the side of notifying you about too many replies rather than too few replies.

The documentation isn't specific about how the two kinds of manual unsubscriptions specifically interact with subscriptions; in fact throughout the documentation on the site, the ability to sub/unsub to/from a post is at best strongly implied but never specifically explained. That being said, the quoted line implies that it's more in line with notifier's philosophy to notify of replies even in unsubscribed threads.

So let's consider the use case for this feature. The following conditions are in play:

There is a thread to which I am automatically subscribed. Therefore, at the time of writing, it must be a thread that I started. (For discussion pages, it is possible that I started it unintentionally.)
I have unsubscribed from the thread. There are many reasons that I might do this but foremost is that I just don't care about it.
Someone has replied directly to one of my posts - either the one I made as the OP, or another post I made later in the thread.

The distinction in that 3rd point is significant. If I made a post later on in the thread, it implies some level of engagement; to have unsubscribed after that implies that the unsubscription is very much intentional. In this case it would be better not to notify of a post reply. If I made only the OP and did not engage further, and I have unsubscribed, it probably means the thread isn't actually relevant to me at all, so I should not be notified of a post reply. Therefore I can justify removing this undocumented feature entirely.

rossjrw · 2024-04-23T14:24:45Z

Ran an integration test and no notifications were selected for me, despite me setting my timestamp to 0. Right now I suspect an error during the migration (are posts discarded for users on the test frequency?)

Reran the integration test and it looks pretty much fine :)

I reason that it is not sensible to notify about post replies to threads from which the user is unsubscribed.

Because the correct HTTP response for a failed login is 200

rossjrw · 2024-04-26T17:29:48Z

oh my god

Graphs from cloudwatch:

DB size increased (????) no idea why - will give it a few runs and see if it cleans itself back down

Very very very surprised at the CPU usage going down. If that's consistent then I might be able to lower the DB server spec. Will give it a few runs to be sure.

rossjrw added the optimisation Make an existing feature faster or smaller label Oct 4, 2023

This was referenced Oct 23, 2023

Let users request a list of their automatic subscriptions #79

Closed

Show users unnecessary manual subscriptions #51

Open

rossjrw force-pushed the split-post-table branch from e8b2475 to bb29f26 Compare November 10, 2023 22:42

rossjrw force-pushed the split-post-table branch from 37199ea to bb29f26 Compare November 15, 2023 01:26

rossjrw force-pushed the split-post-table branch from bb29f26 to 341c070 Compare January 17, 2024 13:54

rossjrw force-pushed the split-post-table branch from fe12c79 to 6f22278 Compare January 28, 2024 16:57

rossjrw force-pushed the split-post-table branch 2 times, most recently from a3e9119 to 93ef5d2 Compare February 1, 2024 23:54

rossjrw mentioned this pull request Feb 6, 2024

Regression testing for notification accuracy #92

Open

rossjrw force-pushed the split-post-table branch 2 times, most recently from 8c42862 to 8404045 Compare March 9, 2024 20:07

rossjrw added 12 commits March 28, 2024 22:15

Create most of data structure migration

5e7c82d

Clarify a couple of join names

d1c2176

Use new structure for post context cache

3ce4099

Update wiki DB addition to new data format

5f95e37

Now also soft-deletes wikis instead of deleting their data on every run, preserving their contextual information for posts that request it. Unsure exactly where wiki_service_configured will come in handy.

Add commented-out table drop

879a802

Update devcontainer config until it works

696950c

Return timestamps from RSS parsing

420773d

Return only meta and one page from thread iterator

b6847fa

This means that there is no longer a page-by-page thread iterator in the codebase. That's truly a shame because honestly this was one of my favourite functions I've ever written. But it is time for it to go.

Verify behaviour when post ID is not in thread

aa8e861

Rewrite post and context storage methods

608a924

rossjrw added 5 commits March 28, 2024 22:15

Replace subqueries with joins

875c7fb

Move migration filter into delete query

963e0a1

Filter users last notified about current post

c918e32

Add context timestamp index

0db407b

Remove ownership from post replies

89aa3cc

rossjrw force-pushed the split-post-table branch from 8404045 to 43f89fa Compare March 28, 2024 23:04

rossjrw added 14 commits April 26, 2024 16:07

Fix deletion query

4efb016

Remove test statements and wrap DML only in transaction

fed4bb3

Make post delete script also delete context

75e9c97

Split post deletion and context deletion

06e7447

Make now a module global

bf36a05

Rewrite post deletion check and syncer

8832d09

Split interpolation tests by language (en/non-en)

2fccfbf

Do not notify of post replies in unsubbed threads

3f20afb

I reason that it is not sensible to notify about post replies to threads from which the user is unsubscribed.

Include test channel in initial posts

fdfcb6e

Add status to error message

38082b5

Raise an error if not logged in

4dfe3a7

Raise error if login fails

c5500ca

Because the correct HTTP response for a failed login is 200

Pluralise digest subject

9008cc4

Remove old parmaters from translations

e7dc8cd

rossjrw force-pushed the split-post-table branch from 05809f9 to e7dc8cd Compare April 26, 2024 15:07

rossjrw mentioned this pull request Apr 26, 2024

Display a reason for each notification #96

Open

rossjrw marked this pull request as ready for review April 26, 2024 16:18

Run deletion checks on every run

1a41ca0

rossjrw mentioned this pull request Apr 27, 2024

Track thread page for each post to optimise deletion #98

Open

rossjrw merged commit 1e664c2 into main Apr 27, 2024
1 check passed

rossjrw deleted the split-post-table branch April 27, 2024 18:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split post table into notifiable posts and context #85

Split post table into notifiable posts and context #85

rossjrw commented Oct 4, 2023 •

edited

Loading

rossjrw commented Nov 10, 2023 •

edited

Loading

rossjrw commented Jan 23, 2024 •

edited

Loading

rossjrw commented Jan 29, 2024

rossjrw commented Feb 1, 2024 •

edited

Loading

rossjrw commented Feb 5, 2024 •

edited

Loading

rossjrw commented Mar 14, 2024 •

edited

Loading

rossjrw commented Apr 21, 2024

rossjrw commented Apr 23, 2024 •

edited

Loading

rossjrw commented Apr 26, 2024

Split post table into notifiable posts and context #85

Split post table into notifiable posts and context #85

Conversation

rossjrw commented Oct 4, 2023 • edited Loading

rossjrw commented Nov 10, 2023 • edited Loading

rossjrw commented Jan 23, 2024 • edited Loading

Updating the old queries

rossjrw commented Jan 29, 2024

When should posts be checked for notifiability?

rossjrw commented Feb 1, 2024 • edited Loading

Updating the digester

rossjrw commented Feb 5, 2024 • edited Loading

Speed tests

rossjrw commented Mar 14, 2024 • edited Loading

Deleting posts

rossjrw commented Apr 21, 2024

rossjrw commented Apr 23, 2024 • edited Loading

rossjrw commented Apr 26, 2024

rossjrw commented Oct 4, 2023 •

edited

Loading

rossjrw commented Nov 10, 2023 •

edited

Loading

rossjrw commented Jan 23, 2024 •

edited

Loading

rossjrw commented Feb 1, 2024 •

edited

Loading

rossjrw commented Feb 5, 2024 •

edited

Loading

rossjrw commented Mar 14, 2024 •

edited

Loading

rossjrw commented Apr 23, 2024 •

edited

Loading