Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split post table into notifiable posts and context #85

Merged
merged 67 commits into from
Apr 27, 2024
Merged

Conversation

rossjrw
Copy link
Member

@rossjrw rossjrw commented Oct 4, 2023

Resolves #65.

  • Add migration that splits posts into notifiable posts and context, discarding other posts
  • Rewrite notification retrieval for new data structure
  • Verify no regressions in retrieved notifications
  • Verify that it's faster
  • Rewrite post lookup and download for new data structure
  • Remove non-notifiable posts - either as part of regular post download, or notification retrieval, or as a dedicated cleanup step
  • Rewrite everything that touches the post, thread, category, wiki, or thread_first_post tables to use the new context tables
  • Rewrite stored post cleanup to delete deleted posts and remove posts older than the threshold
  • Remove the bit in the digest that says how many things a user is autosubscribed to, as we no longer keep all of them (the 'still learning' part of the FAQ will need to be changed or removed too).

Requirements deemed out of scope for now:

@rossjrw
Copy link
Member Author

rossjrw commented Nov 10, 2023

For testing purposes using the local dataset, here's the user IDs that would be notified

8hourly
1854139
5033208
5657010
8051787
8497838

daily
2277479
3628082
3695916
4187885
4754680
6361323
6477976
7267457
7566519
7666338
7785805
7823422
7857622
8116833
8140507
8164274
8166351
8481833
8546184

daily
2893766
3535580
4317225
7282391
7869175
8051387
8216454
8254640
8462856
8534171

monthly

weekly
3995762

I need to write the new data structure and scripts to query it. I need to answer the following questions:

  • How long does the old approach take?
  • How long does the new approach take, and is it faster?
  • The old approach uses a temporary cache table. Is the new approach faster with a cache table? (based on f067ac2, I assume the answer is definitely yes)
  • Do EXPLAIN and EXPLAIN ANALYZE give any useful insights about the new approach?
  • What indexes can I add to speed things up?

To run these tests I'll need to have both approaches available at the same time. That'll mean constructing the new tables without disrupting the old tables, for now, and I'll need to have both queries available too.

Measuring the old approach:

for freq in monthly hourly 8hourly weekly daily; do mysql -h127.0.0.1 -uroot -proot wikidot_notifier < <(echo "$(cat notifier/database/queries/cache_post_context.sql | sed 's/%(post_lower_timestamp_limit)s/1627277777/');$(cat notifier/database/queries/get_user_ids_for_frequency_with_notifications.sql | sed "s/%(frequency)s/\"$freq\"/")") > $freq.txt && echo "done $freq"; done
  1. 2m 12s
  2. 2m 25s
  3. 2m 31s

First test, with amendments to the cache table but no amendments to the selection query: completed in less than one second in all trials. However, there are some missing users, which is not acceptable.

@rossjrw
Copy link
Member Author

rossjrw commented Jan 23, 2024

Updating the old queries

Going to record any notes about updating queries here.

What I effectively need to do is, one by one for each query:

  1. Edit the query to use new tables, if it uses any of the old ones (thread, post, wiki, category, thread_first_post).
  2. Edit database methods that use this query to fit the new data format.
  3. Edit uses of those methods in the codebase to fit the new data format.

Once I'm done with that, maybe, this whole thing will be complete.

  • In principle, the old context tables (wiki and category) can be totally removed because the data is now going to be duplicated across rows of posts.
  • Hit check_post_exists.sql and already am thinking that I'm going about this the wrong way. There are probably lots of assumptions in the codebase, particularly around post acquisition, that are incorrect now. The changes I'm making can no longer be contained solely to the database layer. Maybe I need to get in there, get the functionality up to date, and change the DB queries afterwards. Possibly I need to reevaluate what DB queries I even need; possibly, I should delete most of them and rewrite from scratch (most of them are really short so that's not as bad as it sounds).
    • Example of wrong assumption: check_post_exists.sql is called for each newly acquired post to check if it's in the database and presumably skip it if it is (idk I didn't check the code just now). However I actually now only want to store a post if it is notifiable. So, at bare minimum, this script should actually be something like check_post_does_not_exist_and_is_notifiable.sql, or should effect some change in the codebase.

@rossjrw
Copy link
Member Author

rossjrw commented Jan 29, 2024

When should posts be checked for notifiability?

The migration from the old data structure to the new will bin all posts that aren't notifiable. But, right now, all posts are downloaded and stored exactly as naively as they were before.

The choice of when exactly to test for and purge posts that aren't notifiable is an important decision. There are some things to bear in mind:

  • Checking as early as possible, i.e. during the download stage, means we can skip unnecessary requests to Wikidot to get data for posts that will never be used.
  • Users will change their settings between runs, altering the set of posts that are considered notifiable for them.
  • Checking later in the run rather than earlier means notifications are dispatched earlier.
  • Checking later in the run rather than earlier risks the checks being cancelled if the lambda overruns.

The ideal scenario is to always be checking at all stages, as this will ensure that the database is always trimmed and no unnecessary requests are made. But is this always possible?

  • It might be expensive to check which posts are notifiable. (I don't know, I haven't done it yet.) It might be that I can only afford to do it in one batch once per run.
    • As part of the guiding axioms of this project, I consider requests to Wikidot to be expensive and requests to my database to be cheap. That being said, a long running query on my database is still pretty expensive.
  • It might only be possible to ascertain if a post is notifiable or not once all of its context has been acquired from the server; if this is the case, then it's impossible to cut down on the number of requests by skipping posts like this.
    • E.g. currently the only context that effects an additional request is the thread context. The additional request yields data about the first post in the thread, including the username of the OP, which is necessary for determining if a user is autosubscribed to it. Therefore, this request is required.

I like the idea of having one single method of establishing notifiability, and I like the idea of it being an idempotent operation which, if aborted, can be resumed by a later run with only a temporary loss in efficiency.

To this effect, the best way to go about this is to naively download and store all new posts. Then at the end of the run as part of the cleanup phase, any non-notifiable posts are removed. This will be performed in SQL using the algorithm outlined in the data structure migration script.

@rossjrw
Copy link
Member Author

rossjrw commented Feb 1, 2024

Updating the digester

The digester fundamentally assumes that it's getting a list of thread replies and a list of post replies. Now it's just getting a list of posts alongs with flags to mark why it's considered a notification.

The digester is going to need to be made more generic, but also I want to change its overall structure as little as possible because I don't want to deal with changing the translations.

I think what I'm going to try to do is implement this by only removing text from the English lexicon, where possible. Then I'll copy those changes to other languages once I'm done.

@rossjrw rossjrw force-pushed the split-post-table branch 2 times, most recently from a3e9119 to 93ef5d2 Compare February 1, 2024 23:54
@rossjrw
Copy link
Member Author

rossjrw commented Feb 5, 2024

Speed tests

Running a speed comparison again, AMD 7840U, 2023-06-07 dataset.

Reference: drop database wikidot_notifier, create database wikidot_notifier, mysql < test-dataset.sql, then

for freq in monthly hourly 8hourly weekly daily; do mysql -h127.0.0.1 -uroot -proot wikidot_notifier < <(echo "$(cat notifier/database/queries/cache_post_context.sql | sed 's/%(post_lower_timestamp_limit)s/1627277777/');$(cat notifier/database/queries/get_user_ids_for_frequency_with_notifications.sql | sed "s/%(frequency)s/\"$freq\"/")") > bchmrk-ref-$freq.txt && echo "done $freq"; done

Test: mysql < migrations/008.up.sql, then

for freq in monthly hourly 8hourly weekly daily; do mysql -h127.0.0.1 -uroot -proot wikidot_notifier < <(echo "$(cat notifier/database/queries/cache_notifiable_post_context.sql);$(cat notifier/database/queries/get_user_ids_for_frequency_with_notifications.sql | sed "s/%(frequency)s/\"$freq\"/")") > bchmrk-test-$freq.txt && echo "done $freq"; done

Compare results:

for freq in monthly hourly 8hourly weekly daily; do echo $freq; diff bchmrk-ref-$freq.txt bchmrk-test-$freq.txt; done

Test: baseline

Reference @ main: 1m 17s
Test @ 7fef20c: ~0s

Results: Not accurate; test was missing users. Figured that's probably because the cutoff dates were different - service start timestamp vs last post minus 1y.


Test: added AND post.posted_timestamp > ((SELECT MAX(posted_timestamp) FROM post) - (60 * 60 * 24 * 365)) to main branch to match the filter on this branch

Reference @ main+1: 1m 1s
Test @ 7fef20c: ~0s.

Results: Fewer missing users than before. There are still 3 in the hourly channel and 7 in the daily channel. This feels very familiar - I'm sure I've gone through these exact tests. I should see if I can find those in order to work out where those missing users have gone. Failing that (which appears to be the case as of now), I should work it out from scratch.

I'm comparing the notifiable_post migration with the queries that pull notifiable posts/notifiable users from the database, which were more recently written, and there are a few where conditions in the migration that differ. In particular, users are filtered to only those with frequencies in a known set (which excludes Test and Never, for which there do exist some users). Additionally, the old query attempts to avoid notifying users about posts they already responded to - but because the test didn't elicit any additional notificiations I don't believe this is a concern.

I need to:

  • Edit the migration to not do things that the cleanup task should be doing. Removing posts based on their timestamp and removing posts for users with invalid frequencies is not the migration's job.
  • Write the cleanup task.

Then amend the test procedure:

  1. Run the query on the reference data.
  2. Migrate the reference data to the test data.
  3. Run the query on the test data and compare accuracy.
  4. Run the cleanup on the test data.
  5. Run the query on the test data and compare speed.

New test @ 6f11aad: 1m 50s. Very surprised that it actually took longer! Added an index to speed it up.
New test @ 7327e8e: 1m 15s. About equivalent with prior performance now. Obviously will be a magnitude smaller post-cleanup.

Accuracy @ 7327e8e: Pretty good! Just 2 extra results in the daily channel. I need to investigate why that is, but before I do, here's my guess: the new query resolves conflicting manual subscriptions differently. The old one gave weight to unsubscriptions, the new one just sort of picks one in whatever order they come and its behaviour re conflicts is not defined. I think that's fine, but let's see if it's the case here.

  • They're both error users.
  • Neither have any manual subscriptions.
  • One has 1 notification, one has 3.
  • All the notifications are older than a year...

Oops! It's because I didn't re-run the reference without the year filter. There are actually no regressions from the migration alone. Yay!

Looks like the only 'regressions' caused by the delete script are the exact same two users as well.

@rossjrw
Copy link
Member Author

rossjrw commented Mar 14, 2024

Deleting posts

Checking for post deletion state is still something I need to do, to remove from my database posts that have been deleted in the remote.

But it's no longer enough to check a subset of posts for deletion. Previously I was iterating all posts that a weekly or monthly user would be notified about, which minimises the amount of work to do. The new priority is to minimise the absolute size of the database, and that means that I need to check all posts for deletion.

There are far, far fewer stored posts in the new structure, but still far too many to check on every run - even limited to infrequent runs.

So, I need some way to limit the number of posts that I check for deletion.

I figure I probably want to limit the check to some number of posts per run. The current deletion check takes a decent amount of time on average, I should check how many posts that is.

But I also would like to stagger when to check a post for deletion. I'd like to check it frequently initially, as that's when a post is mostly likely to be deleted - probably trailing off after about 6 hours. I want to keep checking it after that, with the frequency of checks trailing off according to the post's age.

Initially I thought I'd need to keep track of how many times a post has been checked, store it along with the rest of the post, increment it over time and increase the timing of checks based on that. But I don't need to do that. If I know the post's age, I can define a formula to select ages for which a post should be checked.

E.g. I can define a list: check posts aged 0-6 hours old, 12 hours old, 24 hours, 48 hours etc. Provided that I round each age to an integer number of hours, and assuming the deletion runs successfully every hour, this should catch all posts with no stragglers.

There's still a risk that too many posts could be captured in one of these bands. It's not really a problem because I can just slap a limit on the query, but I don't much like the idea of some posts being masked and slipping through permanently. So it might actually help to track the number of times a post has been checked, and order by that. Still, I'd really like to avoid needing to do that if possible. It could be possible to define some ordering system that changes each time the deletion check happens. Honestly though, a random order would probably be fine.


It's also worth noting that I now need to check for context deletion too. The only contexts right now to which this applies are thread and parent post, which can both be acquired by a single request to the post.

Context deletion is actually a little more complex than it appears at first glance. If I want to delete a context, I also need to delete all posts associated with it. I then need to follow that up by deleting all contexts that now have zero posts associated with them.

Simplifying that, I can (unintuitively) skip the first step. I can just delete all posts with that context's ID registered, and then when I delete contexts that no longer have any posts associated with them, the context I wanted to delete in the first place will be wiped out too. I like this approach because it reduces duplication and effects an idempotent action (deleting unused context).

I don't know what the best way to delete unused context is, though. I figure I'm going to need a query for each context, even if they're real similar. But is there anything I can do in the setup stage that would help? E.g. I could create a trigger that deletes contexts when a post is deleted. But I'd need to edit that trigger whenever I add a new context and that seems like something I don't need to learn how to do.

I'll brute force it using existing solutions in whatever is the easiest way.

I've decided that this is a pointless behaviour - personally I'd much
rather want the guarantee that I'm going to be notified of a post.
Otherwise a vigilant user might fear that the notifications system
doesn't actually work.

There may well be users who depend on this behaviour and I'm sure
they'll pipe up.

If in the future I choose to reimplement this behaviour, I should run a
migration to remove posts that are no longer relevant because of it.
Now also soft-deletes wikis instead of deleting their data on every run,
preserving their contextual information for posts that request it.
Unsure exactly where wiki_service_configured will come in handy.
The post existence check is only used when acquiring new posts to check
which ones are already known in order to skip them. This isn't necessary
because it's 60n times cheaper, where n is the number of configured
wikis, to just get the latest timestamp and ignore anything that came
earlier than it.
This means that there is no longer a page-by-page thread iterator in the
codebase. That's truly a shame because honestly this was one of my
favourite functions I've ever written. But it is time for it to go.
@rossjrw
Copy link
Member Author

rossjrw commented Apr 21, 2024

I've come up against a wall about the interaction between post replies and thread unsubscriptions. The current behaviour is that a post reply will bypass a thread unsubscription, and will go directly to the user. I don't know if this is the correct approach or not.

Instinctively I'd like the new behaviour to be that a thread unsubscription blocks any notifications from the thread, post replies included.

I think I should probably check the FAQ to see how I describe the current behaviour, and that might give me some indication as to whether it's intentional...

Automatic thread subscriptions are naïve, and err on the side of notifying you about too many replies rather than too few replies.

The documentation isn't specific about how the two kinds of manual unsubscriptions specifically interact with subscriptions; in fact throughout the documentation on the site, the ability to sub/unsub to/from a post is at best strongly implied but never specifically explained. That being said, the quoted line implies that it's more in line with notifier's philosophy to notify of replies even in unsubscribed threads.

So let's consider the use case for this feature. The following conditions are in play:

  1. There is a thread to which I am automatically subscribed. Therefore, at the time of writing, it must be a thread that I started. (For discussion pages, it is possible that I started it unintentionally.)
  2. I have unsubscribed from the thread. There are many reasons that I might do this but foremost is that I just don't care about it.
  3. Someone has replied directly to one of my posts - either the one I made as the OP, or another post I made later in the thread.

The distinction in that 3rd point is significant. If I made a post later on in the thread, it implies some level of engagement; to have unsubscribed after that implies that the unsubscription is very much intentional. In this case it would be better not to notify of a post reply. If I made only the OP and did not engage further, and I have unsubscribed, it probably means the thread isn't actually relevant to me at all, so I should not be notified of a post reply. Therefore I can justify removing this undocumented feature entirely.

@rossjrw
Copy link
Member Author

rossjrw commented Apr 23, 2024

Ran an integration test and no notifications were selected for me, despite me setting my timestamp to 0. Right now I suspect an error during the migration (are posts discarded for users on the test frequency?)


Reran the integration test and it looks pretty much fine :)

@rossjrw rossjrw marked this pull request as ready for review April 26, 2024 16:18
@rossjrw
Copy link
Member Author

rossjrw commented Apr 26, 2024

oh my god

image

Graphs from cloudwatch:

image

DB size increased (????) no idea why - will give it a few runs and see if it cleans itself back down

Very very very surprised at the CPU usage going down. If that's consistent then I might be able to lower the DB server spec. Will give it a few runs to be sure.

@rossjrw rossjrw merged commit 1e664c2 into main Apr 27, 2024
1 check passed
@rossjrw rossjrw deleted the split-post-table branch April 27, 2024 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimisation Make an existing feature faster or smaller
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Purge old data
1 participant