Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PostRepository::getIndexForNumber() walks 2 time the posts table to get offset #3895

Open
DavidKeller opened this issue Oct 7, 2023 · 6 comments
Labels

Comments

@DavidKeller
Copy link

DavidKeller commented Oct 7, 2023

Current Behavior

When showing discussion with a large number of posts, flarum pressures the database when calling

    public function getIndexForNumber(int $discussionId, int $number, ?User $actor = null): int
    {
        if (! ($discussion = Discussion::find($discussionId))) {
            return 0;
        }

        $query = $discussion->posts()
            ->whereVisibleTo($actor)
            ->where('created_at', '<', function ($query) use ($discussionId, $number) {
                $query->select('created_at')
                    ->from('posts')
                    ->where('discussion_id', $discussionId)
                    ->whereNotNull('number')
                    ->take(1)

                    // We don't add $number as a binding because for some
                    // reason doing so makes the bindings go out of order.
                    ->orderByRaw('ABS(CAST(number AS SIGNED) - '.(int) $number.')');
            });

        return $query->count();
    }

with the following SQL request:

SELECT count(*) AS AGGREGATE
FROM `posts`
WHERE `posts`.`discussion_id` = 3096
  AND `posts`.`discussion_id` IS NOT NULL
  AND EXISTS
    (SELECT 1
     FROM `discussions`
     WHERE `discussions`.`id` = `posts`.`discussion_id`
       AND (`discussions`.`id` not in
              (SELECT `discussion_id`
               FROM `discussion_tag`
               WHERE `tag_id` not in
                   (SELECT `tags`.`id`
                    FROM `tags`
                    WHERE (`tags`.`id` in
                             (SELECT `perm_tags`.`id`
                              FROM `tags` AS `perm_tags`)
                           AND (`tags`.`parent_id` in
                                  (SELECT `perm_tags`.`id`
                                   FROM `tags` AS `perm_tags`)
                                OR `tags`.`parent_id` IS NULL)))))
       AND (`discussions`.`is_private` = 0))
  AND (`posts`.`is_private` = 0)
  AND `created_at` <
    (SELECT `created_at`
     FROM `posts`
     WHERE `discussion_id` = 3096
       AND `number` IS NOT NULL
     ORDER BY ABS(CAST(number AS SIGNED) - 31224)
     LIMIT 1)
  AND `type` in ('discussionTagged',
                 'discussionStickied',
                 'discussionLocked',
                 'comment',
                 'discussionRenamed');

which results in the following execution plan on MariaDB:

+------+--------------------+----------------+-----------------+------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+---------+------------------+-------+-----------------------------+
| id   | select_type        | table          | type            | possible_keys                                                                                                                                        | key                                  | key_len | ref              | rows  | Extra                       |
+------+--------------------+----------------+-----------------+------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+---------+------------------+-------+-----------------------------+
|    1 | PRIMARY            | discussions    | const           | PRIMARY                                                                                                                                              | PRIMARY                              | 4       | const            | 1     |                             |
|    1 | PRIMARY            | posts          | ref             | posts_discussion_id_number_unique,posts_discussion_id_number_index,posts_discussion_id_created_at_index,posts_type_index,posts_type_created_at_index | posts_discussion_id_number_unique    | 4       | const            | 66770 | Using where                 |
|    7 | SUBQUERY           | posts          | ref             | posts_discussion_id_number_unique,posts_discussion_id_number_index,posts_discussion_id_created_at_index                                              | posts_discussion_id_created_at_index | 4       | const            | 66770 | Using where; Using filesort |
|    3 | DEPENDENT SUBQUERY | discussion_tag | index_subquery  | PRIMARY                                                                                                                                              | PRIMARY                              | 4       | func             | 1     | Using where                 |
|    4 | DEPENDENT SUBQUERY | tags           | eq_ref          | PRIMARY,tags_parent_id_foreign,tags_last_posted_user_id_foreign,tags_last_posted_discussion_id_foreign                                               | PRIMARY                              | 4       | func             | 1     | Using where                 |
|    4 | DEPENDENT SUBQUERY | perm_tags      | eq_ref          | PRIMARY,tags_parent_id_foreign,tags_last_posted_user_id_foreign,tags_last_posted_discussion_id_foreign                                               | PRIMARY                              | 4       | forum_v4.tags.id | 1     |                             |
|    6 | DEPENDENT SUBQUERY | perm_tags      | unique_subquery | PRIMARY                                                                                                                                              | PRIMARY                              | 4       | func             | 1     |                             |
+------+--------------------+----------------+-----------------+------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+---------+------------------+-------+-----------------------------+

As you can see on the rows column from the last snippet: the posts table is scanned two times in order to retrieve only an offset.

Steps to Reproduce

Create a discussion with a large number of posts (> 50K).
Display it.

Expected Behavior

Expect the display duration to be agnostic of the number of posts within the discussion.

Screenshots

No response

Environment

  • Flarum version: 1.8.2
  • Website URL: https://dev.lesitedecuisine.fr
  • Webserver: nginx
  • Hosting environment: dedicated environment
  • PHP version: 8.1.24
  • Mysql: 11.1.2
  • Browser: Firefox 118.0.1

Output of php flarum info

Flarum core: 1.8.2
PHP version: 8.1.24
MySQL version: 11.1.2-MariaDB-1:11.1.2+maria~ubu2204-log
Loaded extensions: Core, date, libxml, openssl, pcre, sqlite3, zlib, ctype, curl, dom, fileinfo, filter, ftp, hash, iconv, json, mbstring, SPL, session, PDO, pdo_sqlite, standard, posix, readline, Reflection, Phar, SimpleXML, tokenizer, xml, xmlreader, xmlwriter, mysqlnd, gd, pdo_mysql, sodium, zip, Zend OPcache
+-------------------------------+---------+--------+
| Flarum Extensions             |         |        |
+-------------------------------+---------+--------+
| ID                            | Version | Commit |
+-------------------------------+---------+--------+
| flarum-tags                   | v1.8.0  |        |
| migratetoflarum-old-passwords | 1.0.0   |        |
| fof-nightmode                 | 1.5.3   |        |
| flarum-suspend                | v1.8.1  |        |
| flarum-subscriptions          | v1.8.0  |        |
| flarum-sticky                 | v1.8.0  |        |
| flarum-nicknames              | v1.8.0  |        |
| flarum-mentions               | v1.8.2  |        |
| flarum-markdown               | v1.8.0  |        |
| flarum-lock                   | v1.8.0  |        |
| flarum-likes                  | v1.8.0  |        |
| flarum-lang-english           | v1.8.0  |        |
| flarum-flags                  | v1.8.0  |        |
| flarum-emoji                  | v1.8.0  |        |
| flarum-bbcode                 | v1.8.0  |        |
| asile-ninja                   | 1.0.0   |        |
+-------------------------------+---------+--------+
Base URL: https://dev.lesitedecuisine.fr
Installation path: /var/www/dev.lesitedecuisine.fr
Queue driver: sync
Session driver: file
Mail driver: mailgun
Debug mode: off

Possible Solution

Indexes ?

Additional Context

_No response_je

@DavidKeller
Copy link
Author

I wonder why flarum introduced the number column on the post table as it can't be trusted because posts can be deleted and there is already the id/created_at attributes to provide post ordering ?

@DavidKeller DavidKeller changed the title PostRepository::getIndexForNumber() walks 2 time the posts table unindexed PostRepository::getIndexForNumber() walks 2 time the posts table to get offset Oct 7, 2023
@Spencer-Robertson-ANU
Copy link

Hi there, this looks like something I might be able to fix, could I please be assigned? Thankyou

@DavidKeller
Copy link
Author

DavidKeller commented Oct 13, 2023

As a side note, my proposed solution was to add indexes, but I didn't understood at the time the database structure and the execution plan output.

If I understand correctly, Mysql is already using the following two indexes:

  • posts_discussion_id_number_unique
  • posts_discussion_id_created_at_index

to skip unrelated posts, that's why the whole posts table isn't walked.
Still, quite a lot of posts are walked.

This function seems to be called in order to retrieve the current offset of unread post within a discussion.

I wonder if it might be preferable to save for each discussion/user combination:

  1. the post's id of the latest seen post (in order to start the fetch cursor from it)
  2. it's associated number (in order to compute unread posts count)

And replace deleted posts with a message flagging the post as such in order to keep the following posts number coherent ? (deletion becomes O(log P) where P is the total posts count)

OR

Each time a post is deleted, for each discussion/user combination, decrease the associated number if deleted post id is >= discussion/user combination associated id (deletion becomes O(C + log P) where C is the number of combination)

@DavidKeller
Copy link
Author

@Spencer-Robertson-ANU How is it going ?

@luceos
Copy link
Member

luceos commented Mar 2, 2024

A few comments here. From mobile so I might leave out or miss a few details.

First of we don't do assignment of issues unless you are a core developer. You could have provided a proposed solution and a pull request for us to review.

Secondly. Storing the post Id seems like a valid solution, but post Ids aren't sequences. I think introducing this into the last read state of a user would be detrimental and updating all user states upon deletion will be a big hit on performance for communities with millions of users.

Thirdly, thank you for the amazing amount of research on this topic @DavidKeller. I do feel this needs our attention, at least for 2.x.

@DavidKeller
Copy link
Author

DavidKeller commented Mar 7, 2024

In our case, we don't have that many users, but big topics where the current design causes delays up to multiple seconds.
Because of this, we haven't migrated to flarum yet.

But I get your point.

What about the first proposal, that is not really deleting post, but flagging them as deleted instead in order to keep a valid number ?
In a perfect world, all requests against database should be O(Log N) or O(1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants