Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High CPU usage on MongoDB due to slow request when using a huge Queue #4

Open
Zer0-Tolerance opened this issue Jun 10, 2019 · 18 comments

Comments

@Zer0-Tolerance
Copy link

Hi David,

Great module very useful to use a very large queue, but it seems to have an issue with collection of more than a million documents.
Here is the result of the slow queries ie > 3 secs
db.currentOp({"secs_running": {$gte: 3}})

/* 1 */
{
"inprog" : [
{
"host" : "recon.local:27017",
"desc" : "conn1096",
"connectionId" : 1096,
"client" : "127.0.0.1:51081",
"clientMetadata" : {
"driver" : {
"name" : "MongoDB Perl Driver",
"version" : "2.0.3"
},
"os" : {
"type" : "darwin"
},
"platform" : "Perl v5.24.4 darwin-thread-multi-2level"
},
"active" : true,
"currentOpTime" : "2019-06-10T17:43:56.835+0200",
"opid" : 18276052,
"secs_running" : NumberLong(5),
"microsecs_running" : NumberLong(5256346),
"op" : "command",
"ns" : "NScan.hostq",
"command" : {
"aggregate" : "hostq",
"pipeline" : [
{
"$match" : {
"_r" : {
"$exists" : false
}
}
},
{
"$group" : {
"n" : {
"$sum" : 1
},
"_id" : null
}
}
],
"cursor" : {},
"$readPreference" : {
"mode" : "secondaryPreferred"
},
"$db" : "NScan"
},
"planSummary" : "COLLSCAN",
"numYields" : 37202,
"locks" : {},
"waitingForLock" : false,
"lockStats" : {
"Global" : {
"acquireCount" : {
"r" : NumberLong(74408)
}
},
"Database" : {
"acquireCount" : {
"r" : NumberLong(37204)
}
},
"Collection" : {
"acquireCount" : {
"r" : NumberLong(37204)
}
}
}
},
...
],
"ok" : 1.0
}

After some testing it's the group operation in the aggregate that is very slow
db.getCollection('hostq').aggregate([ {$match: {"_r" : {"$exists" : false}}}, // {"$group" : {"n" : {"$sum" : 1},"_id" : null}} ]) -> 0.037 secs db.getCollection('hostq').aggregate([ {$match: {"_r" : {"$exists" : false}}}, {"$group" : {"n" : {"$sum" : 1},"_id" : null}} ]) -> 13 secs

Any chance you can fix this and update the module ?

@xdg
Copy link
Contributor

xdg commented Jun 10, 2019

Hi, that should be fixable with an appropriate index. If you add an index {"_r":1, "_p":1}, does that get rid of the slow query? (Also, can you confirm whether you have an existing index on "_p"?)

@Zer0-Tolerance
Copy link
Author

already checked that and added an Index on _r but still very slow ...
indexSizes" : { "_id_" : 226652160, "_p_1" : 46321664, "_r_1" : 21610496 },

Any other idea ? I'm using Mongo 3.6

@xdg
Copy link
Contributor

xdg commented Jun 10, 2019

I was suggesting a joint index on _r and _p, with _r first, following the rule of "equality, sort, range" for indexes. That should do better than an individual index as it should avoid the collection scan (COLLSCAN).

@Zer0-Tolerance
Copy link
Author

just tried the compound index but same result still 3 secs with 2.6 millions docs.

@xdg
Copy link
Contributor

xdg commented Jun 11, 2019

Ok, I'll have to run some experiments locally, but I'm slammed at work and then on vacation next week (which is why I'm slammed this week). So please be patient, but I will look into it. This should be possible without a collection scan.

@xdg
Copy link
Contributor

xdg commented Jun 11, 2019

Looking again at your report, nothing in MongoDBx::Queue is issuing an aggregate or a $group pipeline stage -- so I think this must be some other query you're running outside MongoDBx::Queue.

I also found SERVER-29444 for improving performance of $group based on indexes. You might find some insight there and in linked tickets. For example, it's possible that indexing on the group field and sorting it first could help, but that's getting outside my area of expertise.

@Zer0-Tolerance
Copy link
Author

I've just rechecked but I'm not doing any query on the collection used by MongoDBx::Queue so it must be coming from MongoDBx maybe ?

@xdg
Copy link
Contributor

xdg commented Jun 11, 2019

Oh, I see now. That's the new underlying query generate from count_documents. Great! Now I know what to investigate.

@xdg
Copy link
Contributor

xdg commented Jun 11, 2019

I take it you're calling waiting -- that would be the origin of the slow query. In the meantime, if you can avoid that call, you'll see performance improve.

@Zer0-Tolerance
Copy link
Author

true I've changed to size it's better but underlying queries are using COLLSCAN which is surprising given we get better results than waiting that is using IXSCAN (index).
Nevertheless I'm still getting this 500 % CPU usage but it's not starting the CPU FAN anymore ...

@xdg
Copy link
Contributor

xdg commented Jun 12, 2019

There's a design problem in that the "_r" field is created and removed and that make is hard to index -- even a sparse index doesn't always get picked up properly by the query planner. The fix would be to change the internals to create "_r" always with some value like 0 meaning "not reserved", and then indexing would work correctly. The problem is that's a breaking change, so it's a major version bump and I'd need to provide some way for people to upgrade their old data, too. That's what I'm currently thinking through.

@Zer0-Tolerance
Copy link
Author

I fully understand your problem but it would be good to get better performance :)
Let me know if you need some beta testing.

@Zer0-Tolerance
Copy link
Author

Any progress ?

@xdg
Copy link
Contributor

xdg commented Jun 19, 2019

A bit. Conceptually, I'm going to rewrite MongoDBx::Queue as a facade in front of two implementation classes. Default will be legacy, but users can opt into v2. Then migration is just a matter of reserving tasks from the old queue and inserting them into a new, v2, queue (separate collection). That should avoid surprising users on update and allow users control over migration -- whether to shut down all old work processing and migrate or whether to keep old running while migrating to a new queue.

I'm not getting a lot of time to hack on it around family obligations, but at least I have now have a plan and will keep chipping away. I'm working on migrating legacy to the facade model and when that passes testing, I'll work on v2. Worst case, I'm back from vacation next week and can bang something out then.

@xdg
Copy link
Contributor

xdg commented Jun 19, 2019

Here's a test tarball: https://www.dropbox.com/s/ypl3ia5v1ze92e2/MongoDBx-Queue-2.001-TRIAL.tar.gz?dl=1

I did some local testing (at 100k documents) and a query comparable to waiting uses an index rather than a collection scan.

You'll need to migrate/copy your jobs from the v1 queue to another collection/database name for the v2 queue (using version => 2 in the queue options) and test there. That's not well documented yet.

@Zer0-Tolerance
Copy link
Author

great thks ! I'll do some test this week end and let you know.

@Zer0-Tolerance
Copy link
Author

Zer0-Tolerance commented Jun 25, 2019

Hi David,
I still have some slow queries like this one with q very big queue:

inprog" : [ 
        {
            "host" : "recon.home:27017",
            "desc" : "conn14",
            "connectionId" : 14,
            "client" : "127.0.0.1:49381",
            "clientMetadata" : {
                "driver" : {
                    "name" : "MongoDB Perl Driver",
                    "version" : "2.0.3"
                },
                "os" : {
                    "type" : "darwin"
                },
                "platform" : "Perl v5.24.4 darwin-thread-multi-2level"
            },
            "active" : true,
            "currentOpTime" : "2019-06-25T07:55:50.672+0200",
            "opid" : 87194,
            "secs_running" : NumberLong(2),
            "microsecs_running" : NumberLong(2856433),
            "op" : "command",
            "ns" : "NScan.hostq",
            "command" : {
                "aggregate" : "hostq",
                "pipeline" : [ 
                    {
                        "$match" : {
                            "_x" : false
                        }
                    }, 
                    {
                        "$group" : {
                            "_id" : null,
                            "n" : {
                                "$sum" : 1
                            }
                        }
                    }
                ],
                "cursor" : {},
                "$readPreference" : {
                    "mode" : "secondaryPreferred"
                },
                "$db" : "NScan"
            },
            "planSummary" : "COUNT_SCAN { _x: 1, _p: 1 }",
            "numYields" : 25632,
            "locks" : {},
            "waitingForLock" : false,
            "lockStats" : {
                "Global" : {
                    "acquireCount" : {
                        "r" : NumberLong(51268)
                    }
                },
                "Database" : {
                    "acquireCount" : {
                        "r" : NumberLong(25634)
                    }
                },
                "Collection" : {
                    "acquireCount" : {
                        "r" : NumberLong(25634)
                    }
                }
            }
        }
    ],
    "ok" : 1.0

@xdg
Copy link
Contributor

xdg commented Jun 25, 2019

I've discussed this with the query team. The COUNT_SCAN plan is the best the database is going to do -- it's walking the index and incrementing a counter without touching any underlying data and you've just got a lot of things to count. I suspect the only way to make it faster is to scale up the hardware the database is running on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants