Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate performance of TRSS #3354

Closed
smlambert opened this issue Jan 25, 2024 · 9 comments
Closed

Investigate performance of TRSS #3354

smlambert opened this issue Jan 25, 2024 · 9 comments
Assignees

Comments

@smlambert
Copy link
Contributor

As per adoptium/temurin#13 (comment), the performance of TRSS was sub-par in the January release. It needs investigation, as it has degraded than prior to release period (too slow to fill out results info on release pipelines to be useful during release).

Also very slow for main page to load and to generate release summary report.

@sxa
Copy link
Member

sxa commented Jan 25, 2024

FYI @Haroon-Khel since you dealt with #3335
The mongodb process is chewing up 2 CPU cores almost continuously at the moment.
The mongo database is 7.2Gb in size and is on a 30Gb filesystem, however /var/lib/docker is currently running at 3Gb free (after another docker system prune) on the 50Gb file system. It's not obvious where the space as gone given this output - the container disk usage it's very small so nowhere near using 50GB:

# docker system df
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          4         4         3.112GB   1.092GB (35%)
Containers      4         4         97.62MB   0B (0%)
Local Volumes   6         1         0B        0B
Build Cache     25        0         0B        0B

I'd quite like to try a complete shutdown docker service if we can schedule a time to do that to see if that frees up the "lost" space. Let me know when would be appropriate.
Also noting that the current TRSS container images were generated two weeks ago which may be consistent with when we started experiencing issues.

@sxa
Copy link
Member

sxa commented Jan 29, 2024

Issues npm run docker-down but got the following message:

! Network aqa-test-tools_default  Resource is still in use

Followed that up with an explicit docker stop of the containers, but that did not result in any additional space being freed up in /var/lib/docker

@sxa
Copy link
Member

sxa commented Jan 29, 2024

OK ... So it looks like the mongodb container has a log file (outside the container on the host but under `/var/lib/docker/containers/<mongo_container_uuid>. This was 47GB and filling up the docker file system.

Unclear if it's directly related but I've gzipped it, shut down the server and restarted it. The log file is still increasing in size so hopefully we'll be able to get some attention on this from the TRSS developers to understand the problem before it causes a problem again. I have pinged @llxia for advice in the AQA slack channel. I have started the processes under nohup, and the output from npm run docker is also producing a lot of similar debug output (including what looks like full parameters from the jenkins jobs it's monitoring) so we may have significant space use in two locations.

@smlambert Can you confirm if the performance issues you have described are now resolved by restarting the docker subsystem?

@sxa sxa self-assigned this Jan 29, 2024
@smlambert
Copy link
Contributor Author

We should look at what level the DB profiler is running at. The mongoDB documentation says the profiler is off by default, but should verify, and ensure it is started with it off or at level 1 with a filter to ensure it is only moderately active.

@sxa
Copy link
Member

sxa commented Jan 31, 2024

@Haroon-Khel Can I ask you to take a look at this please, since you've done more work on the setup and configuration of this server under #trss?

@Haroon-Khel
Copy link
Contributor

Seeing as we are nearing the release, I propose we should increase the number of cores on the machine to see if this helps. It's not a definitive solution, but one that may improve performance, and thereby will help with triage etc

@sxa
Copy link
Member

sxa commented Apr 11, 2024

I propose we should increase the number of cores on the machine to see if this helps

Are we seeing high CPU load on the server at the moment that will be alleviated by this? The throttling I put in place at the nginx level should have eliminated some of the problems with the client requests. Fixes have gone into TRSS to resolve that now, although they have not been deployed on our server yet.

Ref:

@smlambert
Copy link
Contributor Author

smlambert commented Apr 11, 2024

although they have not been deployed on our server yet.

They have been deployed on our server by my manual running of the sync job on the server (related: adoptium/aqa-test-tools#856 (comment)). Noting that the rate-limiting could/should now be adjusted.

@smlambert
Copy link
Contributor Author

Performance has improved by the several ways linked in issues above. Last thing to do is to get the synch script running regularly which is tracked under adoptium/aqa-test-tools#856

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

No branches or pull requests

3 participants