Investigate performance of TRSS #3354

smlambert · 2024-01-25T03:52:03Z

As per adoptium/temurin#13 (comment), the performance of TRSS was sub-par in the January release. It needs investigation, as it has degraded than prior to release period (too slow to fill out results info on release pipelines to be useful during release).

Also very slow for main page to load and to generate release summary report.

sxa · 2024-01-25T11:12:06Z

FYI @Haroon-Khel since you dealt with #3335
The mongodb process is chewing up 2 CPU cores almost continuously at the moment.
The mongo database is 7.2Gb in size and is on a 30Gb filesystem, however /var/lib/docker is currently running at 3Gb free (after another docker system prune) on the 50Gb file system. It's not obvious where the space as gone given this output - the container disk usage it's very small so nowhere near using 50GB:

# docker system df
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          4         4         3.112GB   1.092GB (35%)
Containers      4         4         97.62MB   0B (0%)
Local Volumes   6         1         0B        0B
Build Cache     25        0         0B        0B

I'd quite like to try a complete shutdown docker service if we can schedule a time to do that to see if that frees up the "lost" space. Let me know when would be appropriate.
Also noting that the current TRSS container images were generated two weeks ago which may be consistent with when we started experiencing issues.

sxa · 2024-01-29T15:33:44Z

Issues npm run docker-down but got the following message:

! Network aqa-test-tools_default  Resource is still in use

Followed that up with an explicit docker stop of the containers, but that did not result in any additional space being freed up in /var/lib/docker

sxa · 2024-01-29T16:05:59Z

OK ... So it looks like the mongodb container has a log file (outside the container on the host but under `/var/lib/docker/containers/<mongo_container_uuid>. This was 47GB and filling up the docker file system.

Unclear if it's directly related but I've gzipped it, shut down the server and restarted it. The log file is still increasing in size so hopefully we'll be able to get some attention on this from the TRSS developers to understand the problem before it causes a problem again. I have pinged @llxia for advice in the AQA slack channel. I have started the processes under nohup, and the output from npm run docker is also producing a lot of similar debug output (including what looks like full parameters from the jenkins jobs it's monitoring) so we may have significant space use in two locations.

@smlambert Can you confirm if the performance issues you have described are now resolved by restarting the docker subsystem?

smlambert · 2024-01-30T20:39:45Z

We should look at what level the DB profiler is running at. The mongoDB documentation says the profiler is off by default, but should verify, and ensure it is started with it off or at level 1 with a filter to ensure it is only moderately active.

sxa · 2024-01-31T11:38:42Z

@Haroon-Khel Can I ask you to take a look at this please, since you've done more work on the setup and configuration of this server under #trss?

Haroon-Khel · 2024-04-11T10:55:45Z

Seeing as we are nearing the release, I propose we should increase the number of cores on the machine to see if this helps. It's not a definitive solution, but one that may improve performance, and thereby will help with triage etc

sxa · 2024-04-11T12:01:54Z

I propose we should increase the number of cores on the machine to see if this helps

Are we seeing high CPU load on the server at the moment that will be alleviated by this? The throttling I put in place at the nginx level should have eliminated some of the problems with the client requests. Fixes have gone into TRSS to resolve that now, although they have not been deployed on our server yet.

Ref:

smlambert · 2024-04-11T16:54:49Z

although they have not been deployed on our server yet.

They have been deployed on our server by my manual running of the sync job on the server (related: adoptium/aqa-test-tools#856 (comment)). Noting that the rate-limiting could/should now be adjusted.

smlambert · 2024-05-01T12:23:46Z

Performance has improved by the several ways linked in issues above. Last thing to do is to get the synch script running regularly which is tracked under adoptium/aqa-test-tools#856

smlambert mentioned this issue Jan 25, 2024

General Retrospective for January 2024 Releases adoptium/temurin#13

Closed

12 tasks

sxa self-assigned this Jan 29, 2024

sxa added the reliability label Jan 30, 2024

sxa assigned smlambert Mar 26, 2024

sxa mentioned this issue Apr 5, 2024

TRSS query efficiency adoptium/aqa-test-tools#839

Closed

smlambert closed this as completed May 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate performance of TRSS #3354

Investigate performance of TRSS #3354

smlambert commented Jan 25, 2024

sxa commented Jan 25, 2024

sxa commented Jan 29, 2024

sxa commented Jan 29, 2024

smlambert commented Jan 30, 2024

sxa commented Jan 31, 2024

Haroon-Khel commented Apr 11, 2024

sxa commented Apr 11, 2024

smlambert commented Apr 11, 2024 •

edited

Loading

smlambert commented May 1, 2024

Investigate performance of TRSS #3354

Investigate performance of TRSS #3354

Comments

smlambert commented Jan 25, 2024

sxa commented Jan 25, 2024

sxa commented Jan 29, 2024

sxa commented Jan 29, 2024

smlambert commented Jan 30, 2024

sxa commented Jan 31, 2024

Haroon-Khel commented Apr 11, 2024

sxa commented Apr 11, 2024

smlambert commented Apr 11, 2024 • edited Loading

smlambert commented May 1, 2024

smlambert commented Apr 11, 2024 •

edited

Loading