New Machine requirement: TRSS replacement #3116

sxa · 2023-06-26T14:34:43Z

I need to request a new machine:

New machine operating system (e.g. linux/windows/macos/solaris/aix): Linux
New machine architecture (e.g. x64/aarch32/arm32/ppc64/ppc64le/sparc): Probably x64
Provider (leave blank if it does not matter):
Desired usage: Replacement for the TRSS server
Any unusual specification/setup required: New system will be running under docker, so will need docker to be available.
How many of them are required: 1

Please explain what this machine is needed for: Replacement for the existing TRSS server which is running on AWS and should therefore be decommissioned as most others there have been as it is not a sponsored provider.

sxa · 2023-06-26T14:48:42Z

We have an existing playbook for setting up a TRSS server, but the current recommended way to set up TRSS is using docker compose, Some references:

sxa · 2023-08-08T14:11:37Z

Created at ~~172.187.145.103~~ [CHANGED - SEE LATER COMMENT] - infrastructure team's keys added

sxa · 2023-08-08T14:29:17Z

@llxia @smlambert Should the second link with the docker-compose docs work straight out of the box with the current aqa-test-tools repo? The package.json has references to docker compose instead of docker-compose and if I fix it gets a bit further then hits this problem:

Removing intermediate container 125d62fd987c
 ---> dc4be6de7db7
Step 4/13 : COPY package.json package-lock.json .
When using COPY with more than one source file, the destination must be a directory and end with a /
ERROR: Service 'client' failed to build : Build failed

sxa · 2023-08-09T11:56:00Z

Also from taking a look at the video, is it storing the database in the directory where you run npm docker run as opposed to in the docker container itself (I need to know for setting up the file systems appropriately)?
The docs for docker-compose also says "Using Docker is a good way to test and development locally." Is it definitely also suitable and ready for production use this way?

llxia · 2023-08-16T02:55:26Z

docker compose should be used.
docker-compose, aka v1, has now been deprecated and development has moved to v2. Please see
https://stackoverflow.com/questions/66514436/difference-between-docker-compose-and-docker-compose
https://docs.docker.com/compose/#compose-v2-and-the-new-docker-compose-command
https://docs.docker.com/compose/migrate/#what-are-the-differences-between-compose-v1-and-compose-v2

Haroon-Khel · 2023-08-22T13:39:03Z

The new machine has 100g in /var/lib/docker, however because of https://github.com/adoptium/aqa-test-tools/blob/6ffa683e929c20b6eb31eb2ecaf473d2e977544e/docker-compose.yml#L9C30-L9C30, mongo will store everything in a docker-host mounted volume in the top level aqa-test-tools directory because thats where the docker-compose file is. Right now on the new machine I have the service running from /home/jenkins/aqa-test-tools which means mongo is storing everything in a docker-host mounted volume in /home/jenkins/aqa-test-tools, ie not in /var/lib/docker where all the space is.

I think a solution is to add a disk and mount it to a separate directory, like /data/, and launch docker compose from there

Haroon-Khel · 2023-08-23T14:08:39Z

What ive done so far:

TRSS is running on http://172.187.145.103/
It is running as a non root user
port 80 is open on the machine via the azure console
The mongo data from the current machine has been imported into the new machine

What needs to be done:

The back end is not 'pointing' at our adoptium jenkins. I copied the credentials file from the current machine into a trssConf.json file on the new machine. The logs indicate that the credentials file is being used, yet the new TRSS instance does not look like it is getting any new test data from the adoptium jenkins server I think it may be pointing at the adoptium jenkins instance. Some of the builds have 23/08/2023 timestamp (I kicked off the service yesterday (22/08))
Certificate files need to be copied from the current machine
The nginx conf files on the new machine are the bare minimum. I suggest the conf files from the current machine are copied over after the certificates get copied over

sxa · 2023-08-24T09:57:53Z

I think a solution is to add a disk and mount it to a separate directory, like /data/, and launch docker compose from there

In that case I recommend that we redo the disk with /var/lib/docker on it and created a smaller file system for that and a larger one that covers the data - I'd probably prefer that being mounted on /home or /home/jenkins or another path underneath those - seems simpler not to have a complete new separate directory off / for this, and it's always good to have /home separate to avoid it filling up the root file system.

Haroon-Khel · 2023-08-24T16:03:24Z

Ill add the disk to /home/jenkins/trss/ and launch the service from there.

Regarding the import of the mongo data, I used exportMongo.sh on the existing machine. However since now mongo runs in a docker container the restore command (mentioned in that script) needs to be run in the container, and the archive needs to be copied into the running container.

https://adoptium.slack.com/archives/C5219G28G/p1692865247390769?thread_ts=1692109686.752019&cid=C5219G28G, at the moment the performance of the machine when it comes to loading data is extremely slow. Ill continue to monitor the performance after ive added the new disk

Haroon-Khel · 2023-09-01T15:58:45Z

Update

The machine's ip has changed to 20.90.182.165. The TRSS service is running on port 80 on the machine. The data disk is 128g with 30g for /var/lib/docker, 60g for /home/jenkins/trss (which is where mongo will be storing its data) and 4g swap.

More can be added if needs be, at the moment mongo (and the other trss services) are taking up 4.9G (thats including the import of the data from the current trss machine)

sxa · 2023-09-19T14:44:58Z

MongoDB was chewing up two full cores of CPU continuously which the original server was not doing. It appears ok today though - has a configuration change occurred to resolve that problem @Haroon-Khel ?

Haroon-Khel · 2023-09-19T15:41:57Z

I checked too. It still eats up the CPU when you go to http://20.90.182.165/. By this I mean mongod uses 0.7% when idle, and goes up to 200% when accessing the service. Funnily enough I did the same thing for the current trss service and I found it behaved the same way, ie on the current trss machine mongod eats up upto 400% of the CPU but only when accessed, stays at 0.3%ish when idle

PID     USER    PR  NI    VIRT   RES         SHR   S    %CPU  %MEM    TIME+         COMMAND 
9630 mongodb   20   0     10.439g  8.002g   6932  S     355.1  53.0  139390:51   mongod

Managed to capture top output after refreshing https://trss.adoptium.net/. So its not specific to the new trss instance

Haroon-Khel · 2023-09-19T15:48:30Z

Using htop, im seeing mongodb hit upto 500% CPU usage, on the current trss machine

sxa · 2023-09-19T19:06:29Z

Interesting - I'm no longer seeing it getting stuck at high CPU load although it does sit at 200% when loading the initial TRSS homepage - I wonder if we could optimise some of the queries there or improve the indexing?

@smlambert Can you see if this new TRSS instance is working adequately for the JDK21 triage? Would be good to give it a bit of a workout before switching it to be the primary one.

smlambert · 2023-09-19T21:42:26Z

Something is not quite right with it, as I am struggling to add new pipelines to monitor, trying to add https://ci.adoptium.net/job/build-scripts/job/release-openjdk21-pipeline/ and the service seems to 'go away' (would be good to see the logs to see what is going on). I can give details when I am back from 3 day PTO and can share a screen. Let's keep the old one up and running.

sxa · 2023-09-28T09:34:01Z

Haroon has indicated that he's been able to replicate the problem that Shelley mentioned in the previous comment.

Given that we are having issues here with the new instance and are unable to switch over the production server we should perhaps consider creating a direct replica of the existing server for now and see if that behaves as expected and look at fixing underlying issues with the new deployment asynchronously to avoid delaying the switchover.

How long would it take to see if that works?

Haroon-Khel · 2023-09-28T10:37:34Z

Yes I have seen the data just disappear on occasion. Annoyingly there is no log file in the normal mongo log file location

root@5a6e8cab57d6:/# ls -la /var/log/mongodb/
total 12
drwxr-xr-x 1 mongodb mongodb 4096 Sep  2  2022 .
drwxr-xr-x 1 root    root    4096 Sep  2  2022 ..

Haroon-Khel · 2023-09-28T10:39:12Z

Given that we are having issues here with the new instance and are unable to switch over the production server we should perhaps consider creating a direct replica of the existing server for now and see if that behaves as expected and look at fixing underlying issues with the new deployment asynchronously to avoid delaying the switchover.

I could increase the new trss server's cpu and ram to make it similar to the current server and then observe any differences. That would be quicker than creating a whole replica

sxa · 2023-09-28T10:56:51Z

I could increase the new trss server's cpu and ram to make it similar to the current server and then observe any differences. That would be quicker than creating a whole replica

I'm sceptical as to whether that will make any difference but if you can quickly test it then feel free.

Haroon-Khel · 2023-09-28T11:50:17Z

I've resized the machine to 8cpu and 16g ram. On htop im seeing CPU usage upto 500% when the trss service is accessed, 0.7% when idle. So similar to the current trss machine. Data loads up alot faster than before.

@smlambert Could you try adding another pipeline again, see if we hit the same error

sxa · 2023-09-29T10:37:19Z

Yes I have seen the data just disappear on occasion. Annoyingly there is no log file in the normal mongo log file location
root@5a6e8cab57d6:/# ls -la /var/log/mongodb/
total 12
drwxr-xr-x 1 mongodb mongodb 4096 Sep  2  2022 .
drwxr-xr-x 1 root    root    4096 Sep  2  2022 ..

Can we see if we can enable the logging? https://betterstack.com/community/questions/how-to-log-all-or-slow-mongodb-queries/ shows how to enable query logging which might give us an idea of where the slowness is occurring. There's also stuff in there to specifically track slow-to-execute queries in there which might be useful. https://www.mongodb.com/docs/manual/reference/log-messages/#logging-slow-operations is the official docs related to logging slow queries.

Haroon-Khel · 2023-09-29T11:26:35Z

Looks like logging is already enabled in /etc/mongod.conf

# where to write logging data.
systemLog:
  destination: file
  logAppend: true
  path: /var/log/mongodb/mongod.log

Haroon-Khel · 2023-09-29T11:27:55Z

Running db.adminCommand( { getLog: "global" } ) in mongo shows me a somewhat readable log

Haroon-Khel · 2023-10-06T12:30:04Z

@smlambert The changes in https://github.com/adoptium/aqa-test-tools/pull/821/files have been applied to the new trss machine. Could you try again to add a new pipeline

smlambert · 2023-10-06T15:11:39Z

I am now able to add new pipelines without issue, thanks @Haroon-Khel !

sxa · 2023-10-11T15:26:46Z

Let's see how it goes this week - from the comments being made it seems to be working well now (other than a problem when I configured the nginx frontend to rate limit more than it needed) We can then look at switching this to be the primary server next week. @Haroon-Khel Have you looked at copying the SSL certificate from the old machine across as a test to make sure we can enable it with that (It will show as an invalid certificate when you connect, but worth making sure the setup of nginx is correct)

sxa · 2023-10-11T15:29:32Z

@Haroon-Khel It looks like the light rate limiting may have stopped the mongodb becoming overwhelmed with requests as it's not jumping up to 500% CPU in the way it was previously. Can you confirm this (i.e. it's not just me not looking at it properly!) If that's the case we might be able to drop the server back down to 2 CPUs.

Haroon-Khel · 2023-10-11T15:36:19Z

It looks like the light rate limiting may have stopped the mongodb becoming overwhelmed with requests as it's not jumping up to 500% CPU in the way it was previously.

Can confirm. At most it is hitting 100% cpu. If it stays like this we could certainly drop the cpus down

Haroon-Khel · 2023-10-11T15:43:09Z

Have you looked at copying the SSL certificate from the old machine across as a test to make sure we can enable it with that (It will show as an invalid certificate when you connect, but worth making sure the setup of nginx is correct)

I did but it is a bit more complicated than that. The certs on the existing trss server were setup using certbot so I need to find out how to mimic that setup on the new server.

i did try copying the cert files (that the nginx config on the current machine point to) over to the new machine, along with the current nginx config, but I did not get anywhere. I was at least expecting http://20.90.182.165/ to have a faulty certs proceed with caution barrier but instead it just gave me nginx errors

sxa · 2023-10-11T15:51:00Z

I was at least expecting http://20.90.182.165/ to have a faulty certs proceed with caution barrier but instead it just gave me nginx errors

Hmmm http would never give a faulty cert option (unless it redirected!) - you'd need to go to the HTTPS port for it to present the certificate. Do you know what the errors were? Copying the files across shouldn't have caused anything related to the HTTP connections to break.

If it's not an obvious reason I suggest we take a look at this and the switch of CPUs tomorrow morning as I doubt too many people will be using the server until the afternoon.

sxa · 2023-11-10T16:40:43Z

DNS entry has been updated so should propogate everywhere soon https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/issues/3904

sxa · 2023-12-18T12:30:43Z

LetsEncrypt certbot is now set up and will auto-renew via a systemd timer

sxa · 2024-01-03T16:32:37Z

Old machine in AWS has now been decommissioned so will not be incurring future charges - closing.

sxa added the Machine Request label Jun 26, 2023

Haroon-Khel self-assigned this Aug 22, 2023

sxa pinned this issue Sep 20, 2023

sxa added this to the 2023-09 (September) milestone Sep 22, 2023

sxa unpinned this issue Oct 12, 2023

sxa modified the milestones: 2023-09 (September), 2023-11 (November) Nov 1, 2023

sxa modified the milestones: 2023-11 (November), 2023-12 (December) Jan 3, 2024

sxa closed this as completed Jan 3, 2024

sxa mentioned this issue Jan 31, 2024

Investigate performance of TRSS #3354

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Machine requirement: TRSS replacement #3116

New Machine requirement: TRSS replacement #3116

sxa commented Jun 26, 2023

sxa commented Jun 26, 2023 •

edited

sxa commented Aug 8, 2023 •

edited

sxa commented Aug 8, 2023

sxa commented Aug 9, 2023 •

edited

llxia commented Aug 16, 2023

Haroon-Khel commented Aug 22, 2023 •

edited

Haroon-Khel commented Aug 23, 2023 •

edited

sxa commented Aug 24, 2023

Haroon-Khel commented Aug 24, 2023 •

edited

Haroon-Khel commented Sep 1, 2023

sxa commented Sep 19, 2023

Haroon-Khel commented Sep 19, 2023 •

edited

Haroon-Khel commented Sep 19, 2023

sxa commented Sep 19, 2023 •

edited

smlambert commented Sep 19, 2023 •

edited

sxa commented Sep 28, 2023 •

edited

Haroon-Khel commented Sep 28, 2023

Haroon-Khel commented Sep 28, 2023

sxa commented Sep 28, 2023

Haroon-Khel commented Sep 28, 2023 •

edited

sxa commented Sep 29, 2023

Haroon-Khel commented Sep 29, 2023

Haroon-Khel commented Sep 29, 2023 •

edited

Haroon-Khel commented Oct 6, 2023

smlambert commented Oct 6, 2023

sxa commented Oct 11, 2023

sxa commented Oct 11, 2023

Haroon-Khel commented Oct 11, 2023

Haroon-Khel commented Oct 11, 2023 •

edited

sxa commented Oct 11, 2023

sxa commented Nov 10, 2023

sxa commented Dec 18, 2023 •

edited

sxa commented Jan 3, 2024

New Machine requirement: TRSS replacement #3116

New Machine requirement: TRSS replacement #3116

Comments

sxa commented Jun 26, 2023

sxa commented Jun 26, 2023 • edited

sxa commented Aug 8, 2023 • edited

sxa commented Aug 8, 2023

sxa commented Aug 9, 2023 • edited

llxia commented Aug 16, 2023

Haroon-Khel commented Aug 22, 2023 • edited

Haroon-Khel commented Aug 23, 2023 • edited

sxa commented Aug 24, 2023

Haroon-Khel commented Aug 24, 2023 • edited

Haroon-Khel commented Sep 1, 2023

sxa commented Sep 19, 2023

Haroon-Khel commented Sep 19, 2023 • edited

Haroon-Khel commented Sep 19, 2023

sxa commented Sep 19, 2023 • edited

smlambert commented Sep 19, 2023 • edited

sxa commented Sep 28, 2023 • edited

Haroon-Khel commented Sep 28, 2023

Haroon-Khel commented Sep 28, 2023

sxa commented Sep 28, 2023

Haroon-Khel commented Sep 28, 2023 • edited

sxa commented Sep 29, 2023

Haroon-Khel commented Sep 29, 2023

Haroon-Khel commented Sep 29, 2023 • edited

Haroon-Khel commented Oct 6, 2023

smlambert commented Oct 6, 2023

sxa commented Oct 11, 2023

sxa commented Oct 11, 2023

Haroon-Khel commented Oct 11, 2023

Haroon-Khel commented Oct 11, 2023 • edited

sxa commented Oct 11, 2023

sxa commented Nov 10, 2023

sxa commented Dec 18, 2023 • edited

sxa commented Jan 3, 2024

sxa commented Jun 26, 2023 •

edited

sxa commented Aug 8, 2023 •

edited

sxa commented Aug 9, 2023 •

edited

Haroon-Khel commented Aug 22, 2023 •

edited

Haroon-Khel commented Aug 23, 2023 •

edited

Haroon-Khel commented Aug 24, 2023 •

edited

Haroon-Khel commented Sep 19, 2023 •

edited

sxa commented Sep 19, 2023 •

edited

smlambert commented Sep 19, 2023 •

edited

sxa commented Sep 28, 2023 •

edited

Haroon-Khel commented Sep 28, 2023 •

edited

Haroon-Khel commented Sep 29, 2023 •

edited

Haroon-Khel commented Oct 11, 2023 •

edited

sxa commented Dec 18, 2023 •

edited