Nightly publish job with slack notifications no longer runs #258

jerboaa · 2022-03-02T10:05:55Z

We used to have a job which posted slack messages to the build channel, detailing whether or not latest nightly builds got published to github. This no longer works. I'm guessing due to adoptium/temurin-build#2671

Expected behaviour:
Publish stats posted to build slack channel.

Observed behaviour:
No publish stats.

Any other comments:
https://github.com/adoptium/ci-jenkins-pipelines/blob/master/tools/nightly_build_and_test_stats.groovy#L35 and the like would probably need to get updated. It used to have VARIANT=temurin now it's VARIANT=hotspot

The text was updated successfully, but these errors were encountered:

jerboaa · 2022-03-02T10:08:01Z

Last successful run:
https://ci.adoptopenjdk.net/view/Tooling/job/nightlyBuildAndTestStats_hotspot/569

It has this in the console output:

[](https://ci.adoptopenjdk.net/view/Tooling/job/nightlyBuildAndTestStats_hotspot/569/console#)[](https://ci.adoptopenjdk.net/view/Tooling/job/nightlyBuildAndTestStats_hotspot/569/console#)[](https://ci.adoptopenjdk.net/view/Tooling/job/nightlyBuildAndTestStats_hotspot/569/console#)[](https://ci.adoptopenjdk.net/view/Tooling/job/nightlyBuildAndTestStats_hotspot/569/console#)[](https://ci.adoptopenjdk.net/view/Tooling/job/nightlyBuildAndTestStats_hotspot/569/console#)[](https://ci.adoptopenjdk.net/view/Tooling/job/nightlyBuildAndTestStats_hotspot/569/console#)[](https://ci.adoptopenjdk.net/view/Tooling/job/nightlyBuildAndTestStats_hotspot/569/console#)-------------- Nightly pipeline health report ------------------
[Pipeline] echo
===> JDK 8 nightly pipeline publish status: healthy. Last published: 2 day(s) ago
[Pipeline] slackSend
Slack Send Pipeline step running, values are - baseUrl: <empty>, teamDomain: adoptium, channel: #build, color: good, botUser: false, tokenCredentialId: slack-token, notifyCommitters: false, iconEmoji: <empty>, username: <empty>, timestamp: <empty>
[Pipeline] echo
===> JDK 11 nightly pipeline publish status: healthy. Last published: 3 day(s) ago
[Pipeline] slackSend
Slack Send Pipeline step running, values are - baseUrl: <empty>, teamDomain: adoptium, channel: #build, color: good, botUser: false, tokenCredentialId: slack-token, notifyCommitters: false, iconEmoji: <empty>, username: <empty>, timestamp: <empty>
[Pipeline] echo
===> JDK 17 nightly pipeline publish status: healthy. Last published: 3 day(s) ago
[Pipeline] slackSend
Slack Send Pipeline step running, values are - baseUrl: <empty>, teamDomain: adoptium, channel: #build, color: good, botUser: false, tokenCredentialId: slack-token, notifyCommitters: false, iconEmoji: <empty>, username: <empty>, timestamp: <empty>
[Pipeline] echo
----------------------------------------------------------------

Most recent job doesn't:
https://ci.adoptopenjdk.net/view/Tooling/job/nightlyBuildAndTestStats_hotspot/587/

sxa · 2022-03-02T13:32:26Z

Hmmm switching it back to hotspot doesn't quite have the desired effect as we are not building a variant of "hotspot" any more so whever it's looking picks up the last builds from over two weeks ago, so we need to understand how to make it pick up the temurin data. I suspect we want to create a nightlyBuildAndTestStats_temurin variant which picks up the right thing, although if it's querying the information from TRSS then it might need some change to the logic there too?

I've left the job running from my branch just now until we resolve this.

sxa · 2022-03-02T14:43:11Z

We resolved a problem with space issues on the TRSS server, although if it's still not showing the correct results today it's going to need some further investigation to understand whether the problem is TRSS not having the data available or whether the nightlyBuildAndTestStats_* jobs are not querying it properly for the temurin case.
FYI @smlambert @llxia @sophia-guo in case you have any insight/input today in @andrew-m-leonard 's absence.

sophia-guo · 2022-03-02T17:19:05Z

Also update the job configuration with default VARIANT=temurin.

https://adoptium.slack.com/archives/C09NW3L2J/p1646240602313249

sxa · 2022-03-02T17:25:22Z

@sophia-guo If I set the job to run from your variant1 branch an set a VARIANT of temurin on the parameters, it is still not retrieving the information as far as I can see. Is there still an issue somewhere with TRSS that needs to be addressed?

sophia-guo · 2022-03-02T17:45:02Z

The PR with variant1 branch is addressing no slack notifications issue.

No information retrieved from TRSS is a different issue. There is no builds information ever since Feb 16 ( disk issue you mentioned before?).

sophia-guo · 2022-03-02T17:47:30Z

Looks like it will only report recent 7 days infor. #230. So no report data is expected.

sophia-guo · 2022-03-02T18:02:19Z

Updated PR #259 Now should be fine https://ci.adoptopenjdk.net/view/Tooling/job/nightlyBuildAndTestStats_hotspot/597/ with output

[](https://ci.adoptopenjdk.net/view/Tooling/job/nightlyBuildAndTestStats_hotspot/597/console#)[](https://ci.adoptopenjdk.net/view/Tooling/job/nightlyBuildAndTestStats_hotspot/597/console#)[](https://ci.adoptopenjdk.net/view/Tooling/job/nightlyBuildAndTestStats_hotspot/597/console#)[](https://ci.adoptopenjdk.net/view/Tooling/job/nightlyBuildAndTestStats_hotspot/597/console#)[](https://ci.adoptopenjdk.net/view/Tooling/job/nightlyBuildAndTestStats_hotspot/597/console#)[](https://ci.adoptopenjdk.net/view/Tooling/job/nightlyBuildAndTestStats_hotspot/597/console#)[](https://ci.adoptopenjdk.net/view/Tooling/job/nightlyBuildAndTestStats_hotspot/597/console#)[Pipeline] echo
-------------- Nightly pipeline health report ------------------
[Pipeline] echo
===> JDK 8 nightly pipeline publish status: healthy. Last published: 15 day(s) ago
[Pipeline] slackSend
Slack Send Pipeline step running, values are - baseUrl: <empty>, teamDomain: adoptium, channel: #build, color: good, botUser: false, tokenCredentialId: slack-token, notifyCommitters: false, iconEmoji: <empty>, username: <empty>, timestamp: <empty>
[Pipeline] echo
===> JDK 11 nightly pipeline publish status: healthy. Last published: 19 day(s) ago
[Pipeline] slackSend
Slack Send Pipeline step running, values are - baseUrl: <empty>, teamDomain: adoptium, channel: #build, color: good, botUser: false, tokenCredentialId: slack-token, notifyCommitters: false, iconEmoji: <empty>, username: <empty>, timestamp: <empty>
[Pipeline] echo
===> JDK 17 nightly pipeline publish status: healthy. Last published: 19 day(s) ago
[Pipeline] slackSend
Slack Send Pipeline step running, values are - baseUrl: <empty>, teamDomain: adoptium, channel: #build, color: good, botUser: false, tokenCredentialId: slack-token, notifyCommitters: false, iconEmoji: <empty>, username: <empty>, timestamp: <empty>

sxa · 2022-03-02T18:05:21Z

Looks like it will only report recent 7 days infor. #230. So no report data is expected.
There is no builds information ever since Feb 16 ( disk issue you mentioned before?).

That was resolved yesterday so I would expect it to show data from last night's pipelines unless there's another issue stopping it from being picked up - we had JDK11 builds last night that were successfully published as nightlies so there's still an issue somewhere if they are not being picked up.

===> JDK 11 nightly pipeline publish status: healthy. Last published: 19 day(s) ago

That suggests it's still not picking up the data from the temurin pipelines correctly yet and is getting the last published one from before the change

jerboaa · 2022-03-02T20:03:12Z

Thanks for the fix folks!

jerboaa · 2022-03-02T20:05:46Z

Looks like it will only report recent 7 days infor. #230. So no report data is expected.
There is no builds information ever since Feb 16 ( disk issue you mentioned before?).

That was resolved yesterday so I would expect it to show data from last night's pipelines unless there's another issue stopping it from being picked up - we had JDK11 builds last night that were successfully published as nightlies so there's still an issue somewhere if they are not being picked up.

===> JDK 11 nightly pipeline publish status: healthy. Last published: 19 day(s) ago

That suggests it's still not picking up the data from the temurin pipelines correctly yet and is getting the last published one from before the change

I concur. There is still something wrong. Also, why is it reporting it as healthy when it's > 15 days ago and stale threshold is 4 days?

sophia-guo · 2022-03-02T23:10:29Z

The new PR output is

[](https://ci.adoptopenjdk.net/view/Tooling/job/nightlyBuildAndTestStats_hotspot/598/console#)[](https://ci.adoptopenjdk.net/view/Tooling/job/nightlyBuildAndTestStats_hotspot/598/console#)[](https://ci.adoptopenjdk.net/view/Tooling/job/nightlyBuildAndTestStats_hotspot/598/console#)[](https://ci.adoptopenjdk.net/view/Tooling/job/nightlyBuildAndTestStats_hotspot/598/console#)[](https://ci.adoptopenjdk.net/view/Tooling/job/nightlyBuildAndTestStats_hotspot/598/console#)[](https://ci.adoptopenjdk.net/view/Tooling/job/nightlyBuildAndTestStats_hotspot/598/console#)[](https://ci.adoptopenjdk.net/view/Tooling/job/nightlyBuildAndTestStats_hotspot/598/console#)[](https://ci.adoptopenjdk.net/view/Tooling/job/nightlyBuildAndTestStats_hotspot/598/console#)[Pipeline] echo
-------------- Nightly pipeline health report ------------------
[Pipeline] echo
===> JDK 8 nightly pipeline publish status: unhealthy. Last published: 15 day(s) ago. Stale threshold: 4 days.
[Pipeline] slackSend
Slack Send Pipeline step running, values are - baseUrl: <empty>, teamDomain: adoptium, channel: #build, color: warning, botUser: false, tokenCredentialId: slack-token, notifyCommitters: false, iconEmoji: <empty>, username: <empty>, timestamp: <empty>
[Pipeline] echo
===> JDK 11 nightly pipeline publish status: unhealthy. Last published: 19 day(s) ago. Stale threshold: 4 days.
[Pipeline] slackSend
Slack Send Pipeline step running, values are - baseUrl: <empty>, teamDomain: adoptium, channel: #build, color: warning, botUser: false, tokenCredentialId: slack-token, notifyCommitters: false, iconEmoji: <empty>, username: <empty>, timestamp: <empty>
[Pipeline] echo
===> JDK 17 nightly pipeline publish status: unhealthy. Last published: 19 day(s) ago. Stale threshold: 4 days.
[Pipeline] slackSend
Slack Send Pipeline step running, values are - baseUrl: <empty>, teamDomain: adoptium, channel: #build, color: warning, botUser: false, tokenCredentialId: slack-token, notifyCommitters: false, iconEmoji: <empty>, username: <empty>, timestamp: <empty>
[Pipeline] echo

sophia-guo · 2022-03-02T23:11:38Z

The next step would check why there is no new data in TRSS. Should we restart TRSS? @sxa

sophia-guo · 2022-03-03T05:10:26Z

No data issue opened in adoptium/aqa-test-tools#623

sophia-guo · 2022-03-03T15:16:21Z

We resolved a problem with space issues on the TRSS server, although if it's still not showing the correct results today it's going to need some further investigation to understand whether the problem is TRSS not having the data available or whether the nightlyBuildAndTestStats_* jobs are not querying it properly for the temurin case

I believe after the space cleaning up TRSS need to be restart @sxa

sxa · 2022-03-03T15:55:49Z

I believe after the space cleaning up TRSS need to be restart @sxa

I suggest @llxia or @Haroon-Khel takes that action as they have more experience of the server if it doesn't come back. Is the belief of the need for a restart a guess, or is there something specific about TRSS that you think will mean it requires a restart? The concern is obviously that with an out of space issue there may be files that have got truncated. It's also not currently clear - at least to me - why an out of space issue outside the /data file system would have caused the instability we are experiencing)

sophia-guo · 2022-03-03T16:09:53Z

Not sure what do you mean by 'truncate' , is it the issue mentioned in #254 (comment)? If that is the case maybe reopen #254

llxia · 2022-03-03T16:44:57Z

I would also suggest restarting the TRSS services.
It seems the method of running TRSS changed. I cannot find TRSS log anymore and I do not know how to restart it.

And I think we may have a disk space issue soon:

Filesystem      Size  Used Avail Use% Mounted on
udev            7.6G  302M  7.3G   4% /dev
tmpfs           1.6G  816K  1.6G   1% /run
/dev/nvme1n1p1  7.7G  7.6G  148M  99% /
tmpfs           7.6G  828K  7.6G   1% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           7.6G     0  7.6G   0% /sys/fs/cgroup
/dev/nvme0n1p2  167G  5.5G  153G   4% /data
/dev/loop6       34M   34M     0 100% /snap/amazon-ssm-agent/3552
/dev/loop4       25M   25M     0 100% /snap/amazon-ssm-agent/4046
/dev/loop1      100M  100M     0 100% /snap/core/11993
/dev/loop3       56M   56M     0 100% /snap/core18/2253
/dev/loop5       56M   56M     0 100% /snap/core18/2284
/dev/loop2      111M  111M     0 100% /snap/core/12603
tmpfs           1.6G     0  1.6G   0% /run/user/0

smlambert · 2022-03-04T00:21:56Z

Reopening this. If the method for running TRSS has changed, it would be a change driven by the infra team, so needs infra input/assistance.

(and I must have inadvertently closed this issue merging some script change)

sxa · 2022-03-04T11:29:47Z

Not sure what do you mean by 'truncate' , is it the issue mentioned in #254 (comment)? If that is the case maybe reopen #254

I am NOT referencing any specific issue. I am saying, as a system administrator, that if the disk space fills up on any machine and an application runs on that file system, there is a risk that files are not correctly written and become corrupted/shortened due to the out of space conditions. In order to understand how likely that we'd need to know if TRSS writes anything to the file system other than in the /data directory (since that one was safe and did not fill up). I'm calling it out as a risk that there may be files in that state that will prevent restarts. It may be a non-issue. We could just restart it and see if it comes back, but there is a risk of such a condition and it would need careful watching while it comes back up in order to minimise the risk of downtime.

If the method for running TRSS has changed, it would be a change driven by the infra team, so needs infra input/assistance.

NOTHING HAS CHANGED on the production server so far.

smlambert · 2022-03-04T13:14:18Z

We are in a broken state at the moment. I am presuming the only option is to restart it. Someone please correct me if that assumption is incorrect. I am not sure why the all-caps shouting. From my read of what Lan shared in a comment above, there have been cores written to disk, so that is something we know is being written to disk.

/data is not the issue. The request during community scrum to increase the amount of space (currently 8G?) for where the server is running was denied (can't remember the reason, because it is hard to do? because of Yeah Nah?)

I would like everyone to keep their eye on the end goal here and work towards a solution that can work for the project. I would like to be able to use TRSS for next weeks project triage, and in its current state that will not be possible.

sxa · 2022-05-17T17:23:25Z

Publish job has now been publishing for a while - closing.

jerboaa self-assigned this Mar 2, 2022

github-actions bot added hotspot testing labels Mar 2, 2022

sophia-guo mentioned this issue Mar 2, 2022

Enable temurin variant for now #259

Merged

smlambert closed this as completed in #259 Mar 2, 2022

jerboaa assigned sophia-guo and unassigned jerboaa Mar 2, 2022

sophia-guo mentioned this issue Mar 2, 2022

Fix compare logic #260

Merged

llxia mentioned this issue Mar 3, 2022

Figure out why we are not getting the last 2 weeks of job information. adoptium/aqa-test-tools#623

Closed

smlambert reopened this Mar 4, 2022

sxa pinned this issue Mar 7, 2022

sxa closed this as completed May 17, 2022

andrew-m-leonard unpinned this issue May 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nightly publish job with slack notifications no longer runs #258

Nightly publish job with slack notifications no longer runs #258

jerboaa commented Mar 2, 2022

jerboaa commented Mar 2, 2022

sxa commented Mar 2, 2022

sxa commented Mar 2, 2022

sophia-guo commented Mar 2, 2022 •

edited

Loading

sxa commented Mar 2, 2022

sophia-guo commented Mar 2, 2022

sophia-guo commented Mar 2, 2022

sophia-guo commented Mar 2, 2022

sxa commented Mar 2, 2022 •

edited

Loading

jerboaa commented Mar 2, 2022

jerboaa commented Mar 2, 2022

sophia-guo commented Mar 2, 2022

sophia-guo commented Mar 2, 2022 •

edited

Loading

sophia-guo commented Mar 3, 2022

sophia-guo commented Mar 3, 2022

sxa commented Mar 3, 2022 •

edited

Loading

sophia-guo commented Mar 3, 2022

llxia commented Mar 3, 2022

smlambert commented Mar 4, 2022 •

edited

Loading

sxa commented Mar 4, 2022

smlambert commented Mar 4, 2022

sxa commented May 17, 2022

Nightly publish job with slack notifications no longer runs #258

Nightly publish job with slack notifications no longer runs #258

Comments

jerboaa commented Mar 2, 2022

jerboaa commented Mar 2, 2022

sxa commented Mar 2, 2022

sxa commented Mar 2, 2022

sophia-guo commented Mar 2, 2022 • edited Loading

sxa commented Mar 2, 2022

sophia-guo commented Mar 2, 2022

sophia-guo commented Mar 2, 2022

sophia-guo commented Mar 2, 2022

sxa commented Mar 2, 2022 • edited Loading

jerboaa commented Mar 2, 2022

jerboaa commented Mar 2, 2022

sophia-guo commented Mar 2, 2022

sophia-guo commented Mar 2, 2022 • edited Loading

sophia-guo commented Mar 3, 2022

sophia-guo commented Mar 3, 2022

sxa commented Mar 3, 2022 • edited Loading

sophia-guo commented Mar 3, 2022

llxia commented Mar 3, 2022

smlambert commented Mar 4, 2022 • edited Loading

sxa commented Mar 4, 2022

smlambert commented Mar 4, 2022

sxa commented May 17, 2022

sophia-guo commented Mar 2, 2022 •

edited

Loading

sxa commented Mar 2, 2022 •

edited

Loading

sophia-guo commented Mar 2, 2022 •

edited

Loading

sxa commented Mar 3, 2022 •

edited

Loading

smlambert commented Mar 4, 2022 •

edited

Loading