Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nightly publish job with slack notifications no longer runs #258

Closed
jerboaa opened this issue Mar 2, 2022 · 22 comments · Fixed by #259
Closed

Nightly publish job with slack notifications no longer runs #258

jerboaa opened this issue Mar 2, 2022 · 22 comments · Fixed by #259
Assignees

Comments

@jerboaa
Copy link
Contributor

jerboaa commented Mar 2, 2022

We used to have a job which posted slack messages to the build channel, detailing whether or not latest nightly builds got published to github. This no longer works. I'm guessing due to adoptium/temurin-build#2671

Expected behaviour:
Publish stats posted to build slack channel.

Observed behaviour:
No publish stats.

Any other comments:
https://github.com/adoptium/ci-jenkins-pipelines/blob/master/tools/nightly_build_and_test_stats.groovy#L35 and the like would probably need to get updated. It used to have VARIANT=temurin now it's VARIANT=hotspot

@jerboaa
Copy link
Contributor Author

jerboaa commented Mar 2, 2022

Last successful run:
https://ci.adoptopenjdk.net/view/Tooling/job/nightlyBuildAndTestStats_hotspot/569

It has this in the console output:

[](https://ci.adoptopenjdk.net/view/Tooling/job/nightlyBuildAndTestStats_hotspot/569/console#)[](https://ci.adoptopenjdk.net/view/Tooling/job/nightlyBuildAndTestStats_hotspot/569/console#)[](https://ci.adoptopenjdk.net/view/Tooling/job/nightlyBuildAndTestStats_hotspot/569/console#)[](https://ci.adoptopenjdk.net/view/Tooling/job/nightlyBuildAndTestStats_hotspot/569/console#)[](https://ci.adoptopenjdk.net/view/Tooling/job/nightlyBuildAndTestStats_hotspot/569/console#)[](https://ci.adoptopenjdk.net/view/Tooling/job/nightlyBuildAndTestStats_hotspot/569/console#)[](https://ci.adoptopenjdk.net/view/Tooling/job/nightlyBuildAndTestStats_hotspot/569/console#)-------------- Nightly pipeline health report ------------------
[Pipeline] echo
===> JDK 8 nightly pipeline publish status: healthy. Last published: 2 day(s) ago
[Pipeline] slackSend
Slack Send Pipeline step running, values are - baseUrl: <empty>, teamDomain: adoptium, channel: #build, color: good, botUser: false, tokenCredentialId: slack-token, notifyCommitters: false, iconEmoji: <empty>, username: <empty>, timestamp: <empty>
[Pipeline] echo
===> JDK 11 nightly pipeline publish status: healthy. Last published: 3 day(s) ago
[Pipeline] slackSend
Slack Send Pipeline step running, values are - baseUrl: <empty>, teamDomain: adoptium, channel: #build, color: good, botUser: false, tokenCredentialId: slack-token, notifyCommitters: false, iconEmoji: <empty>, username: <empty>, timestamp: <empty>
[Pipeline] echo
===> JDK 17 nightly pipeline publish status: healthy. Last published: 3 day(s) ago
[Pipeline] slackSend
Slack Send Pipeline step running, values are - baseUrl: <empty>, teamDomain: adoptium, channel: #build, color: good, botUser: false, tokenCredentialId: slack-token, notifyCommitters: false, iconEmoji: <empty>, username: <empty>, timestamp: <empty>
[Pipeline] echo
----------------------------------------------------------------

Most recent job doesn't:
https://ci.adoptopenjdk.net/view/Tooling/job/nightlyBuildAndTestStats_hotspot/587/

@sxa
Copy link
Member

sxa commented Mar 2, 2022

Hmmm switching it back to hotspot doesn't quite have the desired effect as we are not building a variant of "hotspot" any more so whever it's looking picks up the last builds from over two weeks ago, so we need to understand how to make it pick up the temurin data. I suspect we want to create a nightlyBuildAndTestStats_temurin variant which picks up the right thing, although if it's querying the information from TRSS then it might need some change to the logic there too?

I've left the job running from my branch just now until we resolve this.

@sxa
Copy link
Member

sxa commented Mar 2, 2022

We resolved a problem with space issues on the TRSS server, although if it's still not showing the correct results today it's going to need some further investigation to understand whether the problem is TRSS not having the data available or whether the nightlyBuildAndTestStats_* jobs are not querying it properly for the temurin case.
FYI @smlambert @llxia @sophia-guo in case you have any insight/input today in @andrew-m-leonard 's absence.

@sophia-guo
Copy link
Contributor

sophia-guo commented Mar 2, 2022

Also update the job configuration with default VARIANT=temurin.

https://adoptium.slack.com/archives/C09NW3L2J/p1646240602313249

@sxa
Copy link
Member

sxa commented Mar 2, 2022

@sophia-guo If I set the job to run from your variant1 branch an set a VARIANT of temurin on the parameters, it is still not retrieving the information as far as I can see. Is there still an issue somewhere with TRSS that needs to be addressed?

@sophia-guo
Copy link
Contributor

The PR with variant1 branch is addressing no slack notifications issue.

No information retrieved from TRSS is a different issue. There is no builds information ever since Feb 16 ( disk issue you mentioned before?).

@sophia-guo
Copy link
Contributor

Looks like it will only report recent 7 days infor. #230. So no report data is expected.

@sophia-guo
Copy link
Contributor

Updated PR #259 Now should be fine https://ci.adoptopenjdk.net/view/Tooling/job/nightlyBuildAndTestStats_hotspot/597/ with output

[](https://ci.adoptopenjdk.net/view/Tooling/job/nightlyBuildAndTestStats_hotspot/597/console#)[](https://ci.adoptopenjdk.net/view/Tooling/job/nightlyBuildAndTestStats_hotspot/597/console#)[](https://ci.adoptopenjdk.net/view/Tooling/job/nightlyBuildAndTestStats_hotspot/597/console#)[](https://ci.adoptopenjdk.net/view/Tooling/job/nightlyBuildAndTestStats_hotspot/597/console#)[](https://ci.adoptopenjdk.net/view/Tooling/job/nightlyBuildAndTestStats_hotspot/597/console#)[](https://ci.adoptopenjdk.net/view/Tooling/job/nightlyBuildAndTestStats_hotspot/597/console#)[](https://ci.adoptopenjdk.net/view/Tooling/job/nightlyBuildAndTestStats_hotspot/597/console#)[Pipeline] echo
-------------- Nightly pipeline health report ------------------
[Pipeline] echo
===> JDK 8 nightly pipeline publish status: healthy. Last published: 15 day(s) ago
[Pipeline] slackSend
Slack Send Pipeline step running, values are - baseUrl: <empty>, teamDomain: adoptium, channel: #build, color: good, botUser: false, tokenCredentialId: slack-token, notifyCommitters: false, iconEmoji: <empty>, username: <empty>, timestamp: <empty>
[Pipeline] echo
===> JDK 11 nightly pipeline publish status: healthy. Last published: 19 day(s) ago
[Pipeline] slackSend
Slack Send Pipeline step running, values are - baseUrl: <empty>, teamDomain: adoptium, channel: #build, color: good, botUser: false, tokenCredentialId: slack-token, notifyCommitters: false, iconEmoji: <empty>, username: <empty>, timestamp: <empty>
[Pipeline] echo
===> JDK 17 nightly pipeline publish status: healthy. Last published: 19 day(s) ago
[Pipeline] slackSend
Slack Send Pipeline step running, values are - baseUrl: <empty>, teamDomain: adoptium, channel: #build, color: good, botUser: false, tokenCredentialId: slack-token, notifyCommitters: false, iconEmoji: <empty>, username: <empty>, timestamp: <empty>

@sxa
Copy link
Member

sxa commented Mar 2, 2022

Looks like it will only report recent 7 days infor. #230. So no report data is expected.
There is no builds information ever since Feb 16 ( disk issue you mentioned before?).

That was resolved yesterday so I would expect it to show data from last night's pipelines unless there's another issue stopping it from being picked up - we had JDK11 builds last night that were successfully published as nightlies so there's still an issue somewhere if they are not being picked up.

===> JDK 11 nightly pipeline publish status: healthy. Last published: 19 day(s) ago

That suggests it's still not picking up the data from the temurin pipelines correctly yet and is getting the last published one from before the change

@jerboaa jerboaa assigned sophia-guo and unassigned jerboaa Mar 2, 2022
@jerboaa
Copy link
Contributor Author

jerboaa commented Mar 2, 2022

Thanks for the fix folks!

@jerboaa
Copy link
Contributor Author

jerboaa commented Mar 2, 2022

Looks like it will only report recent 7 days infor. #230. So no report data is expected.
There is no builds information ever since Feb 16 ( disk issue you mentioned before?).

That was resolved yesterday so I would expect it to show data from last night's pipelines unless there's another issue stopping it from being picked up - we had JDK11 builds last night that were successfully published as nightlies so there's still an issue somewhere if they are not being picked up.

===> JDK 11 nightly pipeline publish status: healthy. Last published: 19 day(s) ago

That suggests it's still not picking up the data from the temurin pipelines correctly yet and is getting the last published one from before the change

I concur. There is still something wrong. Also, why is it reporting it as healthy when it's > 15 days ago and stale threshold is 4 days?

@sophia-guo
Copy link
Contributor

The new PR output is

[](https://ci.adoptopenjdk.net/view/Tooling/job/nightlyBuildAndTestStats_hotspot/598/console#)[](https://ci.adoptopenjdk.net/view/Tooling/job/nightlyBuildAndTestStats_hotspot/598/console#)[](https://ci.adoptopenjdk.net/view/Tooling/job/nightlyBuildAndTestStats_hotspot/598/console#)[](https://ci.adoptopenjdk.net/view/Tooling/job/nightlyBuildAndTestStats_hotspot/598/console#)[](https://ci.adoptopenjdk.net/view/Tooling/job/nightlyBuildAndTestStats_hotspot/598/console#)[](https://ci.adoptopenjdk.net/view/Tooling/job/nightlyBuildAndTestStats_hotspot/598/console#)[](https://ci.adoptopenjdk.net/view/Tooling/job/nightlyBuildAndTestStats_hotspot/598/console#)[](https://ci.adoptopenjdk.net/view/Tooling/job/nightlyBuildAndTestStats_hotspot/598/console#)[Pipeline] echo
-------------- Nightly pipeline health report ------------------
[Pipeline] echo
===> JDK 8 nightly pipeline publish status: unhealthy. Last published: 15 day(s) ago. Stale threshold: 4 days.
[Pipeline] slackSend
Slack Send Pipeline step running, values are - baseUrl: <empty>, teamDomain: adoptium, channel: #build, color: warning, botUser: false, tokenCredentialId: slack-token, notifyCommitters: false, iconEmoji: <empty>, username: <empty>, timestamp: <empty>
[Pipeline] echo
===> JDK 11 nightly pipeline publish status: unhealthy. Last published: 19 day(s) ago. Stale threshold: 4 days.
[Pipeline] slackSend
Slack Send Pipeline step running, values are - baseUrl: <empty>, teamDomain: adoptium, channel: #build, color: warning, botUser: false, tokenCredentialId: slack-token, notifyCommitters: false, iconEmoji: <empty>, username: <empty>, timestamp: <empty>
[Pipeline] echo
===> JDK 17 nightly pipeline publish status: unhealthy. Last published: 19 day(s) ago. Stale threshold: 4 days.
[Pipeline] slackSend
Slack Send Pipeline step running, values are - baseUrl: <empty>, teamDomain: adoptium, channel: #build, color: warning, botUser: false, tokenCredentialId: slack-token, notifyCommitters: false, iconEmoji: <empty>, username: <empty>, timestamp: <empty>
[Pipeline] echo

@sophia-guo
Copy link
Contributor

sophia-guo commented Mar 2, 2022

The next step would check why there is no new data in TRSS. Should we restart TRSS? @sxa

@sophia-guo
Copy link
Contributor

No data issue opened in adoptium/aqa-test-tools#623

@sophia-guo
Copy link
Contributor

We resolved a problem with space issues on the TRSS server, although if it's still not showing the correct results today it's going to need some further investigation to understand whether the problem is TRSS not having the data available or whether the nightlyBuildAndTestStats_* jobs are not querying it properly for the temurin case

I believe after the space cleaning up TRSS need to be restart @sxa

@sxa
Copy link
Member

sxa commented Mar 3, 2022

I believe after the space cleaning up TRSS need to be restart @sxa

I suggest @llxia or @Haroon-Khel takes that action as they have more experience of the server if it doesn't come back. Is the belief of the need for a restart a guess, or is there something specific about TRSS that you think will mean it requires a restart? The concern is obviously that with an out of space issue there may be files that have got truncated. It's also not currently clear - at least to me - why an out of space issue outside the /data file system would have caused the instability we are experiencing)

@sophia-guo
Copy link
Contributor

Not sure what do you mean by 'truncate' , is it the issue mentioned in #254 (comment)? If that is the case maybe reopen #254

@llxia
Copy link
Contributor

llxia commented Mar 3, 2022

I would also suggest restarting the TRSS services.
It seems the method of running TRSS changed. I cannot find TRSS log anymore and I do not know how to restart it.

And I think we may have a disk space issue soon:

Filesystem      Size  Used Avail Use% Mounted on
udev            7.6G  302M  7.3G   4% /dev
tmpfs           1.6G  816K  1.6G   1% /run
/dev/nvme1n1p1  7.7G  7.6G  148M  99% /
tmpfs           7.6G  828K  7.6G   1% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           7.6G     0  7.6G   0% /sys/fs/cgroup
/dev/nvme0n1p2  167G  5.5G  153G   4% /data
/dev/loop6       34M   34M     0 100% /snap/amazon-ssm-agent/3552
/dev/loop4       25M   25M     0 100% /snap/amazon-ssm-agent/4046
/dev/loop1      100M  100M     0 100% /snap/core/11993
/dev/loop3       56M   56M     0 100% /snap/core18/2253
/dev/loop5       56M   56M     0 100% /snap/core18/2284
/dev/loop2      111M  111M     0 100% /snap/core/12603
tmpfs           1.6G     0  1.6G   0% /run/user/0

@smlambert
Copy link
Contributor

smlambert commented Mar 4, 2022

Reopening this. If the method for running TRSS has changed, it would be a change driven by the infra team, so needs infra input/assistance.

(and I must have inadvertently closed this issue merging some script change)

@smlambert smlambert reopened this Mar 4, 2022
@sxa
Copy link
Member

sxa commented Mar 4, 2022

Not sure what do you mean by 'truncate' , is it the issue mentioned in #254 (comment)? If that is the case maybe reopen #254

I am NOT referencing any specific issue. I am saying, as a system administrator, that if the disk space fills up on any machine and an application runs on that file system, there is a risk that files are not correctly written and become corrupted/shortened due to the out of space conditions. In order to understand how likely that we'd need to know if TRSS writes anything to the file system other than in the /data directory (since that one was safe and did not fill up). I'm calling it out as a risk that there may be files in that state that will prevent restarts. It may be a non-issue. We could just restart it and see if it comes back, but there is a risk of such a condition and it would need careful watching while it comes back up in order to minimise the risk of downtime.

If the method for running TRSS has changed, it would be a change driven by the infra team, so needs infra input/assistance.

NOTHING HAS CHANGED on the production server so far.

@smlambert
Copy link
Contributor

We are in a broken state at the moment. I am presuming the only option is to restart it. Someone please correct me if that assumption is incorrect. I am not sure why the all-caps shouting. From my read of what Lan shared in a comment above, there have been cores written to disk, so that is something we know is being written to disk.

/data is not the issue. The request during community scrum to increase the amount of space (currently 8G?) for where the server is running was denied (can't remember the reason, because it is hard to do? because of Yeah Nah?)

I would like everyone to keep their eye on the end goal here and work towards a solution that can work for the project. I would like to be able to use TRSS for next weeks project triage, and in its current state that will not be possible.

@sxa sxa pinned this issue Mar 7, 2022
@sxa
Copy link
Member

sxa commented May 17, 2022

Publish job has now been publishing for a while - closing.

@sxa sxa closed this as completed May 17, 2022
@andrew-m-leonard andrew-m-leonard unpinned this issue May 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants