-
-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nightly publish job with slack notifications no longer runs #258
Comments
Last successful run: It has this in the console output:
Most recent job doesn't: |
Hmmm switching it back to hotspot doesn't quite have the desired effect as we are not building a variant of "hotspot" any more so whever it's looking picks up the last builds from over two weeks ago, so we need to understand how to make it pick up the temurin data. I suspect we want to create a I've left the job running from my branch just now until we resolve this. |
We resolved a problem with space issues on the TRSS server, although if it's still not showing the correct results today it's going to need some further investigation to understand whether the problem is TRSS not having the data available or whether the nightlyBuildAndTestStats_* jobs are not querying it properly for the temurin case. |
Also update the job configuration with default VARIANT=temurin. https://adoptium.slack.com/archives/C09NW3L2J/p1646240602313249 |
@sophia-guo If I set the job to run from your |
The PR with variant1 branch is addressing no slack notifications issue. No information retrieved from TRSS is a different issue. There is no builds information ever since Feb 16 ( disk issue you mentioned before?). |
Looks like it will only report recent 7 days infor. #230. So no report data is expected. |
Updated PR #259 Now should be fine https://ci.adoptopenjdk.net/view/Tooling/job/nightlyBuildAndTestStats_hotspot/597/ with output
|
That was resolved yesterday so I would expect it to show data from last night's pipelines unless there's another issue stopping it from being picked up - we had JDK11 builds last night that were successfully published as nightlies so there's still an issue somewhere if they are not being picked up.
That suggests it's still not picking up the data from the temurin pipelines correctly yet and is getting the last published one from before the change |
Thanks for the fix folks! |
I concur. There is still something wrong. Also, why is it reporting it as healthy when it's > 15 days ago and stale threshold is 4 days? |
The new PR output is
|
The next step would check why there is no new data in TRSS. Should we restart TRSS? @sxa |
No data issue opened in adoptium/aqa-test-tools#623 |
I believe after the space cleaning up TRSS need to be restart @sxa |
I suggest @llxia or @Haroon-Khel takes that action as they have more experience of the server if it doesn't come back. Is the belief of the need for a restart a guess, or is there something specific about TRSS that you think will mean it requires a restart? The concern is obviously that with an out of space issue there may be files that have got truncated. It's also not currently clear - at least to me - why an out of space issue outside the |
Not sure what do you mean by 'truncate' , is it the issue mentioned in #254 (comment)? If that is the case maybe reopen #254 |
I would also suggest restarting the TRSS services. And I think we may have a disk space issue soon:
|
Reopening this. If the method for running TRSS has changed, it would be a change driven by the infra team, so needs infra input/assistance. (and I must have inadvertently closed this issue merging some script change) |
I am NOT referencing any specific issue. I am saying, as a system administrator, that if the disk space fills up on any machine and an application runs on that file system, there is a risk that files are not correctly written and become corrupted/shortened due to the out of space conditions. In order to understand how likely that we'd need to know if TRSS writes anything to the file system other than in the
NOTHING HAS CHANGED on the production server so far. |
We are in a broken state at the moment. I am presuming the only option is to restart it. Someone please correct me if that assumption is incorrect. I am not sure why the all-caps shouting. From my read of what Lan shared in a comment above, there have been cores written to disk, so that is something we know is being written to disk. /data is not the issue. The request during community scrum to increase the amount of space (currently 8G?) for where the server is running was denied (can't remember the reason, because it is hard to do? because of Yeah Nah?) I would like everyone to keep their eye on the end goal here and work towards a solution that can work for the project. I would like to be able to use TRSS for next weeks project triage, and in its current state that will not be possible. |
Publish job has now been publishing for a while - closing. |
We used to have a job which posted slack messages to the build channel, detailing whether or not latest nightly builds got published to github. This no longer works. I'm guessing due to adoptium/temurin-build#2671
Expected behaviour:
Publish stats posted to build slack channel.
Observed behaviour:
No publish stats.
Any other comments:
https://github.com/adoptium/ci-jenkins-pipelines/blob/master/tools/nightly_build_and_test_stats.groovy#L35 and the like would probably need to get updated. It used to have
VARIANT=temurin
now it'sVARIANT=hotspot
The text was updated successfully, but these errors were encountered: