Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track corrupt workspaces on Windows Machines #1396

Closed
Willsparker opened this issue Jun 16, 2020 · 18 comments
Closed

Track corrupt workspaces on Windows Machines #1396

Willsparker opened this issue Jun 16, 2020 · 18 comments
Assignees
Milestone

Comments

@Willsparker
Copy link
Contributor

Willsparker commented Jun 16, 2020

Ref: adoptium/temurin-build#1855

These have been regularly cropping up in 2 forms:

  • A file such as the one referred to in the issue referenced above (a file called ..the_) is unable to be deleted. When doing this manually through File Explorer, it errors saying the file can't be found and to verify it's location. It can be cleared off by removing it using the Cygwin terminal.

  • A file is locked by a process, and often both Administrator and Jenkins can't delete them due to not having permissions. To rectify this, the file path can be searched in the Associated Handles section, under the CPU tab, in the resource monitor. Once the processes that are using those files are found, they can be right clicked and forced exited. The file can then be deleted by the Administrator user.

This issue is going to act as a way of tracking those sorts of issues to determine if there's a pattern / cause and a way of fixing it so a cleanup isn't required.

@Willsparker
Copy link
Contributor Author

@karianna was attempting to run a build on build-softlayer-win2012r2-x64-1 in which it was failing with the following error:

Suppressed: java.nio.file.AccessDeniedException: C:\workspace\openjdk-build\makejdk-any-platform.sh

Both Admin and Jenkins didn't have permissions to delete or access the workspace as it apparently had Unknown+User as an owner:

Administrator@build-softlayer-win2012r2-x64-1 /cygdrive/c/workspace/openjdk-build
$ ls -la
total 16
drwxrwx---+ 1 jenkins      None             0 Jun 16 04:43 .
drwxrwx---+ 1 jenkins      None             0 Jun 10 07:02 ..
-rw-r-----  1 Unknown+User Unknown+Group 2137 Jun 10 07:02 makejdk-any-platform.                                                             sh
drwxrwxr-x+ 1 jenkins      None             0 Jun 16 04:43 sbin
drwxr-x---  1 Unknown+User Unknown+Group    0 Jun 16 04:44 workspace

The following processes were using those filesystems:
Screenshot 2020-06-16 at 11 20 09

Once those processes were stopped, makejdk-any-platform.sh and workspace disappeared on their own, and the sbin folder was able to be deleted.

@Willsparker
Copy link
Contributor Author

Ref: adoptium/temurin-build#1892
If the above PR works, it should stop the ..the_ file issues.

@karianna karianna added this to TODO in infrastructure via automation Jun 16, 2020
@Willsparker Willsparker changed the title Corrupt workspaces on Windows Machines Track corrupt workspaces on Windows Machines Jun 17, 2020
@Willsparker Willsparker self-assigned this Jun 18, 2020
@Willsparker
Copy link
Contributor Author

#1410 (comment)

^ This comment is a bit of a more in-depth look at what's stopping Jenkins from deleting a workspace. It appears to be due to a java process that is writing to a log file after a test that isn't being stopped properly by Jenkins.
@M-Davies , I know you have a fair amount of experience with tests & Jenkins, any ideas? (sorry to put you on the spot) :-)

@M-Davies
Copy link

At first I thought this may have been due to an aborted run on the affected machine (which usually causes problems like this) but this didn't appear the case as no grinders and only two unsuccessful test runs have been executed on the machine recently.

  • The first run on 20-Jun-2020 had one test failure that had some trouble cleaning itself up at various stages during the process:
00:26:18  Attempting to destroy all caches in cacheDir C:\Users\jenkins.TEST-WIN2012-2\AppData\Local\javasharedresources\
00:26:18  
00:26:18  JVMSHRC241E Error: unable to delete shared class cache file
00:26:18  JVMSHRC336E Port layer error code = -100
00:26:18  JVMSHRC337E Platform error message: (32) The process cannot access the file because it is being used by another process.
00:26:18  JVMSHRC430I Failed to remove current generation of shared class cache "sharedcc_jenkins"
00:26:18  
00:26:18  Attempting to destroy all caches in cacheDir C:\Users\jenkins.TEST-WIN2012-2\AppData\Local\javasharedresources\
00:26:18  
00:26:18  JVMSHRC241E Error: unable to delete shared class cache file
00:26:18  JVMSHRC336E Port layer error code = -100
00:26:18  JVMSHRC337E Platform error message: (32) The process cannot access the file because it is being used by another process.
00:26:18  JVMSHRC430I Failed to remove current generation of shared class cache "sharedcc_jenkins"
00:26:18  cache cleanup done
00:26:18  jdk_lang_j9_0 Finish Time: Sun Jun 21 00:26:17 2020 Epoch Time (ms): 1592717177854
00:26:18  
00:26:18  ===============================================
00:26:18  Running test jdk_math_j9_0 ...
00:26:18  ===============================================
00:26:18  jdk_math_j9_0 Start Time: Sun Jun 21 00:26:18 2020 Epoch Time (ms): 1592717178048
00:26:18  "C:/Users/jenkins/workspace/Test_openjdk14_j9_sanity.openjdk_x86-64_windows/openjdkbinary/j2sdk-image\\bin\\java" -Xshareclasses:destroyAll; "C:/Users/jenkins/workspace/Test_openjdk14_j9_sanity.openjdk_x86-64_windows/openjdkbinary/j2sdk-image\\bin\\java" -Xshareclasses:groupAccess,destroyAll; echo "cache cleanup done";
00:26:18  
00:26:18  Attempting to destroy all caches in cacheDir C:\Users\jenkins.TEST-WIN2012-2\AppData\Local\javasharedresources\
00:26:18  
00:26:18  JVMSHRC241E Error: unable to delete shared class cache file
00:26:18  JVMSHRC336E Port layer error code = -100
00:26:18  JVMSHRC337E Platform error message: (32) The process cannot access the file because it is being used by another process.
00:26:18  JVMSHRC430I Failed to remove current generation of shared class cache "sharedcc_jenkins"
00:26:18  
00:26:18  Attempting to destroy all caches in cacheDir C:\Users\jenkins.TEST-WIN2012-2\AppData\Local\javasharedresources\
00:26:18  
00:26:18  JVMSHRC241E Error: unable to delete shared class cache file
00:26:18  JVMSHRC336E Port layer error code = -100
00:26:18  JVMSHRC337E Platform error message: (32) The process cannot access the file because it is being used by another process.
00:26:18  JVMSHRC430I Failed to remove current generation of shared class cache "sharedcc_jenkins"

See the console log for more instances of this occurring and also at the end of the build, it fails to clean up (probably due to the above failures):

[WS-CLEANUP] Deleting project workspace...
[WS-CLEANUP] Deferred wipeout is disabled by the job configuration...
Cannot delete workspace: null
Option not to fail the build is turned on, so let's continue
  • However, the second run on 22-Jun-2020 did not see this error and failed after pumping it's own javacore out. In addition, the cleanws step passed this time which implies that it had been cleaned up previously so it's possible I have missed something here.
  • Then we come to the failure reported in issue 1410 which decided to die at the first step.

The troublesome java process that wasn't cleaned up was created in the first run as the file path in the test_ouput matches the one reported by @Willsparker in #1410 (comment) and I also see the exact same final words in the javacore file of the first run as I do in #1410 .

The only thing that confuses me is why didn't the second run fail with the error reported in #1410 ? Why did it take another build before we saw this error?

A temp solution for this could be to ensure we always force cleanup the workspaces on the test machines PRIOR to running any tests but if the file cannot be deleted by Jenkins or Admin then I'm not sure how we would go about doing that.

Hope this helps! :)

@M-Davies
Copy link

A temp solution for this could be to ensure we always force cleanup the workspaces on the test machines PRIOR to running any tests but if the file cannot be deleted by Jenkins or Admin then I'm not sure how we would go about doing that.

So we do actually clean up prior to running tests but it's not forced and I don't think there is a way for cleanws() to be forced

@Willsparker
Copy link
Contributor Author

Willsparker commented Jun 24, 2020

I think the way this needs to be resolved is to ensure all the Java processes have stopped after any given test run - no process, no locked file and cleanWS() should be able to do it's job. My concern is that Jenkins really should be doing it by itself due to ProcessTreeKiller.
This is linked to #770 - The outcome of which has led to this Jenkins Job : https://ci.adoptopenjdk.net/job/SXA-processCheck/

The problem is that this job only runs once every 7 days, which is evidently not enough to catch the rogue processes. I suppose we could just cause every test job to trigger a processCheck run once it's completed, whether it's finished or not.
ping @sxa

@Willsparker
Copy link
Contributor Author

Ref: #1415
The referenced issue will be a good way of keeping a track of the files that aren't able to be cleaned-up automatically by Jenkins

@Willsparker
Copy link
Contributor Author

Ref: https://ci.adoptopenjdk.net/view/Failing%20Builds/job/build-scripts/job/jobs/job/jdk/job/jdk-windows-x64-openj9/69/console (from the issue @M-Davies opened above)

Affected build-softlayer-win2012r2-x64-1, directory C:/workspace/openjdk-build. I was unable to delete it manually with the Administrator user, but was able to with the Jenkins user, weirdly. Haven't seen that one before, as usually the permissions one is due to a leftover process still using the folder, which has to be stopped and then deleted.

@M-Davies
Copy link

M-Davies commented Aug 5, 2020

A file is locked by a process, and often both Administrator and Jenkins can't delete them due to not having permissions. To rectify this, the file path can be searched in the Associated Handles section, under the CPU tab, in the resource monitor. Once the processes that are using those files are found, they can be right clicked and forced exited. The file can then be deleted by the Administrator user.

@Willsparker https://ci.adoptopenjdk.net/view/Test_openjdk/job/Test_openjdk8_j9_sanity.openjdk_x86-32_windows/277/console sees test-softlayer-win2012r2-x64-2 reproduce this error

@Willsparker
Copy link
Contributor Author

@M-Davies Right you are- java.exe process has been stopped & the folder has been manually deleted- thanks :-)

@sxa Can we make it so your SXA-processCheck job kicks off for that machine after every test run?

@M-Davies
Copy link

M-Davies commented Aug 6, 2020

@Willsparker test-azure-win2012r2-x64-1 has the same problem as java.exe not being stopped https://ci.adoptopenjdk.net/view/Test_openjdk/job/Test_openjdk8_j9_sanity.openjdk_x86-64_windows_xl/256/console

@Willsparker
Copy link
Contributor Author

This is a new one - there's 10 java.exe processes running... Also I like this output:

Also:   java.nio.file.AccessDeniedException: **C:\Users\jenkins\workspace\Test_openjdk8_j9_sanity.openjdk_x86-64_windows_xl\openjdkbinary\j2sdk-image\bin\java.exe**

Highly useful 😁
I stopped a handful and then the rest stopped, and deleted C:\Users\jenkins\workspace\Test_openjdk8_j9_sanity.openjdk_x86-64_windows_xl\openjdkbinary\ manually 👍

@Willsparker
Copy link
Contributor Author

ref: adoptium/TKG#45

@Willsparker
Copy link
Contributor Author

Ref: adoptium/temurin-build#2076 (comment)
Machine was: build-azure-win2012r2-x64-2

Initial thoughts is that the run was trying to delete a directory that was in use in a different run- however there's only one executor on the machine so that couldn't happen unless someone was running a build locally on the machine.
When I got to it this morning, I was able to delete the folder without any issues, so whichever process was using those folders stopped over the weekend.

@andrew-m-leonard
Copy link
Contributor

https://ci.adoptopenjdk.net/job/build-scripts/job/jobs/job/jdk8u/job/jdk8u-windows-x64-openj9/895/console

17:02:45  Compiling 2 files for BUILD_DEMO_APPLET_GraphicsTest
17:02:45  Compiling 1 files for BUILD_DEMO_APPLET_NervousText
17:02:45  Compiling 1 files for BUILD_DEMO_APPLET_SimpleGraph
17:02:45  Compiling 5 files for BUILD_DEMO_APPLET_SortDemo
17:02:45  Could not create temporary file name for at file!
17:02:45  make[2]: *** [CompileDemos.gmk:75: /cygdrive/e/jenkins/tmp/workspace/build/src/build/windows-x86_64-normal-server-release/jdk/demo/applets/SortDemo/_the.BUILD_DEMO_APPLET_SortDemo_batch] Error 127
17:02:45  make[2]: *** Waiting for unfinished jobs....
17:02:45  Compiling 1 files for BUILD_DEMO_APPLET_SpreadSheet
17:02:47  make[1]: *** [BuildJdk.gmk:93: demos] Error 2
17:02:47  make: *** [/cygdrive/e/jenkins/tmp/workspace/build/src//make/Main.gmk:133: demos-only] Error 2

Error comes from here: https://github.com/ibmruntimes/openj9-openjdk-jdk8/blob/f9b33244a76dfd671907872ad6800f3e26e6a4a7/common/src/fixpath.c#L254

@andrew-m-leonard
Copy link
Contributor

I would suggest putting some extra debug into: https://github.com/ibmruntimes/openj9-openjdk-jdk8/blob/f9b33244a76dfd671907872ad6800f3e26e6a4a7/common/src/fixpath.c#L254
in a fork and building to recreate, we could do with the "rc" and the file "name"

@sxa
Copy link
Member

sxa commented Nov 18, 2020

We did some experimentation on a call earlier today and the atfile issue seems to have been caused by a lot of leftover atfile_ in the jenkins user's %TEMP% directory - possibly from previous runs with the fixpath debug option enabled. Clearing these out on the machine (ibmcloud build machine 1) appears to have resolved that particular issue.

The changes we are implementing for #1573 may well mean we can close this soon (Fingers crossed...)

@karianna karianna moved this from To do to Review in progress in Top Priorities Nov 18, 2020
@karianna karianna moved this from Review in progress to In progress in Top Priorities Nov 18, 2020
@Willsparker
Copy link
Contributor Author

Okay- with #1573 & adoptium/aqa-tests#2059 , I haven't seen anymore leftover Java processes on Windows Boxes, issues (thank god). Closing :-)

infrastructure automation moved this from In Progress to Done Dec 21, 2020
Top Priorities automation moved this from In progress to Done Dec 21, 2020
@karianna karianna added this to the December 2020 milestone Dec 21, 2020
@gdams gdams removed this from Done in Top Priorities May 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

No branches or pull requests

5 participants