-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
submit command results in memory errors and rejects with <EOFError: end of file reached> invalid SIP descriptor errors and defunct processes #777
Comments
After stopping and restarting DAITSS, ruby is still spawning all sorts of processes: [lydiam@fclnx30 submit-direct]$ ps -ef | grep daitss Ingest seems to be processing. Once all submitted packages are ingested I will attempt to submit one package at a time. |
Now the AIP descriptor validation failed with the error we were getting on Friday: descriptor fails daitss aip xml validation (1 errors) I reset the package and started manually. That worked on Friday. Failed again with the same error. I will restart daitss again. I restarted daitss and stopped pulse. Restarted another package manually and it appears to be succeeding. Archived E30MZ6Y4F_63VOVM . Other packages failed repeatedly at aip descriptor validation. |
Gerald suggested that collect-fixities be killed. I did that and issued a stop-cron command: sudo -u daitss touch /opt/fda/etc/stop-cron. I stopped and started daitss again, and the rack processes still show up. |
Packages are failing repeatedly with multiple errors such as: 2015-12-14 15:55:14 -0500 descriptor fails daitss aip xml validation (1 errors) trace /opt/web-services/sites/core/releases/20141117161651/lib/daitss/proc/template/descriptor.rb:49:in |
Since packages aren't ingesting I tried to submit SIPs one at a time and got the java runtime error and a reject message: |
It has been a while but I seem to remember we setup daitss to spawn multiple occurrences of specific web services to help with load. Such as more than one description service. Some of the other processes listed above are spawned by daitss such as a mailer daemon and connection to the database. Also, anytime a user logs into daitss Apache will spawn a core process for that user. Not saying that there isn't something wrong with the number of daitss processes but its too hard for me to tell. I would focus on the JVM error. One of two things must have occured for that error to generate. The system either ran out of Swap memory having gone through the other forms of memory first. Or the JVM settings need to be tweaked in the daitss config file. Daitss relies on the JRE(Java Runtime Environment) for ruby gem RJB https://rubygems.org/gems/rjb/versions/1.5.4 to do xml_validation. We use it here: https://github.com/daitss/core/blob/master/lib/daitss/proc/xmlvalidation.rb#L54 In the daitss config you will need to look at this line: Xms2G - the starting heap size is 2GB Basically, every JVM spawned will take up 2GB of memory regardless of what is needed. You may want to decrease the starting heap size to Xms256m and increase the max heap size to Xmx4G. This could help with memory allocation. It will allow a single JVM process to grow larger than 2GB if needed but keep other JVM processes smaller. Keep in mind this is specific to the use of the JRE in daitss and not other processes on the daitss server using a JVM. |
Something to think about - if there are other memory errors not relating to the JRE that can indicate a host of other potential problems. You will need a baseline. Stop daitss and look at the memory usage. Does it look normal? Fire things up and see what memory looks like. It may be a good idea to disable GUI submission or better yet disable the UI completely. Submit packages from the command line and see what's going on with memory. |
Current settings in /opt/web-services/conf.d/daitss-config.ymljjj jvm_options: describe.fda.fcla.edu: (no Xms setting at all) |
NWRDC has just rebooted fclxn30, and daitss processes and pulse started up. Pulse started ingest on
Thats a 7GB+ package, so it will take some time to process. I've stopped pulse. I will not attempt submission on any new packages right now. There are no new packages ftp-ed to eclipsep yesterday. Actual memory use is 6% and 0% swap, with physical 44.9% I'm guessing that it's just trying to store the package, because all other steps were completed yesterday: make aip descriptor 2015-12-14T14:03:49-05:00 112.63 I'm seeing some Passenger processes that are unfamiliar to me:
This package errored out when DAITSS was shut down while it was storing. |
Xms setting is optional. Leaving it off will start the JVM with a heap size of 0. This is okay as long as there is memory to be allocated when the JVM needs it. It will only allocate up to the Xmx setting. I would run some of the packages that failed before and monitor the memory. If they succeed then its not really a problem with the JVM. It propbably means the JVM cannot allocate memory because there is none to be had. I'd be willing to bet fclxn30 needs more memory. If that is not an option I would look at increasing the swap size to compensate. It still may be a good idea to disable the GUI for a time. |
There doesn't appear to be any current option to stop only the daitss GUI, so we'd require a developer to provide that option. |
There is an option for the operator to disable GUI submission from the UI. Affiliates may still try to download/ disseminate packages. A sysadmin should be able to setup a type of redirect for anyone outside of your organization. That would keep people from being able to connect to the GUI. |
All packages currently in the workspace were manually restarted and successfully processed. |
Retried submission of a single package that had failed yesterday. Failed again, with the same symptoms:
Xymon reports memory usage at 6% actual. Attempted to submit one of the FDA_benchmarks packages with the same results:
[lydiam@fclnx30 ETDs]$ # |
GUI submission for the above benchmark package succeeded: E4P2B1PWE_5W0LH1 |
Downloading SIPs and submitting them via the GUI, where possible, and starting them manually. There appears to be some problem with the command-line submit and pulse processes. What's the difference between a package started manually and one started by pulse? |
Bill K. experimented with submit, submitting as Stephen's user and starting pulse as Stephen's user and also removed the memory limit from my user. I resubmitted the batch of records and they appear to have submitted successfully and are now being successfully processed. So my biggest problem appears to have been the memory limit on my user. Bill thinks that there are multiple problems, including a possibly memory leak. More details to follow. |
Bill's assessment of the problem: We actually have 3 different issues. One is the memory problems upon submission. I think we have resolved this memory issue. There was a parameter in Lydia’s .bash_profile that was limiting her memory usage. This seems to have resolved this issue, but want to wait a little while longer before declaring victory. The second issue, the EOF error, has been an intermittent problem for a while now according to Lydia. There are no filesystem, CPU or memory constraints on the server during the time the error occurs. It does not seem to depend on who submitted the package to the app, who started the app, or how long the server has been up. The third is another memory issue, in which all of the server memory is consumed by the daitss processes. This looks to be a memory leak in the application. I do not think the second and third problems are hardware or OS related. I have exhausted my limited Ruby skills looking at the problem. |
I’ve mostly cleaned up the mess created from multiple submissions/rejects of UF records and submitted a batch of FIU records that were FTP-ed on Friday and am getting the same submit memory error again. Only 2 packages submitted successfully, and 18 got the same error: sip path: /var/daitss/ops/incoming/nonUF_UFAD/FIU/FI15042500; Actual memory is at 21% and physical at 79% [lydiam@fclnx30 ops]$ sudo -b -u daitss submit --username lydiam --password fclad00d2 --source FTPDL --path /var/daitss/ops/incoming/nonUF_UFAD/FIU --batch FTPDL_20151212T090803_FIU |
Assuming the same packages fail again and again with the same Java error and it doesn't appear the physical memory capped out maybe its time to try increasing the JVM memory setting. Instead of -Xmx4096m try -Xmx8G. That will allow the JVM to grab more memory if needed. If it has no impact then switch it back. Also, is there something about these packages that are in common? Stephen seemed to indicate one of the affiliates is sending large/ unusual packages. Are there a lot of .mov files in these packages? |
Retried the submit command with a new session. I likely used an old session to do the submission, one that didn't register the memory changes that Bill K. made for my user. |
It may be a worthwhile attempt to find out what user Apache runs as and determine if it has any of those same memory restrictions. The user will typically be "apache" or "nobody" and is usually configured in an httpd.conf file. It could even be configured to run as the "daitss" user. |
I remember the EOF error is what the xml parser returns whenever the parser has a problem parsing an XML file. We used get this error with those invalid characters in the descriptors. The work done in submit has fixed most of the EOF, however, there could be some additional scenarios which can cause XML parsing to fail. |
I submitted the following command on fclnx30:
The result was that 6-8 of about 144 SIPs submitted successfully and the remainder rejected with the following error:
The submit log contained the "standard" log messages, with the majority of packages being rejected and some being submitted.:
At the same time I go memory errors in STDOUT and logs named hs_err_pid*.log, with the following contents:
[lydiam@fclnx30 ftpdl]$ more hs_err_pid15104.log
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate 32744 bytes for ChunkPool::allocate
# Possible reasons:
# The system is out of physical RAM or swap space
# In 32 bit mode, the process size limit was hit
# Possible solutions:
# Reduce memory load on the system
# Increase physical memory or swap space
# Check if swap backing store is full
# Use 64 bit Java on a 64 bit OS
# Decrease Java heap size (-Xmx/-Xms)
# Decrease number of Java threads
# Decrease Java thread stack sizes (-Xss)
# Set larger code cache with -XX:ReservedCodeCacheSize=
# This output file may be truncated or incomplete.
#
# Out of Memory Error (allocation.cpp:214), pid=15104, tid=47784611424576
#
# JRE version: OpenJDK Runtime Environment (7.0_85-b01) (build 1.7.0_85-mockbuild_2015_07_13_18_00-b00)
# Java VM: OpenJDK 64-Bit Server VM (24.85-b03 mixed mode linux-amd64 compressed oops)
# Derivative: IcedTea 2.6.1
# Distribution: Built on Red Hat Enterprise Linux Server release 5.11 (Tikanga) (Mon Jul 13 18:00:16 EDT 2015)
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
--------------- T H R E A D ---------------
Current thread (0x000000001a667000): JavaThread "C2 CompilerThread0" daemon [_thread_in_native, id=15130, stack(0x00002b75b8ba2000,0x00002b75b8ca3000)]
Stack: [0x00002b75b8ba2000,0x00002b75b8ca3000], sp=0x00002b75b8c9d440, free space=1005k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V [libjvm.so+0x9e3ece]
V [libjvm.so+0x9e45fb]
V [libjvm.so+0x4dcf85]
V [libjvm.so+0x2fdea0]
....
These logs can be found in: /var/daitss/ops/incoming/ftpdl
There are also lots of defunct processes:
daitss 27706 27690 0 12:53 pts/0 00:00:00 [ruby]
daitss 27727 27690 0 12:53 pts/0 00:00:00 [ruby]
daitss 27745 27690 0 12:53 pts/0 00:00:00 [ruby]
daitss 27766 27690 0 12:53 pts/0 00:00:00 [ruby]
daitss 27787 27690 0 12:53 pts/0 00:00:00 [ruby]
daitss 27809 27690 0 12:53 pts/0 00:00:00 [ruby]
And other daitss processes from Dec. 13:
daitss 23705 1 0 Dec13 ? 00:03:44 Rack: /opt/web-services/sites/core/current
daitss 23713 1 0 Dec13 ? 00:00:00 Rack: /opt/web-services/sites/core/current
daitss 23719 1 0 Dec13 ? 00:00:00 Rack: /opt/web-services/sites/core/current
daitss 23725 1 0 Dec13 ? 00:00:21 Rack: /opt/web-services/sites/core/current
daitss 23736 1 0 Dec13 ? 00:00:49 Rack: /opt/web-services/sites/actionplan/current
daitss 23743 1 0 Dec13 ? 00:00:00 Rack: /opt/web-services/sites/actionplan/current
daitss 23749 1 0 Dec13 ? 00:00:00 Rack: /opt/web-services/sites/actionplan/current
daitss 23755 1 0 Dec13 ? 00:00:17 Rack: /opt/web-services/sites/actionplan/current
daitss 23763 1 0 Dec13 ? 00:00:02 Rack: /opt/web-services/sites/storage-master/current
daitss 23773 1 0 Dec13 ? 00:00:02 Rack: /opt/web-services/sites/storage-master/current
daitss 23787 1 0 Dec13 ? 00:00:00 Rack: /opt/web-services/sites/transform/current
I will kill the defunct processes and attempt to restart DAITSS.
Gerald suggested attempting submission of one package at a time.
The text was updated successfully, but these errors were encountered: