redundant CUPS servers dropping print jobs when 1 is down #581

Closed
michaelrsweet opened this Issue Feb 18, 2004 · 4 comments

Comments

Projects
None yet
1 participant
Collaborator

michaelrsweet commented Feb 18, 2004

Version: 1.1.20
CUPS.org User: jeff.layton.bowebellhowell

I posted this to the cups.bugs newsgroup yesterday, but perhaps this is the better place to post it:

I have 2 CUPS servers in my environment that are configured in a redundant setup. Both machines have the exact same printers set up on them, and my clients all use the 'BrowsePoll' option to get a list of print queues from both servers. I'm running 1.1.20 on both the clients and the servers, and they're all Solaris 8 machines.

In the past I've noticed some problems with dropped print jobs when I
bring one of the CUPS servers down, but have had little info to go on.
Today, I tested it in a somewhat controlled environment and have some
info.

Today at around 7:55AM, I brought down one of my servers and watched the
log from the client as several print jobs were submitted soon afterward.
Here's an excerpt from the client's error_log file during this period:

E [17/Feb/2004:07:55:57 +0500] [cups-polld 192.168.xxx.yy:631] get-printers failed: server-error-service-unavailable
E [17/Feb/2004:07:55:57 +0500] [cups-polld 192.168.xxx.yy:631] get-classes failed: server-error-service-unavailable
I [17/Feb/2004:07:55:59 +0500] Job 9642 queued on 'pap3' by 'applprd'.
I [17/Feb/2004:07:55:59 +0500] Started backend /usr/local/lib/cups/backend/ipp (PID 27001) for job 9642.
I [17/Feb/2004:07:56:00 +0500] Job 9643 queued on 'pabom' by 'applprd'. I [17/Feb/2004:07:56:00 +0500] Started backend /usr/local/lib/cups/backend/ipp (PID 27006) for job 9643.
I [17/Feb/2004:07:56:01 +0500] Job 9644 queued on 'pareceiving3' by
'applprd'.
I [17/Feb/2004:07:56:01 +0500] Started backend /usr/local/lib/cups/backend/ipp (PID 27014) for job 9644.
I [17/Feb/2004:07:56:01 +0500] Job 9645 queued on 'pareceiving3' by
'applprd'.
I [17/Feb/2004:07:56:01 +0500] Started backend /usr/local/lib/cups/backend/ipp (PID 27038) for job 9645.
I [17/Feb/2004:07:56:02 +0500] Job 9646 queued on 'pabom' by 'applprd'. I [17/Feb/2004:07:56:02 +0500] Started backend
/usr/local/lib/cups/backend/ipp (PID 27039) for job 9646.
E [17/Feb/2004:07:56:04 +0500] PID 27001 stopped with status 1!
I [17/Feb/2004:07:56:04 +0500] Hint: Try setting the LogLevel to "debug" to find out more.
I [17/Feb/2004:07:56:04 +0500] Job 9647 queued on 'pap3' by 'applprd'.
I [17/Feb/2004:07:56:04 +0500] Started backend
/usr/local/lib/cups/backend/ipp (PID 27152) for job 9647.
E [17/Feb/2004:07:56:05 +0500] PID 27006 stopped with status 1!
I [17/Feb/2004:07:56:05 +0500] Hint: Try setting the LogLevel to "debug" to find out more.
E [17/Feb/2004:07:56:06 +0500] PID 27038 stopped with status 1!
I [17/Feb/2004:07:56:06 +0500] Hint: Try setting the LogLevel to "debug" to find out more.
E [17/Feb/2004:07:56:27 +0500] [cups-polld 192.168.xxx.yy:631]
get-printers failed: server-error-service-unavailable
E [17/Feb/2004:07:56:27 +0500] [cups-polld 192.168.xxx.yy:631] get-classes failed: server-error-service-unavailable
I [17/Feb/2004:07:56:31 +0500] Job 9648 queued on 'pap3' by 'applprd'.
I [17/Feb/2004:07:56:31 +0500] Started backend /usr/local/lib/cups/backend/ipp (PID 27326) for job 9648.
I [17/Feb/2004:07:56:32 +0500] Job 9649 queued on 'pareceiving3' by
'applprd'.
I [17/Feb/2004:07:56:32 +0500] Started backend /usr/local/lib/cups/backend/ipp (PID 27365) for job 9649.
E [17/Feb/2004:07:56:36 +0500] PID 27326 stopped with status 1!
I [17/Feb/2004:07:56:36 +0500] Hint: Try setting the LogLevel to "debug" to find out more.
E [17/Feb/2004:07:56:57 +0500] [cups-polld 192.168.xxx.yy:631] get-printers failed: server-error-service-unavailable
E [17/Feb/2004:07:56:57 +0500] [cups-polld 192.168.xxx.yy:631] get-classes failed: server-error-service-unavailable
E [17/Feb/2004:07:57:23 +0500] [cups-polld 192.168.xxx.zz:631] get-printers failed: client-error-bad-request
I [17/Feb/2004:07:58:05 +0500] Job 9650 queued on 'pinnfin4' by 'applprd'.
I [17/Feb/2004:07:58:05 +0500] Started backend /usr/local/lib/cups/backend/ipp (PID 28094) for job 9650.
I [17/Feb/2004:07:58:46 +0500] Job 9651 queued on 'pap4' by 'applprd'.

The thing to notice is that several jobs (9642, 9643, 9645, and 9648), all had their backend ipp processes die (presumably because they were
attempting to send to the down server), but they never resubmitted the
jobs.

Also when I look at the page_log file, there is no record at all of any of the 4 jobs that got dropped.

I then changed the LogLevel on the client to debug and bounced cupsd
there. Once it had polled all the printers, I killed cupsd on one of the
servers, and then immediately queued 3 small print jobs (9668, 9669,
9670). It dropped one of them (9670).

Here is a small excerpt from the log with the debug info in it. I've
attached all the error log messages from 02/17/2004 to this STR.

I [17/Feb/2004:08:31:02 +0500] Started backend /usr/local/lib/cups/backend/ipp (PID 10487) for job 9670.
E [17/Feb/2004:08:31:07 +0500] [cups-polld 192.168.100.42:631]
get-printers failed: server-error-service-unavailable
E [17/Feb/2004:08:31:07 +0500] [cups-polld 192.168.100.42:631] get-classes failed: server-error-service-unavailable
D [17/Feb/2004:08:31:07 +0500] UpdateJob: job 9670, file 0 is complete. D [17/Feb/2004:08:31:07 +0500] CancelJob: id = 9670
D [17/Feb/2004:08:31:07 +0500] StopJob: id = 9670, force = 0
D [17/Feb/2004:08:31:07 +0500] StopJob: printer state is 3
E [17/Feb/2004:08:31:07 +0500] PID 10487 stopped with status 1!

The status 1 error message seems to be the last mention in the error_log
of anything to do with job 9670. Again, there is no listing for 9670 in
the page_log.

Job 9668, seems to have failed on the first attempt as well, but did get resubmitted correctly:

I [17/Feb/2004:08:30:56 +0500] Started backend
/usr/local/lib/cups/backend/ipp (PID 10371) for job 9668.
E [17/Feb/2004:08:31:02 +0500] PID 10371 stopped with status 1!
D [17/Feb/2004:08:31:02 +0500] UpdateJob: job 9668, file 0 is complete. D [17/Feb/2004:08:31:02 +0500] StopJob: id = 9668, force = 0
D [17/Feb/2004:08:31:02 +0500] StopJob: printer state is 3
D [17/Feb/2004:08:31:02 +0500] StartJob(9668, 231218)
D [17/Feb/2004:08:31:02 +0500] StartJob() id = 9668, file = 0/1
D [17/Feb/2004:08:31:02 +0500] StartJob: Sending job to queue tagged as
raw...
D [17/Feb/2004:08:31:02 +0500] job-sheets=none,none
D [17/Feb/2004:08:31:02 +0500] banner_page = 0
D [17/Feb/2004:08:31:02 +0500] StartJob: argv = "pinncopy@bohr","9668","laytonj","hosts","1","job-sheets=none,none
job-priority=50 job-hold-until=no-hold","/var/spool/cups/d09668-001"
D [17/Feb/2004:08:31:02 +0500] StartJob: envp[0]="PATH=/usr/local/lib/cups/filter:/bin:/usr/bin"
D [17/Feb/2004:08:31:02 +0500] StartJob: envp[1]="SOFTWARE=CUPS/1.1"
D [17/Feb/2004:08:31:02 +0500] StartJob: envp[2]="USER=root"
D [17/Feb/2004:08:31:02 +0500] StartJob: envp[3]="CHARSET=iso-8859-1"
D [17/Feb/2004:08:31:02 +0500] StartJob: envp[4]="LANG=en"
D [17/Feb/2004:08:31:02 +0500] StartJob: envp[5]="TZ=US/Eastern"
D [17/Feb/2004:08:31:02 +0500] StartJob: envp[6]="PPD=/etc/cups/ppd/pinncopy@bohr.ppd"
D [17/Feb/2004:08:31:02 +0500] StartJob: envp[7]="CUPS_SERVERROOT=/etc/cups"
D [17/Feb/2004:08:31:02 +0500] StartJob: envp[8]="RIP_MAX_CACHE=8m"
D [17/Feb/2004:08:31:02 +0500] StartJob:
envp[9]="TMPDIR=/var/spool/cups/tmp"
D [17/Feb/2004:08:31:02 +0500] StartJob: envp[10]="CONTENT_TYPE=text/plain"
D [17/Feb/2004:08:31:02 +0500] StartJob: envp[11]="DEVICE_URI=ipp://bohr:631/printers/pinncopy"
D [17/Feb/2004:08:31:02 +0500] StartJob: envp[12]="PRINTER=pinncopy@bohr"
D [17/Feb/2004:08:31:02 +0500] StartJob: envp[13]="CUPS_DATADIR=/usr/local/share/cups"
D [17/Feb/2004:08:31:02 +0500] StartJob: envp[14]="CUPS_FONTPATH=/usr/local/share/cups/fonts"
D [17/Feb/2004:08:31:02 +0500] StartJob: envp[15]="CUPS_SERVER=localhost"
D [17/Feb/2004:08:31:02 +0500] StartJob: envp[16]="IPP_PORT=631"
D [17/Feb/2004:08:31:02 +0500] StartJob: envp[17]="CLASS=pinncopy"
D [17/Feb/2004:08:31:02 +0500] StartJob: statusfds = [ 8 9 ]
D [17/Feb/2004:08:31:02 +0500] StartJob: filterfds[1] = [ 12 -1 ]

Just for chuckles, here's the page_log from that period:

pinncopy@bohr laytonj 9669 [17/Feb/2004:08:30:57 +0500] 1 1 - localhost
pinncopy@bohr laytonj 9669 [17/Feb/2004:08:30:57 +0500] total 0 - localhost
pinncopy@bohr laytonj 9668 [17/Feb/2004:08:31:02 +0500] 1 1 - localhost
pinncopy@bohr laytonj 9668 [17/Feb/2004:08:31:02 +0500] total 0 - localhost
pinncopy@bohr laytonj 9669 [17/Feb/2004:08:31:07 +0500] total 1 - localhost
pinncopy@bohr laytonj 9668 [17/Feb/2004:08:31:12 +0500] total 1 - localhost

Some other (anecdotal) info that may be useful:

I did a more controlled test yesterday where I had a very idle client
browsepoll one of the main cups servers and another (fairly idle) cups
server. I then brought down the idle cups server and submitted 3 jobs.
Those all went through. So perhaps this is a corner-case type of problem, and is somewhat dependent on how busy the client or server is.

I'm willing to help in about any way possible! I'll be happy to test
patches, or do more testing if requested to do so!

Collaborator

michaelrsweet commented Feb 18, 2004

CUPS.org User: mike

I believe this was caused by the issue in STR #448; can you check out the patch that is attached to the STR:

http://www.cups.org/str.php?L448

and then let me know if it does not resolve the issue?

Thanks!

Collaborator

michaelrsweet commented Feb 18, 2004

CUPS.org User: jeff.layton.bowebellhowell

Will do, it'll probably take some time (a week or more) for me to get a new version built, packaged and deployed. I'll let you know once I've tested it again.

Thanks!
Jeff

Collaborator

michaelrsweet commented Feb 24, 2004

CUPS.org User: jeff.layton.bowebellhowell

The patch seems to have done the trick. The jobs are being resubmitted correctly now when one server goes down.

Will this patch be in 1.1.21?

Collaborator

michaelrsweet commented Feb 24, 2004

CUPS.org User: mike

Yes, it will be part of 1.1.21 and 1.2.

michaelrsweet added this to the Stable milestone Mar 17, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment