Skip to content

1.0.2 + SSL breaks everything #691

Closed
Borkason opened this Issue Mar 24, 2013 · 58 comments

1 participant

@Borkason
Cherokee Project member

Original author: hcarvalh...@gmail.com (June 17, 2010 07:24:47)

What steps will reproduce the problem?
1. Update to 1.0.2-1
2. Configure a vServer with FCGI and Media serving under SSL
3. Browse around

What is the expected output?
Expected things to work.

What do you see instead?
Random timeouts, partial content transfer and cherokee-worker using 100% CPU is seen instead.

What version of the product are you using? On what operating system?
Cherokee 1.0.2-1~karmic~ppa / Ubuntu Karmic 64bit

Please provide any additional information below.
Site was working perfectly with 1.0, the same config file.

Changing keep-alive or server timeout just leads to different rates of breakage.

NOT related to FCGI source neither media served, it works fine with other httpds (apache, lighty)

Original issue: http://code.google.com/p/cherokee/issues/detail?id=909

@Borkason
Cherokee Project member

From hcarvalh...@gmail.com on June 17, 2010 07:45:50
For the record, ldconfig output:

libssl.so.0.9.8 -> libssl.so.0.9.8
libgnutls-openssl.so.26 -> libgnutls-openssl.so.26.14.10
@Borkason
Cherokee Project member

From hcarvalh...@gmail.com on June 17, 2010 15:56:52
Still getting inconsistent behavior for SSL across my user base browsers. Some request simply refuse to transfer any content (blank in the browser), and some versions of Safari even crash when SSL is accessed.

What's going on with this release? Anyone with related problems?

@Borkason
Cherokee Project member

From ste...@konink.de on June 17, 2010 18:24:44
You are not the only one but pinpointing the exact cause is currently problematic.

http://code.google.com/p/cherokee/issues/detail?id=594 (Github: #575)

@Borkason
Cherokee Project member

From alobbs on June 20, 2010 12:58:47
http://svn.cherokee-project.com/changeset/5210 should fix part of the issue.

@Borkason
Cherokee Project member

From hcarvalh...@gmail.com on June 20, 2010 21:52:03
Looking forward to 1.0.3 to test this. For now rolling back to 0.9x. Thanks for the feedback

@Borkason
Cherokee Project member

From lnu...@gmail.com on June 21, 2010 14:45:50
Just tested the latest svn tarball cherokee-1.0.3b5215 and still hangs on SSL

@Borkason
Cherokee Project member

From alobbs on June 21, 2010 14:54:34
Leonel, are you sure of that? Doing what?

@Borkason
Cherokee Project member

From lnu...@gmail.com on June 21, 2010 17:37:28
Downloaded the latest svn tarball
compiled buildted and tested with a self cert
trying to access https://localhost/

The browser just sits there waiting ..

@Borkason
Cherokee Project member

From lnu...@gmail.com on June 21, 2010 17:45:46
If I open https://localhost/ the first 2 - 5 times responds ok
then stops responding.

If I do an ab -c 10 -n 100 https://localhost/ the browser stops responding
If I do ab ab -c 100 -n 1000 http://localhost/ the browser works fine

@Borkason
Cherokee Project member

From hcarvalh...@gmail.com on June 21, 2010 20:28:47
Thanks lnunez, testing with ab maybe shows the problem for me:

If we get many concurrent requests, cherokee-worker tops 100% cpu and freezes. Looks like it's somehow related to SSL.

@Borkason
Cherokee Project member

From hcarvalh...@gmail.com on June 23, 2010 01:23:55
Updated to 1.0.4 today and couldn't reproduce easily with Firefox as before. I'll test on production for the next day with real-world load (~10 req/s) and tracing enabled. I'll report the result if we have any problems, otherwise I guess this one can be marked as solved ;)

@Borkason
Cherokee Project member

From ste...@konink.de on June 23, 2010 01:34:28
I have benchmarked it today, but we should get about 100x more requests through as in in this release. Anyway, it is way better than before. So still not off the radar.

@Borkason
Cherokee Project member

From lnu...@gmail.com on June 23, 2010 02:38:25
Still on 1.0.4 I can only reload an https://localhost/ default page only 2 times before the server closes the connection

@Borkason
Cherokee Project member

From lnu...@gmail.com on June 23, 2010 02:41:17
I can't put this version on producction I need https and this bug makes me hold the upgrade

@Borkason
Cherokee Project member

From hcarvalh...@gmail.com on June 23, 2010 22:57:38
Still having cherokee-worker hanging at 100% CPU sporadically when accessing SSL with 1.0.4

@Borkason
Cherokee Project member

From hcarvalh...@gmail.com on June 23, 2010 23:00:26
Serving 2 SSL requests in a row makes cherokee-worker hang and consume all CPU. It's easy to reproduce the bug by reloading the browser twice on a SSL page.

@Borkason
Cherokee Project member

From hcarvalh...@gmail.com on June 23, 2010 23:06:40
Also, I noticed having keep-alive disabled renders SSL unusable, the server just drops the connection. With it enabled, SSL will work sometimes, and sometimes will hang cherokee-worker.

@Borkason
Cherokee Project member

From lnu...@gmail.com on June 24, 2010 00:21:41
This is why I took time yesterday to build the PPA packages. But read on irc that the ssl bugs where gone.

For me 1.0.4 with ssl still drops connections after the 3 page reload this with ubuntu packages from ppa and build from tar.gz

@Borkason
Cherokee Project member

From alobbs on June 24, 2010 08:02:58
The issue is partially fixed in trunk now. I have managed to get Cherokee to work fine with Chrome, although for some reason it's still misbehaving while serving content to Firefox.

@Borkason
Cherokee Project member

From hcarvalh...@gmail.com on June 25, 2010 20:42:19
Anyone working on this bug? Tried a build from latest source and still having the same issue (hanging cherokee-worker), can always reproduce it.

Built with tracing enabled. Are the trace messages of any utility?

Anything else that can be done from my side to help?

@Borkason
Cherokee Project member

From davisd.davisd@gmail.com on June 30, 2010 21:19:10
I'm having the same problem with 1.0.4, 1.0.3, and 1.0.2

Oddly, Chrome browser works fine... Using firefox causes problems.

1.0.1 works fine, I'll be running that until this is fixed.

-David

@Borkason
Cherokee Project member

From davisd.davisd@gmail.com on June 30, 2010 22:55:33
I should note that with 1.0.1, I'm getting periodic (Error code: sec_error_bad_signature) as in http://code.google.com/p/cherokee/issues/detail?id=594 (Github: #575) and I've got to restart cherokee.

I've had SSL problems since I started using Cherokee way back with 0.99.42 in February...

I've duplicated both problems on several ubuntu servers, 9.10, 10.04 and serveral arch linux servers... I've run different versions of cherokee, openssl, different machines, different virtual machines, different linux distributions.

I assume this bug 909 is directly related to the fixes for bug 594 ?

@Borkason
Cherokee Project member

From alobbs on July 01, 2010 05:57:53
David, I see the same behavior at my end. Chrome and Opera works alright, but FF fails when a connection is unexpectedly closed.

I've been struggling to find a consistent way to reproduce the issue. If you find some, please let me know. It'd be of great help.

About bug 594. I guess we fixed it along the way - at the same time that we introduced the regression we are currently talking about.

@Borkason
Cherokee Project member

From hcarvalh...@gmail.com on July 01, 2010 06:21:57
@alobbs

I cannot always reproduce Firefox reporting sec_error_bad_signature, but I can always reproduce the closed connections and cherokee-worker crashing.

I'm configuring cherokee for both http and https, default settings for keep-alive (enabled, default timeouts), and then trying to make 2 successive request from Firefox will always hang cherokee-worker at 100% and start timing out connections. From those, sometimes Firefox will report sec_error_bad_sig, sometimes it will load everything, and sometimes it will truncate the response. I guess it depends on which state cherokee-worker crashes.

Try with a HTML document linking to many stylesheet, images and everything being served thru SSL, instead of just a blank page. The behavior may be different with more concurrent requests.

@Borkason
Cherokee Project member

From hcarvalh...@gmail.com on July 01, 2010 06:34:19
@alobbs

Forgot saying that this behavior from last comment is with 1.0.4 and SVN.

@Borkason
Cherokee Project member

From hcarvalh...@gmail.com on July 23, 2010 22:30:44
Does 1.0.5 fixed this one?

@Borkason
Cherokee Project member

From alobbs on July 27, 2010 09:14:12
I'm afraid it did not.

@Borkason
Cherokee Project member

From prudhvik...@gmail.com on August 05, 2010 21:28:59
Is this issue fixed in 1.0.6?

@Borkason
Cherokee Project member

From lukasz.k...@gmail.com on August 10, 2010 06:05:02
1.0.6 still have problems on FF, IE 8 works fine.

@Borkason
Cherokee Project member

From prudhvik...@gmail.com on August 10, 2010 16:45:19
This is a major blocker for us to start running cherokee on Production. Is this fixed in 10.7?. Where can i find changelogs?

@Borkason
Cherokee Project member

From alobbs on August 10, 2010 16:47:59
We are currently working on it. A few hours ago a related patch made it to trunk, although the problem isn't fully solved yet: http://svn.cherokee-project.com/changeset/5363

@Borkason
Cherokee Project member

From lnu...@gmail.com on August 11, 2010 00:20:41
@prudhvikrishna Ubuntu packages at launchpad now have 2 repositories
The current PPA NOW has cherokee 1.0.1 wich works perfect with SSL

and the NEW PPA repo named i-tse

You can read all about it here :
http://lists.octality.com/pipermail/cherokee/2010-August/013274.html

So I recommend you to use as I do 1.0.1 on production once the ssl bug gets fixed this 1.0.1 will be upgraded to the version that fixes the problem

This is why there are 2 PPA repos ;)

Saludos

@Borkason
Cherokee Project member

From go.on....@googlemail.com on August 11, 2010 02:20:26
@alobbs
Do you know, where this issue comes from? Or do you need more information about it or any help with this? My problem is, that our site starts in a few days and I would like to provide a ssl cert on it. But with version 1.0.4 it's impossible. Version 0.74 (I think) in Debian respos does not support webm streaming and compiling 1.0.1 is impossible to me because of some ffmpeg problems that are somehow unsolveable on Lenny.

Yet I'm really happy with Cherokee and the performance tests are great so far, but this (and the lack of aditional header support) is a real problem for our production environment. If you say, the problem might be solved in a few days or weeks, we will wait for it and use the time to optimize our webpage. If not, I might need to take a look at lighty. And I really don't want to do that, as I think cherokee is much better.

@Borkason
Cherokee Project member

From alobbs on August 11, 2010 06:45:19
@go.on.joe: I'm still investigation the issue. Hopefully it'll be fixed up soon, although I couldn't tell you when that will be for sure. It has already taken much more time than what I'd have expected.

@Borkason
Cherokee Project member

From alobbs on August 11, 2010 09:42:56
I believe the issue has been fixed up. Could you guys please give r5368 (or later) a try?

@Borkason
Cherokee Project member

From lnu...@gmail.com on August 11, 2010 10:40:02
Tested r5369

first set of tests and .... https on firefox it's working! \o/ YES !!
I'll do more testing later

Thank you

@Borkason
Cherokee Project member

From hcarvalh...@gmail.com on August 11, 2010 18:08:40
That's good news. When I have time, I'll try this SVN rev on our staging server with all major browsers.

@Borkason
Cherokee Project member

From go.on....@googlemail.com on August 11, 2010 22:36:42
Ok, I recompiled X264, ffmpeg with shared libs and managed to compile cherokee. The firefox issue seems to be gone, but now I got another problem. I can output files with php up to about 20k - everything above just shows a blank file with no error message. :-(

Big static files work fine and the same php files without ssl also. Any idea, what the problem might be?

@Borkason
Cherokee Project member

From ste...@konink.de on August 11, 2010 22:48:51
@go.on.joe

Please log a separate bug for this.

@Borkason
Cherokee Project member

From go.on....@googlemail.com on August 11, 2010 22:51:23
Chrome shows "Error 100 (net::ERR_CONNECTION_CLOSED): Unknown Error" after a few reloads

Firefox shows blank page nearly everytime.

Opera 10.6 works fine most of the time, but sometimes cuts off parts of the file

@Borkason
Cherokee Project member

From ste...@konink.de on August 11, 2010 22:56:37
@go.on.joe

Please open another bug if this does not refer to the SSL bug. Are you using 1.0.8?

@Borkason
Cherokee Project member

From davisd.davisd@gmail.com on August 15, 2010 00:02:03
I upgraded two servers to 1.0.8 today and so far so good! Thanks for all of the hard work! I'll post back if there are problems.

@Borkason
Cherokee Project member

From alobbs on August 15, 2010 07:27:27
Sweet. Thanks for the feedback @davisd.davisd!

@Borkason
Cherokee Project member

From skar...@gmail.com on August 15, 2010 15:55:18
Only to say: Congratulations @alobbs

Tough bug fix... ;)

@Borkason
Cherokee Project member

From kallist...@gmail.com on September 24, 2010 15:25:28
uh-oh.. I just started seeing this in 1.0.8:

Chrome shows "Error 100 (net::ERR_CONNECTION_CLOSED): Unknown Error" after a few reloads

Firefox shows blank page nearly everytime.

@Borkason
Cherokee Project member

From woll...@gmail.com on October 16, 2010 00:31:26
We are using Cherokee (1.0.8) since the start of the term on a quite busy moodle-installation (eLearning) for our university and we where first quite happy about the enhanced smoothness...
But we are using HTTPS for authentication - and we also encounter this bug. We made some experiments and analysis on our own to discover its nature, so perhaps I can contribute some useful information and gain help (?)

1) The bug appears here with SSL/fcgi-php5/chunked-encoding/content-compression/keep-alive. Static content gets delivered. PHP-content isn't delivered reliable. If there is just a small amount of Data send - the delivery probability is high. With just a few bytes exchanged its nearly 99% but drops to maybe 70% for larger outputs.

2) I'm not sure that bug is really Browser dependent. Its just that different Browsers act different, when Data that once was accessible is no longer available -
some just present a cached version instead.
The bug is easy detectable in our server logs (e.g. zero bytes delivered successfully - With larger files sometimes only truncated) and it seems every browser-type gets its share on errors randomly. In our case we were getting a unusual high complaint-rate about faulty logins from our students

3) Symptoms in Cherokees Error-Log:
Some of these Problems are visible inside the error log. Its mostly lines like these
15/10/2010 16:07:10.508] (error) fdpoll-epoll.c:140 - epoll_ctl: ep_fd 18, fd 107: 'No such file or directory'
the time corresponding directly to a login failure. Which results than in one defunct Thread(-connection?). If that happens to often in short time. The server can't deliver any php generated content for a while, not even via plain HTTP - showing "Gateway 504" (timeout) messages.

The total blackout obviously happens, when lots of students are logging in. But its not a real heavy load situation otherwise - plenty of free processor power, RAM and idling php-fcgis - especially after a while ;¬)

I didn't want to file a new bug. Because this bugreport seems to address the same problem - and also Issue ​954 (Github: #707).
(I'm not sure if it suggests to disable "chunked encoding" globally just to use SSL?)

By the way: The Server is running on OpenSuse 11.3 (x86_64) on Kernel 2.6.34.7, Cherokee 1.0.8-10.2, php 5.3.3-0.1.2,

@Borkason
Cherokee Project member

From ste...@konink.de on October 16, 2010 00:52:35
@wollatz how many concurrent php-cgi clients are you running? Is it possible that all your clients are saturated?

@Borkason
Cherokee Project member

From woll...@gmail.com on October 16, 2010 13:26:51
@konink.de
Big thanks for your answer! I'm surprised about the swiftness.

12 concurrent php-cgi clients ... and yes, saturation was the case, due to an overloaded authentication server. It wasn't directly the SSL-problem - which took all the blame but is just loosely related:
The new students tried different passwords randomly, after their real passwords got them a white page, witch triggered a delay feature of the password-Server... ;o(


I've just tested the SSL-error again and it always shows up with a delivered Http-header, no timeout, but also no (full) content.
So, in the access.log it shows up as:
xxx.xxx.116.62 - - [16/Oct/2010:13:24:02 +0200] "GET /css.php HTTP/1.1" 200 360 (..)
instead of the correct:
xxx.xxx.116.62 - - [16/Oct/2010:13:23:04 +0200] "GET /css.php HTTP/1.1" 200 14946 (..)
Checked via liveHTTP - at least the delivered header seems identical in both cases.
I've checked that error with different browsers - just the messages seem to differ but some try to cope by presenting an older cached version of the data.

It seems every SSL-delivered Page has a different but individually fixed "truncation point". Our userprofile-page is for one user truncated after 4973 bytes but is normally 17672 bytes big (gziped) the same script just showing a different user-profile is 17692 bytes big and if truncated just 4970 bytes.
The truncation happens always at the same html-Tag! Inside php there seems to be nothing special at this point - no explicit "flush()" and not even a newline behind that last delivered HTML-tag, the following content is just generated by the next print-command. So perhaps some internal PHP-Buffer got filled up at about that position so the next (big) print() always delivers to the next chunk?

So that script always shows the same style of page-rump and the css.php script never shows any content - if they show faulty content.
The probability of a corrupted Page is much higher (40-60%) for the more complex User-Profile Page, than for the login (produces redirect), or the css-output.

I wasn't able to get faulty pages by the WGET-command. But then I haven't tested with keep-alive, gzip and cookies...


Now I try to rewrite the authentication process so it is better suited to handle this situation at high traffic times. ;o)
By the way what is the main drawback in globally disabling "chunked encoding"? I just found the notion that it is bad for "keep-alive Connections". Would a big download and parallel surfing on the same site create a problem in this case?

@Borkason
Cherokee Project member

From ste...@konink.de on October 16, 2010 13:35:00
Can you disable any stuff like GZIP/Deflate? The big pain is ofcourse that to debug it you would do very dirty things and I strongly advise not to do this in production. Maybe you can tryout the svn version and report back if you have the same issues there.

@Borkason
Cherokee Project member

From ste...@konink.de on October 16, 2010 14:04:41
I strongly feel that you are bitten by something additional...

1) the global network timeout (it basically kills the connection to the backend: php and kills the connection to your client)
2) your php timeout is probably higher than the network timeout (15s) so try to increase the timeout, that should reduce the 504 messages (and make php timeout before that error)

@Borkason
Cherokee Project member

From woll...@gmail.com on October 16, 2010 17:30:16
@konink.de

Thank you for your advise and tips. I'm testing all this on an almost identical parallel installation - our test/development server - not on the real thing ... ;¬)

How we got those 504 Gateway Errors seems in retrospect fairly obvious. A increasing number of php-threads waiting for answers from the authentication-server - until no more php-threads were ready for work...
I should change this behaviour by setting the timeout of the auth-request lower than the PHP-timeout and deliver a meaningful error message back - or by switching to another LDAP-Server...
Cherokees network-timeout was set to 35s - and (you are right) the php-timeout was much higher - I've set it back to 34s. Scripts needing lots of execution time should set it higher for themselves...


The issue ​909 (Github: #691) remains. even with gzip/deflate support switched of.
The user-profile truncation got mor often (>60% of the tries) delivered truncated.
The main difference is that the truncation point still is mostly at about 4800 bytes. But since the Content is not any longer compressed the HTML-pages is a lot shorter than before. But now the truncation seems to vary between three values. The later truncations are less frequent than the ~ 4800 byte truncation. With just "deflate" (and no "gzip") enabled the behaviour is just like gzip. With all above scenarios a truncation takes always place after the corresponding php-output command was finished. Output buffering inside PHP is set to 4096 which seemed somehow suspicious, but 8192 made no difference.

I still wasn't able to reproduce the problem with wget (cookies + keepalive enabled)

After globally deactivating "chunked encoding" (Issue ​954 (Github: #707)) the error is gone - but pages are refreshed less smooth now. Are there other drawbacks?
Is it possible to deactivate chunked encoding just for https - something like: server!ssl!chunked_encoding = 0

@Borkason
Cherokee Project member

From ste...@konink.de on October 16, 2010 17:37:59

Scripts needing lots of execution time should set it higher for themselves...

NO! NO! NO! This is a TOTAL misconception. Your maximum timeout is limited by the network timeout (1), FastCGI (2), Interpretertimeout (3). Do you really think Cherokee will wait for PHP until its done?

I think Alvaro should be commenting on this chunked encoding stuff with respect to the chunked encoding stuff.

@Borkason
Cherokee Project member

From woll...@gmail.com on October 16, 2010 20:15:43
Hmm, it looks to me, that at least php-threads listening/waiting/hanging on an secondary open TCP-IP socket couldn't be motivated to do something else, even if Cherokees network timeout is due. At least that's how I interpret the logs. They got freed at about the old timeout in fastcgi/php.ini or perhaps due to a global TCP-IP timeout, but not after just 35 seconds.
But this is not really related to the SSL bug, but you helped me a lot understanding the underlying mechanics which hopefully solved the secondary problem. Much thanks for the unexpected thorough support!
The commercial company, who programmed the universities student-database takes days to notice a request and tons of gold to fix something... ;¬þ

@Borkason
Cherokee Project member

From ste...@konink.de on October 16, 2010 20:37:14
You also can get commercial Cherokee support to really get prioritized with bug fixing, maybe something you can put in consideration if you are tight on time. http://www.octality.com/support.html

I think somethings like timeouts are not very clear to most users, we should fix and maybe make a screencast.

@Borkason
Cherokee Project member

From woll...@gmail.com on October 18, 2010 15:09:01
I've tested this SSL-behaviour today on different computers at our institute. It seems to occur (mostly?) under Linux (but there with different distributions, kernel-versions and browser-types) and more seldom with high speed internet connections (Its easier to get with my desktop-pc at home or my Laptop on WLAN). So I think the priority for a fix is a lot lower than I first thought.

@Borkason
Cherokee Project member

From ste...@konink.de on April 23, 2011 19:34:36
Please test latest SVN

@Borkason
Cherokee Project member

From hcarvalh...@gmail.com on June 24, 2011 23:27:49
Has this been fixed in any recent stable release that I can test?

@Borkason
Cherokee Project member

From ste...@konink.de on October 21, 2011 17:22:37
The most recent release should have addressed a lot of things yes.

@Borkason Borkason closed this Mar 24, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.