New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

abort() when trying to revalidate expired output resource #1553

Closed
cellscape opened this Issue Jun 5, 2017 · 16 comments

Comments

Projects
None yet
4 participants
@cellscape
Copy link

cellscape commented Jun 5, 2017

I am using custom integration of PSOL and recently all instances started to crash with the same message:

[0605/094417:FATAL:output_resource.cc(223)] Output resources shouldn't be loaded via LoadAsync, but rather through FetchResource

Backtrace:

[Switching to Thread 0x7ffea2ffd700 (LWP 3018)]      
base::debug::(anonymous namespace)::DebugBreak () at third_party/chromium/src/base/debug/debugger_posix.cc:246                                                                                                       
246     }                                            
(gdb) bt                                             
#0  base::debug::(anonymous namespace)::DebugBreak () at third_party/chromium/src/base/debug/debugger_posix.cc:246                                                                                                   
#1  0x0000000000b4a83d in base::debug::BreakDebugger () at third_party/chromium/src/base/debug/debugger_posix.cc:259                                                                                                 
#2  0x000000000071cf7e in logging::LogMessage::~LogMessage (this=0x7ffea2ffbe20, __in_chrg=<optimized out>) at third_party/chromium/src/base/logging.cc:657                                                          
#3  0x00000000007a90c7 in net_instaweb::OutputResource::LoadAndCallback (this=0x7ffe94015868, not_cacheable_policy=net_instaweb::Resource::kReportFailureIfNotCacheable, request_context=...,                        
    callback=0x7ffe94018498) at net/instaweb/rewriter/output_resource.cc:223                              
#4  0x00000000007add85 in net_instaweb::Resource::LoadAsync (this=0x7ffe94015868, not_cacheable_policy=net_instaweb::Resource::kReportFailureIfNotCacheable, request_context=..., callback=0x7ffe94018498)           
    at net/instaweb/rewriter/resource.cc:166         
#5  0x0000000000bfd692 in net_instaweb::RewriteContext::OutputCacheRevalidate (this=0x7ffe9c038f68, to_revalidate=std::vector of length 1, capacity 1 = {...}) at net/instaweb/rewriter/rewrite_context.cc:1934      
#6  0x0000000000bfbe96 in net_instaweb::RewriteContext::OutputCacheDone (this=0x7ffe9c038f68, cache_result=0x7ffe94016da8) at net/instaweb/rewriter/rewrite_context.cc:1674                                          
#7  0x0000000000c10a9d in net_instaweb::MemberFunction1<net_instaweb::RewriteContext, net_instaweb::RewriteContext::CacheLookupResult*>::Run (this=0x7ffe94016f38) at ./pagespeed/kernel/base/function.h:171         
#8  0x000000000089428f in net_instaweb::Function::CallRun (this=0x7ffe94016f38) at pagespeed/kernel/base/function.cc:51                                                                                              
#9  0x00000000006dde28 in net_instaweb::QueuedWorkerPool::Run (this=0x169b538, sequence=0x17884b8, worker=0x7ffe9c038528) at pagespeed/kernel/thread/queued_worker_pool.cc:157                                       
#10 0x00000000006e5388 in net_instaweb::MemberFunction2<net_instaweb::QueuedWorkerPool, net_instaweb::QueuedWorkerPool::Sequence*, net_instaweb::QueuedWorker*>::Run (this=0x17afe08)                                
    at ./pagespeed/kernel/base/function.h:202        
#11 0x000000000089428f in net_instaweb::Function::CallRun (this=0x17afe08) at pagespeed/kernel/base/function.cc:51                                                                                                   
#12 0x00000000006e9850 in net_instaweb::Worker::WorkThread::Run (this=0x7ffe9c03a2f8) at pagespeed/kernel/thread/worker.cc:84                                                                                        
#13 0x00000000006ecbab in net_instaweb::PthreadThreadImpl::InvokeRun (self_ptr=0x7ffe9c03a3c8) at pagespeed/kernel/thread/pthread_thread_system.cc:122                                                               
#14 0x00007ffff735c374 in start_thread (arg=0x7ffea2ffd700) at pthread_create.c:333                       
#15 0x00007ffff49d760f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:105   

I've found metadata key it was using:

#5  0x0000000000bfd692 in net_instaweb::RewriteContext::OutputCacheRevalidate (this=0x7ffe9c038f68, to_revalidate=std::vector of length 1, capacity 1 = {...}) at net/instaweb/rewriter/rewrite_context.cc:1934
1934            new ResourceRevalidateCallback(this, resource, input_info));
(gdb) p partition_key_ 
$1 = "rname/ce_JfZ4rtPZUfry0kIJ6v3S/http://XXX/images/xpage1_img1.jpg.pagespeed.ic.965L1HZqE1.jpg@@_"

I've decoded InputInfo from that cache key and it was recently expired after being year in cache.

I'm using latest stable version of psol (1.11.33.4) but believe that last dev version still has the issue. It's also custom integration but I believe it doesn't matter. Well perhaps only except of long-living cache which helped to trigger the issue.

@oschaaf

This comment has been minimized.

Copy link
Member

oschaaf commented Jun 5, 2017

So far no luck hitting this code path with vanilla mod_pagespeed.

It looks like somehow ResourcePtr resource = slots_[input_info->index()]->resource(); returns an OutputResource, which I think should not happen: I think that should only resturn InputResources.
So I wonder: how did the OutputResource end up there?

@cellscape

This comment has been minimized.

Copy link

cellscape commented Jun 5, 2017

I've tried to modify pagespeed to force return expired for every metadata cache lookup:

diff --git a/net/instaweb/rewriter/rewrite_context.cc b/net/instaweb/rewriter/rewrite_context.cc
index 484a14499..d90f9aa72 100644
--- a/net/instaweb/rewriter/rewrite_context.cc
+++ b/net/instaweb/rewriter/rewrite_context.cc
@@ -396,15 +396,15 @@ class RewriteContext::OutputCacheCallback : public CacheInterface::Callback {
         if (!input_info.has_expiration_time_ms()) {
           return false;
         }
-        int64 ttl_ms = input_info.expiration_time_ms() - now_ms;
-        if (ttl_ms > 0) {
-          return true;
-        } else if (
-            !rewrite_context_->has_parent() &&
-            ttl_ms + options->metadata_cache_staleness_threshold_ms() > 0) {
-          *stale_rewrite = true;
-          return true;
-        }
+        // int64 ttl_ms = input_info.expiration_time_ms() - now_ms;
+        // if (ttl_ms > 0) {
+        //   return true;
+        // } else if (
+        //     !rewrite_context_->has_parent() &&
+        //     ttl_ms + options->metadata_cache_staleness_threshold_ms() > 0) {
+        //   *stale_rewrite = true;
+        //   return true;
+        // }

This way both my server and pagespeed_automatic_test abort() with the same backtrace. Change itself is silly one but seems to do the job at least.

% ./pagespeed_automatic_test --gtest_filter=CacheHtmlFlowTest.TestCacheHtmlFlushSubresources                                                                                   [src/tags/1.11.33.4]
Running main() from gtest_main.cc
Note: Google Test filter = CacheHtmlFlowTest.TestCacheHtmlFlushSubresources
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from CacheHtmlFlowTest
[ RUN      ] CacheHtmlFlowTest.TestCacheHtmlFlushSubresources
[0605/174716:FATAL:output_resource.cc(223)] Output resources shouldn't be loaded via LoadAsync, but rather through FetchResource
./pagespeed_automatic_test() [0x11c1ecc]
./pagespeed_automatic_test() [0x11c32c1]
./pagespeed_automatic_test() [0x1313f8a]
./pagespeed_automatic_test() [0x13169f0]
./pagespeed_automatic_test() [0x100718d]
./pagespeed_automatic_test() [0x1005a5d]
./pagespeed_automatic_test() [0x101b9a3]
./pagespeed_automatic_test() [0x136deb6]
./pagespeed_automatic_test() [0x13bcafa]
./pagespeed_automatic_test() [0x13c2df8]
./pagespeed_automatic_test() [0x136deb6]
./pagespeed_automatic_test() [0x13c9462]
./pagespeed_automatic_test() [0xefddae]
/lib64/libpthread.so.0(+0x7374) [0x7f583083e374]
/lib64/libc.so.6(clone+0x5f) [0x7f583058960f]

zsh: abort (core dumped)  ./pagespeed_automatic_test 

This happens with both vanilla and patched pagespeed for me. I do have some modifications for PSOL and obviously there is integration step but changes do not seem to be related. I didn't touch rewriting at all, only cache/system stuff.

Can you reproduce above steps to confirm the issue on your side?

What happens in that test case I think is that cache extend filter requires already rewritten resource as input. It seems that OutputResource is being assigned to slot in RewriteContext::Propagate. After that RewriteContext::RunSuccessors starts new rewrite with OutputResource in cache. In test case above pagespeed first runs image compression filter and then cache extends it.

RewriteContext::FetchInputs has comment saying about chained rewrites and requiring pagespeed resources as inputs. FetchInputs is called when metadata cache is missing at all for some input and it checks if URL of input is pagespeed URL, avoiding LoadAsync for such inputs and instead using FetchResource.

It seems that RewriteContext::OutputCacheRevalidate should do similar: if input is already rewritten then use FetchResource, otherwise fallback to LoadAsync.

@cellscape

This comment has been minimized.

Copy link

cellscape commented Jun 5, 2017

This triggers the issue with vanilla pagespeed without modification above.

TEST_F(CacheHtmlFlowTest, TestCacheHtmlRevalidateInputs) {
  GoogleString text;
  RequestHeaders request_headers;
  ResponseHeaders response_headers;
  FetchFromProxy("http://test.com/flush_subresources.html"
                 "?PageSpeedFilters=+extend_cache_css,-inline_css", true,
                 request_headers, &text, &response_headers, NULL, false);
  SetTimeMs(MockTimer::kApr_5_2010_ms + 2 * Timer::kYearMs);
  FetchFromProxy("http://test.com/flush_subresources.html"
                 "?PageSpeedFilters=+extend_cache_css,-inline_css", true,
                 request_headers, &text, &response_headers, NULL, false);
}
% ./pagespeed_automatic_test --gtest_filter=CacheHtmlFlowTest.TestCacheHtmlRevalidateInputs                                                                                    [src/tags/1.11.33.4]
Running main() from gtest_main.cc
Note: Google Test filter = CacheHtmlFlowTest.TestCacheHtmlRevalidateInputs
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from CacheHtmlFlowTest
[ RUN      ] CacheHtmlFlowTest.TestCacheHtmlRevalidateInputs
[0605/191504:FATAL:output_resource.cc(223)] Output resources shouldn't be loaded via LoadAsync, but rather through FetchResource
./pagespeed_automatic_test() [0x11c22da]
./pagespeed_automatic_test() [0x11c36cf]
./pagespeed_automatic_test() [0x1314398]
./pagespeed_automatic_test() [0x1316dfe]
./pagespeed_automatic_test() [0x10074e7]
./pagespeed_automatic_test() [0x1005db7]
./pagespeed_automatic_test() [0x101bdb1]
./pagespeed_automatic_test() [0x136e2c4]
./pagespeed_automatic_test() [0x13bcefa]
./pagespeed_automatic_test() [0x13c31f8]
./pagespeed_automatic_test() [0x136e2c4]
./pagespeed_automatic_test() [0x13c9862]
./pagespeed_automatic_test() [0xefe108]
/lib64/libpthread.so.0(+0x7374) [0x7fb43e66d374]
/lib64/libc.so.6(clone+0x5f) [0x7fb43e3b860f]

zsh: abort (core dumped)  ./pagespeed_automatic_test 
@jmarantz

This comment has been minimized.

Copy link
Contributor

jmarantz commented Jun 5, 2017

@cellscape

This comment has been minimized.

Copy link

cellscape commented Jun 5, 2017

CacheHtmlFlow was just the first place where I observed the crash in tests with modification above. I just modified it to be able to reproduce quickly. It's in no way specific and I believe the issue is general one.

Basically this test does rewrite (with image compression to cache extender filter chain), expires stored metadata for InputInfo and then aborts when trying to revalidate output of image compression. I might be wrong though, don't fully understand rewriting process.

As for integration, I already described it some time ago on mailing list: https://groups.google.com/forum/#!msg/mod-pagespeed-discuss/_LJJ0UmQ8SU/sw5cvqedjmMJ. There probably were some changes but overall architecture is the same. As I said I don't think integration is to blame in that case though, we don't touch any rewriting or filters.

@oschaaf

This comment has been minimized.

Copy link
Member

oschaaf commented Jun 5, 2017

This reproduces on master:

oschaaf@ubuntu:~/code/google/mps-abo/mod_pagespeed⟫ git diff
diff --git a/pagespeed/automatic/proxy_interface_test.cc b/pagespeed/automatic/proxy_interface_test.cc
index bc900b84a..89f7d32f8 100644
--- a/pagespeed/automatic/proxy_interface_test.cc
+++ b/pagespeed/automatic/proxy_interface_test.cc
@@ -381,6 +381,37 @@ TEST_F(ProxyInterfaceTest, LoggingInfo) {
   EXPECT_TRUE(logging_info()->is_request_disabled());
 }
 
+TEST_F(ProxyInterfaceTest, TestAbort) {
+  GoogleString text;
+  RequestHeaders request_headers;
+  ResponseHeaders response_headers;
+  response_headers.Add(HttpAttributes::kContentType, kContentTypeHtml.mime_type());
+  response_headers.SetStatusAndReason(HttpStatus::kOK);
+  GoogleString url = "http://test.com/flush_subresources.html"
+    "?PageSpeedFilters=+extend_cache_css,-inline_css";
+  mock_url_fetcher_.SetResponse("http://test.com/flush_subresources.html", response_headers,
+                                "<html><head><link rel='stylesheet' type='text/css' href='test.css'></head></html>");
+
+  response_headers.Replace(HttpAttributes::kContentType, kContentTypeCss.mime_type());
+  mock_url_fetcher_.SetResponse("http://test.com/test.css", response_headers,
+                                kCssContent);
+
+  FetchFromProxy(url,
+                 request_headers,
+                 true,  /* expect_success */
+                 &text,
+                 &response_headers,
+                 false  /* proxy_fetch_property_callback_collector_created */);
+  SetTimeMs(MockTimer::kApr_5_2010_ms + 2 * Timer::kYearMs);
+
+  FetchFromProxy(url,
+                 request_headers,
+                 true,  /* expect_success */
+                 &text,
+                 &response_headers,
+                 false  /* proxy_fetch_property_callback_collector_created */);
+}
+
 TEST_F(ProxyInterfaceTest, SkipPropertyCacheLookupIfOptionsNotEnabled) {
   GoogleString url = "http://www.example.com/";
   GoogleString text;
@morlovich

This comment has been minimized.

Copy link
Contributor

morlovich commented Jun 5, 2017

@morlovich

This comment has been minimized.

Copy link
Contributor

morlovich commented Jun 5, 2017

@oschaaf

This comment has been minimized.

Copy link
Member

oschaaf commented Jun 5, 2017

I want to note that in release mode there is no abort, and with the simple test that excites this the results seem to look good

@morlovich

This comment has been minimized.

Copy link
Contributor

morlovich commented Jun 5, 2017

@cellscape

This comment has been minimized.

Copy link

cellscape commented Jun 5, 2017

Looks like it's matter of having NDEBUG defined? This explains why nobody triggered that with apache or nginx then.

@jmarantz

This comment has been minimized.

Copy link
Contributor

jmarantz commented Jun 5, 2017

@oschaaf

This comment has been minimized.

Copy link
Member

oschaaf commented Jun 5, 2017

It sounds like in the short term what we could do is to remove the VLOG(DFATAL) with a warning for those running debug builds, and add a TODO for modifying OutputResource to indicate it is always valid as @morlovich suggested above? It would also be nice to include the new unit test.

@cellscape Would you like to create a PR for this? :)

@cellscape

This comment has been minimized.

Copy link

cellscape commented Jun 5, 2017

Is CLA still required? If yes then I pass otherwise will prepare one tomorrow.

@oschaaf

This comment has been minimized.

Copy link
Member

oschaaf commented Jun 5, 2017

@cellscape a CLA is required.. so I'll create the PR then

oschaaf added a commit that referenced this issue Jun 6, 2017

Debug builds may abort() when revalidating expired output resources
As a workaround, remove the LOG(DFATAL) and replace it with a TODO.

#1553

oschaaf added a commit that referenced this issue Jun 6, 2017

Debug builds may abort() when revalidating expired output resources
As a workaround, remove the LOG(DFATAL) and replace it with a TODO.

#1553

oschaaf added a commit that referenced this issue Jun 7, 2017

Debug builds may abort() when revalidating expired output resources (#…
…1554)

* Debug builds may abort() when revalidating expired output resources

As a workaround, remove the LOG(DFATAL) and replace it with a TODO.

#1553
@oschaaf

This comment has been minimized.

Copy link
Member

oschaaf commented Jul 7, 2017

Closing, as the fix for this was merged

@oschaaf oschaaf closed this Jul 7, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment