Improve encoding detection in WebsiteAgent

Previously, WebsiteAgent always assumed that a content with no charset specified in the Content-Type header should be encoded in UTF-8. This enhancement is to make use of the encoding detector implemented in Nokogiri for HTML/XML documents, instead of blindly falling back to UTF-8. When the document `type` is `html` or `xml`, WebsiteAgent tries to detect the encoding a fetched document from the presence of a BOM, XML declaration, or HTML `meta` tag. This fixes #1742.
huginn · Oct 25, 2016 · f157ad2 · f157ad2
1 parent 950975d
commit f157ad2
Show file tree

Hide file tree

Showing 3 changed files with 296 additions and 93 deletions.
diff --git a/app/concerns/web_request_concern.rb b/app/concerns/web_request_concern.rb
@@ -46,6 +46,8 @@ def call(env)
             # Never try to transcode a binary content
             next
           end
+          # Return body as binary if default_encoding is nil
+          next if encoding.nil?
         end
         body.encode!(Encoding::UTF_8, encoding)
       end
@@ -89,6 +91,9 @@ def validate_web_request_options!
     end
   end
 
+  # The default encoding for a text content with no `charset`
+  # specified in the Content-Type header.  Override this and make it
+  # return nil if you want to detect the endcoding on your own.
   def default_encoding
     Encoding::UTF_8
   end

diff --git a/app/models/agents/website_agent.rb b/app/models/agents/website_agent.rb
@@ -94,7 +94,12 @@ class WebsiteAgent < Agent
 
       Set `uniqueness_look_back` to limit the number of events checked for uniqueness (typically for performance).  This defaults to the larger of #{UNIQUENESS_LOOK_BACK} or #{UNIQUENESS_FACTOR}x the number of detected received results.
 
-      Set `force_encoding` to an encoding name if the website is known to respond with a missing, invalid, or wrong charset in the Content-Type header.  Note that a text content without a charset is taken as encoded in UTF-8 (not ISO-8859-1).
+      Set `force_encoding` to an encoding name (such as `UTF-8` and `ISO-8859-1`) if the website is known to respond with a missing, invalid, or wrong charset in the Content-Type header.  Below are the steps to detect the encoding of a fetched content:
+
+      1. If `force_encoding` is given, use the value.
+      2. If the Content-Type header contains a charset parameter, use the value.
+      3. When `type` is `html` or `xml`, check for the presence of a BOM, XML declaration with attribute "encoding", and an HTML meta tag with charset information.
+      4. Fall back to UTF-8 (not ISO-8859-1).
 
       Set `user_agent` to a custom User-Agent name if the website does not like the default value (`#{default_user_agent}`).
 
@@ -307,6 +312,16 @@ def check_url(url, existing_payload = {})
       error "Error when fetching url: #{e.message}\n#{e.backtrace.join("\n")}"
     end
 
+    def default_encoding
+      case extraction_type
+      when 'html', 'xml'
+        # Let Nokogiri detect the encoding
+        nil
+      else
+        super
+      end
+    end
+
     def handle_data(body, url, existing_payload)
       doc = parse(body)