Skip to content

Commit

Permalink
Improve encoding detection in WebsiteAgent
Browse files Browse the repository at this point in the history
Previously, WebsiteAgent always assumed that a content with no charset
specified in the Content-Type header should be encoded in UTF-8.  This
enhancement is to make use of the encoding detector implemented in
Nokogiri for HTML/XML documents, instead of blindly falling back to
UTF-8.

When the document `type` is `html` or `xml`, WebsiteAgent tries to
detect the encoding a fetched document from the presence of a BOM, XML
declaration, or HTML `meta` tag.

This fixes #1742.
  • Loading branch information
knu committed Oct 25, 2016
1 parent 950975d commit f157ad2
Show file tree
Hide file tree
Showing 3 changed files with 296 additions and 93 deletions.
5 changes: 5 additions & 0 deletions app/concerns/web_request_concern.rb
Expand Up @@ -46,6 +46,8 @@ def call(env)
# Never try to transcode a binary content
next
end
# Return body as binary if default_encoding is nil
next if encoding.nil?
end
body.encode!(Encoding::UTF_8, encoding)
end
Expand Down Expand Up @@ -89,6 +91,9 @@ def validate_web_request_options!
end
end

# The default encoding for a text content with no `charset`
# specified in the Content-Type header. Override this and make it
# return nil if you want to detect the endcoding on your own.
def default_encoding
Encoding::UTF_8
end
Expand Down
17 changes: 16 additions & 1 deletion app/models/agents/website_agent.rb
Expand Up @@ -94,7 +94,12 @@ class WebsiteAgent < Agent
Set `uniqueness_look_back` to limit the number of events checked for uniqueness (typically for performance). This defaults to the larger of #{UNIQUENESS_LOOK_BACK} or #{UNIQUENESS_FACTOR}x the number of detected received results.
Set `force_encoding` to an encoding name if the website is known to respond with a missing, invalid, or wrong charset in the Content-Type header. Note that a text content without a charset is taken as encoded in UTF-8 (not ISO-8859-1).
Set `force_encoding` to an encoding name (such as `UTF-8` and `ISO-8859-1`) if the website is known to respond with a missing, invalid, or wrong charset in the Content-Type header. Below are the steps to detect the encoding of a fetched content:
1. If `force_encoding` is given, use the value.
2. If the Content-Type header contains a charset parameter, use the value.
3. When `type` is `html` or `xml`, check for the presence of a BOM, XML declaration with attribute "encoding", and an HTML meta tag with charset information.
4. Fall back to UTF-8 (not ISO-8859-1).
Set `user_agent` to a custom User-Agent name if the website does not like the default value (`#{default_user_agent}`).
Expand Down Expand Up @@ -307,6 +312,16 @@ def check_url(url, existing_payload = {})
error "Error when fetching url: #{e.message}\n#{e.backtrace.join("\n")}"
end

def default_encoding
case extraction_type
when 'html', 'xml'
# Let Nokogiri detect the encoding
nil
else
super
end
end

def handle_data(body, url, existing_payload)
doc = parse(body)

Expand Down

0 comments on commit f157ad2

Please sign in to comment.