Skip to content

Commit

Permalink
Implement charset handling in WebRequestConcern
Browse files Browse the repository at this point in the history
- The `force_encoding` option in WebsiteAgent is moved to
  WebRequestConcern so other users of the concern such as RssAgent can
  benefit from it.

- WebRequestConcern detects a charset specified in the Content-Type
  header to decode the content properly, and if it is missing the
  content is assumed to be encoded in UTF-8 unless it has a binary MIME
  type.  Not all Faraday adopters handle character encodings, and
  Faraday passes through what is returned from the backend, so we need
  to do this on our own. (cf. lostisland/faraday#139)

- WebRequestConcern now converts text contents to UTF-8, so agents can
  handle non-UTF-8 data without having to deal with encodings
  themselves.  Previously, WebsiteAgent in "json"/"text" modes and
  RssAgent would suffer from encoding errors when dealing with non-UTF-8
  contents.  WebsiteAgent in "html"/"xml" modes did not have this
  problem because Nokogiri would always return results in UTF-8
  independent of the input encoding.

This should fix #608.
  • Loading branch information
knu committed Aug 1, 2015
1 parent 77ec022 commit 2b5c7dd
Show file tree
Hide file tree
Showing 3 changed files with 57 additions and 1 deletion.
55 changes: 55 additions & 0 deletions app/concerns/web_request_concern.rb
Expand Up @@ -14,6 +14,40 @@ def self.decode(val)
end
end

class CharacterEncoding < Faraday::Middleware
def initialize(app, force_encoding: nil, default_encoding: nil)
super(app)
@force_encoding = force_encoding
@default_encoding = default_encoding
end

def call(env)
@app.call(env).on_complete do |env|
body = env[:body]

case
when @force_encoding
encoding = @force_encoding
when body.encoding == Encoding::ASCII_8BIT
# Not all Faraday adapters support automatic charset
# detection, so we do that.
case env[:response_headers][:content_type]
when /;(?:\s*,)?\s*charset\s*=\s*([^()<>@,;:\\\"\/\[\]?={}\s]+)/i
encoding = Encoding.find($1) rescue nil
when /\A\s*(?:text\/[^\s;]+|application\/[^\s;]+(?<=\b(?:xml|json))/i
encoding = @default_encoding
else
# Never try to transcode a binary content
return
end
end
body.encode!(Encoding::UTF_8, encoding) unless body.encoding == Encoding::UTF_8
end
end
end

Faraday::Response.register_middleware character_encoding: CharacterEncoding

extend ActiveSupport::Concern

def validate_web_request_options!
Expand All @@ -34,6 +68,23 @@ def validate_web_request_options!
rescue ArgumentError => e
errors.add(:base, e.message)
end

if (encoding = options['force_encoding']).present?
case encoding
when String
begin
Encoding.find(encoding)
rescue ArgumentError
errors.add(:base, "Unknown encoding: #{encoding.inspect}")
end
else
errors.add(:base, "force_encoding must be a string")
end
end
end

def default_encoding
Encoding::UTF_8
end

def faraday
Expand All @@ -44,6 +95,10 @@ def faraday
}

@faraday ||= Faraday.new(faraday_options) { |builder|
builder.response :character_encoding,
force_encoding: options['force_encoding'].presence,
default_encoding: default_encoding

builder.headers = headers if headers.length > 0

builder.headers[:user_agent] = user_agent
Expand Down
1 change: 1 addition & 0 deletions app/models/agents/rss_agent.rb
Expand Up @@ -29,6 +29,7 @@ class RssAgent < Agent
* `basic_auth` - Specify HTTP basic auth parameters: `"username:password"`, or `["username", "password"]`.
* `disable_ssl_verification` - Set to `true` to disable ssl verification.
* `disable_url_encoding` - Set to `true` to disable url encoding.
* `force_encoding` - Set `force_encoding` to an encoding name if the website is known to respond with a missing, invalid or wrong charset in the Content-Type header. Note that a text content without a charset is taken as encoded in UTF-8 (not ISO-8859-1).
* `user_agent` - A custom User-Agent name (default: "Faraday v#{Faraday::VERSION}").
* `max_events_per_run` - Limit number of events created (items parsed) per run for feed.
Expand Down
2 changes: 1 addition & 1 deletion app/models/agents/website_agent.rb
Expand Up @@ -87,7 +87,7 @@ class WebsiteAgent < Agent
Set `uniqueness_look_back` to limit the number of events checked for uniqueness (typically for performance). This defaults to the larger of #{UNIQUENESS_LOOK_BACK} or #{UNIQUENESS_FACTOR}x the number of detected received results.
Set `force_encoding` to an encoding name if the website does not return a Content-Type header with a proper charset.
Set `force_encoding` to an encoding name if the website is known to respond with a missing, invalid or wrong charset in the Content-Type header. Note that a text content without a charset is taken as encoded in UTF-8 (not ISO-8859-1).
Set `user_agent` to a custom User-Agent name if the website does not like the default value (`#{default_user_agent}`).
Expand Down

0 comments on commit 2b5c7dd

Please sign in to comment.