Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

WebsiteAgent: Introduce per-node XPath evaluation in extraction. #412

Merged
merged 6 commits into from

4 participants

@knu
Collaborator

The new extraction sub-hash key value holds an XPath expression which
should be applied to each node to get the result value.

The previous directives "text": true and "attr": "src" are now
written as "value": "text()" and "value": "@src", respectively.

With this enhancement, it is now possible for this agent to perform some
basic text processing on its own by making use of XPath functions like
normalize-space, substring-after and translate.

@knu
Collaborator

@cantino This branch is based on my provide_access_to_agent branch just because there was a merge involved. This is just for my personal convenience, sorry. Please deal with this PR after you have dealt with #410.

@coveralls

Coverage Status

Coverage increased (+0.06%) when pulling f3535e2 on knu:website_agent-per_node_xpath into 67fcfef on cantino:master.

app/models/agents/event_formatting_agent.rb
@@ -104,11 +106,11 @@ def working?
def receive(incoming_events)
incoming_events.each do |event|
- payload = perform_matching(event.payload)
+ agent = Agent.find(event.agent_id)
@cantino Owner
cantino added a note

Can't this just be event.agent?

@knu Collaborator
knu added a note

Oh, yup. It was in the original code. :smiley:

@cantino Owner
cantino added a note

Interesting... I wonder why.

@knu Collaborator
knu added a note

This line will be gone when #410 is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
app/models/agents/website_agent.rb
@@ -168,14 +170,11 @@ def check_url(in_url)
return
end
result = nodes.map { |node|
- if extraction_details['attr']
- node.attr(extraction_details['attr'])
- elsif extraction_details['text']
- node.text()
- else
- error '"attr" or "text" is required on HTML or XML extraction patterns'
- return
+ value, = node.xpath(extraction_details['value'])
+ if value.is_a?(Float) && value.to_i == value
@cantino Owner
cantino added a note

Interesting; why are you casting 1.0 to 1?

@knu Collaborator
knu added a note

#xpath evaluates any numeric return value as float, so you need to convert it to integer if you want count(//li.item) to be expanded to for example "5" instead of "5.0". I'll add a comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
app/models/agents/website_agent.rb
@@ -168,14 +170,11 @@ def check_url(in_url)
return
end
result = nodes.map { |node|
- if extraction_details['attr']
- node.attr(extraction_details['attr'])
- elsif extraction_details['text']
- node.text()
- else
- error '"attr" or "text" is required on HTML or XML extraction patterns'
- return
+ value, = node.xpath(extraction_details['value'])
@cantino Owner
cantino added a note

Are you using the , here to throw away extra values? If so, I'd suggest either of these for clarity:

node.xpath(extraction_details['value']).first

or

value, status = node.xpath(extraction_details['value'])

(assuming the other value is status)

@knu Collaborator
knu added a note

#xpath may return a single value (string, float, boolean, ...) or a node set. This is a shortcut for getting a single value itself or the first node if a node set is returned.

@knu Collaborator
knu added a note

On second thought, if a node set is returned the result should include all nodes so text() expands to all texts concatenated. I'll fix it.

@knu Collaborator
knu added a note

Done in 6d97513.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@coveralls

Coverage Status

Coverage increased (+0.09%) when pulling 6d97513 on knu:website_agent-per_node_xpath into 67fcfef on cantino:master.

@knu
Collaborator

Here's the diff from #410 for convenience in review:
knu@knu:provide_access_to_agent...website_agent-per_node_xpath

@cantino
Owner

To get the migration to run, I had to use send on extraction_type:

next if agent.send(:extraction_type) == 'json'

This migration will fail in the future if WebsiteAgent changes. Can we make it more robust, perhaps like:

  class Agent < ActiveRecord::Base
    include JSONSerializedField
    include LiquidInterpolatable
    json_serialize :options

    def extraction_type
      (interpolated['type'] || begin
        if interpolated['url'] =~ /\.(rss|xml)$/i
          "xml"
        elsif interpolated['url'] =~ /\.json$/i
          "json"
        else
          "html"
        end
      end).to_s
    end
  end

  def up
    Agent.where(type: 'Agents::WebsiteAgent').each do |agent|
      next if agent.extraction_type == 'json'

      agent.options_will_change!
      agent.options['extract'].each { |name, extraction|
        case
        when extraction.delete('text')
          extraction['value'] = './/text()'
        when attr = extraction.delete('attr')
          extraction['value'] = "@#{attr}"
        end
      }
      agent.save!
    end
  end

It's not perfect, but should be a bit more robust against changes to the codebase. What do you think?

app/models/agents/website_agent.rb
((14 lines not shown))
}
+ "@_attr_" is the XPath expression to extract the value of an attribute named _attr_ from a node, and "//text()" is to extract all the enclosed texts. You can also use [XPath functions](http://www.w3.org/TR/xpath/#section-String-Functions) like `normalize-space` to strip and squeeze whitespace, `substring-after` to extract part of a text, and `translate` to remove comma from a formatted number, etc. Note that these functions take a string, not a node set, so what you may think would be written as `normalize-text(.//text())` should actually be `normalize-text(.)`.
@cantino Owner
cantino added a note

//text() or .//text()?

@knu Collaborator
knu added a note

It's ".//text()", thanks for pointing it out!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@cantino
Owner

I like the direction that you're taking this-- XPath is very powerful and now people can do transformations inside of the WebsiteAgent.

I'd like to make sure this transition is approachable for everyone. What do other Huginn users think? /cc @alias1 @dsander @CloCkWeRX @albertsun @snicker etc.

@knu
Collaborator

@cantino Indeed, the call for extraction_type was wrong, and your code will work. I just thought pulling in the current implementation of extraction_type with interpolation might be a little bit overkill, so I made it look into options['extract'] instead. Hope this should be enough.

@coveralls

Coverage Status

Coverage increased (+0.56%) when pulling 79fca69 on knu:website_agent-per_node_xpath into 67fcfef on cantino:master.

@cantino
Owner

Sounds good. Testing on my setup for a day or two.

@cantino
Owner

If you re-merge master, this diff should get simpler.

knu added some commits
@knu knu WebsiteAgent: Introduce per-node XPath evaluation in extraction.
The new extraction sub-hash key `value` holds an XPath expression which
should be applied to each node to get the result value.

The previous directives `"text": true` and `"attr": "src"` are now
written as `"value": "text()"` and `"value": "@src"`, respectively.

With this enhancement, it is now possible for this agent to perform some
basic text processing on its own by making use of XPath functions like
`normalize-space`, `substring-after` and `translate`.
4d623c5
@knu knu WebsiteAgent: Add a spec for XPath returning an integer value. a800342
@knu knu WebsiteAgent: Improve readability with regard to XPath evaluation. 76ecced
@knu knu `"text": true` should have meant ".//text()", not "text()". 7b6119f
@knu knu Correct an XPath in description. fa0723e
@knu knu Look into options['extract'] for checking the extraction type.
WebsiteAgent#extraction_type was private and uncallable, and may not
remain existent in future.
70157dc
@knu
Collaborator

Thanks! I've rebased this branch on master.

@cantino cantino merged commit 47eee57 into from
@cantino
Owner

@knu, I've been running this for a few days and it seems to work well. The migration worked on my setup. Merging!

@cantino
Owner

Nice addition!

@knu knu deleted the branch
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Jul 29, 2014
  1. @knu

    WebsiteAgent: Introduce per-node XPath evaluation in extraction.

    knu authored
    The new extraction sub-hash key `value` holds an XPath expression which
    should be applied to each node to get the result value.
    
    The previous directives `"text": true` and `"attr": "src"` are now
    written as `"value": "text()"` and `"value": "@src"`, respectively.
    
    With this enhancement, it is now possible for this agent to perform some
    basic text processing on its own by making use of XPath functions like
    `normalize-space`, `substring-after` and `translate`.
  2. @knu
  3. @knu
  4. @knu
  5. @knu

    Correct an XPath in description.

    knu authored
  6. @knu

    Look into options['extract'] for checking the extraction type.

    knu authored
    WebsiteAgent#extraction_type was private and uncallable, and may not
    remain existent in future.
This page is out of date. Refresh to see the latest.
View
39 app/models/agents/website_agent.rb
@@ -23,14 +23,16 @@ class WebsiteAgent < Agent
To tell the Agent how to parse the content, specify `extract` as a hash with keys naming the extractions and values of hashes.
- When parsing HTML or XML, these sub-hashes specify how to extract with either a `css` CSS selector or a `xpath` XPath expression and either `"text": true` or `attr` pointing to an attribute name to grab. An example:
+ When parsing HTML or XML, these sub-hashes specify how each extraction should be done. The Agent first selects a node set from the document for each extraction key by evaluating either a CSS selector in `css` or an XPath expression in `xpath`. It then evaluates an XPath expression in `value` on each node in the node set, converting the result into string. Here's an example:
"extract": {
- "url": { "css": "#comic img", "attr": "src" },
- "title": { "css": "#comic img", "attr": "title" },
- "body_text": { "css": "div.main", "text": true }
+ "url": { "css": "#comic img", "value": "@src" },
+ "title": { "css": "#comic img", "value": "@title" },
+ "body_text": { "css": "div.main", "value": ".//text()" }
}
+ "@_attr_" is the XPath expression to extract the value of an attribute named _attr_ from a node, and ".//text()" is to extract all the enclosed texts. You can also use [XPath functions](http://www.w3.org/TR/xpath/#section-String-Functions) like `normalize-space` to strip and squeeze whitespace, `substring-after` to extract part of a text, and `translate` to remove comma from a formatted number, etc. Note that these functions take a string, not a node set, so what you may think would be written as `normalize-text(.//text())` should actually be `normalize-text(.)`.
+
When parsing JSON, these sub-hashes specify [JSONPaths](http://goessner.net/articles/JsonPath/) to the values that you care about. For example:
"extract": {
@@ -70,9 +72,9 @@ def default_options
'type' => "html",
'mode' => "on_change",
'extract' => {
- 'url' => { 'css' => "#comic img", 'attr' => "src" },
- 'title' => { 'css' => "#comic img", 'attr' => "alt" },
- 'hovertext' => { 'css' => "#comic img", 'attr' => "title" }
+ 'url' => { 'css' => "#comic img", 'value' => "@src" },
+ 'title' => { 'css' => "#comic img", 'value' => "@alt" },
+ 'hovertext' => { 'css' => "#comic img", 'value' => "@title" }
}
}
end
@@ -152,20 +154,21 @@ def check_url(in_url)
error '"css" or "xpath" is required for HTML or XML extraction'
return
end
- unless Nokogiri::XML::NodeSet === nodes
+ case nodes
+ when Nokogiri::XML::NodeSet
+ result = nodes.map { |node|
+ case value = node.xpath(extraction_details['value'])
+ when Float
+ # Node#xpath() returns any numeric value as float;
+ # convert it to integer as appropriate.
+ value = value.to_i if value.to_i == value
+ end
+ value.to_s
+ }
+ else
error "The result of HTML/XML extraction was not a NodeSet"
return
end
- result = nodes.map { |node|
- if extraction_details['attr']
- node.attr(extraction_details['attr'])
- elsif extraction_details['text']
- node.text()
- else
- error '"attr" or "text" is required on HTML or XML extraction patterns'
- return
- end
- }
log "Extracting #{extraction_type} at #{xpath || css}: #{result}"
end
output[name] = result
View
30 db/migrate/20140723110551_adopt_xpath_in_website_agent.rb
@@ -0,0 +1,30 @@
+class AdoptXpathInWebsiteAgent < ActiveRecord::Migration
+ class Agent < ActiveRecord::Base
+ include JSONSerializedField
+ json_serialize :options
+ end
+
+ def up
+ Agent.where(type: 'Agents::WebsiteAgent').each do |agent|
+ extract = agent.options['extract']
+ next unless extract.is_a?(Hash) && extract.all? { |name, detail|
+ detail.key?('xpath') || detail.key?('css')
+ }
+
+ agent.options_will_change!
+ agent.options['extract'].each { |name, extraction|
+ case
+ when extraction.delete('text')
+ extraction['value'] = './/text()'
+ when attr = extraction.delete('attr')
+ extraction['value'] = "@#{attr}"
+ end
+ }
+ agent.save!
+ end
+ end
+
+ def down
+ raise ActiveRecord::IrreversibleMigration, "Cannot revert this migration"
+ end
+end
View
8 spec/fixtures/agents.yml
@@ -10,8 +10,8 @@ jane_website_agent:
:expected_update_period_in_days => 2,
:mode => :on_change,
:extract => {
- :title => {:css => "item title", :text => true},
- :url => {:css => "item link", :text => true}
+ :title => {:css => "item title", :value => './/text()'},
+ :url => {:css => "item link", :value => './/text()'}
}
}.to_json.inspect %>
@@ -27,8 +27,8 @@ bob_website_agent:
:expected_update_period_in_days => 2,
:mode => :on_change,
:extract => {
- :url => {:css => "#comic img", :attr => "src"},
- :title => {:css => "#comic img", :attr => "title"}
+ :url => {:css => "#comic img", :value => "@src"},
+ :title => {:css => "#comic img", :value => "@title"}
}
}.to_json.inspect %>
View
4 spec/models/agent_spec.rb
@@ -768,8 +768,8 @@ def interpolate(string, agent)
url: 'http://dilbert.com/',
mode: 'on_change',
extract: {
- url: { css: '[id^=strip_enlarged_] img', attr: 'src' },
- title: { css: '.STR_DateStrip', text: true },
+ url: { css: '[id^=strip_enlarged_] img', value: '@src' },
+ title: { css: '.STR_DateStrip', value: './/text()' },
},
},
schedule: 'every_12h',
View
55 spec/models/agents/website_agent_spec.rb
@@ -11,9 +11,9 @@
'url' => "http://xkcd.com",
'mode' => 'on_change',
'extract' => {
- 'url' => { 'css' => "#comic img", 'attr' => "src" },
- 'title' => { 'css' => "#comic img", 'attr' => "alt" },
- 'hovertext' => { 'css' => "#comic img", 'attr' => "title" }
+ 'url' => { 'css' => "#comic img", 'value' => "@src" },
+ 'title' => { 'css' => "#comic img", 'value' => "@alt" },
+ 'hovertext' => { 'css' => "#comic img", 'value' => "@title" }
}
}
@checker = Agents::WebsiteAgent.new(:name => "xkcd", :options => @valid_options, :keep_events_for => 2)
@@ -256,8 +256,7 @@
'url' => "http://xkcd.com",
'mode' => "on_change",
'extract' => {
- 'url' => {'css' => "#topLeft a", 'attr' => "href"},
- 'title' => {'css' => "#topLeft a", 'text' => "true"}
+ 'url' => {'css' => "#topLeft a", 'value' => "@href"},
}
}
rel = Agents::WebsiteAgent.new(:name => "xkcd", :options => rel_site)
@@ -268,6 +267,44 @@
event.payload['url'].should == "http://xkcd.com/about"
end
+ it "should return an integer value if XPath evaluates to one" do
+ rel_site = {
+ 'name' => "XKCD",
+ 'expected_update_period_in_days' => 2,
+ 'type' => "html",
+ 'url' => "http://xkcd.com",
+ 'mode' => "on_change",
+ 'extract' => {
+ 'num_links' => {'css' => "#comicLinks", 'value' => "count(./a)"}
+ }
+ }
+ rel = Agents::WebsiteAgent.new(:name => "xkcd", :options => rel_site)
+ rel.user = users(:bob)
+ rel.save!
+ rel.check
+ event = Event.last
+ event.payload['num_links'].should == "9"
+ end
+
+ it "should return all texts concatenated if XPath returns many text nodes" do
+ rel_site = {
+ 'name' => "XKCD",
+ 'expected_update_period_in_days' => 2,
+ 'type' => "html",
+ 'url' => "http://xkcd.com",
+ 'mode' => "on_change",
+ 'extract' => {
+ 'slogan' => {'css' => "#slogan", 'value' => ".//text()"}
+ }
+ }
+ rel = Agents::WebsiteAgent.new(:name => "xkcd", :options => rel_site)
+ rel.user = users(:bob)
+ rel.save!
+ rel.check
+ event = Event.last
+ event.payload['slogan'].should == "A webcomic of romance, sarcasm, math, and language."
+ end
+
describe "JSON" do
it "works with paths" do
json = {
@@ -389,9 +426,9 @@
'url' => "http://www.example.com",
'mode' => 'on_change',
'extract' => {
- 'url' => { 'css' => "#comic img", 'attr' => "src" },
- 'title' => { 'css' => "#comic img", 'attr' => "alt" },
- 'hovertext' => { 'css' => "#comic img", 'attr' => "title" }
+ 'url' => { 'css' => "#comic img", 'value' => "@src" },
+ 'title' => { 'css' => "#comic img", 'value' => "@alt" },
+ 'hovertext' => { 'css' => "#comic img", 'value' => "@title" }
},
'basic_auth' => "user:pass"
}
@@ -421,7 +458,7 @@
'mode' => 'on_change',
'headers' => { 'foo' => 'bar' },
'extract' => {
- 'url' => { 'css' => "#comic img", 'attr' => "src" },
+ 'url' => { 'css' => "#comic img", 'value' => "@src" },
}
}
@checker = Agents::WebsiteAgent.new(:name => "ua", :options => @valid_options)
Something went wrong with that request. Please try again.