Nikkou

Extract useful data from HTML and XML with ease!

Description

Nikkou adds additional methods to Nokogiri to make extracting commonly-used data from HTML and XML easier. It lets you transform HTML into structured data very quickly, and it integrates nicely with Mechanize.

Installation

Add Nikkou to your Gemfile:

gem 'nikkou'

Method Overview

Here's a summary of the methods Nikkou provides (see "Methods" for details):

Formatting

parse_text - Parses the node's text as XML and returns it as a Nokogiri::XML::NodeSet

time(options={}) - Intelligently parses the time (relative or absolute) of either the text or a specified attribute; accepts a time_zone option

url(attribute='href') - Converts the href (or other specified attribute) into an absolute URL using the document's URI; <a href="/p/1">Link</a> yields http://mysite.com/p/1

Searching

attr_equals(attribute, string) - Finds nodes where the attribute equals the string

attr_includes(attribute, string) - Finds nodes where the attribute includes the string

attr_matches(attribute, pattern) - Finds nodes where the attribute matches the pattern

*drill(methods) - Nil-safe method chaining

find(path) - Same as search but returns the first matched node

text_equals(string) - Finds nodes where the text equals the string

text_includes(string) - Finds nodes where the text includes the string

text_matches(pattern) - Finds nodes where the text matches the pattern

Methods

Formatting

time(options={})

Returns a Time object (in UTC) by automatically parsing the text or specified attribute of the node.

# <a href="/p/1">3 hours ago</a>
doc.search('a').first.time

Options

attribute

The attribute to parse:

# <a href="/p/1" data-published-at="2013-05-22 02:42:34">My link</a>
doc.search('a').first.time(attribute: 'data-published-at')

time_zone

The document's time zone (the time will be converted from that to UTC):

# <a href="/p/1">3 hours ago</a>
doc.search('a').first.time(time_zone: 'America/New_York')

url(attribute='href')

Returns an absolute URL; useful for parsing relative hrefs. The document's uri needs to be set for Nikkou to know what domain to add to relative paths.

# <a href="/p/1">My link</a>
doc.uri = 'http://mysite.com/mypage'
doc.search('a').first.url # "http://mysite.com/p/1"

If Mechanize is being used, the uri doesn't need to be manually set.

Options

attribute

The attribute to parse:

# <a href="/p/1" data-comments-url="/p/1#comments">My Link</a>
doc.uri = 'http://mysite.com/mypage'
doc.search('a').first.url('data-comments-url') # "http://mysite.com/p/1#comments"

Searching

attr_equals(attribute, string)

Selects nodes where the specified attribute equals the string.

# <div data-type="news">My Text</div>
doc.attr_equals('data-type', 'news').first.text # "My Text"

attr_includes(attribute, string)

Selects nodes where the specified attribute includes the string.

# <div data-type="major-news">My Text</div>
doc.attr_includes('data-type', 'news').first.text # "My Text"

attr_matches(attribute, pattern)

Selects nodes with an attribute matching a pattern. The pattern's matches are available in Node#matches.

# <span data-tooltip="3 Comments">My Text</span>
doc.attr_matches('data-tooltip', /(\d+) comments/i).first.text # "My Text"
doc.attr_matches('data-tooltip', /(\d+) comments/i).first.matches # ["3 Comments", "3"]

drill(*methods)

Nil-safe method chaining. Replaces this:

node = doc.find('.count')
if node
  attribute = node.attr('data-count')
  if attribute
    return attribute.to_i
  end
end

With this:

return doc.drill([:find, '.count'], [:attr, 'data-count'], :to_i)

find(path)

Same as search, but returns the first matched node. Replaces this:

nodes = node.search('h4')
if nodes
  return nodes.first
end

With this:

return node.find('h4')

text_includes(string)

Selects nodes where the text includes the string.

# <div data-type="news">My Text</div>
doc.text_includes('Text').first.text # "My Text"

text_matches(pattern)

Selects nodes with text matching a pattern. The pattern's matches are available in Node#matches.

# <a href="/p/1">3 Comments</a>
doc.text_matches(/^(\d+) comments$/i).first.attr('href') # "/p/1"
doc.text_matches(/^(\d+) comments$/i).first.matches # ["3 Comments", "3"]

License

Nikkou is released under the MIT License. Please see the MIT-LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
lib		lib
spec		spec
.gitignore		.gitignore
.travis.yml		.travis.yml
Gemfile		Gemfile
MIT-LICENSE		MIT-LICENSE
README.md		README.md
Rakefile		Rakefile
nikkou.gemspec		nikkou.gemspec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Nikkou

Description

Installation

Method Overview

Formatting

Searching

Methods

Formatting

time(options={})

Options

url(attribute='href')

Options

Searching

attr_equals(attribute, string)

attr_includes(attribute, string)

attr_matches(attribute, pattern)

drill(*methods)

find(path)

text_includes(string)

text_matches(pattern)

License

About

Releases

Packages

Languages

License

asiniy/nikkou

Folders and files

Latest commit

History

Repository files navigation

Nikkou

Description

Installation

Method Overview

Formatting

Searching

Methods

Formatting

time(options={})

Options

url(attribute='href')

Options

Searching

attr_equals(attribute, string)

attr_includes(attribute, string)

attr_matches(attribute, pattern)

drill(*methods)

find(path)

text_includes(string)

text_matches(pattern)

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages