Skip to content

Commit

Permalink
merge peterc and fix tests
Browse files Browse the repository at this point in the history
  • Loading branch information
bborn committed Oct 12, 2012
2 parents da2c4c5 + 77fe57e commit 88527c2
Show file tree
Hide file tree
Showing 5 changed files with 657 additions and 129 deletions.
37 changes: 33 additions & 4 deletions lib/pismo/internal_attributes.rb
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ module Pismo
module InternalAttributes
@@phrasie = Phrasie::Extractor.new

MONTHS_REGEX = %r{(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec|January|February|March|April|May|June|July|August|September|October|November|December)}i
MONTHS_REGEX = %r{(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec|January|February|March|April|May|June|July|August|September|October|November|December)\.?}i
DATETIME_REGEXEN = [
/#{MONTHS_REGEX}\b\s+\d+\D{1,10}\d{4}/i,
/(on\s+)?\d+\s+#{MONTHS_REGEX}\s+\D{0,10}\d+/i,
Expand Down Expand Up @@ -168,13 +168,12 @@ def datetime
DATETIME_REGEXEN.detect {|r| datetime = @doc.to_html[r] }

return unless datetime and datetime.length > 4

# Clean up the string for use by Chronic
datetime.strip!
datetime.gsub!(/(Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday|mon|tues|tue|weds|wed|thurs|thur|thu|fri|sat|sun)[^\w]*/i, '')
datetime.sub!(/(on\s+|\,)/, '')
datetime.sub!(/(on\s+|\,|\.)/, '')
datetime.sub!(/(\d+)(th|st|rd)/, '\1')
Chronic.parse(datetime) || datetime
Chronic.parse(datetime, :context => :past) || datetime
end

# Returns the author of the page/content
Expand Down Expand Up @@ -264,6 +263,36 @@ def videos(limit = 1)
reader_doc && !reader_doc.videos.empty? ? reader_doc.videos(limit) : nil
end

# Returns the tags or categories of the page/content
def tags
css_selectors = [
'.watch-info-tag-list a', # YouTube
'.entry .tags a', # Livejournal
'a[rel~=tag]', # Wordpress and many others
'a.tag', # Tumblr
'.tags a',
'.labels a',
'.categories a',
'.topics a'
]

tags = []

# grab the first one we get results from
css_selectors.each do |css_selector|
tags += @doc.css(css_selector)
break if tags.any?
end

# convert from Nokogiri Element objects to strings
tags.map!(&:inner_text)

# remove "#" from hashtag-like tags
tags.map! { |t| t.gsub(/^#/, '') }

tags
end

# Returns the "keyword phrases" in the document (not the meta keywords - they're next to useless now)
DEFAULT_KEYWORD_OPTIONS = { :limit => 20, :minimum_score => "1%" }
def keywords(options = {})
Expand Down
12 changes: 9 additions & 3 deletions test/corpus/metadata_expected.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,9 @@
:lede: I'm just aching to know if the new Apple tablet (insert caveats, weasel words and qualifiers here) is a potential Cintiq competitor. I don't think it will be, but you never know. It may also have a built in barometer and bird call generator. I'm never sure if Apple does themselves more good than harm with the secrecy and anticipation that surrounds the run-up to these announcements.
:feeds:
- http://www.readwriteweb.com/rss.xml
- http://www.readwriteweb.com/archives/2010/01/cartoon_apple_tablet_now_with_barometer_and_bird_c.xml
:briancray:
- http://www.readwriteweb.com/archives/2010/01/cartoon_apple_tablet_now_with_barometer_and_bird_c.xml
:tags: [apple tablet, steve jobs]
:briancray:
:title: 5 great examples of popular blog posts that you should know
:feed: http://feeds.feedburner.com/briancray/blog
:lede: "This is a mock post."
Expand Down Expand Up @@ -54,6 +55,7 @@
:author: Peter Cooper
:lede: CoffeeScript (GitHub repo) is a new programming language with a pure Ruby compiler.
:feed: http://www.rubyinside.com/feed/
:tags: [Cool]
:zefrank:
:sentences: If there's anyone who knows how to marshal an online audience, it's Ze Frank. Ze is best-known for his 2006 program "The Show," in which he made a new 2-3 minute video every day for 1 year. Topics ranged from "fingers in food" to the mysteries of airport signage to a tour de force summary of creatives' addiction to un-executed ideas, aka brain crack.
:title: "Ze Frank on Imaginary Audiences :: Articles :: The 99 Percent"
Expand All @@ -69,4 +71,8 @@
:title: 18 Incredible CSS3 Effects You Have Never Seen Before
:lede: "CSS3 is hot these days and will soon be available in most modern browser. Just recently, I started to become aware to the present of CSS3 around the web. I can see some of the websites such as twitter and designer portfolios websites are using it. Also, I have started to implement it to my own project as well and I really love it!"
:sentences: CSS3 is hot these days and will soon be available in most modern browser. Just recently, I started to become aware to the present of CSS3 around the web. I can see some of the websites such as twitter and designer portfolios websites are using it.
:datetime: 2010-02-17 12:00:00 -07:00
:datetime: 2010-02-17 12:00:00 +00:00
:thegoodbookblog:
:title: Signs Of Life
:datetime: 2012-07-25 12:00:00 +00:00
:tags: [Church Life, Marriage and Family, Ministry and Leadership]
122 changes: 0 additions & 122 deletions test/corpus/metadata_expected.yaml.old

This file was deleted.

Loading

0 comments on commit 88527c2

Please sign in to comment.